Skip to content

Ingest from Public ATS APIs

Use HonestRoles ingestion commands to fetch public postings from supported ATS APIs and write canonical parquet input for pipeline runs.

When to use

Use this when you do not already have parquet input and want deterministic, ToS-safe ingestion from official public ATS endpoints.

Prerequisites

  • HonestRoles installed.
  • Internet access to supported ATS APIs.
  • Valid --source-ref values for the sources you ingest.

Supported --source values:

  • greenhouse (board token)
  • lever (site/company handle)
  • ashby (job board name)
  • workable (company subdomain with public careers API access)

Steps

  1. Run one source directly:
$ honestroles ingest sync --source greenhouse --source-ref stripe --format table
  1. Optional: tune operational controls (timeout, retries, and backoff):
$ honestroles ingest sync \
  --source lever \
  --source-ref netflix \
  --max-pages 10 \
  --max-jobs 1000 \
  --timeout-seconds 20 \
  --max-retries 5 \
  --base-backoff-seconds 0.5 \
  --user-agent "honestroles-batch/1.0" \
  --format table
  1. Optional: run validation-only quality checks before writing latest parquet:
$ honestroles ingest validate \
  --source greenhouse \
  --source-ref stripe \
  --quality-policy ingest_quality.toml \
  --strict-quality \
  --format table

For manual live smoke checks, use ingest_quality_smoke.toml to keep quality gates strict while relaxing freshness windows for long-lived archived postings.

  1. Optional: use batch ingestion with ingest.toml:
[defaults]
state_file = ".honestroles/ingest/state.json"
max_pages = 25
max_jobs = 5000
quality_policy_file = "ingest_quality.toml"
strict_quality = false
merge_policy = "updated_hash"
retain_snapshots = 30
prune_inactive_days = 90

[[sources]]
source = "greenhouse"
source_ref = "stripe"

[[sources]]
source = "lever"
source_ref = "netflix"
max_pages = 10
$ honestroles ingest sync-all --manifest ingest.toml --format table
  1. Optional: force a refresh or write raw payload:
$ honestroles ingest sync --source ashby --source-ref notion --full-refresh
$ honestroles ingest sync --source workable --source-ref your-company --write-raw
  1. Point your runtime pipeline at the latest parquet output:
[input]
kind = "parquet"
path = "dist/ingest/greenhouse/stripe/jobs.parquet"
  1. Run the runtime pipeline:
$ honestroles run --pipeline-config pipeline.toml

Expected result

Per source, ingestion writes:

  • latest parquet: dist/ingest/<source>/<source_ref>/jobs.parquet
  • per-run snapshot parquet: dist/ingest/<source>/<source_ref>/snapshots/<stamp>-<run>.parquet
  • catalog parquet: dist/ingest/<source>/<source_ref>/catalog.parquet
  • sync report: dist/ingest/<source>/<source_ref>/sync_report.json
  • optional raw payload: dist/ingest/<source>/<source_ref>/raw.jsonl (or adjacent to --output-parquet when that flag is explicitly set)
  • state: .honestroles/ingest/state.json

Batch runs also write:

  • dist/ingest/sync_all_report.json (or custom --report-file)

Incremental semantics:

  • Filtering uses posted+updated watermarks and recent IDs.
  • --full-refresh bypasses incremental filtering.
  • Tombstones are applied only on coverage-complete runs.
  • Truncated runs (hitting max-pages or max-jobs) do not tombstone missing records.
  • Pagination loop/repeat protection emits INGEST_PAGE_REPEAT_DETECTED and marks run coverage incomplete.
  • Merge policy controls latest conflict resolution (updated_hash, first_seen, last_seen).
  • Snapshot retention keeps newest retain_snapshots; older snapshots are pruned after successful sync.
  • Catalog compaction prunes inactive rows older than prune_inactive_days.
  • Quality policy checks run before latest overwrite; with --strict-quality, non-pass quality fails the command.
  • URL dedup preserves identity query keys (gh_jid, job_id, jobid, posting_id, position_id) and removes tracking params.

Normalization highlights by connector:

  • greenhouse: company_name -> company, first_published|updated_at -> posted_at, content -> description_text.
  • lever: workplaceType -> remote/work_mode, categories.organization|team -> company fallback.
  • ashby: publishedAt -> posted_at, descriptionPlain -> description_text, isRemote/workplaceType -> remote/work_mode.
  • workable: published_on|created_at -> posted_at, application_url -> apply_url, city/state/country/locations -> location, telecommuting -> remote.

Next steps