Ingest from Public ATS APIs¶
Use HonestRoles ingestion commands to fetch public postings from supported ATS APIs and write canonical parquet input for pipeline runs.
When to use¶
Use this when you do not already have parquet input and want deterministic, ToS-safe ingestion from official public ATS endpoints.
Prerequisites¶
- HonestRoles installed.
- Internet access to supported ATS APIs.
- Valid
--source-refvalues for the sources you ingest.
Supported --source values:
greenhouse(board token)lever(site/company handle)ashby(job board name)workable(company subdomain with public careers API access)
Steps¶
- Run one source directly:
$ honestroles ingest sync --source greenhouse --source-ref stripe --format table
- Optional: tune operational controls (
timeout, retries, and backoff):
$ honestroles ingest sync \
--source lever \
--source-ref netflix \
--max-pages 10 \
--max-jobs 1000 \
--timeout-seconds 20 \
--max-retries 5 \
--base-backoff-seconds 0.5 \
--user-agent "honestroles-batch/1.0" \
--format table
- Optional: run validation-only quality checks before writing latest parquet:
$ honestroles ingest validate \
--source greenhouse \
--source-ref stripe \
--quality-policy ingest_quality.toml \
--strict-quality \
--format table
For manual live smoke checks, use ingest_quality_smoke.toml to keep quality gates
strict while relaxing freshness windows for long-lived archived postings.
- Optional: use batch ingestion with
ingest.toml:
[defaults]
state_file = ".honestroles/ingest/state.json"
max_pages = 25
max_jobs = 5000
quality_policy_file = "ingest_quality.toml"
strict_quality = false
merge_policy = "updated_hash"
retain_snapshots = 30
prune_inactive_days = 90
[[sources]]
source = "greenhouse"
source_ref = "stripe"
[[sources]]
source = "lever"
source_ref = "netflix"
max_pages = 10
$ honestroles ingest sync-all --manifest ingest.toml --format table
- Optional: force a refresh or write raw payload:
$ honestroles ingest sync --source ashby --source-ref notion --full-refresh
$ honestroles ingest sync --source workable --source-ref your-company --write-raw
- Point your runtime pipeline at the latest parquet output:
[input]
kind = "parquet"
path = "dist/ingest/greenhouse/stripe/jobs.parquet"
- Run the runtime pipeline:
$ honestroles run --pipeline-config pipeline.toml
Expected result¶
Per source, ingestion writes:
- latest parquet:
dist/ingest/<source>/<source_ref>/jobs.parquet - per-run snapshot parquet:
dist/ingest/<source>/<source_ref>/snapshots/<stamp>-<run>.parquet - catalog parquet:
dist/ingest/<source>/<source_ref>/catalog.parquet - sync report:
dist/ingest/<source>/<source_ref>/sync_report.json - optional raw payload:
dist/ingest/<source>/<source_ref>/raw.jsonl(or adjacent to--output-parquetwhen that flag is explicitly set) - state:
.honestroles/ingest/state.json
Batch runs also write:
dist/ingest/sync_all_report.json(or custom--report-file)
Incremental semantics:
- Filtering uses posted+updated watermarks and recent IDs.
--full-refreshbypasses incremental filtering.- Tombstones are applied only on coverage-complete runs.
- Truncated runs (hitting
max-pagesormax-jobs) do not tombstone missing records. - Pagination loop/repeat protection emits
INGEST_PAGE_REPEAT_DETECTEDand marks run coverage incomplete. - Merge policy controls latest conflict resolution (
updated_hash,first_seen,last_seen). - Snapshot retention keeps newest
retain_snapshots; older snapshots are pruned after successful sync. - Catalog compaction prunes inactive rows older than
prune_inactive_days. - Quality policy checks run before latest overwrite; with
--strict-quality, non-pass quality fails the command. - URL dedup preserves identity query keys (
gh_jid,job_id,jobid,posting_id,position_id) and removes tracking params.
Normalization highlights by connector:
greenhouse:company_name -> company,first_published|updated_at -> posted_at,content -> description_text.lever:workplaceType -> remote/work_mode,categories.organization|team -> companyfallback.ashby:publishedAt -> posted_at,descriptionPlain -> description_text,isRemote/workplaceType -> remote/work_mode.workable:published_on|created_at -> posted_at,application_url -> apply_url,city/state/country/locations -> location,telecommuting -> remote.
Next steps¶
- Full flags and payload fields: CLI Reference
- Batch schema details: Ingest Manifest Schema
- Quality policy schema: Ingest Quality Policy Schema
- Source identifier lookup: Ingest Source-Ref Glossary
- Failure handling and retry tuning: Common Errors