Skip to content

Stage Contracts

Canonical stage behavior and data contracts.

Stage Order

Enabled stages execute in this fixed order:

  1. clean
  2. filter
  3. label
  4. rate
  5. match

Source Data Contract

Runtime normalizes columns to include these names:

  • id
  • title
  • company
  • location
  • remote
  • description_text
  • description_html
  • skills
  • salary_min
  • salary_max
  • apply_url
  • posted_at

Before normalization, runtime resolves source aliases into canonical names using:

  • Declarative source adapter from [input.adapter] (runs first)
  • Built-in aliases: location_raw -> location, remote_flag -> remote
  • Optional pipeline aliases from [input.aliases]

Conflict policy:

  • Canonical field wins.
  • Adapter/alias conflicts are recorded in diagnostics.

Validation requires:

  • title
  • At least one of description_text or description_html

Stage Responsibilities

  • Stage input/output object: JobDataset
  • clean: clean text/html and apply text-level policy such as dropping null titles
  • filter: apply remote_only, salary threshold, keyword filters, then run filter plugins
  • label: derive base labels, then run label plugins
  • rate: compute bounded rate_completeness, rate_quality, and rate_composite, then run rate plugins
  • match: compute bounded fit_score, sort descending, enforce top_k, generate application_plan

Output Invariants

  • rate_* metrics are bounded to [0.0, 1.0].
  • fit_score is bounded to [0.0, 1.0].
  • fit_rank starts at 1 and reflects descending fit_score.
  • application_plan is aligned to ranked top top_k rows and returned as ApplicationPlanEntry[].
  • Plugins must return a valid canonical JobDataset with all canonical fields and canonical logical dtypes preserved.

Diagnostics Additions

Runtime diagnostics include input_aliasing:

{
  "input_aliasing": {
    "applied": {"location": "location_raw", "remote": "remote_flag"},
    "conflicts": {"remote": 2},
    "unresolved": ["skills", "salary_min", "salary_max"]
  }
}

Runtime diagnostics also include input_adapter:

{
  "input_adapter": {
    "enabled": true,
    "applied": {"remote": "remote_flag", "posted_at": "date_posted"},
    "conflicts": {"remote": 2},
    "coercion_errors": {"posted_at": 1},
    "null_like_hits": {"location": 10},
    "unresolved": ["salary_max"],
    "error_samples": [
      {
        "field": "posted_at",
        "source": "date_posted",
        "value": "32/13/2024",
        "reason": "date_parse_failed"
      }
    ]
  }
}