Skip to content

Explore Data with EDA

Generate deterministic EDA artifacts from parquet inputs, compare runs, and enforce CI gate policies.

When to use

Use this guide when you need reproducible profile/diff artifacts for PR review and automated quality checks.

Prerequisites

  • HonestRoles installed
  • For dashboard mode: pip install "honestroles[eda]"
  • pyarrow is not required for dashboard table rendering.
  • Keep raw parquet inputs in data/ (for example data/jobs_baseline.parquet) and write outputs under dist/eda/ rather than cluttering repository root.

Steps

Generate baseline and candidate artifacts:

$ honestroles eda generate --input-parquet data/jobs_baseline.parquet --output-dir dist/eda/baseline
$ honestroles eda generate --input-parquet data/jobs_candidate.parquet --output-dir dist/eda/candidate

Create diff artifacts:

$ honestroles eda diff --baseline-dir dist/eda/baseline --candidate-dir dist/eda/candidate --output-dir dist/eda/diff

Evaluate gate policy (CI-friendly):

$ honestroles eda gate --candidate-dir dist/eda/candidate --baseline-dir dist/eda/baseline --rules-file eda-rules.toml

Review artifacts:

$ cat dist/eda/candidate/report.md
$ cat dist/eda/candidate/summary.json
$ cat dist/eda/diff/diff.json

Launch dashboard view layer:

$ honestroles eda dashboard --artifacts-dir dist/eda/candidate --diff-dir dist/eda/diff --host 127.0.0.1 --port 8501

Expected result

  • dist/eda/candidate contains profile artifacts (manifest.json, summary.json, report.md, tables/, figures/)
  • dist/eda/diff contains diff artifacts (manifest.json, diff.json, tables/)
  • eda gate exits 0 on pass and 1 on policy failure

Next steps

  • Tune thresholds in eda-rules.toml for your CI bar.
  • Track diff.json deltas over time to catch regressions early.