Explore Data with EDA¶
Generate deterministic EDA artifacts from parquet inputs, compare runs, and enforce CI gate policies.
When to use¶
Use this guide when you need reproducible profile/diff artifacts for PR review and automated quality checks.
Prerequisites¶
- HonestRoles installed
- For dashboard mode:
pip install "honestroles[eda]" pyarrowis not required for dashboard table rendering.- Keep raw parquet inputs in
data/(for exampledata/jobs_baseline.parquet) and write outputs underdist/eda/rather than cluttering repository root.
Steps¶
Generate baseline and candidate artifacts:
$ honestroles eda generate --input-parquet data/jobs_baseline.parquet --output-dir dist/eda/baseline
$ honestroles eda generate --input-parquet data/jobs_candidate.parquet --output-dir dist/eda/candidate
Create diff artifacts:
$ honestroles eda diff --baseline-dir dist/eda/baseline --candidate-dir dist/eda/candidate --output-dir dist/eda/diff
Evaluate gate policy (CI-friendly):
$ honestroles eda gate --candidate-dir dist/eda/candidate --baseline-dir dist/eda/baseline --rules-file eda-rules.toml
Review artifacts:
$ cat dist/eda/candidate/report.md
$ cat dist/eda/candidate/summary.json
$ cat dist/eda/diff/diff.json
Launch dashboard view layer:
$ honestroles eda dashboard --artifacts-dir dist/eda/candidate --diff-dir dist/eda/diff --host 127.0.0.1 --port 8501
Expected result¶
dist/eda/candidatecontains profile artifacts (manifest.json,summary.json,report.md,tables/,figures/)dist/eda/diffcontains diff artifacts (manifest.json,diff.json,tables/)eda gateexits0on pass and1on policy failure
Next steps¶
- Tune thresholds in
eda-rules.tomlfor your CI bar. - Track
diff.jsondeltas over time to catch regressions early.