IO
IO¶
honestroles.io provides read/write helpers for Parquet and DuckDB, plus a
minimal DataFrame validation utility.
Modules¶
parquet.py: Parquet read/write.duckdb_io.py: DuckDB read/write.dataframe.py:validate_dataframe.contract.py: Source-data normalization and contract validation.
Public API reference¶
read_parquet(path: str | Path, validate: bool = True) -> pd.DataFrame¶
Reads a Parquet file. Raises FileNotFoundError if the path does not exist.
When validate=True, checks required columns.
write_parquet(df: pd.DataFrame, path: str | Path) -> None¶
Writes a Parquet file, creating parent directories if needed.
read_duckdb_table(conn: duckdb.DuckDBPyConnection, table: str, validate: bool = True) -> pd.DataFrame¶
Reads from DuckDB by validated table name. When validate=True, checks
required columns.
read_duckdb_query(conn: duckdb.DuckDBPyConnection, query: str, validate: bool = True) -> pd.DataFrame¶
Reads from DuckDB using a validated read-only SQL query (SELECT/WITH,
single statement only). When validate=True, checks required columns.
write_duckdb(df: pd.DataFrame, conn: duckdb.DuckDBPyConnection, table: str, *, overwrite: bool = True) -> None¶
Writes a DataFrame to a DuckDB table. Validates table name using a regex.
Uses CREATE OR REPLACE when overwrite=True.
validate_dataframe(df: pd.DataFrame, required_columns: Iterable[str] | None = None) -> pd.DataFrame¶
Ensures required columns are present (defaults to schema.REQUIRED_COLUMNS).
Raises ValueError if any are missing.
validate_source_data_contract(df: pd.DataFrame, required_columns: Iterable[str] | None = None, require_non_null: bool = True, enforce_formats: bool = True) -> pd.DataFrame¶
Validates the source-data contract used by honestroles:
- required columns must exist
- required columns must be non-null when require_non_null=True
- format/type validation when enforce_formats=True:
- timestamp columns must be parseable
- apply_url must be a valid http/https URL
- known array columns must be arrays of strings
- known boolean columns must be booleans
- salary metadata values must match expected format/ranges
normalize_source_data_contract(df: pd.DataFrame, timestamp_columns: Iterable[str] | None = None, array_columns: Iterable[str] | None = None) -> pd.DataFrame¶
Normalizes common source-data format issues: - timestamp-like fields -> ISO-8601 UTC strings - array-like fields encoded as strings -> Python lists
Usage examples¶
import honestroles as hr
df = hr.read_parquet("jobs_current.parquet")
hr.write_parquet(df, "jobs_clean.parquet")
import duckdb
from honestroles.io import read_duckdb_query, read_duckdb_table, write_duckdb
conn = duckdb.connect()
df = read_duckdb_table(conn, "jobs_current")
subset = read_duckdb_query(conn, "select * from jobs_current where source = 'greenhouse'")
write_duckdb(df, conn, "jobs_scored", overwrite=True)
from honestroles.io import (
normalize_source_data_contract,
validate_source_data_contract,
)
df = normalize_source_data_contract(df)
df = validate_source_data_contract(df)
Design notes¶
- Contract validation is strict by default for known field types/formats.
read_duckdb_tableandread_duckdb_queryare the preferred explicit APIs.write_duckdbuses a temporary registered DataFrame and validates table names.