Skip to content

DuckDB Source Schema

This document defines the schema for the DuckDB database used by the scraper. It serves as the source-of-truth for downstream transformations.

Tables

companies_discovered

Stores companies discovered via crawling (e.g., Common Crawl), identifying potential ATS endpoints.

Column Type Constraints Description
ats_type VARCHAR PK Type of ATS detected
ats_org VARCHAR PK ATS organization slug detected
company_name VARCHAR Extracted company name
company_slug VARCHAR Slugified company name
company_name_source VARCHAR Source of the company name
board_url VARCHAR URL where the board was found
company_domain VARCHAR Domain of the company
first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP First time seen
last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP Last time seen
last_enriched_at TIMESTAMP Timestamp of last enrichment

jobs_current

Stores the current state of active jobs.

Column Type Constraints Description
job_key VARCHAR PRIMARY KEY Unique job identifier (company::source::job_id)
company VARCHAR NOT NULL Company identifier
source VARCHAR NOT NULL ATS Source (e.g., greenhouse)
job_id VARCHAR NOT NULL ID from the ATS
title VARCHAR NOT NULL Job title
team VARCHAR Team or department name
location_raw VARCHAR NOT NULL Raw location string
remote_flag BOOLEAN DEFAULT FALSE Simple remote flag
employment_type VARCHAR Employment type (e.g., full-time)
posted_at TIMESTAMP Date posted
updated_at TIMESTAMP Date updated
apply_url VARCHAR NOT NULL URL to apply
description_html TEXT HTML description
description_text TEXT Plain text description
ingested_at TIMESTAMP NOT NULL Ingestion timestamp
content_hash VARCHAR NOT NULL Hash of stable content content
salary_min DOUBLE Minimum salary
salary_max DOUBLE Maximum salary
salary_currency VARCHAR Salary currency
salary_interval VARCHAR Salary interval
city VARCHAR Enriched city
country VARCHAR Enriched country
remote_type VARCHAR Enriched remote type
skills VARCHAR[] Extracted skills
last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP Last time seen already processed
salary_text VARCHAR Raw salary string
languages VARCHAR[] Extracted languages requirements
benefits VARCHAR[] Extracted benefits
visa_sponsorship BOOLEAN Visa sponsorship mentioned

jobs_historical

Stores historical snapshots of jobs, including Common Crawl provenance.

Column Type Constraints Description
id INTEGER PRIMARY KEY Auto-incrementing ID
job_key VARCHAR NOT NULL Job identifier
crawl_id VARCHAR Common Crawl ID
crawl_timestamp TIMESTAMP Common Crawl timestamp
crawl_url VARCHAR URL processed by crawler
source_url VARCHAR Source URL for validation
... (Inherits all columns from jobs_current)

job_events

Tracks lifecycle events for jobs.

Column Type Constraints Description
id INTEGER PRIMARY KEY Auto-incrementing ID
job_key VARCHAR NOT NULL Job identifier
event_type VARCHAR NOT NULL 'opened', 'closed', 'changed'
event_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP Event timestamp
old_hash VARCHAR Previous state hash
new_hash VARCHAR New state hash

crawl_completions

Tracks progress of historical crawls.

Column Type Constraints Description
crawl_id VARCHAR PRIMARY KEY Common Crawl snapshot ID
completed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP Completion time
ats_types_processed VARCHAR List of ATS types handled
total_slugs_found INTEGER DEFAULT 0 Metrics