provero (Esperanto): to test, to put to proof.
A vendor-neutral, declarative data quality engine.
pip install provero
provero initEdit provero.yaml:
source:
type: duckdb
table: orders
checks:
- not_null: [order_id, customer_id, amount]
- unique: order_id
- accepted_values:
column: status
values: [pending, shipped, delivered, cancelled]
- range:
column: amount
min: 0
max: 100000
- row_count:
min: 1Run:
provero run┌─────────────────┬──────────────┬──────────┬──────────────────┬──────────────────┐
│ Check │ Column │ Status │ Observed │ Expected │
├─────────────────┼──────────────┼──────────┼──────────────────┼──────────────────┤
│ not_null │ order_id │ ✓ PASS │ 0 nulls │ 0 nulls │
│ not_null │ customer_id │ ✓ PASS │ 0 nulls │ 0 nulls │
│ not_null │ amount │ ✓ PASS │ 0 nulls │ 0 nulls │
│ unique │ order_id │ ✓ PASS │ 0 duplicates │ 0 duplicates │
│ accepted_values │ status │ ✓ PASS │ 0 invalid values │ only [pending..] │
│ range │ amount │ ✓ PASS │ min=45, max=999 │ min=0, max=100k │
│ row_count │ - │ ✓ PASS │ 5 │ >= 1 │
└─────────────────┴──────────────┴──────────┴──────────────────┴──────────────────┘
Score: 100/100 | 7 passed, 0 failed | 22ms
- 16 check types: not_null, unique, unique_combination, completeness, accepted_values, range, regex, email_validation, type, freshness, latency, row_count, row_count_change, anomaly, custom_sql, referential_integrity
- 3 connectors: DuckDB (files + in-memory), PostgreSQL, Pandas/Polars DataFrame
- SQL batch optimizer: compiles N checks into 1 query
- Data contracts: schema validation, SLA enforcement, contract diff
- Anomaly detection: Z-Score, MAD, IQR (stdlib only, no scipy needed)
- HTML reports:
provero run --report html - Webhook alerts: notify Slack, PagerDuty, or any HTTP endpoint on failure
- Result store: SQLite with time-series metrics and
provero history - Data profiling:
provero profile --suggestauto-generates checks - Configurable severity: info, warning, critical, blocker per check
- JSON Schema validation for provero.yaml
- Airflow provider: ProveroCheckOperator + @provero_check decorator
- SodaCL migration:
provero import sodaconverts configs in one command - dbt interop:
provero export dbtgenerates schema.yml test definitions - Continuous monitoring:
provero watchpolls checks on interval
| Check | Description | Example |
|---|---|---|
not_null |
Column has no null values | not_null: order_id |
unique |
Column has no duplicate values | unique: order_id |
unique_combination |
Composite uniqueness across columns | unique_combination: [date, store_id] |
completeness |
Minimum percentage of non-null values | completeness: { column: email, min: 95% } |
accepted_values |
Column values are within allowed set | accepted_values: { column: status, values: [a, b] } |
range |
Numeric values within min/max bounds | range: { column: amount, min: 0, max: 100000 } |
regex |
Values match a regular expression | regex: { column: email, pattern: ".+@.+" } |
email_validation |
Values are valid email addresses | email_validation: { column: email } |
type |
Column data type matches expected | type: { column: amount, expected: numeric } |
freshness |
Most recent timestamp within threshold | freshness: { column: updated_at, max_age: 24h } |
latency |
Time between two timestamp columns | latency: { source_column: created_at, target_column: processed_at, max_latency: 1h } |
row_count |
Table row count within bounds | row_count: { min: 1, max: 1000000 } |
row_count_change |
Row count change vs previous run | row_count_change: { max_decrease: 10% } |
anomaly |
Statistical anomaly detection | anomaly: { column: amount, method: zscore } |
custom_sql |
Custom SQL query returns truthy value | custom_sql: "SELECT COUNT(*) > 0 FROM orders" |
referential_integrity |
FK values exist in reference table | referential_integrity: { column: customer_id, reference_table: customers, reference_column: id } |
A provero.yaml file defines your data source, checks, alerts, and contracts:
# Source configuration
source:
type: duckdb # duckdb, postgres, dataframe
table: orders # table name or file expression
# connection: postgres://... # connection string for databases
# Quality checks
checks:
- not_null: [order_id, customer_id]
- unique: order_id
- range:
column: amount
min: 0
max: 100000
- freshness:
column: updated_at
max_age: 24h
- anomaly:
column: amount
method: zscore # zscore, mad, iqr
threshold: 3.0
window: 30 # lookback window in days
- referential_integrity:
column: customer_id
reference_table: customers
reference_column: id
# Severity levels: info, warning, critical, blocker
# Blocker checks cause a non-zero exit code
# Alert notifications
alerts:
- type: webhook
url: https://hooks.slack.com/services/YOUR/WEBHOOK
trigger: on_failure # on_failure, on_success, always
# Data contracts (optional)
contracts:
- name: orders_contract
owner: data-team
table: orders
schema:
columns:
- name: order_id
type: integer
checks: [not_null, unique]
sla:
freshness: 24hProvero includes built-in statistical anomaly detection that works without external dependencies (no scipy needed).
Supported methods:
| Method | Description | Best for |
|---|---|---|
zscore |
Standard Z-Score | Normally distributed metrics |
mad |
Median Absolute Deviation | Robust to outliers |
iqr |
Interquartile Range | Skewed distributions |
checks:
- anomaly:
column: daily_revenue
method: mad
threshold: 3.5
window: 30Anomaly detection uses the result store to compare current values against historical data. Run provero run regularly to build up the baseline.
| Command | Description |
|---|---|
provero init |
Create a new provero.yaml template |
provero run |
Execute quality checks |
provero validate |
Validate config syntax without running |
provero profile |
Profile a data source |
provero history |
Show historical check results |
provero contract validate |
Validate data contracts against live data |
provero contract diff |
Compare two contract versions |
provero watch |
Continuously run checks on interval |
provero import soda |
Convert SodaCL config to Provero format |
provero export dbt |
Generate dbt schema.yml from checks |
provero version |
Show version |
Send webhook notifications when checks fail:
source:
type: duckdb
table: orders
checks:
- not_null: order_id
- row_count:
min: 1
alerts:
- type: webhook
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
trigger: on_failure
- type: webhook
url: ${PAGERDUTY_WEBHOOK}
headers:
Authorization: "Bearer ${PD_TOKEN}"Triggers: on_failure (default), on_success, always.
Define and enforce schema contracts:
contracts:
- name: orders_contract
owner: data-team
table: orders
on_violation: warn
schema:
columns:
- name: order_id
type: integer
checks: [not_null, unique]
- name: status
type: varchar
sla:
freshness: 24h
completeness: "95%"provero contract validate
provero contract diff old.yaml new.yaml| Connector | Status | Install |
|---|---|---|
| DuckDB | Stable | included |
| PostgreSQL | Stable | pip install provero[postgres] |
| DataFrame | Stable | pip install provero[dataframe] |
| Snowflake | Beta | pip install provero[snowflake] |
| BigQuery | Beta | pip install provero[bigquery] |
| MySQL | Beta | pip install provero[mysql] |
| Redshift | Beta | pip install provero[redshift] |
DuckDB supports file expressions: read_csv('data.csv'), read_parquet('*.parquet').
from provero.core.engine import Engine
engine = Engine("provero.yaml")
results = engine.run()
for result in results:
print(f"{result.check_name}: {result.status}")from provero.core.engine import Engine
engine = Engine.from_dict({
"source": {"type": "duckdb", "table": "orders"},
"checks": [
{"not_null": "order_id"},
{"row_count": {"min": 1}},
],
})
results = engine.run()pip install provero-airflowfrom provero.airflow.operators import ProveroCheckOperator
check_orders = ProveroCheckOperator(
task_id="check_orders",
config_path="dags/provero.yaml",
suite="orders_daily",
)Full documentation is available on GitHub Pages.
- Getting Started
- Configuration
- Check Types
- Connectors
- CLI Reference
- Architecture
- Contributing
- Governance
- Security Policy
- Support
Apache License 2.0. See LICENSE.