Parquet-first data quality validation and repair.
qfix is a Rust CLI and library that scans tabular files, reports schema drift and row-level validation issues, and repairs them into clean output plus a quarantine audit trail. It ships with preset rule sets for common warehouses (BigQuery, Redshift, PostgreSQL) and is designed to be extended to custom targets. Built for agents and scripts as well as interactive use: every command supports --format json, repair supports --dry-run, and a built-in MCP server lets tools like Cursor or Claude invoke qfix directly.
Early version.
qfixis under active development. The Parquet scan/repair path is solid; CSV/JSON/NDJSON are basic compatibility paths and ORC is experimental. Expect changes to CLI flags and report shapes as the project matures.
- Parquet-first scan reports with schema variants, column profiles, and issue counts.
- Row-level repair that writes clean rows to one file and quarantined rows to CSV with audit columns.
- Preset validation rules for BigQuery, Redshift, and PostgreSQL; extensible to custom targets.
- Schema drift detection and string-width profiling across files.
- Agent-friendly CLI:
--format human|json|quiet,--dry-run, structured exit codes. - Built-in MCP server (
qfix mcp) exposingscan,repair,describe, andrules_exampletools.
| Format | Status | Notes |
|---|---|---|
| Parquet | First-class | Full scan, repair, schema, and profile support. |
| CSV | Basic | Good for imports and quarantine output. |
| JSON / NDJSON | Basic | Simple interchange and inspection. |
| ORC | Experimental | Explicit paths only; verify before relying on it. |
cargo build --releaseThe binary is written to target/release/qfix.
Requirements: Rust 1.85+. uv is only needed for the optional Python performance script in scripts/perf_test.py.
Default scan discovers Parquet files under ./data, then the current directory, and writes report.json:
qfixScan an explicit file:
qfix scan data/sample_bad.parquet --target redshift --output report.jsonRepair into default outputs:
qfix repair data/sample_bad.parquet
# writes out/cleaned.parquet, out/quarantine.csv, out/repair_report.jsonPreview a repair without writing files:
qfix repair data/sample_bad.parquet --dry-runDescribe schemas and row counts as JSON:
qfix describe data/sample_ok.parquetMachine-readable output for agents and CI:
qfix scan data/sample_bad.parquet --format json
qfix repair data/sample_bad.parquet --format json --quietqfix is built to be driven by agents and scripts.
--format jsonemits a final summary object on stdout and sends logs to stderr.--quietsuppresses non-error output.--dry-runlets a repair be previewed without side effects.- Exit codes:
0= clean,1= command failure,2= success but issues found.
Start the MCP server:
qfix mcpThe server speaks JSON-RPC 2.0 over stdio and exposes:
scan— validate files and produce a JSON report.repair— repair files with optionaldry_run.describe— return schema and row metadata.rules_example— return example repair rules.
Reads input files and writes a JSON validation report. Does not modify input data.
qfix scan [PATH ...] --target bigquery --output report.jsonIf paths are omitted, qfix auto-discovers Parquet files under ./data, then the current directory, then falls back to CSV/JSON/NDJSON. ORC must be passed explicitly.
Normalizes data toward the most common schema, removes invalid rows, and writes:
- Cleaned dataset:
out/cleaned.parquetby default. - Quarantine CSV:
out/quarantine.csvby default. - Repair report:
out/repair_report.jsonby default.
qfix repair [PATH ...] \
--target bigquery \
--output out/cleaned.parquet \
--quarantine out/quarantine.csv \
--report out/repair_report.jsonSupported output extensions are .parquet, .csv, .json, .ndjson, and .orc. Parquet is recommended for clean output.
Quarantine rows include _source_file, _row_index, _issue_types, and _issue_messages so the source and reason for every failure is preserved.
Returns file format, row count, file size, schema id, and column names/types as JSON.
qfix describe data/sample_ok.parquetStarts the Model Context Protocol server over stdio.
qfix mcpqfix tuiqfix includes preset rule sets for common warehouses. New targets can be added by implementing the validator trait.
- Timestamp range validity.
- String size limit checks.
- String-length outlier and drift detection.
- CSV quote and row shape diagnostics.
VARCHARbyte-length checks.- String-length outlier and drift detection.
- CSV quote and row shape diagnostics.
- PostgreSQL string size checks.
- General data quality validation through the shared validator interface.
Planned: S3 and remote storage
S3 support exists in the codebase but is not a polished, first-class surface yet. It is retained as an expandable path for future releases.
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-east-1
qfix scan s3://my-bucket/input/
qfix repair s3://my-bucket/input/ \
--output s3://my-bucket/output/cleaned.parquet \
--quarantine s3://my-bucket/output/quarantine.csv \
--report s3://my-bucket/output/repair_report.jsonFor local S3-compatible testing, see tests/README.md.
qfix loads .qfix.toml from the current directory or project root, then merges it with ~/.config/qfix/config.toml when present.
[validation]
max_string_bytes = 10485760
drift_multiplier = 4.0
outlier_multiplier = 6.0
aws_region = "us-east-1"
[output]
default_dir = "out"
preferred_format = "parquet"
[storage]
s3_endpoint = "http://localhost:8333"cargo fmt
cargo check
cargo clippy --locked --all-targets -- -D warnings
cargo test --locked --all-targetsS3 integration tests:
docker compose -f docker-compose.s3proxy.yml up -d
cargo test --test s3_integration_test -- --ignored
docker compose -f docker-compose.s3proxy.yml downOptional performance benchmark:
uv run --with pandas --with numpy --with pyarrow python scripts/perf_test.pySee CONTRIBUTING.md.
MIT — see LICENSE.
Copyright (c) 2026 David J Haggerty.