qfix

Parquet-first data quality validation and repair.

qfix is a Rust CLI and library that scans tabular files, reports schema drift and row-level validation issues, and repairs them into clean output plus a quarantine audit trail. It ships with preset rule sets for common warehouses (BigQuery, Redshift, PostgreSQL) and is designed to be extended to custom targets. Built for agents and scripts as well as interactive use: every command supports --format json, repair supports --dry-run, and a built-in MCP server lets tools like Cursor or Claude invoke qfix directly.

Early version. qfix is under active development. The Parquet scan/repair path is solid; CSV/JSON/NDJSON are basic compatibility paths and ORC is experimental. Expect changes to CLI flags and report shapes as the project matures.

Features

Parquet-first scan reports with schema variants, column profiles, and issue counts.
Row-level repair that writes clean rows to one file and quarantined rows to CSV with audit columns.
Preset validation rules for BigQuery, Redshift, and PostgreSQL; extensible to custom targets.
Schema drift detection and string-width profiling across files.
Agent-friendly CLI: --format human|json|quiet, --dry-run, structured exit codes.
Built-in MCP server (qfix mcp) exposing scan, repair, describe, and rules_example tools.

Format support

Format	Status	Notes
Parquet	First-class	Full scan, repair, schema, and profile support.
CSV	Basic	Good for imports and quarantine output.
JSON / NDJSON	Basic	Simple interchange and inspection.
ORC	Experimental	Explicit paths only; verify before relying on it.

Install

cargo build --release

The binary is written to target/release/qfix.

Requirements: Rust 1.85+. uv is only needed for the optional Python performance script in scripts/perf_test.py.

Quick start

Default scan discovers Parquet files under ./data, then the current directory, and writes report.json:

qfix

Scan an explicit file:

qfix scan data/sample_bad.parquet --target redshift --output report.json

Repair into default outputs:

qfix repair data/sample_bad.parquet
# writes out/cleaned.parquet, out/quarantine.csv, out/repair_report.json

Preview a repair without writing files:

qfix repair data/sample_bad.parquet --dry-run

Describe schemas and row counts as JSON:

qfix describe data/sample_ok.parquet

Machine-readable output for agents and CI:

qfix scan data/sample_bad.parquet --format json
qfix repair data/sample_bad.parquet --format json --quiet

Agent integration

qfix is built to be driven by agents and scripts.

--format json emits a final summary object on stdout and sends logs to stderr.
--quiet suppresses non-error output.
--dry-run lets a repair be previewed without side effects.
Exit codes: 0 = clean, 1 = command failure, 2 = success but issues found.

Start the MCP server:

qfix mcp

The server speaks JSON-RPC 2.0 over stdio and exposes:

scan — validate files and produce a JSON report.
repair — repair files with optional dry_run.
describe — return schema and row metadata.
rules_example — return example repair rules.

Commands

`scan`

Reads input files and writes a JSON validation report. Does not modify input data.

qfix scan [PATH ...] --target bigquery --output report.json

If paths are omitted, qfix auto-discovers Parquet files under ./data, then the current directory, then falls back to CSV/JSON/NDJSON. ORC must be passed explicitly.

`repair`

Normalizes data toward the most common schema, removes invalid rows, and writes:

Cleaned dataset: out/cleaned.parquet by default.
Quarantine CSV: out/quarantine.csv by default.
Repair report: out/repair_report.json by default.

qfix repair [PATH ...] \
  --target bigquery \
  --output out/cleaned.parquet \
  --quarantine out/quarantine.csv \
  --report out/repair_report.json

Supported output extensions are .parquet, .csv, .json, .ndjson, and .orc. Parquet is recommended for clean output.

Quarantine rows include _source_file, _row_index, _issue_types, and _issue_messages so the source and reason for every failure is preserved.

`describe`

Returns file format, row count, file size, schema id, and column names/types as JSON.

qfix describe data/sample_ok.parquet

`mcp`

Starts the Model Context Protocol server over stdio.

qfix mcp

`tui` (experimental)

qfix tui

Validation presets

qfix includes preset rule sets for common warehouses. New targets can be added by implementing the validator trait.

BigQuery

Timestamp range validity.
String size limit checks.
String-length outlier and drift detection.
CSV quote and row shape diagnostics.

Redshift

VARCHAR byte-length checks.
String-length outlier and drift detection.
CSV quote and row shape diagnostics.

PostgreSQL

PostgreSQL string size checks.
General data quality validation through the shared validator interface.

Planned: S3 and remote storage

S3 support exists in the codebase but is not a polished, first-class surface yet. It is retained as an expandable path for future releases.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-east-1

qfix scan s3://my-bucket/input/
qfix repair s3://my-bucket/input/ \
  --output s3://my-bucket/output/cleaned.parquet \
  --quarantine s3://my-bucket/output/quarantine.csv \
  --report s3://my-bucket/output/repair_report.json

For local S3-compatible testing, see tests/README.md.

Configuration

qfix loads .qfix.toml from the current directory or project root, then merges it with ~/.config/qfix/config.toml when present.

[validation]
max_string_bytes = 10485760
drift_multiplier = 4.0
outlier_multiplier = 6.0
aws_region = "us-east-1"

[output]
default_dir = "out"
preferred_format = "parquet"

[storage]
s3_endpoint = "http://localhost:8333"

Development

cargo fmt
cargo check
cargo clippy --locked --all-targets -- -D warnings
cargo test --locked --all-targets

S3 integration tests:

docker compose -f docker-compose.s3proxy.yml up -d
cargo test --test s3_integration_test -- --ignored
docker compose -f docker-compose.s3proxy.yml down

Optional performance benchmark:

uv run --with pandas --with numpy --with pyarrow python scripts/perf_test.py

Contributing

See CONTRIBUTING.md.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
docker-compose.s3proxy.yml		docker-compose.s3proxy.yml
justfile		justfile
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qfix

Features

Format support

Install

Quick start

Agent integration

Commands

`scan`

`repair`

`describe`

`mcp`

`tui` (experimental)

Validation presets

BigQuery

Redshift

PostgreSQL

Configuration

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

qfix

Features

Format support

Install

Quick start

Agent integration

Commands

scan

repair

describe

mcp

tui (experimental)

Validation presets

BigQuery

Redshift

PostgreSQL

Configuration

Development

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`scan`

`repair`

`describe`

`mcp`

`tui` (experimental)

Packages