Skip to content

dhaggertycs/qfix

Repository files navigation

qfix

Rust License: MIT

Parquet-first data quality validation and repair.

qfix is a Rust CLI and library that scans tabular files, reports schema drift and row-level validation issues, and repairs them into clean output plus a quarantine audit trail. It ships with preset rule sets for common warehouses (BigQuery, Redshift, PostgreSQL) and is designed to be extended to custom targets. Built for agents and scripts as well as interactive use: every command supports --format json, repair supports --dry-run, and a built-in MCP server lets tools like Cursor or Claude invoke qfix directly.

Early version. qfix is under active development. The Parquet scan/repair path is solid; CSV/JSON/NDJSON are basic compatibility paths and ORC is experimental. Expect changes to CLI flags and report shapes as the project matures.

Features

  • Parquet-first scan reports with schema variants, column profiles, and issue counts.
  • Row-level repair that writes clean rows to one file and quarantined rows to CSV with audit columns.
  • Preset validation rules for BigQuery, Redshift, and PostgreSQL; extensible to custom targets.
  • Schema drift detection and string-width profiling across files.
  • Agent-friendly CLI: --format human|json|quiet, --dry-run, structured exit codes.
  • Built-in MCP server (qfix mcp) exposing scan, repair, describe, and rules_example tools.

Format support

Format Status Notes
Parquet First-class Full scan, repair, schema, and profile support.
CSV Basic Good for imports and quarantine output.
JSON / NDJSON Basic Simple interchange and inspection.
ORC Experimental Explicit paths only; verify before relying on it.

Install

cargo build --release

The binary is written to target/release/qfix.

Requirements: Rust 1.85+. uv is only needed for the optional Python performance script in scripts/perf_test.py.

Quick start

Default scan discovers Parquet files under ./data, then the current directory, and writes report.json:

qfix

Scan an explicit file:

qfix scan data/sample_bad.parquet --target redshift --output report.json

Repair into default outputs:

qfix repair data/sample_bad.parquet
# writes out/cleaned.parquet, out/quarantine.csv, out/repair_report.json

Preview a repair without writing files:

qfix repair data/sample_bad.parquet --dry-run

Describe schemas and row counts as JSON:

qfix describe data/sample_ok.parquet

Machine-readable output for agents and CI:

qfix scan data/sample_bad.parquet --format json
qfix repair data/sample_bad.parquet --format json --quiet

Agent integration

qfix is built to be driven by agents and scripts.

  • --format json emits a final summary object on stdout and sends logs to stderr.
  • --quiet suppresses non-error output.
  • --dry-run lets a repair be previewed without side effects.
  • Exit codes: 0 = clean, 1 = command failure, 2 = success but issues found.

Start the MCP server:

qfix mcp

The server speaks JSON-RPC 2.0 over stdio and exposes:

  • scan — validate files and produce a JSON report.
  • repair — repair files with optional dry_run.
  • describe — return schema and row metadata.
  • rules_example — return example repair rules.

Commands

scan

Reads input files and writes a JSON validation report. Does not modify input data.

qfix scan [PATH ...] --target bigquery --output report.json

If paths are omitted, qfix auto-discovers Parquet files under ./data, then the current directory, then falls back to CSV/JSON/NDJSON. ORC must be passed explicitly.

repair

Normalizes data toward the most common schema, removes invalid rows, and writes:

  • Cleaned dataset: out/cleaned.parquet by default.
  • Quarantine CSV: out/quarantine.csv by default.
  • Repair report: out/repair_report.json by default.
qfix repair [PATH ...] \
  --target bigquery \
  --output out/cleaned.parquet \
  --quarantine out/quarantine.csv \
  --report out/repair_report.json

Supported output extensions are .parquet, .csv, .json, .ndjson, and .orc. Parquet is recommended for clean output.

Quarantine rows include _source_file, _row_index, _issue_types, and _issue_messages so the source and reason for every failure is preserved.

describe

Returns file format, row count, file size, schema id, and column names/types as JSON.

qfix describe data/sample_ok.parquet

mcp

Starts the Model Context Protocol server over stdio.

qfix mcp

tui (experimental)

qfix tui

Validation presets

qfix includes preset rule sets for common warehouses. New targets can be added by implementing the validator trait.

BigQuery

  • Timestamp range validity.
  • String size limit checks.
  • String-length outlier and drift detection.
  • CSV quote and row shape diagnostics.

Redshift

  • VARCHAR byte-length checks.
  • String-length outlier and drift detection.
  • CSV quote and row shape diagnostics.

PostgreSQL

  • PostgreSQL string size checks.
  • General data quality validation through the shared validator interface.
Planned: S3 and remote storage

S3 support exists in the codebase but is not a polished, first-class surface yet. It is retained as an expandable path for future releases.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-east-1

qfix scan s3://my-bucket/input/
qfix repair s3://my-bucket/input/ \
  --output s3://my-bucket/output/cleaned.parquet \
  --quarantine s3://my-bucket/output/quarantine.csv \
  --report s3://my-bucket/output/repair_report.json

For local S3-compatible testing, see tests/README.md.

Configuration

qfix loads .qfix.toml from the current directory or project root, then merges it with ~/.config/qfix/config.toml when present.

[validation]
max_string_bytes = 10485760
drift_multiplier = 4.0
outlier_multiplier = 6.0
aws_region = "us-east-1"

[output]
default_dir = "out"
preferred_format = "parquet"

[storage]
s3_endpoint = "http://localhost:8333"

Development

cargo fmt
cargo check
cargo clippy --locked --all-targets -- -D warnings
cargo test --locked --all-targets

S3 integration tests:

docker compose -f docker-compose.s3proxy.yml up -d
cargo test --test s3_integration_test -- --ignored
docker compose -f docker-compose.s3proxy.yml down

Optional performance benchmark:

uv run --with pandas --with numpy --with pyarrow python scripts/perf_test.py

Contributing

See CONTRIBUTING.md.

License

MIT — see LICENSE.

Copyright (c) 2026 David J Haggerty.

About

data cleaning

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors