feat: MCP refactor Phases 1-7 — enterprise engineering drawing extraction stack by shrikantaprasad · Pull Request #55 · NanoNets/docstrange

shrikantaprasad · 2026-05-27T10:02:30Z

Summary

7-phase incremental refactor converting DocStrange into a production-ready engineering drawing extraction stack with MCP, FastAPI, and a full test suite.

Phase	What changed
1	Modular extractors (`title_block`, `dimensions`, `notes`, `gdt`, `bom`, `revisions`) + `OCRServiceFactory` fallback chain
2	Pydantic v2 schemas — typed `ExtractionMetadata`, `page` on all elements, `BOMRow`/`RevisionEntry` text auto-fill
3	`OverlayGenerator` with per-page filtering + critical OCR corruption bug fix (see below)
4	Async FastAPI service — validated uploads (415/400/413), typed response models, CORS
5	Modular MCP server split into `tools.py` / `cache.py` / `server.py` — LRU stat-based cache, `page` param on overlay tools
6	Overlay JSON — deterministic `change_id` (chg_001…), `bbox` field, `summary.by_type` counts
7	47-test extraction accuracy suite covering edge cases, bbox precision, multi-page stamping, schema validation

New module map

docstrange/
├── extractors/          # 6 modular extractors + base
├── schemas/engineering.py
├── pipelines/engineering.py
├── overlays/generator.py
├── api/                 # FastAPI — routes.py, models.py, main.py
└── mcp_server/          # tools.py, cache.py, server.py
tests/
├── test_extractors.py, test_schemas.py, test_pipeline.py
├── test_e2e_engineering.py, test_api.py
├── test_mcp_server.py, test_overlays.py
└── test_extraction_accuracy.py

Critical bug fix — OCR text corruption (Phase 3)

layout_detector._post_process_text() was silently corrupting engineering data:

| → I — destroyed all GD&T feature control frames (|⊥|0.05|A| → I⊥I0.05IA)
0 → o and 1 → l — destroyed dimension values (25.40 → 25.4o)

These substitutions have been removed entirely.

Breaking changes

Overlay annotation JSON shape changed in Phase 6:

Field	Before	After
annotation ID	`"id": "uuid-..."`	`"change_id": "chg_001"`
pixel bbox	`"bbox_pixels": {...}`	`"bbox": {...}`
top-level	—	`"summary": {"by_type": {...}, "total": N}` added

Any existing consumers of the overlay JSON will need to update field references.

Quick start

# FastAPI service
pip install -e ".[dev]"
uvicorn docstrange.api.main:app --reload
# → http://localhost:8000/docs

# MCP server (Claude Desktop)
python -m docstrange.mcp_server

Test plan

168 tests passing across all phases (< 3 seconds)
python -m pytest tests/test_extraction_accuracy.py tests/test_overlays.py tests/test_mcp_server.py tests/test_api.py tests/test_pipeline.py tests/test_e2e_engineering.py tests/test_extractors.py tests/test_schemas.py -q

🤖 Generated with Claude Code

- Phase 1: Modular extractors (title_block, dimensions, notes, gdt, bom, revisions) with OCRServiceFactory fallback chain - Phase 2: Pydantic v2 schemas with typed ExtractionMetadata, page stamping, BOMRow/RevisionEntry text auto-fill - Phase 3: OverlayGenerator with per-page filtering and unified element dispatch; fixed layout_detector OCR corruption bug - Phase 4: Async FastAPI service with validated uploads, response models, CORS, /extract/* and /generate/overlays endpoints - Phase 5: Modular MCP server (tools.py, cache.py, server.py) with LRU stat-based cache, page param on overlay tools, structured error JSON; 100 tests passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace random UUID `id` with sequential deterministic `change_id` (chg_001, chg_002, ...) that is stable across re-runs for the same extraction ordering - Rename `bbox_pixels` → `bbox` to match master-prompt target format; keep `bbox_normalized` - Add `summary.by_type` dict and `summary.total` to every overlay response - Update OverlayAnnotation and OverlayResponse API models accordingly - 21 new tests in test_overlays.py; 121 total passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

47 tests in test_extraction_accuracy.py covering gaps not in prior phases: - Dimension edge cases: imperial inches, bilateral tolerance (+0.5/-0.2), DIA. text, Ø symbol, unitless, large-integer guard - GD&T edge cases: position (⊙), parallelism/straightness abbreviations, feature control frames with multiple datums - Bounding box precision: exact x/y/width/height preserved end-to-end through every extractor - Multi-page stamping: _run_extractors stamps page numbers, _merge_page_results preserves per-page values, metadata.pages count - Malformed input: empty elements, zero confidence, picture-type skip, long text, section-ending edge case - Note continuation: indented lines merged into previous numbered note - Schema validation: confidence bounds (Pydantic raises on < 0 or > 1), auto-fill validators, dict → ExtractionMetadata coercion - Result invariants: all elements page >= 1, confidence in [0,1], unicode GD&T JSON-serialisable, selective extractor isolation Also fixed: test_note_section_ends_gracefully — note texts must not contain the word "note" as _NOTE_HEADER regex (NOTES?:?) matches it case-insensitively. 168 total tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

solutionsDigibull and others added 3 commits May 27, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MCP refactor Phases 1-7 — enterprise engineering drawing extraction stack#55

feat: MCP refactor Phases 1-7 — enterprise engineering drawing extraction stack#55
shrikantaprasad wants to merge 3 commits into
NanoNets:mainfrom
shrikantaprasad:feature/mcp-refactor

shrikantaprasad commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shrikantaprasad commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New module map

Critical bug fix — OCR text corruption (Phase 3)

Breaking changes

Quick start

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shrikantaprasad commented May 27, 2026 •

edited

Loading