feat: MCP refactor Phases 1-7 — enterprise engineering drawing extraction stack#55
Open
shrikantaprasad wants to merge 3 commits into
Open
feat: MCP refactor Phases 1-7 — enterprise engineering drawing extraction stack#55shrikantaprasad wants to merge 3 commits into
shrikantaprasad wants to merge 3 commits into
Conversation
- Phase 1: Modular extractors (title_block, dimensions, notes, gdt, bom, revisions) with OCRServiceFactory fallback chain - Phase 2: Pydantic v2 schemas with typed ExtractionMetadata, page stamping, BOMRow/RevisionEntry text auto-fill - Phase 3: OverlayGenerator with per-page filtering and unified element dispatch; fixed layout_detector OCR corruption bug - Phase 4: Async FastAPI service with validated uploads, response models, CORS, /extract/* and /generate/overlays endpoints - Phase 5: Modular MCP server (tools.py, cache.py, server.py) with LRU stat-based cache, page param on overlay tools, structured error JSON; 100 tests passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace random UUID `id` with sequential deterministic `change_id` (chg_001, chg_002, ...) that is stable across re-runs for the same extraction ordering - Rename `bbox_pixels` → `bbox` to match master-prompt target format; keep `bbox_normalized` - Add `summary.by_type` dict and `summary.total` to every overlay response - Update OverlayAnnotation and OverlayResponse API models accordingly - 21 new tests in test_overlays.py; 121 total passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
47 tests in test_extraction_accuracy.py covering gaps not in prior phases: - Dimension edge cases: imperial inches, bilateral tolerance (+0.5/-0.2), DIA. text, Ø symbol, unitless, large-integer guard - GD&T edge cases: position (⊙), parallelism/straightness abbreviations, feature control frames with multiple datums - Bounding box precision: exact x/y/width/height preserved end-to-end through every extractor - Multi-page stamping: _run_extractors stamps page numbers, _merge_page_results preserves per-page values, metadata.pages count - Malformed input: empty elements, zero confidence, picture-type skip, long text, section-ending edge case - Note continuation: indented lines merged into previous numbered note - Schema validation: confidence bounds (Pydantic raises on < 0 or > 1), auto-fill validators, dict → ExtractionMetadata coercion - Result invariants: all elements page >= 1, confidence in [0,1], unicode GD&T JSON-serialisable, selective extractor isolation Also fixed: test_note_section_ends_gracefully — note texts must not contain the word "note" as _NOTE_HEADER regex (NOTES?:?) matches it case-insensitively. 168 total tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
7-phase incremental refactor converting DocStrange into a production-ready engineering drawing extraction stack with MCP, FastAPI, and a full test suite.
title_block,dimensions,notes,gdt,bom,revisions) +OCRServiceFactoryfallback chainExtractionMetadata,pageon all elements,BOMRow/RevisionEntrytext auto-fillOverlayGeneratorwith per-page filtering + critical OCR corruption bug fix (see below)tools.py/cache.py/server.py— LRU stat-based cache,pageparam on overlay toolschange_id(chg_001…),bboxfield,summary.by_typecountsNew module map
Critical bug fix — OCR text corruption (Phase 3)
layout_detector._post_process_text()was silently corrupting engineering data:|→I— destroyed all GD&T feature control frames (|⊥|0.05|A|→I⊥I0.05IA)0→oand1→l— destroyed dimension values (25.40→25.4o)These substitutions have been removed entirely.
Breaking changes
Overlay annotation JSON shape changed in Phase 6:
"id": "uuid-...""change_id": "chg_001""bbox_pixels": {...}"bbox": {...}"summary": {"by_type": {...}, "total": N}addedAny existing consumers of the overlay JSON will need to update field references.
Quick start
Test plan
python -m pytest tests/test_extraction_accuracy.py tests/test_overlays.py tests/test_mcp_server.py tests/test_api.py tests/test_pipeline.py tests/test_e2e_engineering.py tests/test_extractors.py tests/test_schemas.py -q🤖 Generated with Claude Code