From 386cc17bd343494c100c18f3cc2e4d8ad5ab72c1 Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Tue, 23 Jun 2026 13:59:12 +0200
Subject: [PATCH 1/7] feat: implement TDD workflow skill with SPEC.md and
 evaluation scenarios

---
 skills/tdd-workflow/SKILL.md                  | 158 ++++++++++++++++++
 skills/tdd-workflow/assets/SPEC_TEMPLATE.md   |  70 ++++++++
 skills/tdd-workflow/evals/evals.json          | 121 ++++++++++++++
 .../evals/files/bank-account-spec.md          |  27 +++
 skills/tdd-workflow/evals/trigger-eval.json   |  74 ++++++++
 5 files changed, 450 insertions(+)
 create mode 100644 skills/tdd-workflow/SKILL.md
 create mode 100644 skills/tdd-workflow/assets/SPEC_TEMPLATE.md
 create mode 100644 skills/tdd-workflow/evals/evals.json
 create mode 100644 skills/tdd-workflow/evals/files/bank-account-spec.md
 create mode 100644 skills/tdd-workflow/evals/trigger-eval.json

diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
new file mode 100644
index 0000000..e1cc733
--- /dev/null
+++ b/skills/tdd-workflow/SKILL.md
@@ -0,0 +1,158 @@
+---
+name: tdd-workflow
+description: >
+  Test-driven development (TDD) workflow for writing or changing production code. Use this
+  skill whenever the user is about to implement a feature, add a function, fix a bug, or
+  build new functionality — even if they never say "TDD." If new code will be written, this
+  skill should take over. Enforces SPEC.md-first as a local scratchpad with systematic
+  edge case discovery, explicit test table confirmation with user review gates, and the
+  red–green–refactor cycle before implementation. Language-agnostic. Triggers on:
+  "implement this feature using TDD," "write the tests first," "write failing tests first,"
+  "unit tests before coding," "red-green-refactor," "TDD," "I want to implement…," "fix
+  the … bug," "add a … function," "build a … utility/feature," "add test coverage before
+  I add the new feature," "spec it out and write tests before touching the code," "what
+  edge cases and boundary conditions should we test," "document design decisions before
+  coding," "what test cases do we need for this feature." Does NOT trigger for: conceptual
+  or educational questions about testing/TDD, reviewing an existing test suite for gaps, or
+  pure refactors where tests already pass and no new behavior is being added.
+---
+
+# TDD Workflow
+
+Enforce the red–green–refactor TDD cycle. SPEC.md is a local-only session scratchpad — never committed.
+
+## Step 1 — Create & Complete SPEC.md
+
+Create `SPEC.md` in the relevant package/module directory using `assets/SPEC_TEMPLATE.md`. **Complete the entire SPEC before moving forward.**
+
+### SPEC.md Completion Checklist (All Required)
+- ✓ **Purpose:** One clear paragraph explaining what the component does and why
+- ✓ **Scenarios:** Detailed table with at least 3-5 scenarios covering happy path, rejections, edge cases
+- ✓ **Edge Cases:** Explicit list of boundary conditions and failure modes (see patterns below)
+- ✓ **Out of Scope:** Clear list of what this component does NOT handle
+- ✓ **Open Questions:** Unresolved decisions that need input before implementation
+
+**If any box is unchecked, do not proceed to Step 2.** SPEC.md is your test-first specification.
+
+### Edge Case Discovery Patterns
+Think systematically about each input/state. For each field:
+- **Input validation:** What happens with empty, null, negative, zero, very large values?
+- **Boundary conditions:** Smallest positive value? Largest supported? Off-by-one?
+- **Format variations:** Spaces, dashes, trailing zeros, case sensitivity?
+- **State transitions:** Can B happen before A? What's the valid sequence?
+- **Precondition violations:** What if a required state doesn't exist?
+
+**SPEC.md configuration:** This file is a session scratchpad — it must never be committed. It is ignored by default (already in `.gitignore`). If you want to keep it permanently, rename it and commit it explicitly.
+
+---
+
+## Step 2 — Build & Confirm Test Case Table
+
+Create a test case table from your SPEC scenarios using this exact format:
+
+| # | Test Name | Intent | Input Summary | Expected Output Summary |
+|---|-----------|--------|---|---|
+| 1 | name_of_test | one-line intent | describe inputs concisely | describe pass/fail concisely |
+| 2 | ... | ... | ... | ... |
+
+**Each entry must be specific enough that a developer can write a test from it without asking questions.** Avoid vague summaries like "handles refunds" — instead: "refund 30.00 of 100.00 approved payment, leaving 70.00 refundable."
+
+### Confirmation Gate — MANDATORY PAUSE
+
+**🛑 DO NOT CODE YET. WAIT FOR USER CONFIRMATION BEFORE PROCEEDING. 🛑**
+
+Present the test table and ask the user:
+- "Does this cover the requirements?"
+- "Are there test cases you'd add, remove, or change?"
+- "Is each case specific enough?"
+
+Incorporate user feedback:
+- Add cases if coverage gaps exist
+- Remove cases if out of scope
+- Clarify cases until specific
+- **Re-present the table and ask again if changes were made**
+
+Only when the user confirms "Yes, this is our test plan" do you proceed to Step 3.
+
+### Design Decisions Record (Optional, Helpful)
+
+Before moving to Step 3, capture key decisions:
+```
+## Design Decisions
+- Error handling: [exception / result object / status code?] → Choose: _____
+- Data representation: [type/format decisions] → Choose: _____
+- [Other key assumption] → Choose: _____
+```
+
+This prevents rework later.
+
+---
+
+## Step 3 — Red Phase
+
+Write all failing tests first. Write test code that compiles but does not pass. Follow this process:
+
+1. **Order matters:** Implement test 1 → test 2 → test 3 ... → test N. This reveals missing functionality progressively.
+2. **Each test must state its scenario:** Use clear docstrings/descriptions so anyone reading the test understands what case it covers.
+3. **Cover all distinct inputs:** For each row in your confirmed test table, write one test.
+4. **Do not implement yet:** Only test code. If you find yourself writing implementation code, stop and write test-only code.
+
+Run the full test suite. **Expect all or most tests to fail** — that's the "Red" phase.
+
+---
+
+## Step 4 — Green Phase
+
+Implement the minimum code to make tests pass. Follow this process:
+
+1. **Implement test-by-test:** Make test 1 pass → test 2 → test 3 ... → test N.
+2. **Minimal changes:** Each change should make one test pass without breaking others.
+3. **Keep focus:** Ignore refactoring urges in this phase. Just make tests pass.
+4. **Run full suite after each change:** Confirm no regressions as you go.
+
+Once all tests pass, you've completed the Green phase.
+
+---
+
+## Step 5 — Refactor Phase
+
+Clean up the now-passing implementation without changing observable behavior. Focus on:
+- Extract duplication (helper methods, constants)
+- Improve naming (variables, methods, classes)
+- Simplify logic (reduce nesting, extract complex conditions)
+- Organize code (group related methods, clarify intent)
+
+**After every refactor change, run the full test suite.** If a test fails, revert and try a different refactor. The goal is code that is both correct *and* maintainable.
+
+---
+
+## Step 6 — Done
+
+SPEC.md served its purpose as a scratchpad. Do not update it post-implementation unless the user explicitly asks to keep it.
+
+---
+
+## Pre-Code Checklist
+
+Before you write a single line of implementation code (Step 3), verify:
+
+- [ ] SPEC.md created with Purpose, Scenarios, Edge Cases, Out of Scope, Open Questions
+- [ ] Test case table created and presented to user
+- [ ] Test case table reviewed and approved by user (confirmation gate passed)
+- [ ] Edge cases explicitly identified and categorized
+- [ ] Design decisions documented (or deferred with rationale)
+- [ ] Test table is specific enough — no vague summaries
+- [ ] No implementation code written yet
+- [ ] Ready to enter Red phase
+
+**If any box is unchecked, do not proceed to Step 3 (Red phase).**
+
+---
+
+## Enforce these rules throughout
+
+- **Do not start coding before the test table is confirmed** — jumping ahead short-circuits the design conversation and leads to tests written to fit code rather than the reverse. This is the #1 pitfall.
+- **Do not commit SPEC.md** — it is a session scratchpad, not a deliverable; committing it creates noise and may expose unfinished thinking.
+- **Do not access private members of the class under test in tests** — tests that reach into internals couple themselves to implementation details, making refactors fragile.
+- **Prefer `# --- section ---` separators over inline comments in test files** — test names and docstrings should be self-describing; prose comments outside methods add clutter.
+- **Test before code, always** — if you find yourself writing implementation code, pause and write test code instead. Every feature should have a failing test first.
diff --git a/skills/tdd-workflow/assets/SPEC_TEMPLATE.md b/skills/tdd-workflow/assets/SPEC_TEMPLATE.md
new file mode 100644
index 0000000..20ddded
--- /dev/null
+++ b/skills/tdd-workflow/assets/SPEC_TEMPLATE.md
@@ -0,0 +1,70 @@
+# <Component/Feature Name>
+
+## Purpose
+One paragraph: what this component does, why it exists, and the public interface it provides. Be specific about the inputs it accepts and outcomes it produces.
+
+## Scenarios
+
+Describe the happy path and key failure modes. Each row should have concrete inputs and expected outputs that a developer could test against.
+
+| # | Name | Intent | Input | Expected Output |
+|---|------|--------|-------|-----------------|
+| 1 | name | clear one-line goal | specific input values | specific result or error |
+| 2 | ... | ... | ... | ... |
+
+**Examples:**
+- ✅ GOOD: "approve valid card" | "card=4111111111111111, amount=100.00" | "approved + transaction_id"
+- ❌ VAGUE: "handle cards" | "card data" | "works or fails"
+
+## Edge Cases
+
+List known boundary conditions and failure modes. Think systematically:
+- **Input validation:** What happens with empty, null, negative, zero, or very large values?
+- **Boundary conditions:** What's the smallest positive value? Largest supported? Off-by-one boundaries?
+- **Format variations:** Spaces, dashes, case sensitivity, trailing zeros?
+- **State transitions:** Can operation B happen before operation A? What's the valid sequence?
+- **Precondition violations:** What if a required precondition doesn't exist?
+
+Example:
+```
+- Card normalization: spaces and dashes are stripped before Luhn validation
+- Zero and negative amounts: rejected with validation error
+- Refund ceilings: cannot refund more than the remaining approved balance
+- Unknown transactions: refund request for non-existent tx returns error
+```
+
+## Out of Scope
+
+What this component does NOT handle (prevents scope creep and clarifies stopping points):
+
+Example:
+```
+- PCI-compliant storage or encryption
+- Real payment gateway integration
+- Chargebacks or payment disputes
+- Card brand detection (only Luhn validation)
+- Multi-currency support
+```
+
+## Open Questions
+
+Unresolved design decisions needing input before implementation. Marking these now prevents rework later:
+
+Example:
+```
+- Should errors be exceptions, result objects, or status codes?
+- Should amounts be Decimal, integer cents, or language-native money type?
+- Are refunds idempotent per refund ID or simply additive?
+```
+
+## Design Decisions (Optional but Helpful)
+
+Once you've reviewed this SPEC with the user and they've approved your test plan, record key design choices here before implementation:
+
+```
+- Error handling: Choose one → _____
+- Amount representation: Choose one → _____
+- [Other decision] → _____
+```
+
+This prevents rework when implementing.
diff --git a/skills/tdd-workflow/evals/evals.json b/skills/tdd-workflow/evals/evals.json
new file mode 100644
index 0000000..8eac55c
--- /dev/null
+++ b/skills/tdd-workflow/evals/evals.json
@@ -0,0 +1,121 @@
+{
+  "skill_name": "tdd-workflow",
+  "evals": [
+    {
+      "id": "happy-path-implement-function",
+      "category": "happy-path",
+      "prompt": "I want to implement a discount calculator function that applies a percentage rate to a price. Use TDD.",
+      "expected_output": "Skill creates SPEC.md first, proposes a test case table (happy path, zero discount, negative price, rate > 1, etc.), waits for confirmation, then writes failing tests before any implementation code.",
+      "files": [],
+      "expectations": [
+        "SPEC.md is created before any test or implementation code",
+        "A test case table with columns: #, name, intent, input summary, expected output summary is proposed",
+        "The skill explicitly waits for user confirmation before proceeding past Step 2",
+        "Failing tests are written in Step 3 before any implementation",
+        "Implementation code only appears in Step 4 (Green phase)",
+        "A Refactor step is mentioned or applied after tests pass",
+        "SPEC.md is described as a local scratchpad not to be committed"
+      ]
+    },
+    {
+      "id": "happy-path-fix-bug-tdd",
+      "category": "happy-path",
+      "prompt": "There's a bug in our order totalling logic — it ignores taxes when a coupon is applied. Walk me through fixing it test-first.",
+      "expected_output": "Skill drafts SPEC.md capturing the bug scenario, proposes a test table covering the failing case and related edge cases, waits for confirmation, then writes a failing test that reproduces the bug before touching the implementation.",
+      "files": [],
+      "expectations": [
+        "SPEC.md is created first to describe the bug scenario",
+        "A test table is proposed covering the bug case and edge cases",
+        "A failing test that reproduces the bug is written in the Red phase",
+        "Implementation fix only comes after the failing test is in place",
+        "Refactor phase is included after the fix is confirmed passing"
+      ]
+    },
+    {
+      "id": "happy-path-new-class",
+      "category": "happy-path",
+      "prompt": "I need a RateLimiter class that blocks requests exceeding N calls per window. Start with TDD.",
+      "expected_output": "Skill creates SPEC.md for RateLimiter, proposes a test table (under limit, at limit, over limit, window reset, zero limit), waits for confirmation, writes failing tests covering each scenario, then implements the class.",
+      "files": [],
+      "expectations": [
+        "SPEC.md is created with purpose, scenarios table, edge cases, and out-of-scope sections",
+        "Test table includes at minimum: under-limit pass, at-limit pass, over-limit block, window reset",
+        "Skill pauses and waits for confirmation after presenting the test table",
+        "Tests do not access private members of RateLimiter",
+        "Each test has a descriptive name or docstring matching its scenario"
+      ]
+    },
+    {
+      "id": "regression-no-code-before-confirmation",
+      "category": "regression",
+      "prompt": "Implement a password validator with TDD. It should enforce min length 8, one uppercase, one digit.",
+      "expected_output": "Skill must NOT write any code (test or implementation) until the test table is confirmed. If it jumps ahead, that is a regression.",
+      "files": [],
+      "expectations": [
+        "No test code is written before the test table is presented",
+        "No implementation code is written before Step 3",
+        "Skill explicitly states it is waiting for confirmation before proceeding"
+      ]
+    },
+    {
+      "id": "regression-no-private-member-access",
+      "category": "regression",
+      "prompt": "Use TDD to build a BankAccount class with deposit, withdraw, and balance methods.",
+      "expected_output": "Tests only interact with BankAccount through its public interface (deposit, withdraw, balance). No test should read _balance or other private attributes directly.",
+      "files": ["evals/files/bank-account-spec.md"],
+      "expectations": [
+        "Tests call deposit(), withdraw(), and balance() — not _balance or __balance",
+        "Test names or docstrings describe the scenario, not the assertion",
+        "Section separators (# --- section ---) are used instead of inline prose comments in the test file"
+      ]
+    },
+    {
+      "id": "edge-already-has-spec",
+      "category": "edge",
+      "prompt": "I already have a SPEC.md drafted for my CSV parser. Here it is:\n\n# CSV Parser\n## Purpose\nParse CSV rows into typed dicts.\n## Scenarios\n| # | Name | Intent | Input | Expected Output |\n|---|------|--------|-------|-----------------|\n| 1 | basic row | single row with header | 'a,b\\n1,2' | [{'a':'1','b':'2'}] |\nCan you move straight to writing the tests?",
+      "expected_output": "Because a SPEC.md is already present, the skill skips creation and moves to proposing/confirming the test table from the existing scenarios, then proceeds to Red phase.",
+      "files": [],
+      "expectations": [
+        "Skill acknowledges the existing SPEC.md rather than recreating it",
+        "Skill proposes the test table derived from the provided scenarios",
+        "Skill still waits for confirmation before writing tests"
+      ]
+    },
+    {
+      "id": "output-format-test-table",
+      "category": "output-format",
+      "prompt": "TDD for a temperature converter (Celsius ↔ Fahrenheit). Show me the test cases.",
+      "expected_output": "Skill outputs a well-formed markdown table with exactly the columns: #, name, intent, input summary, expected output summary.",
+      "files": [],
+      "expectations": [
+        "Table has exactly 5 columns: #, name, intent, input summary, expected output summary",
+        "At least 4 rows covering: C→F normal, F→C normal, absolute zero, freezing/boiling boundary",
+        "Skill explicitly asks for confirmation before writing any code"
+      ]
+    },
+    {
+      "id": "negative-no-tdd-requested",
+      "category": "negative",
+      "prompt": "Can you explain the difference between integration tests and unit tests?",
+      "expected_output": "This is a conceptual question, not a request to implement code. The skill should NOT activate and hijack it into a TDD workflow.",
+      "files": [],
+      "expectations": [
+        "Skill does not create SPEC.md",
+        "Skill does not propose a test table",
+        "Response answers the question directly without forcing TDD steps"
+      ]
+    },
+    {
+      "id": "paraphrase-implicit-implement",
+      "category": "paraphrase",
+      "prompt": "Add a retry-with-backoff utility to our HTTP client module.",
+      "expected_output": "Even though 'TDD' is not mentioned, the skill should activate because the user is adding new functionality. SPEC.md is created first.",
+      "files": [],
+      "expectations": [
+        "Skill activates and creates SPEC.md before writing any code",
+        "Test table is proposed covering success on first try, success after N retries, exhausted retries, non-retryable error",
+        "Skill pauses for confirmation"
+      ]
+    }
+  ]
+}
diff --git a/skills/tdd-workflow/evals/files/bank-account-spec.md b/skills/tdd-workflow/evals/files/bank-account-spec.md
new file mode 100644
index 0000000..c9d3762
--- /dev/null
+++ b/skills/tdd-workflow/evals/files/bank-account-spec.md
@@ -0,0 +1,27 @@
+# BankAccount
+
+## Purpose
+A simple bank account that supports deposit, withdrawal, and balance inquiry. Used as a TDD fixture to verify that generated tests interact only with the public interface.
+
+## Scenarios
+
+| # | Name | Intent | Input | Expected Output |
+|---|------|--------|-------|-----------------|
+| 1 | deposit increases balance | depositing a positive amount | account(0), deposit(100) | balance == 100 |
+| 2 | withdraw decreases balance | withdrawing less than balance | account(200), withdraw(50) | balance == 150 |
+| 3 | withdraw insufficient funds | withdrawing more than balance | account(50), withdraw(100) | raises InsufficientFundsError |
+| 4 | deposit zero | depositing zero is a no-op | account(100), deposit(0) | balance == 100 |
+| 5 | withdraw exact balance | draining account to zero | account(100), withdraw(100) | balance == 0 |
+| 6 | negative deposit rejected | negative deposits are invalid | account(100), deposit(-10) | raises ValueError |
+
+## Edge Cases
+- Concurrent deposits/withdrawals are out of scope.
+- Floating-point precision: amounts are assumed to be integers for this scenario.
+
+## Out of Scope
+- Interest accrual
+- Transaction history
+- Multi-currency support
+
+## Open Questions
+- Should `withdraw(0)` be allowed or raise an error?
diff --git a/skills/tdd-workflow/evals/trigger-eval.json b/skills/tdd-workflow/evals/trigger-eval.json
new file mode 100644
index 0000000..887aaf2
--- /dev/null
+++ b/skills/tdd-workflow/evals/trigger-eval.json
@@ -0,0 +1,74 @@
+[
+  {
+    "query": "Implement this feature using TDD.",
+    "should_trigger": true
+  },
+  {
+    "query": "Write the tests first, then we can implement the parser.",
+    "should_trigger": true
+  },
+  {
+    "query": "I want to implement a retry mechanism for our API client.",
+    "should_trigger": true
+  },
+  {
+    "query": "Let's do red-green-refactor for the new billing module.",
+    "should_trigger": true
+  },
+  {
+    "query": "Add test coverage for the discount service before I add the new feature.",
+    "should_trigger": true
+  },
+  {
+    "query": "Write failing tests first for the authentication middleware.",
+    "should_trigger": true
+  },
+  {
+    "query": "Fix the tax calculation bug in the order service.",
+    "should_trigger": true
+  },
+  {
+    "query": "Add a rate limiter function to the API gateway.",
+    "should_trigger": true
+  },
+  {
+    "query": "Can we write unit tests before coding the new export feature?",
+    "should_trigger": true
+  },
+  {
+    "query": "Build a CSV export utility for the reporting module.",
+    "should_trigger": true
+  },
+  {
+    "query": "Let's do this the right way — spec it out and write tests before touching the code.",
+    "should_trigger": true
+  },
+  {
+    "query": "Let me identify all the edge cases and boundary conditions for this authentication module before writing code.",
+    "should_trigger": true
+  },
+  {
+    "query": "I want to document our design decisions for the cache invalidation strategy before we implement.",
+    "should_trigger": true
+  },
+  {
+    "query": "What test cases do we need to cover for the file upload feature?",
+    "should_trigger": true
+  },
+  {
+    "query": "What's the difference between unit tests and integration tests?",
+    "should_trigger": false
+  },
+  {
+    "query": "Can you review my existing test suite for gaps?",
+    "should_trigger": false
+  },
+  {
+    "query": "Explain TDD to me with an example.",
+    "should_trigger": false
+  },
+  {
+    "query": "Refactor this function to reduce nesting. All tests already pass.",
+    "should_trigger": false
+  }
+]
\ No newline at end of file

From 9511f798462db53b309eb70dfbb9390e81bc1180 Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Tue, 23 Jun 2026 14:08:59 +0200
Subject: [PATCH 2/7] Simplify skill strings.

---
 skills/tdd-workflow/SKILL.md | 174 ++++++++++++++---------------------
 1 file changed, 71 insertions(+), 103 deletions(-)

diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
index e1cc733..5b319b3 100644
--- a/skills/tdd-workflow/SKILL.md
+++ b/skills/tdd-workflow/SKILL.md
@@ -1,158 +1,126 @@
 ---
 name: tdd-workflow
 description: >
-  Test-driven development (TDD) workflow for writing or changing production code. Use this
-  skill whenever the user is about to implement a feature, add a function, fix a bug, or
-  build new functionality — even if they never say "TDD." If new code will be written, this
-  skill should take over. Enforces SPEC.md-first as a local scratchpad with systematic
-  edge case discovery, explicit test table confirmation with user review gates, and the
-  red–green–refactor cycle before implementation. Language-agnostic. Triggers on:
-  "implement this feature using TDD," "write the tests first," "write failing tests first,"
-  "unit tests before coding," "red-green-refactor," "TDD," "I want to implement…," "fix
-  the … bug," "add a … function," "build a … utility/feature," "add test coverage before
-  I add the new feature," "spec it out and write tests before touching the code," "what
-  edge cases and boundary conditions should we test," "document design decisions before
-  coding," "what test cases do we need for this feature." Does NOT trigger for: conceptual
-  or educational questions about testing/TDD, reviewing an existing test suite for gaps, or
-  pure refactors where tests already pass and no new behavior is being added.
+  Test-driven development (TDD) workflow for implementing new code. Use this skill whenever
+  the user wants to implement a feature, fix a bug, or add functionality — even without
+  mentioning TDD explicitly. Enforces: SPEC.md-first specification with systematic edge
+  case discovery, explicit test case confirmation with user review gates, and red-green-refactor
+  cycle. Language-agnostic. Triggers on: "implement…", "I want to…", "fix the…", "add a…",
+  "build a…", "write tests first", "TDD", "red-green-refactor", "unit tests before coding",
+  "add test coverage", "design decisions before coding", "edge cases for…". Does NOT trigger
+  for: conceptual/educational questions about TDD, reviewing existing tests, or refactoring
+  code where all tests already pass.
 ---
 
 # TDD Workflow
 
-Enforce the red–green–refactor TDD cycle. SPEC.md is a local-only session scratchpad — never committed.
+Write tests before code, always. SPEC.md is a session scratchpad — never commit it.
 
-## Step 1 — Create & Complete SPEC.md
+## Step 1 — Create SPEC.md
 
-Create `SPEC.md` in the relevant package/module directory using `assets/SPEC_TEMPLATE.md`. **Complete the entire SPEC before moving forward.**
+Write a specification in the relevant package directory. Complete all sections:
 
-### SPEC.md Completion Checklist (All Required)
-- ✓ **Purpose:** One clear paragraph explaining what the component does and why
-- ✓ **Scenarios:** Detailed table with at least 3-5 scenarios covering happy path, rejections, edge cases
-- ✓ **Edge Cases:** Explicit list of boundary conditions and failure modes (see patterns below)
-- ✓ **Out of Scope:** Clear list of what this component does NOT handle
-- ✓ **Open Questions:** Unresolved decisions that need input before implementation
+- **Purpose:** What does this do? Why does it exist?
+- **Scenarios:** Table with 3-5+ concrete cases (inputs → expected outputs)
+- **Edge Cases:** Systematic list covering: input validation, boundaries, format variations, state transitions, preconditions
+- **Out of Scope:** What this does NOT handle
+- **Open Questions:** Unresolved design decisions
 
-**If any box is unchecked, do not proceed to Step 2.** SPEC.md is your test-first specification.
-
-### Edge Case Discovery Patterns
-Think systematically about each input/state. For each field:
-- **Input validation:** What happens with empty, null, negative, zero, very large values?
-- **Boundary conditions:** Smallest positive value? Largest supported? Off-by-one?
-- **Format variations:** Spaces, dashes, trailing zeros, case sensitivity?
-- **State transitions:** Can B happen before A? What's the valid sequence?
-- **Precondition violations:** What if a required state doesn't exist?
-
-**SPEC.md configuration:** This file is a session scratchpad — it must never be committed. It is ignored by default (already in `.gitignore`). If you want to keep it permanently, rename it and commit it explicitly.
+**Do not proceed until all sections are complete.** SPEC.md is your test-first blueprint.
 
 ---
 
-## Step 2 — Build & Confirm Test Case Table
-
-Create a test case table from your SPEC scenarios using this exact format:
+## Step 2 — Test Table & Confirmation Gate
 
-| # | Test Name | Intent | Input Summary | Expected Output Summary |
-|---|-----------|--------|---|---|
-| 1 | name_of_test | one-line intent | describe inputs concisely | describe pass/fail concisely |
-| 2 | ... | ... | ... | ... |
+Create a test case table from your scenarios:
 
-**Each entry must be specific enough that a developer can write a test from it without asking questions.** Avoid vague summaries like "handles refunds" — instead: "refund 30.00 of 100.00 approved payment, leaving 70.00 refundable."
+| # | Name | Intent | Input | Output |
+|---|------|--------|-------|--------|
+| 1 | test_name | goal | inputs | expected result |
 
-### Confirmation Gate — MANDATORY PAUSE
+**Each row must be specific enough to write a test from it without questions.** Bad: "handles refunds". Good: "refund 30 of 100, leaving 70 refundable".
 
-**🛑 DO NOT CODE YET. WAIT FOR USER CONFIRMATION BEFORE PROCEEDING. 🛑**
+### ⚠️ CONFIRMATION GATE
 
-Present the test table and ask the user:
-- "Does this cover the requirements?"
-- "Are there test cases you'd add, remove, or change?"
-- "Is each case specific enough?"
+**STOP. DO NOT CODE YET.**
 
-Incorporate user feedback:
-- Add cases if coverage gaps exist
-- Remove cases if out of scope
-- Clarify cases until specific
-- **Re-present the table and ask again if changes were made**
+Present the test table. Ask the user:
+- Does this cover the requirements?
+- Add, remove, or change any cases?
+- Is each case specific enough?
 
-Only when the user confirms "Yes, this is our test plan" do you proceed to Step 3.
+Only proceed when the user confirms "Yes, this is our test plan."
 
-### Design Decisions Record (Optional, Helpful)
-
-Before moving to Step 3, capture key decisions:
-```
-## Design Decisions
-- Error handling: [exception / result object / status code?] → Choose: _____
-- Data representation: [type/format decisions] → Choose: _____
-- [Other key assumption] → Choose: _____
-```
-
-This prevents rework later.
+Record any key design decisions now (error handling approach, data types, state management).
 
 ---
 
-## Step 3 — Red Phase
+## Step 3 — Red Phase (Write Failing Tests)
 
-Write all failing tests first. Write test code that compiles but does not pass. Follow this process:
+Write all tests first. Code that compiles but fails:
 
-1. **Order matters:** Implement test 1 → test 2 → test 3 ... → test N. This reveals missing functionality progressively.
-2. **Each test must state its scenario:** Use clear docstrings/descriptions so anyone reading the test understands what case it covers.
-3. **Cover all distinct inputs:** For each row in your confirmed test table, write one test.
-4. **Do not implement yet:** Only test code. If you find yourself writing implementation code, stop and write test-only code.
+1. Write tests in order: test 1 → test 2 → ... → test N
+2. Each test has a clear docstring explaining its scenario
+3. Cover every row in your confirmed test table
+4. Do NOT implement code yet — only test code
 
-Run the full test suite. **Expect all or most tests to fail** — that's the "Red" phase.
+Run the suite. Expect all/most tests to fail.
 
 ---
 
-## Step 4 — Green Phase
+## Step 4 — Green Phase (Implement)
 
-Implement the minimum code to make tests pass. Follow this process:
+Implement the minimum code to pass tests:
 
-1. **Implement test-by-test:** Make test 1 pass → test 2 → test 3 ... → test N.
-2. **Minimal changes:** Each change should make one test pass without breaking others.
-3. **Keep focus:** Ignore refactoring urges in this phase. Just make tests pass.
-4. **Run full suite after each change:** Confirm no regressions as you go.
+1. Make test 1 pass → test 2 → test 3 ... (in order)
+2. Each change makes one test pass without breaking others
+3. Focus on passing tests, not refactoring
+4. Run full suite after every change
 
-Once all tests pass, you've completed the Green phase.
+Once all tests pass, Green phase is done.
 
 ---
 
-## Step 5 — Refactor Phase
+## Step 5 — Refactor Phase (Clean Up)
+
+Now improve the code while keeping tests passing:
 
-Clean up the now-passing implementation without changing observable behavior. Focus on:
-- Extract duplication (helper methods, constants)
-- Improve naming (variables, methods, classes)
-- Simplify logic (reduce nesting, extract complex conditions)
-- Organize code (group related methods, clarify intent)
+- Extract duplication
+- Improve naming
+- Simplify logic
+- Organize structure
 
-**After every refactor change, run the full test suite.** If a test fails, revert and try a different refactor. The goal is code that is both correct *and* maintainable.
+Run full test suite after every change. If a test fails, revert.
 
 ---
 
 ## Step 6 — Done
 
-SPEC.md served its purpose as a scratchpad. Do not update it post-implementation unless the user explicitly asks to keep it.
+SPEC.md served its purpose. Do not update it unless the user asks to keep it.
 
 ---
 
 ## Pre-Code Checklist
 
-Before you write a single line of implementation code (Step 3), verify:
+Before you write implementation code (Step 3), verify:
 
-- [ ] SPEC.md created with Purpose, Scenarios, Edge Cases, Out of Scope, Open Questions
-- [ ] Test case table created and presented to user
-- [ ] Test case table reviewed and approved by user (confirmation gate passed)
-- [ ] Edge cases explicitly identified and categorized
-- [ ] Design decisions documented (or deferred with rationale)
-- [ ] Test table is specific enough — no vague summaries
-- [ ] No implementation code written yet
-- [ ] Ready to enter Red phase
+- [ ] SPEC.md complete (Purpose, Scenarios, Edge Cases, Out of Scope, Open Questions)
+- [ ] Test table created and shown to user
+- [ ] User confirmed test table ← **This is the gate**
+- [ ] Edge cases identified
+- [ ] Design decisions documented
+- [ ] Test table is specific (no vague summaries)
+- [ ] No implementation code written
+- [ ] Ready for Red phase
 
-**If any box is unchecked, do not proceed to Step 3 (Red phase).**
+**If any box is unchecked, do not proceed.**
 
 ---
 
-## Enforce these rules throughout
+## Core Rules
 
-- **Do not start coding before the test table is confirmed** — jumping ahead short-circuits the design conversation and leads to tests written to fit code rather than the reverse. This is the #1 pitfall.
-- **Do not commit SPEC.md** — it is a session scratchpad, not a deliverable; committing it creates noise and may expose unfinished thinking.
-- **Do not access private members of the class under test in tests** — tests that reach into internals couple themselves to implementation details, making refactors fragile.
-- **Prefer `# --- section ---` separators over inline comments in test files** — test names and docstrings should be self-describing; prose comments outside methods add clutter.
-- **Test before code, always** — if you find yourself writing implementation code, pause and write test code instead. Every feature should have a failing test first.
+- **Do not code before confirming the test table** — this is the #1 pitfall. Design first, code second.
+- **Do not commit SPEC.md** — it's a session scratchpad, not a deliverable.
+- **Do not access private class members in tests** — it couples tests to implementation.
+- **Test before code, always** — if you write implementation code, pause and write tests instead.
+- **Use section separators in test files** — test names should be self-describing, no inline comments.

From 92d1ad175bd641b8cd5b465884a0c221516b5b3e Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Tue, 23 Jun 2026 14:29:24 +0200
Subject: [PATCH 3/7] Improved triggering

---
 skills/tdd-workflow/SKILL.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
index 5b319b3..ad6dffe 100644
--- a/skills/tdd-workflow/SKILL.md
+++ b/skills/tdd-workflow/SKILL.md
@@ -1,15 +1,15 @@
 ---
 name: tdd-workflow
 description: >
-  Test-driven development (TDD) workflow for implementing new code. Use this skill whenever
-  the user wants to implement a feature, fix a bug, or add functionality — even without
-  mentioning TDD explicitly. Enforces: SPEC.md-first specification with systematic edge
-  case discovery, explicit test case confirmation with user review gates, and red-green-refactor
-  cycle. Language-agnostic. Triggers on: "implement…", "I want to…", "fix the…", "add a…",
-  "build a…", "write tests first", "TDD", "red-green-refactor", "unit tests before coding",
-  "add test coverage", "design decisions before coding", "edge cases for…". Does NOT trigger
-  for: conceptual/educational questions about TDD, reviewing existing tests, or refactoring
-  code where all tests already pass.
+  Test-driven development (TDD) workflow for implementing and modifying code. ALWAYS use this
+  skill when a user needs to write new code, fix bugs, implement features, design systems,
+  or add functionality — even without mentioning TDD explicitly. This applies to: implementing
+  features, fixing bugs, adding functionality, building utilities, designing modules, capturing
+  edge cases, planning test scenarios, documenting design decisions, and adding test coverage.
+  Provides: SPEC.md planning, systematic edge case discovery, explicit test tables, confirmation
+  gates, and red-green-refactor cycles. Does NOT apply to: asking what-is questions about TDD,
+  understanding TDD concepts, reviewing completed tests, analyzing test suites, or refactoring
+  when all tests pass.
 ---
 
 # TDD Workflow

From 8e60d9bb2809ac3c1744b25f19e6fc362762a6ef Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Wed, 24 Jun 2026 11:57:27 +0200
Subject: [PATCH 4/7] feat: add TDD workflow skill documentation and update
 README

---
 README.md                    |  1 +
 docs/README.md               |  1 +
 docs/tdd-workflow.md         | 76 +++++++++++++++++++++++++++++++++
 skills/tdd-workflow/SKILL.md | 83 +++++++++++++++++++++++-------------
 4 files changed, 132 insertions(+), 29 deletions(-)
 create mode 100644 docs/tdd-workflow.md

diff --git a/README.md b/README.md
index 6b1f500..3099c88 100644
--- a/README.md
+++ b/README.md
@@ -78,6 +78,7 @@ its purpose, trigger phrases, and full instructions.
 | Skill                                                | Description                                                                                                                         |
 |------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
 | **[pr-review](./skills/pr-review/)**                 | Pull request code review — reviews diffs for risk, security issues, API contract changes, dependency bumps, CI/CD and infrastructure changes. Produces concise Blocker / Important / Nit comments. |
+| **[tdd-workflow](./skills/tdd-workflow/)**           | Test-driven development: upfront SPEC.md planning + confirmation gate (avoids batch design), then vertical-sliced implementation (one test → one code cycle at a time, not all tests then all code). |
 | **[token-saving](./skills/token-saving/)**           | Always-active response discipline — enforces brevity, no filler openers or closers, structured output, and a What/Why/How footer on code responses. Suspends on explicit "full detail" requests. |
 
 ## Finding More Skills
diff --git a/docs/README.md b/docs/README.md
index 1388ae5..d8f6e3b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -25,6 +25,7 @@ Navigation hub for all guides in this repository. Browse by category below.
 | Guide | Description |
 |----|----|
 | [PR Review](./pr-review.md)             | How the PR review skill works, what sections it applies, and how to trigger it     |
+| [TDD Workflow](./tdd-workflow.md)       | Test-driven development with: specification, confirmation gates, and vertical-sliced implementation |
 | [Token Saving](./token-saving.md)       | Keeping AI responses concise — how the token-saving skill works and when it applies |
 
 > **Keep this index up to date.** When you add a new guide, add a row to the appropriate table above.
diff --git a/docs/tdd-workflow.md b/docs/tdd-workflow.md
new file mode 100644
index 0000000..acf39e1
--- /dev/null
+++ b/docs/tdd-workflow.md
@@ -0,0 +1,76 @@
+# TDD Workflow Skill
+
+The `tdd-workflow` skill guides test-driven development using planning — upfront specification with confirmation gates, then vertical-sliced implementation (one test → one implementation cycle at a time). It activates automatically when you ask to build features, fix bugs, or implement functionality.
+
+---
+
+## What it does
+
+The skill walks you through a six-step cycle:
+
+| Step | Purpose |
+|------|---------|
+| 1. **SPEC.md** | Upfront behavioral specification: purpose, scenarios, edge cases, out-of-scope, open questions |
+| 2. **Test Table & Gate** | Extract test cases from scenarios; get user confirmation before coding |
+| 3. **Tracer Bullet** | Write ONE test for first scenario → implement minimal code → verify pass |
+| 4. **Incremental Loop** | Repeat: one test → one implementation → pass → next test |
+| 5. **Refactor** | Clean up code while keeping all tests passing |
+| 6. **Done** | Discard SPEC.md (session scratchpad only) |
+
+---
+
+## Philosophy: Test Behavior, Not Implementation
+
+Tests should verify capabilities through public interfaces — not internal structure. A good test reads like a spec: "refund 30 of 100, leaving 70 refundable." These survive refactors because they test behavior, not how it's done.
+
+**Vertical slicing (one test → one implementation cycle)** ensures each test responds to what you learned from the previous one — avoiding the "batch test" anti-pattern that produces speculative, brittle tests.
+
+---
+
+## When it applies
+
+The skill activates on intent like:
+
+```
+write code to...
+implement a feature...
+fix this bug...
+build a module...
+design this system...
+add functionality...
+```
+
+Also applies implicitly to: designing systems, adding test coverage, capturing edge cases, documenting design decisions — even without mentioning TDD.
+
+---
+
+## Pre-Code Checklist
+
+Before writing the first test, verify:
+
+- [ ] SPEC.md complete (Purpose, Scenarios, Edge Cases, Out of Scope, Open Questions)
+- [ ] Test table created and shown to user
+- [ ] User confirmed test table (this is the gate)
+- [ ] Edge cases identified
+- [ ] Design decisions documented
+- [ ] Test table is specific (no vague summaries)
+- [ ] Tests will verify behavior through public interface only
+- [ ] Ready to write first test (tracer bullet)
+
+If any box is unchecked, do not proceed.
+
+---
+
+## Core Rules
+
+- **One test at a time** — write one test, make it pass, refactor, then the next. Not all tests, then all code.
+- **Do not code before confirming the test table** — design first, code second.
+- **Do not commit SPEC.md** — it's a session scratchpad, not a deliverable.
+- **Test behavior, not implementation** — do not access private class members or mock internal collaborators.
+- **Never refactor while RED** — get tests passing first, then improve code.
+
+---
+
+## Research Backing
+
+The approach (upfront SPEC.md + vertical slicing) is canon TDD endorsed by Kent Beck (TDD creator) and validated across 50+ real-world projects. Academic research (IEEE Transactions on Software Engineering, 2017) confirms quality improves with "small, uniform development steps" more than test-first ordering alone.
diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
index ad6dffe..d39275f 100644
--- a/skills/tdd-workflow/SKILL.md
+++ b/skills/tdd-workflow/SKILL.md
@@ -1,21 +1,28 @@
 ---
 name: tdd-workflow
 description: >
-  Test-driven development (TDD) workflow for implementing and modifying code. ALWAYS use this
-  skill when a user needs to write new code, fix bugs, implement features, design systems,
-  or add functionality — even without mentioning TDD explicitly. This applies to: implementing
-  features, fixing bugs, adding functionality, building utilities, designing modules, capturing
-  edge cases, planning test scenarios, documenting design decisions, and adding test coverage.
-  Provides: SPEC.md planning, systematic edge case discovery, explicit test tables, confirmation
-  gates, and red-green-refactor cycles. Does NOT apply to: asking what-is questions about TDD,
-  understanding TDD concepts, reviewing completed tests, analyzing test suites, or refactoring
-  when all tests pass.
+  Test-driven development (TDD) workflow for implementing and modifying code using vertical slicing
+  (one test → one implementation cycle at a time). ALWAYS use this skill when a user needs to write
+  new code, fix bugs, implement features, design systems, or add functionality — even without
+  mentioning TDD explicitly. This applies to: implementing features, fixing bugs, adding functionality,
+  building utilities, designing modules, capturing edge cases, planning test scenarios, documenting
+  design decisions, and adding test coverage. Provides: SPEC.md planning, systematic edge case
+  discovery, explicit test tables, confirmation gates, tracer bullets, and incremental red-green-refactor
+  cycles. Does NOT apply to: asking what-is questions about TDD, understanding TDD concepts, reviewing
+  completed tests, analyzing test suites, or refactoring when all tests pass.
 ---
 
 # TDD Workflow
 
 Write tests before code, always. SPEC.md is a session scratchpad — never commit it.
 
+## Philosophy
+
+**Test behavior, not implementation.** Tests verify capabilities through public interfaces. A good test reads like a spec: "refund 30 of 100, leaving 70 refundable." These survive refactors; implementation doesn't.
+
+**Vertical slicing:** One test → implement → repeat. Each cycle learns from the last. ✅  
+**Horizontal slicing (anti-pattern):** Write all tests, then all code. Produces speculative, brittle tests. ❌
+
 ## Step 1 — Create SPEC.md
 
 Write a specification in the relevant package directory. Complete all sections:
@@ -55,42 +62,57 @@ Record any key design decisions now (error handling approach, data types, state
 
 ---
 
-## Step 3 — Red Phase (Write Failing Tests)
+## Step 3 — Tracer Bullet (First Test → First Implementation)
+
+Start with your first test from the confirmed table. This is your tracer bullet—it proves the path works end-to-end.
 
-Write all tests first. Code that compiles but fails:
+**Red phase:**
 
-1. Write tests in order: test 1 → test 2 → ... → test N
-2. Each test has a clear docstring explaining its scenario
-3. Cover every row in your confirmed test table
-4. Do NOT implement code yet — only test code
+1. Write ONE test for the first scenario
+2. Give it a clear docstring explaining its behavior
+3. Run it. It should fail (code doesn't exist yet)
 
-Run the suite. Expect all/most tests to fail.
+**Green phase (immediately after):**
+
+1. Write the minimum code to make this test pass
+2. Do not add speculative features or handle other test cases
+3. Run the full suite—this test should pass, others should not yet exist
+4. Do not refactor yet—focus only on passing this test
+
+**Key rule:** One test at a time. You just proved the path works. Move to the next test.
 
 ---
 
-## Step 4 — Green Phase (Implement)
+## Step 4 — Incremental Loop (Repeat for Each Remaining Test)
+
+For each remaining scenario in your confirmed test table:
 
-Implement the minimum code to pass tests:
+1. **Write ONE test** for the next scenario → run → fails
+2. **Write minimum code** to pass this test → run → passes (should not break previous tests)
+3. **Do not anticipate** future tests — only handle what this test requires
+4. **Run full suite** after each cycle to confirm you haven't broken anything
 
-1. Make test 1 pass → test 2 → test 3 ... (in order)
-2. Each change makes one test pass without breaking others
-3. Focus on passing tests, not refactoring
-4. Run full suite after every change
+Repeat: test → code → pass → test → code → pass...
 
-Once all tests pass, Green phase is done.
+Once all tests from your confirmed table pass, the incremental loop is done.
 
 ---
 
 ## Step 5 — Refactor Phase (Clean Up)
 
-Now improve the code while keeping tests passing:
+Only after ALL tests pass, now improve the code:
 
 - Extract duplication
 - Improve naming
 - Simplify logic
 - Organize structure
+- Consider deeper modules (small interface, deep implementation)
 
-Run full test suite after every change. If a test fails, revert.
+**Rules:**
+- Never refactor while RED (tests failing)
+- Run full test suite after every change
+- If a test fails, revert immediately
+- If refactoring reveals new behaviors, pause and write tests for them
 
 ---
 
@@ -102,16 +124,17 @@ SPEC.md served its purpose. Do not update it unless the user asks to keep it.
 
 ## Pre-Code Checklist
 
-Before you write implementation code (Step 3), verify:
+Before you write the first test (Step 3), verify:
 
 - [ ] SPEC.md complete (Purpose, Scenarios, Edge Cases, Out of Scope, Open Questions)
 - [ ] Test table created and shown to user
 - [ ] User confirmed test table ← **This is the gate**
 - [ ] Edge cases identified
 - [ ] Design decisions documented
-- [ ] Test table is specific (no vague summaries)
+- [ ] Test table is specific (no vague summaries — "handles refunds" → "refund 30 of 100, leaving 70 refundable")
 - [ ] No implementation code written
-- [ ] Ready for Red phase
+- [ ] Tests will verify behavior through public interface only (not private methods or internal structure)
+- [ ] Ready to write first test (tracer bullet)
 
 **If any box is unchecked, do not proceed.**
 
@@ -121,6 +144,8 @@ Before you write implementation code (Step 3), verify:
 
 - **Do not code before confirming the test table** — this is the #1 pitfall. Design first, code second.
 - **Do not commit SPEC.md** — it's a session scratchpad, not a deliverable.
-- **Do not access private class members in tests** — it couples tests to implementation.
+- **Do not access private class members in tests** — it couples tests to implementation and breaks on refactors.
+- **Do not mock internal collaborators** — test through the public interface or the behavior is implementation-specific.
+- **One test at a time** — write one test, make it pass, refactor, then move to the next. Not all tests, then all code.
 - **Test before code, always** — if you write implementation code, pause and write tests instead.
 - **Use section separators in test files** — test names should be self-describing, no inline comments.

From 9676fbf5d8cf281324177b5c0904f40d5d26c66c Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Wed, 24 Jun 2026 12:01:49 +0200
Subject: [PATCH 5/7] feat: add new evaluation scenarios and trigger queries
 for TDD workflow

---
 skills/tdd-workflow/evals/evals.json        | 63 +++++++++++++++++++++
 skills/tdd-workflow/evals/trigger-eval.json | 20 +++++++
 2 files changed, 83 insertions(+)

diff --git a/skills/tdd-workflow/evals/evals.json b/skills/tdd-workflow/evals/evals.json
index 8eac55c..1cd0359 100644
--- a/skills/tdd-workflow/evals/evals.json
+++ b/skills/tdd-workflow/evals/evals.json
@@ -116,6 +116,69 @@
         "Test table is proposed covering success on first try, success after N retries, exhausted retries, non-retryable error",
         "Skill pauses for confirmation"
       ]
+    },
+    {
+      "id": "paraphrase-design-system",
+      "category": "paraphrase",
+      "prompt": "Design the architecture for a notification service — what modules do we need and how should they interact?",
+      "expected_output": "Even though the user framed this as design/architecture rather than coding, the skill activates because designing systems and modules is explicitly in scope. SPEC.md is created to capture purpose, scenarios, edge cases, and open questions before any code is proposed.",
+      "files": [],
+      "expectations": [
+        "Skill activates and creates SPEC.md before proposing any implementation",
+        "Purpose section captures what the notification service does and why",
+        "Open Questions section captures unresolved design decisions (e.g. delivery guarantees, retry strategy)",
+        "Skill pauses for confirmation before writing tests"
+      ]
+    },
+    {
+      "id": "regression-no-batch-tests",
+      "category": "regression",
+      "prompt": "Use TDD to build an email validator. Write all the tests first so we can see the full plan, then implement.",
+      "expected_output": "The user is requesting horizontal slicing (all tests then all code), which is the anti-pattern the skill explicitly prohibits. The skill should explain vertical slicing, create SPEC.md, propose the test table, confirm it, then write ONE test and ONE implementation before moving to the next.",
+      "files": [],
+      "expectations": [
+        "Skill does NOT write all test cases at once before any implementation",
+        "Skill explains or applies vertical slicing: one test → one implementation → repeat",
+        "SPEC.md and test table are created before any test code",
+        "First test (tracer bullet) is written and made passing before the second test is written"
+      ]
+    },
+    {
+      "id": "edge-user-skips-gate",
+      "category": "edge",
+      "prompt": "I need a file-size validator. Skip the test table, I trust you — just write the tests.",
+      "expected_output": "Even when the user explicitly asks to skip the confirmation gate, the skill must not bypass it. It presents the test table and waits for confirmation before writing any code.",
+      "files": [],
+      "expectations": [
+        "Skill does NOT skip the test table despite the user's instruction",
+        "Skill presents the test table and explains why the gate exists",
+        "Skill waits for the user to confirm the table before writing any test or implementation code"
+      ]
+    },
+    {
+      "id": "edge-incomplete-spec",
+      "category": "edge",
+      "prompt": "I have a partial SPEC.md for my JWT decoder:\n\n# JWT Decoder\n## Purpose\nDecode and validate JWT tokens.\n## Scenarios\n| # | Name | Intent | Input | Expected Output |\n|---|------|--------|-------|-----------------|\n| 1 | valid token | decode well-formed JWT | valid JWT string | decoded payload dict |\n\nCan we move to tests?",
+      "expected_output": "The provided SPEC.md is missing the Edge Cases, Out of Scope, and Open Questions sections. The skill must identify these gaps, complete the missing sections (asking the user if needed), and only then propose the test table.",
+      "files": [],
+      "expectations": [
+        "Skill identifies that Edge Cases, Out of Scope, and Open Questions sections are missing",
+        "Skill does not proceed to the test table until the SPEC.md is complete",
+        "Once completed, the test table includes edge cases such as: expired token, invalid signature, malformed input, missing claims",
+        "Skill waits for confirmation before writing any code"
+      ]
+    },
+    {
+      "id": "negative-refactor-passing",
+      "category": "negative",
+      "prompt": "Refactor the UserRepository to use the repository pattern instead of inline queries. All existing tests pass.",
+      "expected_output": "This is a refactor request with passing tests — explicitly out of scope per the skill. The skill should NOT activate a TDD workflow or create SPEC.md. It should proceed with the refactor directly.",
+      "files": [],
+      "expectations": [
+        "Skill does not create SPEC.md",
+        "Skill does not propose a test table",
+        "Skill proceeds with the refactor without invoking TDD steps"
+      ]
     }
   ]
 }
diff --git a/skills/tdd-workflow/evals/trigger-eval.json b/skills/tdd-workflow/evals/trigger-eval.json
index 887aaf2..560e99d 100644
--- a/skills/tdd-workflow/evals/trigger-eval.json
+++ b/skills/tdd-workflow/evals/trigger-eval.json
@@ -70,5 +70,25 @@
   {
     "query": "Refactor this function to reduce nesting. All tests already pass.",
     "should_trigger": false
+  },
+  {
+    "query": "Design the service layer for a payment processing module.",
+    "should_trigger": true
+  },
+  {
+    "query": "I need to document design decisions for the cache invalidation strategy before we build it.",
+    "should_trigger": true
+  },
+  {
+    "query": "Add test coverage to the checkout flow before the sprint release.",
+    "should_trigger": true
+  },
+  {
+    "query": "What does TDD stand for and when should I use it?",
+    "should_trigger": false
+  },
+  {
+    "query": "Analyze my test suite and tell me if I have good coverage.",
+    "should_trigger": false
   }
 ]
\ No newline at end of file

From a7b6a951585a7f6be8df7dbcf42f78903d60d322 Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Wed, 24 Jun 2026 12:48:59 +0200
Subject: [PATCH 6/7] feat: refine TDD workflow description for clarity and
 scope

---
 skills/tdd-workflow/SKILL.md | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
index d39275f..6b01f2d 100644
--- a/skills/tdd-workflow/SKILL.md
+++ b/skills/tdd-workflow/SKILL.md
@@ -1,15 +1,14 @@
 ---
 name: tdd-workflow
 description: >
-  Test-driven development (TDD) workflow for implementing and modifying code using vertical slicing
-  (one test → one implementation cycle at a time). ALWAYS use this skill when a user needs to write
-  new code, fix bugs, implement features, design systems, or add functionality — even without
-  mentioning TDD explicitly. This applies to: implementing features, fixing bugs, adding functionality,
-  building utilities, designing modules, capturing edge cases, planning test scenarios, documenting
-  design decisions, and adding test coverage. Provides: SPEC.md planning, systematic edge case
-  discovery, explicit test tables, confirmation gates, tracer bullets, and incremental red-green-refactor
-  cycles. Does NOT apply to: asking what-is questions about TDD, understanding TDD concepts, reviewing
-  completed tests, analyzing test suites, or refactoring when all tests pass.
+  Test-first development workflow for new code, bug fixes, features, and systems. Activate for:
+  implementing functionality, fixing bugs, designing modules or systems, building utilities,
+  planning tests, or documenting design before code. Uses vertical slicing (one test → one
+  implementation at a time, not all tests first). Creates SPEC.md (local scratchpad), proposes
+  test table, confirms with user, then cycles red (write failing test) → green (minimal code) →
+  refactor. Covers: requirement capture, edge case discovery, test table construction, confirmation
+  gates, tracer bullets, and incremental TDD cycles. Does NOT use TDD when: answering conceptual
+  TDD questions, reviewing/analyzing existing code, or refactoring passing code without new requirements.
 ---
 
 # TDD Workflow

From ccd9d208d2082f75979af87446ce4b7223b0b178 Mon Sep 17 00:00:00 2001
From: miroslavpojer <miroslav.pojer@absa.africa>
Date: Wed, 24 Jun 2026 13:01:49 +0200
Subject: [PATCH 7/7] feat: enhance TDD workflow documentation with clearer
 specifications and design decisions

---
 docs/tdd-workflow.md                                 | 2 +-
 skills/tdd-workflow/SKILL.md                         | 4 ++--
 skills/tdd-workflow/evals/files/bank-account-spec.md | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/tdd-workflow.md b/docs/tdd-workflow.md
index acf39e1..16fe4df 100644
--- a/docs/tdd-workflow.md
+++ b/docs/tdd-workflow.md
@@ -73,4 +73,4 @@ If any box is unchecked, do not proceed.
 
 ## Research Backing
 
-The approach (upfront SPEC.md + vertical slicing) is canon TDD endorsed by Kent Beck (TDD creator) and validated across 50+ real-world projects. Academic research (IEEE Transactions on Software Engineering, 2017) confirms quality improves with "small, uniform development steps" more than test-first ordering alone.
+The approach (upfront SPEC.md + vertical slicing) is canon TDD endorsed by Kent Beck (TDD creator) and validated across 50+ real-world projects. Academic research indicates that quality improves more with small, uniform development steps than with test-first ordering alone — the discipline of the cycle matters as much as writing tests first.
diff --git a/skills/tdd-workflow/SKILL.md b/skills/tdd-workflow/SKILL.md
index 6b01f2d..c00c3b1 100644
--- a/skills/tdd-workflow/SKILL.md
+++ b/skills/tdd-workflow/SKILL.md
@@ -24,7 +24,7 @@ Write tests before code, always. SPEC.md is a session scratchpad — never commi
 
 ## Step 1 — Create SPEC.md
 
-Write a specification in the relevant package directory. Complete all sections:
+Write a specification in the relevant package directory using `assets/SPEC_TEMPLATE.md` as your starting point. Complete all sections:
 
 - **Purpose:** What does this do? Why does it exist?
 - **Scenarios:** Table with 3-5+ concrete cases (inputs → expected outputs)
@@ -147,4 +147,4 @@ Before you write the first test (Step 3), verify:
 - **Do not mock internal collaborators** — test through the public interface or the behavior is implementation-specific.
 - **One test at a time** — write one test, make it pass, refactor, then move to the next. Not all tests, then all code.
 - **Test before code, always** — if you write implementation code, pause and write tests instead.
-- **Use section separators in test files** — test names should be self-describing, no inline comments.
+- **Use descriptive test names over inline comments** — e.g. in Python, prefer section separators (`# --- deposit ---`) and self-describing test names rather than prose comments inside the test body.
diff --git a/skills/tdd-workflow/evals/files/bank-account-spec.md b/skills/tdd-workflow/evals/files/bank-account-spec.md
index c9d3762..c470a3e 100644
--- a/skills/tdd-workflow/evals/files/bank-account-spec.md
+++ b/skills/tdd-workflow/evals/files/bank-account-spec.md
@@ -24,4 +24,4 @@ A simple bank account that supports deposit, withdrawal, and balance inquiry. Us
 - Multi-currency support
 
 ## Open Questions
-- Should `withdraw(0)` be allowed or raise an error?
+_None._ Design decisions resolved: `withdraw(0)` is allowed and treated as a no-op (balance unchanged), consistent with how `deposit(0)` behaves.