The problem
When coding agents write and run tests, the verification itself becomes throwaway. Green CI doesn't mean reliable software: a failure gets diagnosed, fixed, and the reasoning evaporates — so the next time a similar failure appears, an agent re-derives the root cause from scratch. Agents also tend to mark their own homework, and a flaky or cascading failure can hide a real regression behind a false green. AQA is built to make verification institutional — a durable, attributable, queryable record of what was tested, what failed, and why.
The approach
AQA is a TestLink-style test-management system designed for the agentic coding loop, reachable by agents over MCP (not just humans in a dashboard). It sits one layer above the test runner and tracks plans, cases, dependencies, run history, and requirements-to-test traceability. Its distinguishing feature is verification memory: a failed run's root-cause reasoning is stored and semantically recalled, so a recurrence comes back with its earlier diagnosis instead of being re-investigated.
Architecture
One transport-agnostic service layer is exposed three ways — a REST API (FastAPI), an
MCP server (FastMCP, 43 tools), and a CLI (Typer) — so humans and agents operate on the
same state. Data lives in PostgreSQL with pgvector; failure root-cause prose is embedded
locally with all-MiniLM-L6-v2 for semantic similarity search, and artifacts (traces, logs,
screenshots) go to S3-compatible blob storage.
Four capabilities define it:
- Regression memory —
search_similar_failures/get_known_regressionsrecall prior diagnoses and cached fix-paths, and log the re-investigations they avoid. - Blind-spot radar — requirements → coverage links surface tests that don't exist yet.
- Doer ≠ checker — a claim/verify protocol with identity enforcement, so work isn't self-certified.
- Dependency gating — a dependency-aware run manifest and cascade-blocking that kills false-green CI when an upstream case fails.
A note on rigor
AQA dogfoods itself: a Stop hook blocks "done" while coverage gaps are open, and CI records its own runs. It raises the reliability floor — no silent regressions, no invisible blind spots, no self-graded work — a ratchet that compounds across commits and sessions.