← all projects

AQA — System of Record for Agentic Verification

  • 2026 – Present
aimcptestingdevtools

A TestLink-style test-management and verification-memory platform built for the agentic coding loop, exposed over MCP (43 tools). It tracks plans, cases, dependencies, run history, and requirements-to-test traceability, with a claim/verify protocol (doer ≠ checker) and semantic regression memory: a failed run's root-cause reasoning is stored and recalled when a similar failure recurs, so agents get the prior diagnosis instead of re-deriving it. One service layer, surfaced three ways — REST API, MCP server, and CLI.

The problem

When coding agents write and run tests, the verification itself becomes throwaway. Green CI doesn't mean reliable software: a failure gets diagnosed, fixed, and the reasoning evaporates — so the next time a similar failure appears, an agent re-derives the root cause from scratch. Agents also tend to mark their own homework, and a flaky or cascading failure can hide a real regression behind a false green. AQA is built to make verification institutional — a durable, attributable, queryable record of what was tested, what failed, and why.

The approach

AQA is a TestLink-style test-management system designed for the agentic coding loop, reachable by agents over MCP (not just humans in a dashboard). It sits one layer above the test runner and tracks plans, cases, dependencies, run history, and requirements-to-test traceability. Its distinguishing feature is verification memory: a failed run's root-cause reasoning is stored and semantically recalled, so a recurrence comes back with its earlier diagnosis instead of being re-investigated.

Architecture

One transport-agnostic service layer is exposed three ways — a REST API (FastAPI), an MCP server (FastMCP, 43 tools), and a CLI (Typer) — so humans and agents operate on the same state. Data lives in PostgreSQL with pgvector; failure root-cause prose is embedded locally with all-MiniLM-L6-v2 for semantic similarity search, and artifacts (traces, logs, screenshots) go to S3-compatible blob storage.

Four capabilities define it:

  • Regression memorysearch_similar_failures / get_known_regressions recall prior diagnoses and cached fix-paths, and log the re-investigations they avoid.
  • Blind-spot radar — requirements → coverage links surface tests that don't exist yet.
  • Doer ≠ checker — a claim/verify protocol with identity enforcement, so work isn't self-certified.
  • Dependency gating — a dependency-aware run manifest and cascade-blocking that kills false-green CI when an upstream case fails.

A note on rigor

AQA dogfoods itself: a Stop hook blocks "done" while coverage gaps are open, and CI records its own runs. It raises the reliability floor — no silent regressions, no invisible blind spots, no self-graded work — a ratchet that compounds across commits and sessions.

Outcomes

  • 43 MCP tools spanning authoring, planning, execution, builds/CI, and failure memory
  • Semantic failure search via pgvector + local MiniLM embeddings
  • Cascade-blocking merge gate that kills false-green CI

Tech Stack

Python
API
Embeddings
back to projects

Copyright © 2026 Nishant Tiwari All Rights Reserved