I built an AI that scores my fit for a job — then I tried to make it lie

There's an MCP server behind this site. Point Claude or ChatGPT at mcp.nishanttiwari.com/mcp and it can query my real skills and projects, and run a tool — match_role — that scores how well I fit a job description. I built it to be useful to a recruiter's AI. So the question that matters isn't "does it work?" It's "does it lie?"

Here's how I found out it did.

The test that mattered

Every test I'd run fed it roles I'm a good fit for — cloud architecture, AI platforms, DevOps. All came back strong, all true. None tested the case that actually matters: a role I can't do.

So I passed it a Senior Rust Engineer job — required skills Rust, Tokio, Go, Elixir, Systems Programming. I've never shipped Rust. The honest answer is "no."

It returned: overall_score 71.52, verdict good_match, hire_signal "likely_yes", zero skill gaps. "Likely yes" for a Rust role, to a recruiter, about a man with no Rust — through the exact interface a real recruiter's AI would use. That's the worst thing an AI-facing profile can do.

What was actually happening

match_role had a semantic layer: a small local embedding model (MiniLM, 22M parameters) that, for any skill it couldn't match exactly, found the "nearest" skill I do have and counted it as covered. Rust to Linux (cosine around 0.30). Tokio to Boto3. Go to Google Cloud. Elixir to Lambda. It then overwrote the job's requirements with my own skills and checked them all off. Of course there were no gaps — it had quietly swapped the question for one it could answer.

"Just use a better model" doesn't work

I measured it — cosine separation between false pairs (Rust/Linux) and true paraphrases (Software Architecture/Solution Architecture):

MiniLM: false and true pairs overlap — no clean line.
all-mpnet-base-v2: separates by a razor-thin 0.047 — too narrow to trust.
bge-base: worse — it inflates everything (Rust/Linux jumps to 0.66).

Short skill strings live in too tight a band for any embedding model to cleanly tell "same skill, different words" from "different skill, same field." A bigger model just re-tunes the same coin flip.

The actual fix: know who's smart in the room

Here's the thing I'd missed: the model calling my tool is a frontier LLM — already in the loop, already reading the job description. It knows perfectly well that Rust is a systems language and Linux is an OS. I was using a 22-million-parameter model to make a judgment the model on the other end makes effortlessly.

So I deleted the embedding layer. Now the server does one honest job — deterministic, verifiable matching (exact skill names plus vendor aliases like S3 to AWS). A requirement it can't verify is reported as an honest gap, with an explicit note: this is evidence, not a verdict — you make the fit call. The fuzzy part goes to the model that's actually good at it.

The result

Same Rust job, today: weak_match, "unlikely," all five requirements flagged as gaps. And a real LLM client, run end-to-end, read the evidence and concluded on its own: "not a good fit — zero verified Rust; his strengths are cloud, Python, and AI work, not systems languages." Correct. The server stopped pretending to be smart, and the answer got smarter.

The lesson I keep relearning

The same principle bit me twice here. Don't make the server parse the job description — the calling LLM is the better reader. Don't make the server judge skill-equivalence — the calling LLM is the better judge. The server's job is to be the source of truth, stated plainly; judgment belongs to whoever in the pipeline is best at it, and increasingly that's the model already in the loop, not the bespoke code.

I shipped this to my own résumé tool, which now happily tells you when I'm wrong for a role. That makes it more trustworthy, not less. The bug was never the embarrassing part — shipping a confident lie and never checking would have been.