← all projects

Foundry — LoRA Fine-Tuning & Adapter-Lifecycle Platform

  • 2025 – Present
aimlopsllmfine-tuningresearch

A private research build. A self-built MLOps platform that trains, tracks, evaluates, and serves small LLM (LoRA) adapters across heterogeneous compute — Modal serverless GPUs, Kaggle, Colab, and local Apple-Silicon MLX. Exposes a Click CLI and a FastAPI service over a pluggable compute-backend abstraction, with run tracking in SQLite, eval-gated deployment, and OpenAI-compatible inference. Built to train ChakravyuhRift's pentest-specialist adapters, but architecturally domain-agnostic.

A private research build — not publicly hosted. This write-up focuses on the architecture and the engineering.

The problem

Fine-tuning small LLM adapters is easy to start and painful to operate: the same LoRA job has to run reproducibly across whatever GPU you can get (a Modal A10G today, a Kaggle T4 tomorrow, an Apple-Silicon laptop offline), every run's hyperparameters/loss/cost/eval need to be remembered, bad adapters must be stopped before they ship, and you want to serve dozens of them without paying for dozens of GPUs. Foundry is the control plane for all of that.

The approach

A self-built MLOps platform — "adapter lifecycle tracker" — with two surfaces (a Click CLI and a FastAPI service + web UI) over shared data, training, and inference layers, glued together by a pluggable compute-backend abstraction. One command takes a curated corpus to a tracked, evaluated, deployable adapter.

Architecture

  • Compute backends — a single ComputeBackend protocol (trigger / status / logs / download / cancel) implemented for Modal (serverless A10G GPUs + Volumes), Kaggle, Colab, and local Apple-Silicon MLX, with a polling RunManager that reconciles runs across all of them.
  • Training layer — a provider-agnostic TrainingConfig and a train_lora dispatcher that auto-detects hardware capability and picks framework + precision: Unsloth + TRL SFTTrainer with 4-bit LoRA on CUDA, HuggingFace transformers/PEFT, or MLX locally — including a workaround for the Turing-T4 bf16/GradScaler bug.
  • Inference — two modes from the same adapters: a Modal + vLLM OpenAI-compatible server for cloud, and a local MLX multi-adapter server that loads one base model and hot-swaps LoRA tensors per request (~100ms), choosing the adapter via the OpenAI model field or an X-Adapter header. ("Deploy thousands of models for the cost of one.")
  • Tracking & governance — every run persisted in SQLite (hyperparameters, loss history, cost), an LLM-judge eval-policy engine, and eval-gated deployment that refuses to promote an adapter unless its latest verdict is PASSED.
  • Data layer — corpus discovery + a human-in-the-loop curation workflow (per-record keep / drop / flag → filtered corpus), a scrape→parse→merge→quality-scan data pipeline with versioning, and idempotent migrations that reconstruct historical experiment rows.

Where it fits

Foundry is domain-agnostic at the core (its example adapter is literally "buddhism") but was built to train the pentest-specialist adapters that ChakravyuhRift's PDCA traces produce — the training half of that self-improvement loop.

Outcomes

  • Provider-agnostic LoRA training layer that auto-detects hardware capability and dispatches to the right framework/precision (Unsloth+TRL on CUDA, HF transformers/PEFT, or Apple MLX), including a Turing-T4 bf16/GradScaler workaround
  • ComputeBackend protocol with Modal, Kaggle, Colab, and local-MLX implementations plus a polling RunManager — one CLI command uploads data to a Modal Volume, spawns an A10G Unsloth+TRL SFT job, streams loss events, and downloads adapter weights
  • Multi-adapter inference: a Modal+vLLM OpenAI-compatible server for cloud, and a local MLX server that hot-swaps LoRA tensors over one shared base model (~100ms) selected via the OpenAI model field / X-Adapter header
  • Experiment-tracking + governance: SQLite-persisted runs (hyperparameters, loss history, cost), an LLM-judge eval-policy engine, and eval-gated deployment that blocks promotion unless the latest verdict is PASSED
  • Human-in-the-loop data-curation workflow (per-record keep/drop/flag → filtered corpus) across CLI and FastAPI, with idempotent evidence-backfill migrations reconstructing historical experiment rows

Tech Stack

Python
LLM
Embeddings
API
back to projects

Copyright © 2026 Nishant Tiwari All Rights Reserved