EXPERIMENTS · 4 RUNNING · 2 QUEUED
Active research.
Every experiment is a hypothesis the lab is actively testing. Runs get their own GPU slice, their own artefact bucket, and a status row that's tailable from the console. Results hit the research log when they land.
// active
In-flight experiments
Recursive critique tower
Two-model debate loop. Researcher proposes, critic refutes, three rounds, judge ranks. Tracking inter-rater agreement vs. round depth.
SDXL stylebank explorer
Sweeping 64 controlnet conditioning combos × 12 schedulers. 4 images per combo. Saving thumbnail grids for review.
Whisper diarisation eval
Comparing v3-large turbo against pyannote-3.1 on a 14-speaker meeting corpus. Measuring DER, JER, word-attribution accuracy.
Embedding cache half-life
Production trace replay against a TTL'd embed cache. Looking at hit-rate vs. evict policy: LRU, LFU, ARC, random.
Tool-router fine-tune
LoRA on Llama-8b → tool-call routing classifier. Train on 1.2M synthetic dispatches. Eval on hand-labelled 4k set.
Lab-wide drift watch
Continuous K-S test on response-length and tool-choice distributions. Alerts if today's drift exceeds 2σ from a 30-day baseline.
Retriever recall sweep
Sweeping chunk size × overlap × encoder for an internal docs corpus. Measuring recall@k against a hand-curated query set.
Multilingual code summariser
Fine-tune for code → 1-paragraph English summary across 8 languages. Eval on BLEU + human rubric.
Streaming tool-call protocol
Spec'ing a streaming protocol for tool calls: partial-result hooks, cancellation, recovery. Reference impl + conformance suite.
Vision agent for charts
Multi-modal agent for reading dashboard screenshots and answering natural-language questions about the data shown.
// archive
Completed last 30 days
-
EX-004Cold-start cache priming // 34% p95 reduction across LLM endpoints 2026-05-19 -
EX-003Multi-turn safety probe // 178 prompts. 9 break-throughs (4.7%), written up 2026-05-12 -
EX-002Context window stress-test // Long-context recall degrades sharply past 96k 2026-05-04 -
EX-001Tool-call latency budget // Spec landed: 800ms hard ceiling per tool 2026-04-28