EXPERIMENTS · 4 RUNNING · 2 QUEUED

Active research.

Every experiment is a hypothesis the lab is actively testing. Runs get their own GPU slice, their own artefact bucket, and a status row that's tailable from the console. Results hit the research log when they land.

running NOM
4
min 3 · max 9
queued NOM
2
min 0 · max 12
neurons / hr · k NOM
188
min 80 · max 320
avg wall NOM
1.8 h
min 0.4 · max 6
artefacts / hr NOM
68
min 12 · max 240
fail rate NOM
4.0 %
min 0 · max 24

// active

In-flight experiments

EX-014 RUNNING

Recursive critique tower

Two-model debate loop. Researcher proposes, critic refutes, three rounds, judge ranks. Tracking inter-rater agreement vs. round depth.

64%
model llama-3.3-70b ETA 00:08:42
#agents#judging#rlhf
EX-013 RUNNING

SDXL stylebank explorer

Sweeping 64 controlnet conditioning combos × 12 schedulers. 4 images per combo. Saving thumbnail grids for review.

28%
model sdxl-refiner-1.0 ETA 01:47:11
#vision#sweep#styles
EX-012 PAUSED

Whisper diarisation eval

Comparing v3-large turbo against pyannote-3.1 on a 14-speaker meeting corpus. Measuring DER, JER, word-attribution accuracy.

42%
model whisper-large-v3
#audio#eval
EX-011 DONE

Embedding cache half-life

Production trace replay against a TTL'd embed cache. Looking at hit-rate vs. evict policy: LRU, LFU, ARC, random.

100%
model bge-large-en-v1.5
#embeddings#cache
EX-010 QUEUED

Tool-router fine-tune

LoRA on Llama-8b → tool-call routing classifier. Train on 1.2M synthetic dispatches. Eval on hand-labelled 4k set.

0%
model llama-3.1-8b ETA 03:00:00
#fine-tune#routing
EX-009 FAILED

Lab-wide drift watch

Continuous K-S test on response-length and tool-choice distributions. Alerts if today's drift exceeds 2σ from a 30-day baseline.

89%
model qwen-72b
#ops#monitoring
EX-008 RUNNING

Retriever recall sweep

Sweeping chunk size × overlap × encoder for an internal docs corpus. Measuring recall@k against a hand-curated query set.

51%
model bge-large-en-v1.5 ETA 00:32:19
#rag#sweep
EX-007 QUEUED

Multilingual code summariser

Fine-tune for code → 1-paragraph English summary across 8 languages. Eval on BLEU + human rubric.

0%
model qwen-coder-32b ETA 08:00:00
#code#fine-tune
EX-006 DONE

Streaming tool-call protocol

Spec'ing a streaming protocol for tool calls: partial-result hooks, cancellation, recovery. Reference impl + conformance suite.

100%
model
#agents#protocol
EX-005 RUNNING

Vision agent for charts

Multi-modal agent for reading dashboard screenshots and answering natural-language questions about the data shown.

73%
model llava-1.6 ETA 00:18:04
#vision#agents

// archive

Completed last 30 days

  • EX-004 Cold-start cache priming // 34% p95 reduction across LLM endpoints 2026-05-19
  • EX-003 Multi-turn safety probe // 178 prompts. 9 break-throughs (4.7%), written up 2026-05-12
  • EX-002 Context window stress-test // Long-context recall degrades sharply past 96k 2026-05-04
  • EX-001 Tool-call latency budget // Spec landed: 800ms hard ceiling per tool 2026-04-28