# Model selection

> **Audience.** Buyers evaluating the Bilbs Box and current clients
> planning a new deployment. 6-minute read.
> ⟵ [docs index](./README.md) · [the Bilbs Box](./bilbs-box.md) ·
> [per-tier specs](./bilbs-box.md#03--the-four-sizes)

**TL;DR.** We fine-tune open-weight base models on your data, then
burn the result onto your Box. We don't pre-train from scratch and we
don't ship closed-weights models. Your Box tier determines which
bases comfortably fit; your use case determines which base we pick
from the ones that fit.

---

## 01 · How model choice works

Every engagement makes three picks, in this order:

1. **Task shape** — what should the model actually do? (Chat? Agents?
   Code? Transcription? Structured extraction?)
2. **Open-weights base** — the foundation we fine-tune on top of.
   We only ship bases whose licences permit commercial use on your
   hardware (see below).
3. **Fine-tune approach** — supervised fine-tune (SFT) on your
   corpus, optional preference alignment (DPO / ORPO), then
   quantisation to the smallest precision that still passes your eval
   harness.

The output is a customer-specific weight bundle — signed, pinned,
shipped on the Box — that behaves like its base but knows your data,
your vocabulary, and your guardrails.

## 02 · Base models we ship

These are the families we actively ship in production as of
**April 2026**. Licensing, stability and community momentum drive the
shortlist; we revise it quarterly.

| Family | Sizes we use | Licence | Good at | Caveat |
|--------|-------------|---------|---------|--------|
| **Llama 3.3 / 4** (Meta) | 8B · 14B · 70B · 405B MoE | Llama Community License (commercial OK under 700M MAU) | General chat, RAG, broad tool use | Licence has MAU gate; watch at scale |
| **Qwen 2.5 / Qwen 3** (Alibaba) | 7B · 14B · 32B · 72B | Apache 2.0 (most SKUs) | Code, reasoning, multilingual | Some SKUs pinned to China Mirror — we verify |
| **Gemma 2 / Gemma 3** (Google) | 2B · 9B · 27B | Gemma Terms of Use | Safe outputs, strong guardrail behaviour | "No competitive training" clause — compliant for end-user workloads |
| **Mistral / Mixtral / Mistral Medium 3** | 7B · 8×7B MoE · 8×22B MoE · 24B | Apache 2.0 on OSS SKUs | European customers, multilingual, low-latency | Newer Mistral SKUs are closed-weight — we skip those |
| **DeepSeek V3** | 671B MoE (37B active) | DeepSeek licence (commercial OK) | Hard reasoning, math, long-context coding | Only ships on Box Cluster (VRAM) |
| **Phi-4** (Microsoft) | 14B dense | MIT | Tiny-footprint reasoning, small-RAM fallback | Smaller context (16k default) |

**Embedding / retrieval models** (shipped alongside whatever serving
model we pick):

| Model | Use | Notes |
|-------|-----|-------|
| **BGE-M3** | Default embeddings for RAG | Multilingual, 8192 context, sparse+dense |
| **Nomic Embed v2** | Alternate | Smaller, good on resource-constrained Minis |
| **bge-reranker-v2-m3** | Cross-encoder reranker | Improves RAG precision ~8-15 pp |

**Audio**:

| Model | Use | Notes |
|-------|-----|-------|
| **Whisper Large v3** | Transcription (all tiers) | Customer fine-tune on domain vocabulary available |
| **Whisper Medium** | Mini-tier fallback | 1.2× realtime on Box Mini |

## 03 · Use case → recommended base

These are our default picks. They're not gospel — we confirm against
your data during the Build's week-1 spike.

| Use case | Default base | Why |
|----------|-------------|-----|
| Internal chat assistant | Llama 3.3 14B or Qwen 2.5 14B | Balanced quality ↔ footprint; fits Mini at quantised, serves 200+ on Pro |
| RAG over your corpus | Qwen 2.5 14B + BGE-M3 | Qwen's instruction-following + M3's multilingual retrieval is the sharpest default combo |
| Multi-step agents with tools | Llama 3.3 70B | Strong tool-use + JSON-mode fidelity |
| Code assistance on your monorepo | Qwen 2.5 Coder 32B | Purpose-trained for code, outperforms general models on code-eval |
| Clinical / legal doc summarisation | Llama 3.3 70B or Qwen 2.5 72B | Long-context + clinical fine-tune adapts well |
| Redaction / PII handling | Phi-4 14B | Small, sharp, deterministic on structured extraction |
| Transcription (HIPAA-compliant) | Whisper Large v3 + Llama 3.3 8B formatter | Whisper handles audio, small Llama shapes the output into your template |
| Invoice / ID OCR | Qwen 2.5 VL 7B | Multimodal, strong structured-output |
| Multilingual across 20+ languages | Qwen 3 32B | Best open-weight multilingual today |
| Safety-critical refusal behaviour | Gemma 3 27B | Tightest refusal distribution out of the box |
| Hard math / formal reasoning | DeepSeek V3 | Only on Cluster; rare use case |

## 04 · What fits on which Box

This is where model choice meets hardware. Cells marked ✓ run at
production-grade latency on the tier. ◐ means "fits but slow —
only batch workloads." Blank means "won't fit."

| Base model | Mini | Box | Pro | Cluster |
|------------|:----:|:---:|:---:|:-------:|
| Llama 3.3 **8B** (FP16) | ✓ | ✓ | ✓ | ✓ |
| Llama 3.3 **8B** (FP8) | ✓ | ✓ | ✓ | ✓ |
| Llama 3.3 **14B** (INT4) | ✓ | ✓ | ✓ | ✓ |
| Llama 3.3 **14B** (FP16) | ◐ | ✓ | ✓ | ✓ |
| Llama 3.3 **70B** (INT4) | | ✓ | ✓ | ✓ |
| Llama 3.3 **70B** (FP16, TP) | | | ✓ (TP2) | ✓ |
| Llama 3.3 **405B MoE** (FP8) | | | | ✓ |
| Qwen 2.5 **14B** (INT4) | ✓ | ✓ | ✓ | ✓ |
| Qwen 2.5 **32B** (INT4) | ◐ | ✓ | ✓ | ✓ |
| Qwen 2.5 **72B** (INT4) | | ◐ | ✓ | ✓ |
| Qwen 2.5 Coder **32B** (INT4) | ◐ | ✓ | ✓ | ✓ |
| Qwen 2.5 VL **7B** (FP16) | ✓ | ✓ | ✓ | ✓ |
| Mistral **24B** (INT4) | ◐ | ✓ | ✓ | ✓ |
| Mixtral **8×22B** (INT4 MoE) | | ◐ | ✓ | ✓ |
| DeepSeek V3 (INT4) | | | | ✓ |
| Gemma 3 **27B** (INT4) | ◐ | ✓ | ✓ | ✓ |
| Phi-4 **14B** (INT4) | ✓ | ✓ | ✓ | ✓ |
| Whisper Large v3 | ◐ | ✓ | ✓ | ✓ |
| BGE-M3 + reranker | ✓ | ✓ | ✓ | ✓ |

### What "fits" means in practice

- **Mini (5 concurrent)** — aim for 8–14B quantised. You get one
  serving model + embeddings + reranker + Whisper Medium on one GPU.
- **Box (25 concurrent)** — 14B full-precision or 32B quantised is
  the sweet spot. Can host serving + embedding + side agent.
- **Pro (200 concurrent)** — 70B quantised at p95 sub-500 ms, or
  2–3 smaller models simultaneously.
- **Cluster (1000+ concurrent)** — anything, including 405B-MoE or
  DeepSeek V3, with room left for simultaneous training.

## 05 · Quantisation and its trade-offs

By default we ship **INT4 (AWQ)** — the cost / quality trade-off is
healthiest. We ship FP16 where the customer's eval harness shows
≥ 2-percentage-point regression at INT4, and FP8 on Hopper (H100/H200)
when we can.

| Precision | VRAM factor | Throughput factor | Quality delta vs FP16 |
|-----------|:-----------:|:-----------------:|----------------------:|
| FP16 | 1.0× | 1.0× | baseline |
| FP8 (Hopper) | 0.5× | 1.4× | ≤ 0.5 pp |
| INT4 AWQ | 0.25× | 1.8× | typically 0.3–1.5 pp |
| INT4 GPTQ | 0.25× | 1.6× | fallback if AWQ regresses |
| GGUF q4_K_M | 0.27× | 1.2× | CPU-friendly, for llama.cpp fallback |

Which one you ship on is not a customer decision — it's driven by your
eval-harness gate. We'll tell you in the week-4 hardening report.

## 06 · Multimodal & voice

- **Vision-in** — Qwen 2.5 VL 7B / 32B (default), LLaVA-NeXT for
  research-grade tasks. Runs on Box and up.
- **Speech-to-text** — Whisper Large v3 (default); Parakeet on
  Cluster for sub-real-time at enormous throughput.
- **Text-to-speech** — F5-TTS or XTTS v2. On Mini it adds ~150 ms of
  latency per sentence. On Pro and up it streams.
- **Image-out** — we deliberately do *not* ship diffusion models as
  default. If you need one, we scope it in a separate Build.

## 07 · When a model gets deprecated

We drop a base from the shortlist when any of these happens:
- Licence change that breaks commercial use.
- Upstream author stops shipping patches for ≥ 180 days.
- A newer same-family release wins on our own eval suite by
  ≥ 3 pp on the tasks we care about.
- Security vulnerability in the tokeniser or inference runtime we
  can't patch downstream.

Deprecated bases keep working on your Box — deprecation means *we
stop starting new engagements on them*, not *your Box breaks*. If
you're on the [Updates plan](./plans/updates-plan.md) we'll propose a
migration path inside the next maintenance window.

## 08 · What we will *not* ship

- **Closed-weight hosted models** (GPT-4, Claude, Gemini via API).
  We will help you migrate *off* these; we won't put them on the Box
  because the weights aren't yours.
- **Base models with "no competitive use" clauses that exclude your
  industry.** We read every licence. We'll tell you if one is ruled
  out.
- **Models trained on known copyrighted corpora without a path to
  indemnification.** Case-by-case; founder judgement call.
- **A brand-new model family from a single anonymous contributor on
  Hugging Face.** We need a maintainer we can reach.

## 09 · FAQ

**Q: Can I specify the base model?**  
Yes. If you have a preference from the approved list, we'll default
to it. If it's not on the list, we'll evaluate during the Build's
week-1 spike and either add it or explain why not.

**Q: Can I stack multiple fine-tunes?**  
Yes, via LoRA adapters. Pro and Cluster can hot-swap adapters per
request; Mini and Box need a ~90-second switch.

**Q: How often does my model get re-trained?**  
Only when you ask (Updates plan refreshes on new base-model drops;
ad-hoc re-fine-tunes are priced per SOW).

**Q: If Gemma 5 ships next month, can I get it?**  
Yes — the [Updates plan](./plans/updates-plan.md) covers base-model
refreshes. The new weights get re-fine-tuned on your corpus,
validated against your eval harness, and pushed to the Box.

**Q: Can I run multiple models at once?**  
Depends on the tier. Mini: one serving + embeddings + reranker. Box:
add a side-agent. Pro: up to 3 serving models + retrieval stack.
Cluster: effectively unbounded.

**Q: Do you use customer data to improve other customers' models?**  
No. Never. Your corpus only trains your weights; see the
[Privacy policy](https://bilbs.ai/privacy) and the MSA.

---

See also:
[bilbs-box.md](./bilbs-box.md) · [tier spec sheets](./bilbs-box.md#03--the-four-sizes) ·
[Updates plan](./plans/updates-plan.md) ·
[internal model-selection playbook](./internal/13-model-selection.md) *(team only)*.