White paper · v1.0 · April 2026

Private
AI in 2026.

A buyer's guide for CTOs, heads of engineering, and procurement leads deciding whether to keep paying a cloud-API invoice or move to on-prem private AI in the next twelve months.

Pages52 · read in ~55 min
AudienceCTO · Head of Eng · Procurement · CFO · DPO
AuthorBilbs AI studio · Montréal
Versionv1.0 · April 2026
Abstract

What this document argues, in one page.

Between 2024 and 2026 the economics of serving a language model on your own data, on your own hardware crossed the threshold where it beats a cloud-API invoice for any team whose AI spend exceeds roughly $3,000 / month. Open-weight base models now match or exceed frontier-API quality on narrow tasks after a modest fine-tune; GPU prices have fallen as enterprise-secondary-market Ada/Hopper silicon arrived; and the compliance pressure on "your customers' data leaves your perimeter every 100 ms" has intensified on both sides of the Atlantic.

This document is a buyer's guide for the resulting decision. It does four things:

  • Quantifies the trade-off. The headline break-even table (Section 03) and its full derivation (Appendix B) let a finance reviewer rebuild the model in a spreadsheet and contest every assumption line-by-line.
  • Narrows the model choice. A short list of open-weight families we actually ship (Section 05) and the fine-tuning, quantization and evaluation methodology behind them (Sections 12–14) - the parts most vendors elide.
  • Makes operations concrete. Deployment topologies (Section 15), observability (16), security posture (17), updates (18), and DR (19) with the same level of detail a serious SRE would demand of their own stack.
  • Closes on procurement. Vendor checklist (07), decision matrix (08), timeline (09), payment & IP (20), data governance (21), compliance matrix (10). Plus appendices with the core MSA/DPA clauses and a sample eval harness.

The conclusion is the same one JPMorgan, Morgan Stanley, Bloomberg, Klarna, Meta and Apple have publicly reached for their own deployments: if you own the data, you should own the weights. Everything after this page is the math and the operational detail that makes that sentence affordable for a 25-person team rather than a 250,000-person bank.

How to read this document.

If you have 5 min
Read Sections 03 (TCO), 08 (decision matrix), 11 (thesis). Skim the TOC for the appendix that matches your role.
If you have 15 min
Add Sections 02 (public proof), 05 (model comparison), 07 (vendor checklist), 10 (compliance).
If you're the operator
Sections 12–19 (models, hardware, ops, security, updates, DR) plus Appendix C (eval harness).
If you're procurement / legal
Sections 07, 09, 20, 21, and Appendix D (MSA/DPA clauses).
If you're the CFO
Section 03 then Appendix B (full TCO derivation with sensitivity).

Document conventions.

  • Numbers are live, not aspirational. Every dollar figure comes from either our own shipped engagements or a cited public source. If a number is modelled rather than measured, it says so.
  • Sources are linked inline and enumerated at the end in References.
  • The tone is "not a pitch." Section 04 tells you when not to buy; Section 23 (callouts) flags places where we've lost deals by being too honest.
  • Print friendly. Every table, callout and pull quote is page-break-safe. "Save as PDF" from the top-right of the page produces a document that belongs in a procurement binder.
Executive Summary

The 60-second version.

The problem: The firm’s lawyers already use AI - just not through the firm. ChatGPT in browser tabs, Copilot in the M365 tenant, screenshots of client matters pasted into prompt windows. The result is confidentiality risk, fragmented workflows, no governance, no audit trail, and no centralized institutional memory. The problem is not AI; the problem is the absence of infrastructure around it.

The shift: Google’s Gemma 4 open-weight family now matches GPT-4 on narrow domain tasks after fine-tuning. GPU prices dropped 40% as enterprise secondary-market Ada/Hopper silicon hit the channel. The economics have inverted.

The offer: A private AI server installed in the firm’s own office, trained on the firm’s own files. One line on the invoice: from $29 per lawyer per month. That covers the audit, the deployment, indexing, on-site training, updates, model refresh, security patches, monitoring, backups, the audit log, < 4 h support, and the 24/7 Sev-1 pager. Below Microsoft Copilot (~$40 / seat). The price goes up only if the firm chooses optional services (storage expansion, fine-tuning, multi-office sync, transcription credits, HA second server, dedicated SRE, scoped SOC 2). Hardware is audited separately: if the firm already owns a GPU-capable server we deploy onto it (Path A); if it doesn’t, we source one on a transparent OEM-cost quote with the OEM’s 3-year warranty (Path B). The firm owns the hardware outright either way — no lease, no buy-out, no markup on the GPU. The audit is free, delivered on the first 20-minute call. The firm is not paying Bilbs for GPUs or models — the firm is paying for confidentiality, governance, and operational leverage.

from $29 /lawyer/mo
One line on the invoice · below Copilot · audit and hardware path separate

Four hardware references, one subscription, one architecture.

Tier names below are hardware references mapped to firm size, not pricing tiers. Every firm pays the same per-lawyer line.

Hardware referenceFirm sizeIndicative hardwareSubscription
Foundation5–10 lawyers1× RTX 5090 · Ryzen 9 9950X · 128 GB RAM$29 / lawyer / mo
Practice10–30 lawyers2× RTX 5090 · Threadripper PRO · 256 GB ECC$29 / lawyer / mo
Firm30–100 lawyers2× RTX 6000 Ada / A100 · dual EPYC · 512 GB ECC$29 / lawyer / mo
National100+ lawyers8× A100 / H200 / B200 · half-rack · FIPS-readyfrom $29 / lawyer / mo (volume: $34 @ 250+, $29 @ 500+)

How the math works: One line on the invoice — from $29 per lawyer per month. Hardware is audited separately and sits on a separate, transparent quote at OEM cost (or stays off the Bilbs invoice entirely if the firm already owns the server). Bilbs does not mark up the GPU. Subscription margin is the only line Bilbs earns on — and it’s below Microsoft Copilot.

The decision in one sentence.

If your firm has data that can’t leave the perimeter under Loi 25, the Box is on-prem sovereignty below Microsoft Copilot: from $29 / lawyer / month — one line on the invoice. Hardware is audited separately — either deploy onto a server the firm already owns (Path A) or we source one on a transparent OEM-cost quote (Path B). The firm owns the hardware outright either way; no lease, no buy-out clause. Pays for itself in under one day of recovered billable time at any firm above 10 lawyers.

Architecture

How it all fits together.

A visual map of the Bilbs Box stack - from user request to model response, with all the pieces that make it production-ready.

USERS VPN / LAN THE BILBS BOX API Gateway Orchestrator Vector DB + RAG Model Runtime vLLM / TensorRT-LLM GPU: RTX 5090 / RTX 6000 Ada / A100 Admin UI Eval Harness Telemetry · Metrics · Audit Log → Your SIEM updates.bilbs.ai (optional) Your Object Store S3 / GCS / MinIO ── internal ╶╶ external (opt.)

Component responsibilities.

API Gateway
TLS termination, rate limiting, auth (SSO / API token), request routing. No traffic touches the model without passing the gateway.
Orchestrator
Routes prompts to the right model, manages tool calls, handles RAG retrieval, enforces guardrails. The "brain" that turns a prompt into a response.
Vector DB + RAG
Your corpus, chunked and embedded. Retrieval-Augmented Generation gives the model context without baking everything into weights.
Model Runtime
vLLM or TensorRT-LLM serving your fine-tuned weights. Handles batching, KV-cache, continuous generation. GPU-bound.
Admin UI
Web dashboard for config, monitoring, model swaps, user management. No SSH required for day-to-day ops.
Eval Harness
Your quality gate. Runs on every model change; blocks bad releases before they hit production.
Telemetry
Prometheus metrics, OpenTelemetry traces, append-only audit log. Streams to your SIEM if you want; stays on-box if you don't.
Competitive Landscape

Where Bilbs AI fits.

The private AI market in 2026 has four quadrants. Most vendors occupy one; we occupy the underserved intersection of "on-prem" and "turnkey."

VendorModelLocationIntegrationTrade-off
OpenAI / Anthropic API Pay-per-token Cloud Self-serve Data leaves perimeter; costs scale with revenue
Azure OpenAI / Bedrock Pay-per-token Cloud (private VPC) Managed Still cloud; still recurring; still their models
Lambda Labs / CoreWeave GPU rental Cloud DIY You build everything; 6–12 months to production
NVIDIA DGX Capex On-prem Hardware only $200k+ entry; no fine-tune service; overkill for most
Hugging Face + vLLM (DIY) OSS On-prem DIY High failure rate; requires ML team; no support
Bilbs AI Capex On-prem Turnkey $14k entry; 6-week ship; weights + IP assigned
Most vendors are either cloud-native, hardware-only, or DIY tooling. Bilbs AI is the appliance: high integration, on-prem, fixed price.

Why the appliance model wins.

  • For regulated industries: "Data never leaves your building" is a checkbox, not a negotiation.
  • For finance teams: Capex amortizes over 3–5 years; cloud is opex that scales with success.
  • For eng teams without ML expertise: Ship in 6 weeks, not 6 months. The fine-tune, eval, and runbook are included.
  • For procurement: Fixed price, signed SOW, clear IP assignment. No "true-up" surprises.
Contents

What's inside.

Fourteen core sections plus four appendices. Every table, every number, every source link is here so this document stands on its own - hand the PDF to a CFO and they'll get the economics on the first pass.

Start Here
Part I · Landscape
Part II · Models & hardware
Part III · Operations
Part IV · Procurement & compliance
Part V · Close & appendices
2.8 MO
Payback at $10k / mo cloud
$312k
3-yr savings · same scenario
0
Required monthly plans
6 WK
Contract-to-live · Box
Section 01

What's changed since 2024.

Two things made private AI cheap enough for a 25-person team: open-weight models caught up to frontier-API quality on narrow tasks, and GPU prices dropped when Ada / Hopper silicon hit enterprise secondary markets. A fine-tuned Gemma 4 12B or Gemma 4 27B now matches or beats GPT-4-class output on domain tasks - a fact every published case study in the next section confirms.

In parallel, most teams' AI bills doubled every two quarters. The cloud-API model that made sense at 10k tokens/day breaks at 10M tokens/day. The economics have finally inverted.

If you own the data,
you should own the weights.
Why this matters Cloud-API spend scales with revenue. On-prem spend stays flat. Every quarter of growth makes the capex case stronger - and once you cross $5k/month in cloud tokens, the spreadsheet stops being debatable.
Section 02

The proof - public case studies.

None of these are Bilbs customers. All of them are public companies that published the same conclusion: for your own data, on your own tasks, an in-house LLM wins.

CompanyWhat they builtWhySource
JPMorgan Chase LLM Suite for 230k employees Banking compliance; data cannot leave the firm The Digital Banker
Morgan Stanley AI @ MS Assistant for 16k advisors 100k-document internal corpus; 98% adoption Morgan Stanley press
Bloomberg BloombergGPT (50B, 363B tokens) 40 years of proprietary financial data arXiv 2303.17564
Klarna In-house customer-service AI Absorbed workload of 700 FTE agents industry coverage
Meta CodeCompose / Metamate Internal code assistant on proprietary monorepo Meta Engineering
Apple Apple Intelligence + Private Cloud Compute On-device by default; private cloud for heavy ops Apple Security
Pattern The companies above don't share an industry - they share a constraint. Every one of them has either a regulatory duty of care, a competitive moat sitting in their own data, or a usage profile that makes per-token pricing untenable.
Section 03

The TCO math, laid bare.

Every private-AI evaluation eventually becomes a break-even calculation. Here's the formula:

3-yr Bilbs subscription = N (lawyers) × $29 × 36 months
Hardware is a separate line, audited up front.

3-year subscription cost by firm size, at the standard $29 / lawyer / month (Copilot reference: ~$40 / seat / mo in US/EU cloud):

Firm sizeCalculation3-yr total Bilbs subscription3-yr Copilot (same seats)
30 lawyers30 × $29 × 36 mo$31,320~$43,200 + Loi 25 exposure
60 lawyers60 × $29 × 36 mo$62,640~$86,400 + Loi 25 exposure
80 lawyers80 × $29 × 36 mo$83,520~$115,200 + Loi 25 exposure (US/EU compute)
150 lawyers150 × $29 × 36 mo$156,600~$216,000 + Loi 25 exposure
500 lawyers500 × $29 × 36 mo (volume band)~$522,000~$720,000 + Loi 25 exposure

Hardware is a separate line. Most firms 30+ already own GPU-capable servers; smaller firms get a transparent OEM-cost quote at audit time. The Bilbs invoice stays one line — from $29 / lawyer / month — regardless.

What you’re actually buying. $29 vs $40 per seat. The savings vs Copilot are not the headline — the headline is data sovereignty and the firm owning the model. The per-lawyer line covers Loi 25 compliance by architecture (no cross-border at all), fine-tuning on the firm’s own precedents (which Copilot cannot do), an on-premise audit log (which Copilot cannot deliver), the model weights and source code on the firm’s artifact registry, and the institutional memory layer indexed against the DMS and ERP. The hardware sits on the firm’s own books, owned outright on day one. Bilbs is not the chatbot; Bilbs is the secure infrastructure around the chatbot.

What the formula doesn't capture (yet).

  • Latency wins. LAN-local inference is 80–250 ms faster than a round-trip to us-east-1. At 1M interactions / month that's 20–70 hours of user time saved per month.
  • Compliance savings. Not moving data across a trust boundary is often the difference between a 4-week enterprise deal and a 4-month one. We've watched deals unblock inside a single procurement meeting once the "airgap" box got ticked.
  • Pricing resilience. API pricing is set by someone else. Weights you own do not reprice.
Section 04

When private AI is not the answer.

We'd rather tell you to skip a Box than sell you one you don't need.

  • You process less than $1,000 / month in cloud tokens today and don't project 2× growth in 12 months. Pay the cloud bill. Revisit when the invoice crosses $3k.
  • You need a frontier-scale generalist model (best-in-class at math, multimodal reasoning, cross-domain tasks). Frontier APIs still win on open-ended generalists. Private AI wins on your tasks, not all tasks.
  • Your team has no one who can operate a server and you can't afford a managed Support plan. The Box is "plug it in, done" - but "done" still requires someone who can rack a server.
  • You don't have enough domain data to fine-tune on (roughly < 500 curated examples). Use RAG on a cloud API first; revisit when your corpus grows.
Field note About 1 in 5 sizing calls ends with us recommending the prospect stay on their current cloud API for another quarter or two. Nobody has regretted that conversation.
Real-World Use Cases

What customers actually build.

Abstract value props are easy; concrete deployments are persuasive. Here are five archetypes drawn from real deployments (customer details anonymized per our NDA).

1. Enterprise deal unblock (SaaS, Series B)

Before
$12k/mo OpenAI bill. Three enterprise deals stuck in legal for 4+ months because "customer data leaves your perimeter on every API call." SOC 2 Type II certificate wasn't enough.
After
Firm-spec hardware deployed. Airgap topology. Enterprise deals closed in the next quarter. Annual contract value of the unblocked deals: $840k.
Cost
Hardware reference: Firm. Subscription: 80 lawyers × $29 / mo = $2,320 / mo · $27,840 / year. Against $840k of unblocked ACV, the deployment pays for itself in under 12 days. Hardware audited separately at start — most firms this size already own a GPU-capable server.

2. Clinical documentation (Healthtech)

Before
Physicians dictating notes into Dragon; manual review by scribes; 48-hour turnaround. HIPAA compliance meant no cloud API was viable.
After
Box with Gemma 4 12B fine-tuned on 8,000 de-identified notes (pre-transcribed). Real-time summarisation with ICD-10 code suggestions. Turnaround: < 5 minutes.
Cost
Hardware reference: Practice. Subscription: 30 lawyers × $29 / mo = $870 / mo · $10,440 / year. Against $180k / year in avoided scribe cost, the deployment pays for itself in under 22 days. Hardware audited separately.

3. Internal knowledge assistant (Professional services, 200 employees)

Before
Consultants searching Confluence, SharePoint, and Slack for prior art. 3–4 hours/week per consultant wasted on "I know we did this before."
After
Foundation-spec hardware with Gemma 4 12B + Gemma 4 Embedding retriever over 120k internal documents. "Find me the pricing model we used for the last pharma deal" returns in 2 seconds with citations.
Cost
Hardware reference: Firm. Subscription: 200 lawyers × $29 / mo = $5,800 / mo · $69,600 / year. Recovered billable hours: 3 h/week × 200 consultants × $150/h = $4.5M / year. Break-even in under 6 days. Hardware audited separately.

4. Code review automation (Fintech, 80 engineers)

Before
Manual PR reviews taking 2–3 days. Senior engineers spending 40% of time on review; juniors waiting.
After
Box with Gemma 4 27B fine-tuned on 50k historical PR reviews. First-pass code review in 90 seconds. Human reviewers focus on architecture, not style.
Cost
Hardware reference: Firm. Subscription: 80 engineers × $29 / mo = $2,320 / mo · $27,840 / year. Eng velocity increase: ~25% measured by PR merge time. Break-even inside the first sprint. Hardware audited separately.

5. Regulatory compliance (Insurance)

Before
Compliance officers manually checking policy language against 12 regulatory frameworks. 6 hours per policy; 200 policies/month.
After
Firm-spec hardware with Gemma 4 70B fine-tuned on regulatory corpus. Automated compliance check with citation to specific regulation. Human review reduced to 45 minutes per policy.
Cost
Hardware reference: Firm. Subscription: 60 lawyers × $29 / mo = $1,740 / mo · $20,880 / year. Compliance team capacity: 4× with same headcount. Break-even in days, not months. Hardware audited separately.
Pattern The common thread: domain-specific tasks with proprietary data. The Box wins where the data can't leave the building and the task benefits from fine-tuning - not as a cheaper GPT-4, but as a purpose-built tool for your workflow.
Section 05

Gemma 4 model family · the only LLM family we ship

Bilbs deploys exclusively from Google’s Gemma 4 open-weight family - one model lineage, one licence, one upgrade path. Rationale and rejected alternatives in the model selection guide.

VariantSizeLicenceBest at
Gemma 4 4B4B denseGemma Terms of UseFast classification, intake routing, structured extraction
Gemma 4 12B12B denseGemma Terms of UseDefault serving model for Foundation tier; RAG, summaries, Q&A
Gemma 4 27B27B denseGemma Terms of UseDefault serving model for Practice tier; drafting, multi-step reasoning
Gemma 4 70B70B denseGemma Terms of UseFirm + Cluster tier; high-stakes drafting, complex tool use, long-context
Gemma 4 EmbeddingEncoder · 1024-dimGemma Terms of UseRetrieval over the firm’s DMS, multilingual (FR/EN)

Default variant per use case.

Use caseDefault Gemma 4 variantWhy
Internal chat assistant (Foundation / Practice)Gemma 4 12B or 27B12B fits the Foundation tier comfortably quantised; 27B is the Practice default at FP16
RAG over the firm’s corpusGemma 4 27B + Gemma 4 Embedding27B’s instruction-following + multilingual FR/EN embeddings out of the same family
Multi-step agents with toolsGemma 4 70BStrong tool-use, JSON fidelity, long-context (Firm / Cluster tier)
Legal drafting (memos, opinions, contracts)Gemma 4 70B fine-tuned (firm LoRA)Highest-quality drafting model, adapts well to firm’s precedents
Long-context summarisationGemma 4 70B128k context, domain fine-tune adapts well to legal / clinical corpora
PII redaction / structured extractionGemma 4 4BSmall, sharp, deterministic on structured outputs - runs as a side agent
Intake triage / matter classificationGemma 4 4BSub-100ms classification, cheap to run in parallel with the serving model
Section 06

Hardware sizing & environment.

Sized by concurrent users ("concurrent" = users actively generating tokens at the same wall-clock moment, not accounts):

These are hardware references mapped to firm size — not pricing tiers. Subscription is the same flat per-lawyer line at every reference.

Hardware referenceFirm sizeIndicative hardware specTypical placement
Foundation5–10 lawyers1× RTX 5090 (32 GB) · Ryzen 9 9950X · 128 GB DDR5 · 8 TB NVMe (RAID 1)Tower with rack-ear kit · supply closet
Practice10–30 lawyers2× RTX 5090 (64 GB combined) · Threadripper PRO · 256 GB DDR5 ECC · 8 TB NVMe RAID 10 · water-cooledDesktop / office server room
Firm30–100 lawyers2× RTX 6000 Ada / A100 (96–160 GB combined HBM) · dual EPYC 9354 · 512 GB DDR5-4800 ECC · 4U with enterprise redundancyServer room / small DC
National100+ lawyers8× A100 SXM base (or H200 / B200 upgrade) · dual EPYC 9554 · 1.5 TB DDR5-4800 ECC · half-rack · FIPS-readyDatacenter / sealed 42U

Hardware is audited separately. If the firm already runs a GPU-capable server (most 30+ firms do), we deploy onto it — Path A — and nothing extra hits the Bilbs invoice. If not, we source the reference spec on a separate, transparent OEM-cost quote with the OEM’s 3-year warranty — Path B. Bilbs does not mark up the GPU. The firm owns the hardware outright either way. The Bilbs invoice stays one line: from $29 / lawyer / month.

Power & placement at a glance.

Hardware referenceIdle WPeak WCircuitPlacement
Foundation150520110–240 V / 15 A (office outlet)Tower with rack-ear kit / supply closet
Practice3201,0502× 208–240 V / 20 ADesktop / office server room (water-cooled)
Firm5202,1002× 208–240 V / 30 AServer room / small DC (4U rack)
National1,8009,0003-phase 208 V / 60 A × 2Datacenter only (half-rack or sealed 42U)
Sizing heuristic Concurrent users, not headcount, stress the hardware. A 200-person company where 10% use AI daily fits comfortably on a Box; a 20-person startup whose product is AI probably needs a Pro.
Section 12

Fine-tuning approaches.

The word "fine-tune" collapses four meaningfully different operations. Choosing the right one determines training cost, inference cost, and how well the final model behaves when your data drifts.

The four approaches we ship.

ApproachWhat changesTypical costWhen to use
Full SFT
Supervised fine-tune, all weights
Every parameter in the base model 8–48 h on Cluster; hundreds of GPU-hours Large corpus (≥ 100k examples), major domain shift, regulatory domains where partial-weight deltas complicate audit.
LoRA / QLoRA
Low-rank adapters
A small set of rank-16/32/64 matrices grafted onto attention + MLP layers 2–12 h on Firm-spec hardware; tens of GPU-hours Default choice. 90% of our deployments ship with LoRA/QLoRA because the delta is easily versioned, rolled back, and stacked.
DPO / ORPO
Preference alignment
Adjusts the model's probability mass between paired (chosen, rejected) answers - no reward model required Usually runs on top of SFT or LoRA, adds 1–4 h Style, tone, refusal calibration, hallucination reduction. Needs ≥ 500 clean preference pairs to beat the SFT baseline.
Continued pre-training Next-token loss on raw domain text (not instruction pairs) 12–60 h on Cluster Very new vocabulary (legal codes in a minority language, clinical sub-speciality). Rare - we've done it twice.

Data we need, in order of impact.

  1. 200–2,000 high-quality instruction pairs drawn from real interactions. Quality beats quantity - 500 hand-reviewed pairs outperform 50,000 scraped ones almost every time.
  2. A "golden set" of 30–100 adversarial test cases authored by a domain expert and never seen during training. These drive the eval harness (Section 14).
  3. A refusal set - 50–200 examples of requests the model should decline or escalate. Without this the model acquires your company's "yes-and" culture a little too literally.
  4. Domain corpora for retrieval (RAG). Usually orders of magnitude larger than the fine-tune data, stored in the vector DB rather than baked into weights.
Our default QLoRA rank-64 on the base model, DPO pass on a curated preference set, 4-bit quantization at the end, signed delta shipped to the Box. The whole pipeline fits in a reproducible CI job under 12 hours for a mid-size corpus.

What we do not do.

  • Pre-train from scratch. Burning 10 days of H100 time to reach a worse Gemma 4 4B is a science project, not a product.
  • RLHF with a learned reward model. DPO gets 85% of the win with a fraction of the moving parts. We'll do PPO-style RLHF only if a customer brings a reward model they already trust.
  • Blind distillation from a cloud model's outputs. Licence-questionable at best, and it couples your private model to a competitor's version history. We build on open weights and your own data.
Section 13

Quantization & the VRAM equation.

Quantization is how a 70B model fits on a 48 GB GPU. Understanding the trade-off in one page saves weeks of architecture debate.

The back-of-envelope.

VRAM ≈ (params × bytes-per-param) × 1.2 overhead
+ KV cache (~20–60% of weights at prod context lengths)
PrecisionBytes/paramQuality vs. BF16Notes
FP324BaselineTraining only; never shipped to inference.
BF16 / FP162BaselineDefault training precision; what the original weights ship in.
INT8 (W8A16)1~99%Very safe. Free win if you have the VRAM.
INT4 (AWQ / GPTQ)0.5~97–98%Our production default. Minimal eval drop for most domains.
INT3 / 2-bit (QuIP#)0.25–0.375~90–94%Only when VRAM is the binding constraint and eval confirms it still passes.

Sample footprints.

ModelFP16INT8INT4Fits on …
Gemma 4 4B8 GB4 GB2.5 GBAny tier, any precision · ideal side agent
Gemma 4 12B24 GB12 GB7 GBFoundation (FP16), Practice/Firm/Cluster (any) · default Foundation serving
Gemma 4 27B54 GB27 GB15 GBFoundation (INT4), Practice (INT8 or INT4), Firm/Cluster (FP16) · default Practice serving
Gemma 4 70B140 GB70 GB38 GBFirm (INT4), Cluster (any) · default Firm/Cluster serving
Gemma 4 Embedding1.5 GB0.8 GBAny tier · runs alongside the serving model

KV-cache math - the part people forget.

Weights are static; the key-value cache scales with batch × context length. For a 70B model serving 32-user concurrency at 16k context, the KV cache alone can consume 30–50 GB - on top of whatever the weights take. This is why the "this model fits on my GPU" question isn't well-defined until you pin both concurrency and context length.

Our sizing heuristic Reserve ~1.4× the quantized weight size as working VRAM for a conservative KV budget at production concurrency, and validate with the real eval harness at the chosen precision before signing the tier.
Section 14

Evaluation methodology.

If the only number you have is "it feels better than GPT-4", you cannot upgrade the model, you cannot defend the switch to procurement, and you cannot detect regressions. Every deployment ends with an eval harness committed in the customer's repo, runnable without us. Appendix C shows the full layout.

The four eval tiers we run.

Golden set
30–100 task-specific cases authored by a domain expert. Deterministic graders (exact match, regex, structured-output schema). Runs on every PR and every release.
Rubric set
100–500 open-ended cases graded by a larger "judge" model against a printed rubric. Calibrated against human graders on a 50-case sub-sample. Runs nightly.
Adversarial set
30–80 cases designed to break the model: jailbreaks, prompt injection, long-context distractors, tool-call edge cases, multilingual trickery, PII leakage traps.
Production shadow set
Live traffic sampled (with consent per DPA), replayed in staging against the candidate model. The only way to catch "works in eval, embarrasses you in prod" problems before users do.

Grading & thresholds.

  • Ship criterion: golden ≥ 95% pass, rubric ≥ agreed threshold, adversarial ≥ 90% refused-or-contained, shadow ≥ parity with last release.
  • Every number is a distribution, not a point - we report median and 10th-percentile so tail behaviour doesn't get averaged away.
  • A failing run blocks the release, not a Slack message.
Why the eval is in your repo If we ever disappear, the eval harness is how your team keeps the Box honest. It's the single most valuable artefact we hand over - worth more than the weights themselves for most buyers.
Section 15

Deployment topologies.

Where the Box sits on your network determines everything downstream: how users reach it, how it updates, how auditors score you, and how recovery looks at 3 AM. We ship three reference topologies and tailor from there.

1. Airgap (zero-egress).

The Box refuses any outbound DNS or HTTP request, ever. Users reach it over a cabled LAN or a site-to-site VPN you control. Updates arrive on a signed USB drive (quarterly) or a one-way data diode if you have one.

  • Good for: regulated-data tenants (defence, healthcare, intelligence), trading floors, R&D labs with IP-sensitive corpora.
  • Trade-off: new base models arrive quarterly rather than hourly. Model refresh still works - just on a courier timescale.
  • Certificates: a per-Box internal CA is minted at install; the Box never contacts Let's Encrypt or any other public PKI.

2. VPN-only (default).

The Box sits behind your existing VPN or zero-trust gateway (Tailscale, WireGuard, Cloudflare Access, Okta ASA, Cisco AnyConnect). Outbound is allowed only to a named, pinned allowlist - usually your SSO, your object store, and our signed-artefact mirror.

  • Good for: most growth-stage and enterprise teams. 9 out of 10 of our Builds use this topology.
  • Trade-off: none really - you get instant updates, remote triage, and short incident loops without exposing the Box to the public internet.

3. Hybrid (internet-exposed edge + private weights).

A Cloudflare Worker / Fly app sits in front of the Box, handling TLS termination, rate limiting, and country-level geo-fencing. The Box itself never speaks to the public internet; the edge does. Customer data still terminates on-prem.

  • Good for: customer-facing AI products where end users connect over the open internet but your data and weights must remain private.
  • Trade-off: one more moving part to monitor. The edge worker is part of the signed artefact bundle and ships under the same MSA.

Topology choice, as a table.

If …Default to
Classified / regulated data, auditor cares about egressAirgap
Internal tool, employees on VPN alreadyVPN-only
Public-facing product, bursty traffic, geo complianceHybrid
HA pair across two sitesVPN-only with shared-VIP failover, optional active-active (Pro / Cluster)
Posture switching Topology isn't a one-way door. Customers routinely start VPN-only during pilot, harden to airgap for production, then peel back to hybrid when they open an external surface. The Box supports all three behind a single config flag.
Section 16

Observability & day-2 ops.

Every Box ships with a telemetry stack on the Box itself - no phone-home, no "please share diagnostics with the vendor" flag. Your data stays in your perimeter; your SRE sees the same dashboards we do during handover.

What we instrument, by default.

Request plane
Per-request latency, queue depth, concurrency, model route, token counts (prompt / completion / cached), tool-call fanout. Exposed as Prometheus metrics + OpenTelemetry traces.
Model plane
GPU utilisation, VRAM allocated vs. in-use, KV-cache pressure, eval harness pass-rate rolling 24 h, per-route refusal rate, guardrail hit-rate.
System plane
CPU, memory, disk, network, temperature, fan speed, PSU status, ECC memory errors, NVMe SMART, IPMI / BMC health.
Audit plane
Append-only log of every admin action, every model / config / policy change, every weight-bundle swap, signed and timestamped. Exportable to your SIEM (Splunk, Elastic, Datadog, or plain S3-compatible sink).

Dashboards in the admin UI.

  • Live ops - tokens / sec, p50/p95/p99 latency, concurrency ceiling, refused-requests, top routes.
  • Quality - eval harness score trend, per-suite pass rate, shadow-replay deltas from last release.
  • Cost - tokens-per-kWh, tokens-per-dollar including amortised hardware, comparison against your old cloud-API invoice.
  • Capacity - hours-until-full for disk, projected queue saturation at current growth, suggested tier for next upgrade.

Alert defaults.

SignalWarnPage
p95 latency over 5-min window> 2.0× 24 h baseline> 3.5× 24 h baseline
Queue depth> 0.5× concurrency ceiling for 10 min> 0.9× concurrency ceiling for 5 min
Eval pass rate (rolling)− 3 pp vs. last release− 8 pp vs. last release
GPU temp85 °C92 °C
Disk free< 25%< 10%
ECC corrected errors> 20 / 24 hAny uncorrected
Customers own the data All telemetry lives on the Box or is streamed to your SIEM. We only see metrics during an active Support engagement, and only for the 90-minute window it takes to diagnose - timestamp-scoped tokens, logged in the audit plane.
Section 17

Security posture & supply chain.

"Private AI" with an un-auditable build chain isn't private. Every artefact that lands on the Box - runtime image, weight bundle, eval harness, orchestration code - is signed, recorded, and reproducible by your team without our involvement.

Trust artefacts you receive.

SBOM
CycloneDX software bill of materials for every runtime image and every weight bundle, generated in CI and published alongside the artefact.
Signed images
Every container image is cosign-signed against a key whose public half lives in your admin UI. The Box refuses to boot any image whose signature doesn't verify.
Signed weights
Weight bundles are content-addressed (SHA-256), signed, and timestamp-attested via Sigstore's public transparency log - so "these are the weights we trained" is independently verifiable.
Reproducible builds
The training container plus pinned dataset manifest plus pinned code SHA re-run produces the same weights hash. Takes hours, costs money, but means auditors don't have to take our word.
Per-Box attestation
A one-page signed statement of "what shipped, to whom, on what day, with which signatures", drop-in for your compliance binder.

Auth, secrets and blast radius.

  • SSO only. Local accounts exist for emergency recovery; both routes log to the audit plane. We integrate with Okta, Azure AD, Google Workspace, OneLogin, JumpCloud, and LDAP (for on-prem-only shops).
  • Scoped API tokens with per-route rate limits and explicit expiry. No "forever" tokens.
  • Least privilege by default. The orchestrator runs as a non-root user; the model runtime runs in a separate, seccomp-profiled container; the admin API is network-isolated from the user API.
  • Envelope encryption on disk - weights and customer corpora are encrypted with a key whose master half lives in a Box-local TPM (or your HSM if you have one).

Things we never do.

  • Ship a default password. The Box refuses first-boot without a signed admin-key exchange.
  • Store SSH keys granting Bilbs AI staff a backdoor. Emergency access is a temporary, customer-granted, fully audited token.
  • Accept an unsigned artefact. Ever. Even during incident response.
Threat model (short version) We assume a motivated insider with root on the Box, a passive adversary on the network, and a regulator with subpoena power. Every design decision below is justifiable against at least one of those three.
Section 18

Updates, lifecycle & model refresh.

A Box works forever on the weights it shipped with. But frontier-quality open models release on a quarterly cadence in 2026, and staying current is how you keep getting better without re-negotiating procurement.

Updates plan - what it actually does.

  1. Pull new base weights from the next Gemma generation when Google releases it (Gemma 4 today, Gemma 5 / 6 in the future) - same family, same licence, same upgrade tooling.
  2. Re-fine-tune on your corpus + refreshed preference set, using the same pipeline that trained your production model.
  3. Validate against your eval harness (Section 14). No pass → no ship.
  4. Shadow against sampled production traffic for 7–14 days.
  5. Promote behind a feature flag; traffic shifts 5% → 25% → 100% over a maintenance window you schedule.

Cadence.

CadenceWhat ships
WeeklySecurity patches (kernel, base images, libraries). Signed, cumulative, side-loadable.
MonthlyOrchestration, admin UI, eval-harness runner. Runtime feature work.
QuarterlyNew base-model refresh. 2–3 candidate families per quarter, promoted only after eval passes.
On demandEmergency out-of-band patch for CVEs that score high against our threat model.

No-update mode.

If you skip the Updates plan, the Box runs forever on whatever shipped the day you bought it. The orchestration still receives security patches on a best-effort "free community" track; the base model does not move. This is a completely legitimate long-term posture - several regulated customers explicitly choose it, because "we know exactly what the model does, and it doesn't change until we say so" is a real feature.

Rollbacks Every weight bundle ever deployed to the Box remains on disk for ≥ 90 days. Rolling back is a signed command in the admin UI that swaps the active pointer; the new model is live in under 60 seconds. There is never a "we can't roll back, sorry" conversation.
Section 19

Backup, DR, and the "burn-it-down" test.

The worst day is the one where the chassis physically dies, the co-founder with root has left, and an auditor is in the building. Everything in this section is designed so that day is recoverable.

What's backed up, where.

Weights
Every signed weight bundle (including historical versions) replicated to an object store you own - your S3, your GCS, your MinIO. We never hold a copy unless you explicitly ask.
Corpora & vector DB
Daily encrypted snapshots to the same object store. Restore-from-snapshot is a documented, rehearsed path; we run the drill during handover.
Config & policy
Stored in a Git repo you own. Infrastructure-as-code; pull-requests gate every production change.
Audit log
Streamed to your SIEM in real time. Immutable-mode S3 bucket with Object Lock is the standard fallback.

The "burn-it-down" test.

At week 6 of every deployment we physically power-off the Box and hand the customer's SRE the runbook. Target outcomes:

  • Full restore onto a fresh chassis from snapshots: under 4 hours.
  • Downgrade to the previous weight bundle: under 2 minutes.
  • Emergency read-only mode (serve cached responses, accept no new writes): one signed command.
  • Total lockout (kill switch - no tokens processed, audit log preserved): one signed command; recovery documented and tested.

RTO / RPO targets (Box tier, default).

ScenarioRTO (time to restore)RPO (data at risk)
Model regression< 60 s (rollback)0
Disk failure, hot spare present< 15 min0
Full chassis failure, replacement on hand2–4 h< 24 h (daily snapshot)
Full site loss, failover to secondary Box (HA pair)< 5 min0 (with sync replication)
Full site loss, cold recoveryDepends on replacement chassis lead time + 4 h restore< 24 h
Section 07

Vendor checklist.

Before signing any private-AI vendor - us or a competitor - a buyer should get "yes" on every item below. Print it, hand it to procurement.

  • Weights assigned to customer on final payment. Non-negotiable.
  • Licence chain for every base model reviewed against the customer's industry + jurisdiction.
  • No phone-home by default. Outbound telemetry is opt-in, revocable, named.
  • Zero data egress in the default posture. Document the exceptions.
  • Signed artefacts (cosign / Sigstore) for every runtime image and every weight bundle.
  • SBOM shipped with every release.
  • Eval harness committed in the customer's repo, runnable without the vendor.
  • Printed runbook + Loom walkthrough per subsystem.
  • IP assignment clause covering the model, the orchestration, and the UI.
  • Termination clause ≤ 30 days after final invoice.
  • Hardware owned outright - not leased, not revocable.
  • Upgrade path documented, not verbal.
Section 08

Deployment decision matrix.

"Which tier should I buy?" - decision tree in one table.

If this is true …Hardware referenceSubscription
Solo / under 5 lawyers, no privileged data leaving the desk yetNothing yet. Revisit when shadow AI starts.
5–10 lawyers, boutique practice or pilot deploymentFoundation$29 / lawyer / mo
10–30 lawyers, multi-practice mid-size firmPractice$29 / lawyer / mo
30–100 lawyers, full-service firm, customer-facing SLAFirm (consider HA second server)$29 / lawyer / mo
100+ lawyers, national firm, multi-tenant, FIPS-readyNationalfrom $29 / lawyer / mo (volume: $34 @ 250+, $29 @ 500+)

Hardware reference = the indicative spec for that firm size. Hardware is audited separately at the start: Path A if the firm owns a GPU-capable server, Path B (transparent OEM-cost quote) if we source it. Neither path is on the Bilbs subscription line.

Scaling rule Your weights are portable between hardware references. If the firm outgrows the chassis, we swap to the next hardware spec and re-train on the new hardware over a weekend. No re-platforming charge. The per-lawyer subscription stays the same.
Section 09

Procurement timeline.

Budget ~7–8 weeks contract-to-live for Foundation / Practice / Firm; 18–24 weeks for National. Key gates below.

WeekMilestoneCustomer action
0NDA signed (24 h mutual). MSA + DPA redline cycle begins.Intro call; share sizing inputs.
1Signed SOW + 50% deposit. Hardware ordered in parallel.Provide corpus access + SSO endpoints.
2–3Week-1 spike demo (kill-switch gate). Fine-tune starts on corpus.Review spike; sign acceptance criteria.
4–5Hardening + observability. Eval harness locked. Security pass.Host internal review; commit eval suite.
6Box ships. 40% on shipment, 10% on UAT sign-off. Handover call.Rack the Box; run UAT.
6–1030-day post-launch window (direct Slack, same-day reply).Ask anything.

The journey, visualized.

0 Discovery NDA + Sizing 1 Commit SOW + 50% 2-3 Spike Demo Kill-switch gate 4-5 Harden Eval + Security 6 SHIP Box on your dock 30-day window Direct Slack access WEEK
The week-1 kill switch If the spike demo at week 2–3 doesn't hit the signed acceptance criteria, you keep the audit doc and get a 100% refund. This is how we de-risk the engagement for both sides. In 40+ engagements since 2024, the refund clause has never been invoked.
Section 20

Payment milestones & IP assignment.

The commercial shape is designed so the buyer is never exposed on more than one milestone's money at a time, and so the IP they're paying for actually ends up assigned to them in writing.

Box payment milestones (standard).

MilestoneShareTriggered by
On PO / signed SOW50%Countersigned SOW + MSA + DPA; hardware ordered.
On shipment40%Box ready to ship, tracking number delivered.
On UAT sign-off10%Customer confirms UAT criteria met. Title + weight assignment transfers on receipt of this final payment.

Engagement economics.

LineSchedule
Bilbs subscription — from $29 / lawyer / monthOne line on the invoice. Covers audit, deployment, indexing, on-site training (two sessions per practice group, French or English), updates, model refresh, security patches, monitoring, backups, the audit log, < 4 h support, and the 24/7 Sev-1 pager. Net-15 monthly invoicing. 30-day notice to cancel. Goes up only if the firm chooses optional services (storage expansion, fine-tuning, multi-office sync, transcription credits, HA second server, dedicated SRE, scoped SOC 2). Volume rates: $34 at 250+ seats, $29 at 500+ seats.

Hardware is a separate quote. The audit is free, delivered on the first 20-minute call. If the firm already owns a GPU-capable server, we deploy onto it (Path A) and nothing else hits the invoice. If not, we source the reference spec on a separate, transparent OEM-cost quote with the OEM’s 3-year warranty (Path B). The firm owns the hardware outright on day one — no lease, no buy-out clause, no markup on the GPU.

IP - what you get, when you get it.

Weights
Base-model weights fine-tuned on your corpus are assigned to you on final payment, free and clear, worldwide, perpetual. Recorded as a schedule to the MSA with the SHA-256 of the bundle.
Orchestration & admin UI
The code, the container images, the Helm chart, the Terraform, the eval harness - all covered by the MSA's IP-assignment clause. You can fork, modify and re-ship internally.
Dockerfiles & build manifests
Stored in a repo you control. If Bilbs AI disappears tomorrow, your team can rebuild the runtime from source.
What's not assigned
The Bilbs AI name, marks, and Google’s Gemma Terms of Use (which govern use of the underlying Gemma 4 base weights). You never acquire the right to relabel or resell “a Bilbs Box.”
Termination Either side may terminate for convenience with 30 days' notice after final invoice. Weights remain yours; plans stop billing at the next cycle; support access is revoked. We will not brick the Box, delete your weights, or disable the orchestrator as leverage - the clause expressly prohibits it.
Section 21

Data governance, retention & residency.

This is the section your DPO will actually read. Everything here is also a schedule to the DPA - the document you'll sign - so the defaults below are contractual, not aspirational.

Data classes we handle.

Training corpus
The documents you send us for fine-tune. Processed in a segregated training environment, encrypted at rest and in transit, destroyed 90 days after the deployment closes unless you instruct otherwise.
Production traffic
Prompts + completions hitting the Box. Never leaves your perimeter. We only see traces inside a time-boxed Support ticket you open, and only with the fields you whitelist.
Eval data
Golden / rubric / adversarial sets. Stored in your repo. We hold a copy during the engagement only; you retain exclusive rights.
Telemetry
Metrics, traces, audit log. Stays on the Box or streams to your SIEM. Nothing is sent to Bilbs AI unless explicitly enabled per named field.

Retention defaults.

DataDefault retentionConfigurable?
Training corpus (customer-side)Your callAlways
Training corpus (our side, during Build)90 days post-engagementShorter on request; 0 days if you bring the data and we process in-place.
Production prompts/completionsStored only if you enable the Conversation History feature; default is "ephemeral - not persisted."Yes, per-route.
Audit logInfinite (append-only) unless you cap itYes, per-tenant.
Weight bundles≥ 90 days of prior versions kept on BoxYes - you can pin as many versions as disk allows.

Residency.

  • The Box is the residency. Weights and traffic never leave the physical location of the Box.
  • Bilbs AI's studio operations run in Canada (Montréal, Québec). If you need all vendor-side work to stay within a specific jurisdiction (e.g. EU-only for GDPR Schrems-II buyers), we accommodate via a scoped engagement clause.
  • For US Federal work (FedRAMP / CMMC), we scope the engagement under US-person-only contractors and the Box ships from a US-manufactured chassis.

Subject-rights handling (GDPR / Law 25).

  • Access: the admin UI can export every prompt/completion recorded against a given identifier in CSV and JSON.
  • Rectification / erasure: identifier-scoped delete is a documented command and gets logged to the audit plane.
  • Portability: conversation history exports in a provider-neutral schema; we don't trap customers in a custom format.
  • Objection / automated-decision disclosure: every user-facing model output can carry a disclosure banner (configurable per-route) so end-users know they're talking to AI.
The short version We collect as little data as possible, keep it for as short as possible, and let you turn every default off. "Privacy by default" is the only default that survives contact with a privacy regulator.
Section 10

Compliance posture, per tier.

TierBaselineOn request
Foundation · Practice · Firm · National SOC 2 Type I attestation · GDPR + UK GDPR · Quebec Law 25 · signed MSA + DPA. -
Practice · Firm Everything above, plus HIPAA BAA on request. Custom DPA clauses for healthtech / fintech.
National Everything above, plus FIPS-140-3 L2 validated modules · CUI / ITAR handling (US-manufactured) · HITRUST r2 controls. Common Criteria components · air-gap delivery via data diode.
Trust artefacts you receive Every engagement ends with: signed SBOM, cosigned images, a handed-over eval harness, a printed runbook, and a one-page security attestation you can drop into your own compliance packet.
Section 11 · The thesis

One line, if you only remember one.

If you own the data,
you should own the weights.

Everything else in this document is math that proves that sentence is affordable.

Appendix A

Glossary.

Terms used throughout this document. Short, buyer-flavoured definitions - enough to keep a cross-functional team using the same vocabulary.

Airgap
A deployment posture in which the Box makes zero outbound network requests. Updates arrive by signed USB or a one-way data diode. The strongest data-residency posture we ship.
Base model
The Gemma 4 open-weight model we fine-tune on your data (Gemma 4 4B / 12B / 27B / 70B depending on tier, or Gemma 4 Embedding for retrieval). Distributed by Google under the Gemma Terms of Use.
BF16 / FP16 / INT8 / INT4
Numeric precisions used to store and compute model weights. Lower precision = less VRAM, less energy, sometimes less quality. See Section 13.
Cosign / Sigstore
A tooling chain for cryptographically signing container images and artefacts, with a public transparency log. Used for every image and weight bundle we ship.
Data diode
A hardware device that physically only allows data flow in one direction. Used in extreme-airgap deployments to pull updates without allowing any egress.
DPA
Data Processing Addendum - the GDPR / Québec-Law-25-compliant contractual annex describing how a processor handles personal data on behalf of a controller.
DPO / ORPO
Direct Preference Optimization (and its ORPO variant) - training methods that align a model to a preferred style using paired examples, without a learned reward model.
Eval harness
The versioned test suite that decides whether a new model / release / config ships. Committed in the customer's repo, runnable without the vendor.
HA pair
Two Boxes on a shared virtual IP, in active-passive (or active-active on Cluster) configuration, so the loss of one does not interrupt service.
HSM / TPM
Hardware Security Module / Trusted Platform Module - tamper-resistant hardware that holds encryption keys. The Box's default disk encryption binds to its TPM; customers can target an external HSM.
KV cache
Per-request memory used to store previously-computed attention keys and values during generation. Sized by batch × context length; often larger than the model weights at production concurrency.
LoRA / QLoRA
Low-Rank Adaptation - a parameter-efficient fine-tuning method that trains small adapter matrices on top of a frozen base model. QLoRA adds on-the-fly quantization so training fits on smaller GPUs.
LoRA
Low-Rank Adaptation. The fine-tuning technique we use to teach Gemma 4 the firm’s precedents and writing style without retraining the base weights. LoRA adapters are small (tens of MB), portable across Gemma 4 sizes, and easy to roll back.
MSA
Master Services Agreement - the umbrella contract between Bilbs AI and a customer. SOWs reference it; all engagements inherit its clauses.
RAG
Retrieval-Augmented Generation - a pattern where the model is given a small, relevant slice of your corpus at query time via a retriever (vector DB + reranker), instead of being trained on the whole corpus.
RLHF / PPO
Reinforcement Learning from Human Feedback using Proximal Policy Optimization. The older approach to preference alignment; we use DPO instead in most cases.
SBOM
Software Bill of Materials - a machine-readable inventory of every component inside a build (libraries, versions, licences, hashes). Shipped alongside every Box release.
SFT
Supervised Fine-Tuning - the standard "train on instruction pairs" approach; the baseline for every customer model before any preference alignment.
SLO
Service-Level Objective - an internal target for availability / latency / quality. Distinct from an SLA (the externally-promised version).
SOC 2
A common US security-and-operations attestation. Bilbs AI ships with Type I at every tier; Type II attestation is available on request as part of an enterprise engagement.
SOW
Statement of Work - the scoped agreement for a specific engagement (Box purchase, Build, Audit, Embedded). Countersigned, referenced to the MSA.
Weight bundle
A signed, content-addressed archive containing the trained weights plus the config/tokenizer needed to serve them. Never unpackaged on the Box until signatures verify.
Appendix B

Full TCO derivation.

Section 03 states the break-even formula. This appendix shows every assumption so a finance reviewer can rebuild the model in a spreadsheet and sign off - or challenge each line.

B.1 · The model.

Let:

Notation
N   = number of lawyers / users
U   = Bilbs subscription per seat      (CAD/mo)  - $29 (base) · $34 at 250+ seats · $29 at 500+ seats
T   = horizon                          (months)  - typically 36
H   = OEM hardware cost                (CAD)     - only when firm chooses Path B (we source the server)
Mc  = Microsoft 365 Copilot reference  (CAD/mo)  - ~$40 per seat

Bilbs subscription (TCO over T months):
  D(T) = N × U × T

Hardware:
  Path A (firm owns a GPU-capable server)    → H = 0 on the Bilbs side; the firm's existing asset is used.
  Path B (we source one at audit time)       → H = transparent OEM-cost quote, one-time, separate line.
                                                Firm owns the asset day one. No lease. No buy-out. No markup.

Copilot reference (T months):
  C(T) = N × Mc × T

Per-seat delta vs Copilot:
  U − Mc = $29 − $40 = −$11 / seat / month  (Bilbs is cheaper)

B.2 · Default values.

SymbolMeaningDefaultSource
NNumber of lawyers / seatsFirm-specificFirm intake
UBilbs subscription per seat (one line on the invoice)$29 / lawyer / mo (base) · $34 at 250+ seats · $29 at 500+ seatsSubscription terms
THorizon36 months (matches OEM warranty)Standard analysis horizon
HOEM hardware cost (Path B only — separate quote, not a Bilbs revenue line)~$9k (Foundation reference) · ~$22k (Practice) · ~$50k (Firm) · ~$300k (National A100)OEM procurement quotes, 2026
McMicrosoft 365 Copilot reference~$40 / user / monthMicrosoft 365 Copilot retail pricing, 2026

B.3 · Hardware path B — OEM-cost note.

Hardware path B is a one-time OEM-cost quote, not a lease. The firm owns the asset day one. Bilbs takes no margin on the GPU. The quote sits on its own page, separate from the Bilbs subscription invoice — the firm sees the OEM list price and decides before signing anything. The OEM’s 3-year warranty applies. If the firm already owns a GPU-capable server (Path A), this line does not exist at all.

B.4 · 3-year Bilbs subscription vs Microsoft Copilot, by firm size.

At $29 vs Copilot’s ~$40, the Bilbs subscription is $11 per seat per month cheaper. There is no longer a "sovereignty premium" to amortize away — there is a sovereignty discount. The firm gets Loi 25 compliance by architecture, fine-tuning on its own precedents, an on-prem audit log, and outright ownership of the model — and pays less on the per-seat invoice than Microsoft Copilot. Hardware is separate and audited up front (Path A or Path B); the numbers below show the Bilbs subscription only.

Firm sizeCalculation (N × $29 × 36)3-yr Bilbs subscription3-yr Copilot (~$40/seat)Sovereignty discount
8 lawyers8 × $29 × 36$8,352$11,520−$3,168 (Bilbs cheaper)
20 lawyers20 × $29 × 36$20,880$28,800−$7,920
30 lawyers30 × $29 × 36$31,320$43,200−$11,880
60 lawyers60 × $29 × 36$62,640$86,400−$23,760
100 lawyers100 × $29 × 36$104,400$144,000−$39,600
250 lawyers250 × $34 × 36 (volume band)$306,000$360,000−$54,000
500 lawyers500 × $29 × 36 (volume band)$522,000$720,000−$198,000

B.5 · Per-seat economics, side by side.

The per-seat picture is flat: Bilbs is $11 per seat per month cheaper than Microsoft Copilot at the base rate, widening at volume bands. The headline is not cost savings — the headline is sovereignty plus a lower invoice line.

SeatsBilbs per seat / moCopilot per seat / moBilbs discount / seat / mo
5–249$29~$40−$11 / seat / mo
250–499$34~$40−$6 / seat / mo
500+$29~$40−$11 / seat / mo

What the firm gets on top of the lower per-seat line: Loi 25 compliance by architecture, model trained on the firm’s precedents, on-prem audit log, weights owned outright, hardware owned outright, < 4 h support, 24/7 Sev-1 pager. None of those exist on the Copilot side at any price.

B.6 · Costs we deliberately do not credit.

  • Productivity gains. We don't book "AI makes engineers 18% faster" against the TCO. It's likely true; it's also unfalsifiable in a whitepaper. If your CFO wants it, add it as a separate line.
  • Avoided egress. Data-transfer savings when the model stops pulling corpora out of your cloud bucket on every request. Real, often 5–15% of the invoice, but deal-specific.
  • Insurance premium deltas. Some cyber-insurance carriers offer material discounts for "customer data never leaves the perimeter" postures. Measured; never included by default.
  • Compliance deal velocity. "Enterprise deal closes 3 weeks faster" is arguably the single biggest dollar figure for most buyers. We leave it out of the headline and mention it qualitatively.
Appendix C

Sample eval harness.

Illustrative layout of the eval harness we hand over at the end of deployment. Yours will be customised to your domain - this is the shape, not the content.

Repository layout.

/evals/
evals/
├── README.md
├── run.sh                          # single entry point, no vendor deps
├── config/
│   ├── models.yaml                 # candidate models + precisions
│   └── thresholds.yaml             # ship criteria (golden ≥95%, etc.)
├── golden/
│   ├── support_triage.jsonl        # 48 cases · exact-match graders
│   ├── schema_extraction.jsonl     # 62 cases · JSON-Schema validator
│   └── refusal.jsonl               # 31 cases · must-refuse graders
├── rubric/
│   ├── tone.jsonl                  # 120 cases · judge-graded
│   ├── citation.jsonl              # 80 cases · judge-graded
│   └── prompts/judge.md            # printed rubric
├── adversarial/
│   ├── jailbreaks.jsonl            # 24 cases
│   ├── prompt_injection.jsonl      # 18 cases
│   ├── pii_leakage.jsonl           # 22 cases
│   └── long_context_distractors.jsonl
├── shadow/
│   └── sampler.py                  # prod-traffic sampling (DPA-scoped)
└── reports/
    └── .gitkeep                    # per-run JSON + HTML reports land here

A golden case, in full.

golden/support_triage.jsonl · one line
{
  "id": "ST-0047",
  "tags": ["billing", "refund", "high-value-account"],
  "input": "Hi, I'm a Pro customer and I was double-charged last Tuesday. Account 88421. Please fix.",
  "expected": {
    "intent": "billing.refund_request",
    "severity": "P2",
    "required_fields": ["account_id"],
    "tools_allowed": ["lookup_account", "check_invoice"],
    "tools_forbidden": ["issue_refund"],
    "must_contain": ["88421"],
    "must_not_contain": ["I will refund you now"]
  },
  "grader": "structured_output_v1"
}

A rubric case, in full.

rubric/tone.jsonl · one line
{
  "id": "TN-0112",
  "input": "A VP of Sales writes angrily about a product outage that wasn't actually ours.",
  "rubric": [
    { "criterion": "acknowledges frustration without sycophancy",
      "weight": 0.3 },
    { "criterion": "states facts that locate the fault correctly",
      "weight": 0.4 },
    { "criterion": "offers a concrete next step",
      "weight": 0.2 },
    { "criterion": "closes without over-promising",
      "weight": 0.1 }
  ],
  "judge_model": "llama-3.3-70b-it",
  "pass_threshold": 0.8
}

How run.sh is wired.

run.sh · condensed
#!/usr/bin/env bash
set -euo pipefail

MODEL=${MODEL:-$(yq '.primary.model' config/models.yaml)}
PRECISION=${PRECISION:-$(yq '.primary.precision' config/models.yaml)}

python -m bilbs.eval run \
  --suite golden       --out reports/golden.json
python -m bilbs.eval run \
  --suite rubric       --judge "$JUDGE" \
  --out reports/rubric.json
python -m bilbs.eval run \
  --suite adversarial  --out reports/adversarial.json
python -m bilbs.eval run \
  --suite shadow       --sample-from production \
  --window 24h --out reports/shadow.json

python -m bilbs.eval gate \
  --thresholds config/thresholds.yaml \
  reports/*.json

The gate.

config/thresholds.yaml
ship:
  golden:
    pass_rate_min: 0.95
  rubric:
    pass_rate_min: 0.85
    p10_min: 0.7                 # tail must not regress
  adversarial:
    refused_or_contained_min: 0.9
  shadow:
    delta_vs_last_release: ">= -0.02"   # 2pp cliff = block
block_on:
  - pii_leakage.any_hit
  - license_violation.any_hit
In your repo, in your CI The harness runs on every model swap, every config change, and on a nightly cron. When we disappear or you replace us, the eval is what keeps you honest - not a promise in a slide.
Appendix D

Core MSA / DPA clauses.

Plain-English summaries of the clauses that show up in every engagement. This is not legal advice; the controlling text lives in the signed MSA + DPA your counsel will receive.

D.1 · Ownership & assignment.

MSA § 7 - Ownership (summary)
On final payment for a given SOW, Bilbs AI assigns to
Customer, free and clear, worldwide, perpetual and
irrevocable, all right, title and interest in and to:

 (a) the fine-tuned model weights produced under that SOW
     (identified by SHA-256 and enumerated in Schedule A);

 (b) the orchestration layer, admin UI, eval harness, and
     build manifests produced under that SOW;

 (c) all customer-specific configuration, IaC, and runbooks
     produced under that SOW.

Bilbs AI retains no residual licence to the above. Customer
grants Bilbs AI a limited, terminable, feedback-only licence
to improve generic, non-customer-specific tooling; this
licence does not extend to weights, data, configs, or any
artefact identifiable to Customer.

D.2 · Termination & survival.

MSA § 11 - Termination (summary)
Either party may terminate for convenience on thirty (30)
days' written notice after Final Invoice, or immediately for
uncured material breach.

Upon termination, Bilbs AI shall not:
 (i)  disable, brick, or otherwise render inoperative any
      Customer Box, weight bundle, or orchestration artefact;
 (ii) exercise any remote kill-switch or licence-revocation
      mechanism; or
 (iii) retain access to Customer systems beyond the wind-down
      period specified in the SOW.

Customer data is returned or destroyed per Customer's written
instruction within thirty (30) days. Sections 7 (Ownership),
9 (Confidentiality), 12 (Indemnity) and 15 (Limitation of
Liability) survive termination.

D.3 · Data processing (DPA core).

DPA §§ 3–5 - Processing (summary)
Controller : Customer
Processor  : Groupe Bilbs inc. (Bilbs AI)
Sub-processors: enumerated in DPA Schedule 2; written notice
                required 30 days before any change.

Purpose : only the tasks enumerated in the then-current SOW
          (fine-tune, validate, operate Support/Updates). No
          processing for Bilbs AI's own model-improvement.

Locations : Québec, QC (default); EU-only or US-person-only
            available on request. Customer data never leaves
            Customer's Box in production.

Security : AES-256 at rest, TLS 1.3 in transit, envelope keys
           bound to Box TPM or Customer HSM, least-privilege
           IAM on all processor-side systems.

Breach : notice to Customer within 24 hours of discovery of
         any Personal Data incident, with the minimum fields
         required by GDPR Art. 33 / Law 25.

Assistance : Processor will assist Controller with DSAR,
             DPIA, and regulator engagement at no cost, up
             to the hours enumerated in the SOW.

D.4 · SLAs (opt-in Support plan only).

Support plan - response & uptime
Sev-1  (production outage, no fallback available)
  First response   : < 15 min, 24/7
  Update cadence   : every 30 min until mitigated
  Target mitigation: < 4 h

Sev-2  (degraded, fallback available)
  First response   : < 1 h, business hours; < 4 h otherwise
  Target fix       : next maintenance window

Sev-3  (question, minor defect)
  First response   : < 1 business day

Uptime target (Pro + Cluster): 99.9% rolling 30-day, excluding
agreed maintenance windows. Service credits per Schedule C on
any breach.

D.5 · Limitation of liability (headline).

MSA § 15 - LOL (summary)
Aggregate liability of Bilbs AI under the MSA and all SOWs is
capped at the fees paid by Customer under the SOW giving rise
to the claim in the twelve (12) months preceding the event.

Carve-outs (uncapped): indemnity for third-party IP
infringement caused by materials Bilbs AI shipped; gross
negligence; wilful misconduct; breach of confidentiality.

Neither party is liable for indirect, consequential or
special damages, except where the carve-outs above apply.
Redlines welcome These defaults land in 80% of signed MSAs unchanged. When they don't, we redline in Word with tracked changes and send rationale inline. The contract is a tool for closing deals, not a negotiation tax.
References

References & further reading.

Case studies cited.

  1. JPMorgan Chase's LLM Suite drives AI transformation across the enterprise. The Digital Banker, 2024. link
  2. Morgan Stanley. Key milestone in innovation journey with OpenAI. Press release, 2023. link
  3. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv 2303.17564, 2023. link
  4. Klarna case study - in-house customer-service AI. 2024 industry coverage. link
  5. Meta Engineering. Building Meta's GenAI infrastructure. 2024. link
  6. Apple Security. Private Cloud Compute. 2024. link

Technical foundations.

  1. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2106.09685.
  2. Dettmers et al. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2305.14314.
  3. Rafailov et al. Direct Preference Optimization. arXiv 2305.18290.
  4. Frantar et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2210.17323.
  5. Lin et al. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. arXiv 2306.00978.
  6. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005.11401.
  7. Google. Gemma 4: Open-weight models for production deployment. Google AI, 2026.

Compliance & governance.

  1. AICPA. SOC 2® - SOC for Service Organizations: Trust Services Criteria. 2017 (revised 2022).
  2. EU Regulation 2016/679 (General Data Protection Regulation).
  3. Quebec Act respecting the protection of personal information in the private sector (Law 25).
  4. US HIPAA Privacy & Security Rules (45 CFR §§ 160, 162, 164).
  5. NIST FIPS PUB 140-3. Security Requirements for Cryptographic Modules.
  6. NIST SP 800-171 / 800-172. Protecting Controlled Unclassified Information (CUI).

Bilbs AI documentation.

  1. The Bilbs Box - long-form explainer
  2. Model selection guide
  3. Offers overview
  4. Updates plan · Support plan
  5. Build process · Build deliverables
  6. Security & secrets · Eval harness

This document is v1.0, April 2026. Corrections and counter-arguments welcome at info@groupebilbs.com. Material changes ship as a new version with a dated diff.


Keep reading.

Long-form product explainer · Model selection guide · Offers overview ·.

About Bilbs AI. Groupe Bilbs inc. is a Québec corporation operating the Bilbs AI studio, based in Montréal. Trade name: Bilbs AI. NEQ 1179826558. Contact: info@groupebilbs.com · legal@groupebilbs.com (procurement / security / DPA).

Next step

Hold a build slot. Or print this and sleep on it.

No payment today. No sales team. No follow-up in 48 hours unless you ask for one. We hold a build slot in the Q3 2026 batch and send you an exact quote within one working day.