What this document argues, in one page.
Between 2024 and 2026 the economics of serving a language model on your own data, on your own hardware crossed the threshold where it beats a cloud-API invoice for any team whose AI spend exceeds roughly $3,000 / month. Open-weight base models now match or exceed frontier-API quality on narrow tasks after a modest fine-tune; GPU prices have fallen as enterprise-secondary-market Ada/Hopper silicon arrived; and the compliance pressure on "your customers' data leaves your perimeter every 100 ms" has intensified on both sides of the Atlantic.
This document is a buyer's guide for the resulting decision. It does four things:
- Quantifies the trade-off. The headline break-even table (Section 03) and its full derivation (Appendix B) let a finance reviewer rebuild the model in a spreadsheet and contest every assumption line-by-line.
- Narrows the model choice. A short list of open-weight families we actually ship (Section 05) and the fine-tuning, quantization and evaluation methodology behind them (Sections 12–14) - the parts most vendors elide.
- Makes operations concrete. Deployment topologies (Section 15), observability (16), security posture (17), updates (18), and DR (19) with the same level of detail a serious SRE would demand of their own stack.
- Closes on procurement. Vendor checklist (07), decision matrix (08), timeline (09), payment & IP (20), data governance (21), compliance matrix (10). Plus appendices with the core MSA/DPA clauses and a sample eval harness.
The conclusion is the same one JPMorgan, Morgan Stanley, Bloomberg, Klarna, Meta and Apple have publicly reached for their own deployments: if you own the data, you should own the weights. Everything after this page is the math and the operational detail that makes that sentence affordable for a 25-person team rather than a 250,000-person bank.
How to read this document.
Document conventions.
- Numbers are live, not aspirational. Every dollar figure comes from either our own shipped engagements or a cited public source. If a number is modelled rather than measured, it says so.
- Sources are linked inline and enumerated at the end in References.
- The tone is "not a pitch." Section 04 tells you when not to buy; Section 23 (callouts) flags places where we've lost deals by being too honest.
- Print friendly. Every table, callout and pull quote is page-break-safe. "Save as PDF" from the top-right of the page produces a document that belongs in a procurement binder.
The 60-second version.
The problem: The firm’s lawyers already use AI - just not through the firm. ChatGPT in browser tabs, Copilot in the M365 tenant, screenshots of client matters pasted into prompt windows. The result is confidentiality risk, fragmented workflows, no governance, no audit trail, and no centralized institutional memory. The problem is not AI; the problem is the absence of infrastructure around it.
The shift: Google’s Gemma 4 open-weight family now matches GPT-4 on narrow domain tasks after fine-tuning. GPU prices dropped 40% as enterprise secondary-market Ada/Hopper silicon hit the channel. The economics have inverted.
The offer: A private AI server installed in the firm’s own office, trained on the firm’s own files. One line on the invoice: from $29 per lawyer per month. That covers the audit, the deployment, indexing, on-site training, updates, model refresh, security patches, monitoring, backups, the audit log, < 4 h support, and the 24/7 Sev-1 pager. Below Microsoft Copilot (~$40 / seat). The price goes up only if the firm chooses optional services (storage expansion, fine-tuning, multi-office sync, transcription credits, HA second server, dedicated SRE, scoped SOC 2). Hardware is audited separately: if the firm already owns a GPU-capable server we deploy onto it (Path A); if it doesn’t, we source one on a transparent OEM-cost quote with the OEM’s 3-year warranty (Path B). The firm owns the hardware outright either way — no lease, no buy-out, no markup on the GPU. The audit is free, delivered on the first 20-minute call. The firm is not paying Bilbs for GPUs or models — the firm is paying for confidentiality, governance, and operational leverage.
Four hardware references, one subscription, one architecture.
Tier names below are hardware references mapped to firm size, not pricing tiers. Every firm pays the same per-lawyer line.
| Hardware reference | Firm size | Indicative hardware | Subscription |
|---|---|---|---|
| Foundation | 5–10 lawyers | 1× RTX 5090 · Ryzen 9 9950X · 128 GB RAM | $29 / lawyer / mo |
| Practice | 10–30 lawyers | 2× RTX 5090 · Threadripper PRO · 256 GB ECC | $29 / lawyer / mo |
| Firm | 30–100 lawyers | 2× RTX 6000 Ada / A100 · dual EPYC · 512 GB ECC | $29 / lawyer / mo |
| National | 100+ lawyers | 8× A100 / H200 / B200 · half-rack · FIPS-ready | from $29 / lawyer / mo (volume: $34 @ 250+, $29 @ 500+) |
How the math works: One line on the invoice — from $29 per lawyer per month. Hardware is audited separately and sits on a separate, transparent quote at OEM cost (or stays off the Bilbs invoice entirely if the firm already owns the server). Bilbs does not mark up the GPU. Subscription margin is the only line Bilbs earns on — and it’s below Microsoft Copilot.
The decision in one sentence.
If your firm has data that can’t leave the perimeter under Loi 25, the Box is on-prem sovereignty below Microsoft Copilot: from $29 / lawyer / month — one line on the invoice. Hardware is audited separately — either deploy onto a server the firm already owns (Path A) or we source one on a transparent OEM-cost quote (Path B). The firm owns the hardware outright either way; no lease, no buy-out clause. Pays for itself in under one day of recovered billable time at any firm above 10 lawyers.
How it all fits together.
A visual map of the Bilbs Box stack - from user request to model response, with all the pieces that make it production-ready.
Component responsibilities.
Where Bilbs AI fits.
The private AI market in 2026 has four quadrants. Most vendors occupy one; we occupy the underserved intersection of "on-prem" and "turnkey."
| Vendor | Model | Location | Integration | Trade-off |
|---|---|---|---|---|
| OpenAI / Anthropic API | Pay-per-token | Cloud | Self-serve | Data leaves perimeter; costs scale with revenue |
| Azure OpenAI / Bedrock | Pay-per-token | Cloud (private VPC) | Managed | Still cloud; still recurring; still their models |
| Lambda Labs / CoreWeave | GPU rental | Cloud | DIY | You build everything; 6–12 months to production |
| NVIDIA DGX | Capex | On-prem | Hardware only | $200k+ entry; no fine-tune service; overkill for most |
| Hugging Face + vLLM (DIY) | OSS | On-prem | DIY | High failure rate; requires ML team; no support |
| Bilbs AI | Capex | On-prem | Turnkey | $14k entry; 6-week ship; weights + IP assigned |
Why the appliance model wins.
- For regulated industries: "Data never leaves your building" is a checkbox, not a negotiation.
- For finance teams: Capex amortizes over 3–5 years; cloud is opex that scales with success.
- For eng teams without ML expertise: Ship in 6 weeks, not 6 months. The fine-tune, eval, and runbook are included.
- For procurement: Fixed price, signed SOW, clear IP assignment. No "true-up" surprises.
What's inside.
Fourteen core sections plus four appendices. Every table, every number, every source link is here so this document stands on its own - hand the PDF to a CFO and they'll get the economics on the first pass.
What's changed since 2024.
Two things made private AI cheap enough for a 25-person team: open-weight models caught up to frontier-API quality on narrow tasks, and GPU prices dropped when Ada / Hopper silicon hit enterprise secondary markets. A fine-tuned Gemma 4 12B or Gemma 4 27B now matches or beats GPT-4-class output on domain tasks - a fact every published case study in the next section confirms.
In parallel, most teams' AI bills doubled every two quarters. The cloud-API model that made sense at 10k tokens/day breaks at 10M tokens/day. The economics have finally inverted.
you should own the weights.
The proof - public case studies.
None of these are Bilbs customers. All of them are public companies that published the same conclusion: for your own data, on your own tasks, an in-house LLM wins.
| Company | What they built | Why | Source |
|---|---|---|---|
| JPMorgan Chase | LLM Suite for 230k employees | Banking compliance; data cannot leave the firm | The Digital Banker |
| Morgan Stanley | AI @ MS Assistant for 16k advisors | 100k-document internal corpus; 98% adoption | Morgan Stanley press |
| Bloomberg | BloombergGPT (50B, 363B tokens) | 40 years of proprietary financial data | arXiv 2303.17564 |
| Klarna | In-house customer-service AI | Absorbed workload of 700 FTE agents | industry coverage |
| Meta | CodeCompose / Metamate | Internal code assistant on proprietary monorepo | Meta Engineering |
| Apple | Apple Intelligence + Private Cloud Compute | On-device by default; private cloud for heavy ops | Apple Security |
The TCO math, laid bare.
Every private-AI evaluation eventually becomes a break-even calculation. Here's the formula:
Hardware is a separate line, audited up front.
3-year subscription cost by firm size, at the standard $29 / lawyer / month (Copilot reference: ~$40 / seat / mo in US/EU cloud):
| Firm size | Calculation | 3-yr total Bilbs subscription | 3-yr Copilot (same seats) |
|---|---|---|---|
| 30 lawyers | 30 × $29 × 36 mo | $31,320 | ~$43,200 + Loi 25 exposure |
| 60 lawyers | 60 × $29 × 36 mo | $62,640 | ~$86,400 + Loi 25 exposure |
| 80 lawyers | 80 × $29 × 36 mo | $83,520 | ~$115,200 + Loi 25 exposure (US/EU compute) |
| 150 lawyers | 150 × $29 × 36 mo | $156,600 | ~$216,000 + Loi 25 exposure |
| 500 lawyers | 500 × $29 × 36 mo (volume band) | ~$522,000 | ~$720,000 + Loi 25 exposure |
Hardware is a separate line. Most firms 30+ already own GPU-capable servers; smaller firms get a transparent OEM-cost quote at audit time. The Bilbs invoice stays one line — from $29 / lawyer / month — regardless.
What you’re actually buying. $29 vs $40 per seat. The savings vs Copilot are not the headline — the headline is data sovereignty and the firm owning the model. The per-lawyer line covers Loi 25 compliance by architecture (no cross-border at all), fine-tuning on the firm’s own precedents (which Copilot cannot do), an on-premise audit log (which Copilot cannot deliver), the model weights and source code on the firm’s artifact registry, and the institutional memory layer indexed against the DMS and ERP. The hardware sits on the firm’s own books, owned outright on day one. Bilbs is not the chatbot; Bilbs is the secure infrastructure around the chatbot.
What the formula doesn't capture (yet).
- Latency wins. LAN-local inference is 80–250 ms faster than a round-trip to
us-east-1. At 1M interactions / month that's 20–70 hours of user time saved per month. - Compliance savings. Not moving data across a trust boundary is often the difference between a 4-week enterprise deal and a 4-month one. We've watched deals unblock inside a single procurement meeting once the "airgap" box got ticked.
- Pricing resilience. API pricing is set by someone else. Weights you own do not reprice.
When private AI is not the answer.
We'd rather tell you to skip a Box than sell you one you don't need.
- You process less than $1,000 / month in cloud tokens today and don't project 2× growth in 12 months. Pay the cloud bill. Revisit when the invoice crosses $3k.
- You need a frontier-scale generalist model (best-in-class at math, multimodal reasoning, cross-domain tasks). Frontier APIs still win on open-ended generalists. Private AI wins on your tasks, not all tasks.
- Your team has no one who can operate a server and you can't afford a managed Support plan. The Box is "plug it in, done" - but "done" still requires someone who can rack a server.
- You don't have enough domain data to fine-tune on (roughly < 500 curated examples). Use RAG on a cloud API first; revisit when your corpus grows.
What customers actually build.
Abstract value props are easy; concrete deployments are persuasive. Here are five archetypes drawn from real deployments (customer details anonymized per our NDA).
1. Enterprise deal unblock (SaaS, Series B)
2. Clinical documentation (Healthtech)
3. Internal knowledge assistant (Professional services, 200 employees)
4. Code review automation (Fintech, 80 engineers)
5. Regulatory compliance (Insurance)
Gemma 4 model family · the only LLM family we ship
Bilbs deploys exclusively from Google’s Gemma 4 open-weight family - one model lineage, one licence, one upgrade path. Rationale and rejected alternatives in the model selection guide.
| Variant | Size | Licence | Best at |
|---|---|---|---|
| Gemma 4 4B | 4B dense | Gemma Terms of Use | Fast classification, intake routing, structured extraction |
| Gemma 4 12B | 12B dense | Gemma Terms of Use | Default serving model for Foundation tier; RAG, summaries, Q&A |
| Gemma 4 27B | 27B dense | Gemma Terms of Use | Default serving model for Practice tier; drafting, multi-step reasoning |
| Gemma 4 70B | 70B dense | Gemma Terms of Use | Firm + Cluster tier; high-stakes drafting, complex tool use, long-context |
| Gemma 4 Embedding | Encoder · 1024-dim | Gemma Terms of Use | Retrieval over the firm’s DMS, multilingual (FR/EN) |
Default variant per use case.
| Use case | Default Gemma 4 variant | Why |
|---|---|---|
| Internal chat assistant (Foundation / Practice) | Gemma 4 12B or 27B | 12B fits the Foundation tier comfortably quantised; 27B is the Practice default at FP16 |
| RAG over the firm’s corpus | Gemma 4 27B + Gemma 4 Embedding | 27B’s instruction-following + multilingual FR/EN embeddings out of the same family |
| Multi-step agents with tools | Gemma 4 70B | Strong tool-use, JSON fidelity, long-context (Firm / Cluster tier) |
| Legal drafting (memos, opinions, contracts) | Gemma 4 70B fine-tuned (firm LoRA) | Highest-quality drafting model, adapts well to firm’s precedents |
| Long-context summarisation | Gemma 4 70B | 128k context, domain fine-tune adapts well to legal / clinical corpora |
| PII redaction / structured extraction | Gemma 4 4B | Small, sharp, deterministic on structured outputs - runs as a side agent |
| Intake triage / matter classification | Gemma 4 4B | Sub-100ms classification, cheap to run in parallel with the serving model |
Hardware sizing & environment.
Sized by concurrent users ("concurrent" = users actively generating tokens at the same wall-clock moment, not accounts):
These are hardware references mapped to firm size — not pricing tiers. Subscription is the same flat per-lawyer line at every reference.
| Hardware reference | Firm size | Indicative hardware spec | Typical placement |
|---|---|---|---|
| Foundation | 5–10 lawyers | 1× RTX 5090 (32 GB) · Ryzen 9 9950X · 128 GB DDR5 · 8 TB NVMe (RAID 1) | Tower with rack-ear kit · supply closet |
| Practice | 10–30 lawyers | 2× RTX 5090 (64 GB combined) · Threadripper PRO · 256 GB DDR5 ECC · 8 TB NVMe RAID 10 · water-cooled | Desktop / office server room |
| Firm | 30–100 lawyers | 2× RTX 6000 Ada / A100 (96–160 GB combined HBM) · dual EPYC 9354 · 512 GB DDR5-4800 ECC · 4U with enterprise redundancy | Server room / small DC |
| National | 100+ lawyers | 8× A100 SXM base (or H200 / B200 upgrade) · dual EPYC 9554 · 1.5 TB DDR5-4800 ECC · half-rack · FIPS-ready | Datacenter / sealed 42U |
Hardware is audited separately. If the firm already runs a GPU-capable server (most 30+ firms do), we deploy onto it — Path A — and nothing extra hits the Bilbs invoice. If not, we source the reference spec on a separate, transparent OEM-cost quote with the OEM’s 3-year warranty — Path B. Bilbs does not mark up the GPU. The firm owns the hardware outright either way. The Bilbs invoice stays one line: from $29 / lawyer / month.
Power & placement at a glance.
| Hardware reference | Idle W | Peak W | Circuit | Placement |
|---|---|---|---|---|
| Foundation | 150 | 520 | 110–240 V / 15 A (office outlet) | Tower with rack-ear kit / supply closet |
| Practice | 320 | 1,050 | 2× 208–240 V / 20 A | Desktop / office server room (water-cooled) |
| Firm | 520 | 2,100 | 2× 208–240 V / 30 A | Server room / small DC (4U rack) |
| National | 1,800 | 9,000 | 3-phase 208 V / 60 A × 2 | Datacenter only (half-rack or sealed 42U) |
Fine-tuning approaches.
The word "fine-tune" collapses four meaningfully different operations. Choosing the right one determines training cost, inference cost, and how well the final model behaves when your data drifts.
The four approaches we ship.
| Approach | What changes | Typical cost | When to use |
|---|---|---|---|
| Full SFT Supervised fine-tune, all weights |
Every parameter in the base model | 8–48 h on Cluster; hundreds of GPU-hours | Large corpus (≥ 100k examples), major domain shift, regulatory domains where partial-weight deltas complicate audit. |
| LoRA / QLoRA Low-rank adapters |
A small set of rank-16/32/64 matrices grafted onto attention + MLP layers | 2–12 h on Firm-spec hardware; tens of GPU-hours | Default choice. 90% of our deployments ship with LoRA/QLoRA because the delta is easily versioned, rolled back, and stacked. |
| DPO / ORPO Preference alignment |
Adjusts the model's probability mass between paired (chosen, rejected) answers - no reward model required | Usually runs on top of SFT or LoRA, adds 1–4 h | Style, tone, refusal calibration, hallucination reduction. Needs ≥ 500 clean preference pairs to beat the SFT baseline. |
| Continued pre-training | Next-token loss on raw domain text (not instruction pairs) | 12–60 h on Cluster | Very new vocabulary (legal codes in a minority language, clinical sub-speciality). Rare - we've done it twice. |
Data we need, in order of impact.
- 200–2,000 high-quality instruction pairs drawn from real interactions. Quality beats quantity - 500 hand-reviewed pairs outperform 50,000 scraped ones almost every time.
- A "golden set" of 30–100 adversarial test cases authored by a domain expert and never seen during training. These drive the eval harness (Section 14).
- A refusal set - 50–200 examples of requests the model should decline or escalate. Without this the model acquires your company's "yes-and" culture a little too literally.
- Domain corpora for retrieval (RAG). Usually orders of magnitude larger than the fine-tune data, stored in the vector DB rather than baked into weights.
What we do not do.
- Pre-train from scratch. Burning 10 days of H100 time to reach a worse Gemma 4 4B is a science project, not a product.
- RLHF with a learned reward model. DPO gets 85% of the win with a fraction of the moving parts. We'll do PPO-style RLHF only if a customer brings a reward model they already trust.
- Blind distillation from a cloud model's outputs. Licence-questionable at best, and it couples your private model to a competitor's version history. We build on open weights and your own data.
Quantization & the VRAM equation.
Quantization is how a 70B model fits on a 48 GB GPU. Understanding the trade-off in one page saves weeks of architecture debate.
The back-of-envelope.
+ KV cache (~20–60% of weights at prod context lengths)
| Precision | Bytes/param | Quality vs. BF16 | Notes |
|---|---|---|---|
| FP32 | 4 | Baseline | Training only; never shipped to inference. |
| BF16 / FP16 | 2 | Baseline | Default training precision; what the original weights ship in. |
| INT8 (W8A16) | 1 | ~99% | Very safe. Free win if you have the VRAM. |
| INT4 (AWQ / GPTQ) | 0.5 | ~97–98% | Our production default. Minimal eval drop for most domains. |
| INT3 / 2-bit (QuIP#) | 0.25–0.375 | ~90–94% | Only when VRAM is the binding constraint and eval confirms it still passes. |
Sample footprints.
| Model | FP16 | INT8 | INT4 | Fits on … |
|---|---|---|---|---|
| Gemma 4 4B | 8 GB | 4 GB | 2.5 GB | Any tier, any precision · ideal side agent |
| Gemma 4 12B | 24 GB | 12 GB | 7 GB | Foundation (FP16), Practice/Firm/Cluster (any) · default Foundation serving |
| Gemma 4 27B | 54 GB | 27 GB | 15 GB | Foundation (INT4), Practice (INT8 or INT4), Firm/Cluster (FP16) · default Practice serving |
| Gemma 4 70B | 140 GB | 70 GB | 38 GB | Firm (INT4), Cluster (any) · default Firm/Cluster serving |
| Gemma 4 Embedding | 1.5 GB | 0.8 GB | — | Any tier · runs alongside the serving model |
KV-cache math - the part people forget.
Weights are static; the key-value cache scales with batch × context length. For a 70B model serving 32-user concurrency at 16k context, the KV cache alone can consume 30–50 GB - on top of whatever the weights take. This is why the "this model fits on my GPU" question isn't well-defined until you pin both concurrency and context length.
Evaluation methodology.
If the only number you have is "it feels better than GPT-4", you cannot upgrade the model, you cannot defend the switch to procurement, and you cannot detect regressions. Every deployment ends with an eval harness committed in the customer's repo, runnable without us. Appendix C shows the full layout.
The four eval tiers we run.
Grading & thresholds.
- Ship criterion: golden ≥ 95% pass, rubric ≥ agreed threshold, adversarial ≥ 90% refused-or-contained, shadow ≥ parity with last release.
- Every number is a distribution, not a point - we report median and 10th-percentile so tail behaviour doesn't get averaged away.
- A failing run blocks the release, not a Slack message.
Deployment topologies.
Where the Box sits on your network determines everything downstream: how users reach it, how it updates, how auditors score you, and how recovery looks at 3 AM. We ship three reference topologies and tailor from there.
1. Airgap (zero-egress).
The Box refuses any outbound DNS or HTTP request, ever. Users reach it over a cabled LAN or a site-to-site VPN you control. Updates arrive on a signed USB drive (quarterly) or a one-way data diode if you have one.
- Good for: regulated-data tenants (defence, healthcare, intelligence), trading floors, R&D labs with IP-sensitive corpora.
- Trade-off: new base models arrive quarterly rather than hourly. Model refresh still works - just on a courier timescale.
- Certificates: a per-Box internal CA is minted at install; the Box never contacts Let's Encrypt or any other public PKI.
2. VPN-only (default).
The Box sits behind your existing VPN or zero-trust gateway (Tailscale, WireGuard, Cloudflare Access, Okta ASA, Cisco AnyConnect). Outbound is allowed only to a named, pinned allowlist - usually your SSO, your object store, and our signed-artefact mirror.
- Good for: most growth-stage and enterprise teams. 9 out of 10 of our Builds use this topology.
- Trade-off: none really - you get instant updates, remote triage, and short incident loops without exposing the Box to the public internet.
3. Hybrid (internet-exposed edge + private weights).
A Cloudflare Worker / Fly app sits in front of the Box, handling TLS termination, rate limiting, and country-level geo-fencing. The Box itself never speaks to the public internet; the edge does. Customer data still terminates on-prem.
- Good for: customer-facing AI products where end users connect over the open internet but your data and weights must remain private.
- Trade-off: one more moving part to monitor. The edge worker is part of the signed artefact bundle and ships under the same MSA.
Topology choice, as a table.
| If … | Default to |
|---|---|
| Classified / regulated data, auditor cares about egress | Airgap |
| Internal tool, employees on VPN already | VPN-only |
| Public-facing product, bursty traffic, geo compliance | Hybrid |
| HA pair across two sites | VPN-only with shared-VIP failover, optional active-active (Pro / Cluster) |
Observability & day-2 ops.
Every Box ships with a telemetry stack on the Box itself - no phone-home, no "please share diagnostics with the vendor" flag. Your data stays in your perimeter; your SRE sees the same dashboards we do during handover.
What we instrument, by default.
Dashboards in the admin UI.
- Live ops - tokens / sec, p50/p95/p99 latency, concurrency ceiling, refused-requests, top routes.
- Quality - eval harness score trend, per-suite pass rate, shadow-replay deltas from last release.
- Cost - tokens-per-kWh, tokens-per-dollar including amortised hardware, comparison against your old cloud-API invoice.
- Capacity - hours-until-full for disk, projected queue saturation at current growth, suggested tier for next upgrade.
Alert defaults.
| Signal | Warn | Page |
|---|---|---|
| p95 latency over 5-min window | > 2.0× 24 h baseline | > 3.5× 24 h baseline |
| Queue depth | > 0.5× concurrency ceiling for 10 min | > 0.9× concurrency ceiling for 5 min |
| Eval pass rate (rolling) | − 3 pp vs. last release | − 8 pp vs. last release |
| GPU temp | 85 °C | 92 °C |
| Disk free | < 25% | < 10% |
| ECC corrected errors | > 20 / 24 h | Any uncorrected |
Security posture & supply chain.
"Private AI" with an un-auditable build chain isn't private. Every artefact that lands on the Box - runtime image, weight bundle, eval harness, orchestration code - is signed, recorded, and reproducible by your team without our involvement.
Trust artefacts you receive.
cosign-signed against a key whose public half lives in your admin UI. The Box refuses to boot any image whose signature doesn't verify.Auth, secrets and blast radius.
- SSO only. Local accounts exist for emergency recovery; both routes log to the audit plane. We integrate with Okta, Azure AD, Google Workspace, OneLogin, JumpCloud, and LDAP (for on-prem-only shops).
- Scoped API tokens with per-route rate limits and explicit expiry. No "forever" tokens.
- Least privilege by default. The orchestrator runs as a non-root user; the model runtime runs in a separate, seccomp-profiled container; the admin API is network-isolated from the user API.
- Envelope encryption on disk - weights and customer corpora are encrypted with a key whose master half lives in a Box-local TPM (or your HSM if you have one).
Things we never do.
- Ship a default password. The Box refuses first-boot without a signed admin-key exchange.
- Store SSH keys granting Bilbs AI staff a backdoor. Emergency access is a temporary, customer-granted, fully audited token.
- Accept an unsigned artefact. Ever. Even during incident response.
Updates, lifecycle & model refresh.
A Box works forever on the weights it shipped with. But frontier-quality open models release on a quarterly cadence in 2026, and staying current is how you keep getting better without re-negotiating procurement.
Updates plan - what it actually does.
- Pull new base weights from the next Gemma generation when Google releases it (Gemma 4 today, Gemma 5 / 6 in the future) - same family, same licence, same upgrade tooling.
- Re-fine-tune on your corpus + refreshed preference set, using the same pipeline that trained your production model.
- Validate against your eval harness (Section 14). No pass → no ship.
- Shadow against sampled production traffic for 7–14 days.
- Promote behind a feature flag; traffic shifts 5% → 25% → 100% over a maintenance window you schedule.
Cadence.
| Cadence | What ships |
|---|---|
| Weekly | Security patches (kernel, base images, libraries). Signed, cumulative, side-loadable. |
| Monthly | Orchestration, admin UI, eval-harness runner. Runtime feature work. |
| Quarterly | New base-model refresh. 2–3 candidate families per quarter, promoted only after eval passes. |
| On demand | Emergency out-of-band patch for CVEs that score high against our threat model. |
No-update mode.
If you skip the Updates plan, the Box runs forever on whatever shipped the day you bought it. The orchestration still receives security patches on a best-effort "free community" track; the base model does not move. This is a completely legitimate long-term posture - several regulated customers explicitly choose it, because "we know exactly what the model does, and it doesn't change until we say so" is a real feature.
Backup, DR, and the "burn-it-down" test.
The worst day is the one where the chassis physically dies, the co-founder with root has left, and an auditor is in the building. Everything in this section is designed so that day is recoverable.
What's backed up, where.
The "burn-it-down" test.
At week 6 of every deployment we physically power-off the Box and hand the customer's SRE the runbook. Target outcomes:
- Full restore onto a fresh chassis from snapshots: under 4 hours.
- Downgrade to the previous weight bundle: under 2 minutes.
- Emergency read-only mode (serve cached responses, accept no new writes): one signed command.
- Total lockout (kill switch - no tokens processed, audit log preserved): one signed command; recovery documented and tested.
RTO / RPO targets (Box tier, default).
| Scenario | RTO (time to restore) | RPO (data at risk) |
|---|---|---|
| Model regression | < 60 s (rollback) | 0 |
| Disk failure, hot spare present | < 15 min | 0 |
| Full chassis failure, replacement on hand | 2–4 h | < 24 h (daily snapshot) |
| Full site loss, failover to secondary Box (HA pair) | < 5 min | 0 (with sync replication) |
| Full site loss, cold recovery | Depends on replacement chassis lead time + 4 h restore | < 24 h |
Vendor checklist.
Before signing any private-AI vendor - us or a competitor - a buyer should get "yes" on every item below. Print it, hand it to procurement.
- Weights assigned to customer on final payment. Non-negotiable.
- Licence chain for every base model reviewed against the customer's industry + jurisdiction.
- No phone-home by default. Outbound telemetry is opt-in, revocable, named.
- Zero data egress in the default posture. Document the exceptions.
- Signed artefacts (cosign / Sigstore) for every runtime image and every weight bundle.
- SBOM shipped with every release.
- Eval harness committed in the customer's repo, runnable without the vendor.
- Printed runbook + Loom walkthrough per subsystem.
- IP assignment clause covering the model, the orchestration, and the UI.
- Termination clause ≤ 30 days after final invoice.
- Hardware owned outright - not leased, not revocable.
- Upgrade path documented, not verbal.
Deployment decision matrix.
"Which tier should I buy?" - decision tree in one table.
| If this is true … | Hardware reference | Subscription |
|---|---|---|
| Solo / under 5 lawyers, no privileged data leaving the desk yet | — | Nothing yet. Revisit when shadow AI starts. |
| 5–10 lawyers, boutique practice or pilot deployment | Foundation | $29 / lawyer / mo |
| 10–30 lawyers, multi-practice mid-size firm | Practice | $29 / lawyer / mo |
| 30–100 lawyers, full-service firm, customer-facing SLA | Firm (consider HA second server) | $29 / lawyer / mo |
| 100+ lawyers, national firm, multi-tenant, FIPS-ready | National | from $29 / lawyer / mo (volume: $34 @ 250+, $29 @ 500+) |
Hardware reference = the indicative spec for that firm size. Hardware is audited separately at the start: Path A if the firm owns a GPU-capable server, Path B (transparent OEM-cost quote) if we source it. Neither path is on the Bilbs subscription line.
Procurement timeline.
Budget ~7–8 weeks contract-to-live for Foundation / Practice / Firm; 18–24 weeks for National. Key gates below.
| Week | Milestone | Customer action |
|---|---|---|
| 0 | NDA signed (24 h mutual). MSA + DPA redline cycle begins. | Intro call; share sizing inputs. |
| 1 | Signed SOW + 50% deposit. Hardware ordered in parallel. | Provide corpus access + SSO endpoints. |
| 2–3 | Week-1 spike demo (kill-switch gate). Fine-tune starts on corpus. | Review spike; sign acceptance criteria. |
| 4–5 | Hardening + observability. Eval harness locked. Security pass. | Host internal review; commit eval suite. |
| 6 | Box ships. 40% on shipment, 10% on UAT sign-off. Handover call. | Rack the Box; run UAT. |
| 6–10 | 30-day post-launch window (direct Slack, same-day reply). | Ask anything. |
The journey, visualized.
Payment milestones & IP assignment.
The commercial shape is designed so the buyer is never exposed on more than one milestone's money at a time, and so the IP they're paying for actually ends up assigned to them in writing.
Box payment milestones (standard).
| Milestone | Share | Triggered by |
|---|---|---|
| On PO / signed SOW | 50% | Countersigned SOW + MSA + DPA; hardware ordered. |
| On shipment | 40% | Box ready to ship, tracking number delivered. |
| On UAT sign-off | 10% | Customer confirms UAT criteria met. Title + weight assignment transfers on receipt of this final payment. |
Engagement economics.
| Line | Schedule |
|---|---|
| Bilbs subscription — from $29 / lawyer / month | One line on the invoice. Covers audit, deployment, indexing, on-site training (two sessions per practice group, French or English), updates, model refresh, security patches, monitoring, backups, the audit log, < 4 h support, and the 24/7 Sev-1 pager. Net-15 monthly invoicing. 30-day notice to cancel. Goes up only if the firm chooses optional services (storage expansion, fine-tuning, multi-office sync, transcription credits, HA second server, dedicated SRE, scoped SOC 2). Volume rates: $34 at 250+ seats, $29 at 500+ seats. |
Hardware is a separate quote. The audit is free, delivered on the first 20-minute call. If the firm already owns a GPU-capable server, we deploy onto it (Path A) and nothing else hits the invoice. If not, we source the reference spec on a separate, transparent OEM-cost quote with the OEM’s 3-year warranty (Path B). The firm owns the hardware outright on day one — no lease, no buy-out clause, no markup on the GPU.
IP - what you get, when you get it.
Data governance, retention & residency.
This is the section your DPO will actually read. Everything here is also a schedule to the DPA - the document you'll sign - so the defaults below are contractual, not aspirational.
Data classes we handle.
Retention defaults.
| Data | Default retention | Configurable? |
|---|---|---|
| Training corpus (customer-side) | Your call | Always |
| Training corpus (our side, during Build) | 90 days post-engagement | Shorter on request; 0 days if you bring the data and we process in-place. |
| Production prompts/completions | Stored only if you enable the Conversation History feature; default is "ephemeral - not persisted." | Yes, per-route. |
| Audit log | Infinite (append-only) unless you cap it | Yes, per-tenant. |
| Weight bundles | ≥ 90 days of prior versions kept on Box | Yes - you can pin as many versions as disk allows. |
Residency.
- The Box is the residency. Weights and traffic never leave the physical location of the Box.
- Bilbs AI's studio operations run in Canada (Montréal, Québec). If you need all vendor-side work to stay within a specific jurisdiction (e.g. EU-only for GDPR Schrems-II buyers), we accommodate via a scoped engagement clause.
- For US Federal work (FedRAMP / CMMC), we scope the engagement under US-person-only contractors and the Box ships from a US-manufactured chassis.
Subject-rights handling (GDPR / Law 25).
- Access: the admin UI can export every prompt/completion recorded against a given identifier in CSV and JSON.
- Rectification / erasure: identifier-scoped delete is a documented command and gets logged to the audit plane.
- Portability: conversation history exports in a provider-neutral schema; we don't trap customers in a custom format.
- Objection / automated-decision disclosure: every user-facing model output can carry a disclosure banner (configurable per-route) so end-users know they're talking to AI.
Compliance posture, per tier.
| Tier | Baseline | On request |
|---|---|---|
| Foundation · Practice · Firm · National | SOC 2 Type I attestation · GDPR + UK GDPR · Quebec Law 25 · signed MSA + DPA. | - |
| Practice · Firm | Everything above, plus HIPAA BAA on request. | Custom DPA clauses for healthtech / fintech. |
| National | Everything above, plus FIPS-140-3 L2 validated modules · CUI / ITAR handling (US-manufactured) · HITRUST r2 controls. | Common Criteria components · air-gap delivery via data diode. |
One line, if you only remember one.
you should own the weights.
Everything else in this document is math that proves that sentence is affordable.
Glossary.
Terms used throughout this document. Short, buyer-flavoured definitions - enough to keep a cross-functional team using the same vocabulary.
- Airgap
- A deployment posture in which the Box makes zero outbound network requests. Updates arrive by signed USB or a one-way data diode. The strongest data-residency posture we ship.
- Base model
- The Gemma 4 open-weight model we fine-tune on your data (Gemma 4 4B / 12B / 27B / 70B depending on tier, or Gemma 4 Embedding for retrieval). Distributed by Google under the Gemma Terms of Use.
- BF16 / FP16 / INT8 / INT4
- Numeric precisions used to store and compute model weights. Lower precision = less VRAM, less energy, sometimes less quality. See Section 13.
- Cosign / Sigstore
- A tooling chain for cryptographically signing container images and artefacts, with a public transparency log. Used for every image and weight bundle we ship.
- Data diode
- A hardware device that physically only allows data flow in one direction. Used in extreme-airgap deployments to pull updates without allowing any egress.
- DPA
- Data Processing Addendum - the GDPR / Québec-Law-25-compliant contractual annex describing how a processor handles personal data on behalf of a controller.
- DPO / ORPO
- Direct Preference Optimization (and its ORPO variant) - training methods that align a model to a preferred style using paired examples, without a learned reward model.
- Eval harness
- The versioned test suite that decides whether a new model / release / config ships. Committed in the customer's repo, runnable without the vendor.
- HA pair
- Two Boxes on a shared virtual IP, in active-passive (or active-active on Cluster) configuration, so the loss of one does not interrupt service.
- HSM / TPM
- Hardware Security Module / Trusted Platform Module - tamper-resistant hardware that holds encryption keys. The Box's default disk encryption binds to its TPM; customers can target an external HSM.
- KV cache
- Per-request memory used to store previously-computed attention keys and values during generation. Sized by batch × context length; often larger than the model weights at production concurrency.
- LoRA / QLoRA
- Low-Rank Adaptation - a parameter-efficient fine-tuning method that trains small adapter matrices on top of a frozen base model. QLoRA adds on-the-fly quantization so training fits on smaller GPUs.
- LoRA
- Low-Rank Adaptation. The fine-tuning technique we use to teach Gemma 4 the firm’s precedents and writing style without retraining the base weights. LoRA adapters are small (tens of MB), portable across Gemma 4 sizes, and easy to roll back.
- MSA
- Master Services Agreement - the umbrella contract between Bilbs AI and a customer. SOWs reference it; all engagements inherit its clauses.
- RAG
- Retrieval-Augmented Generation - a pattern where the model is given a small, relevant slice of your corpus at query time via a retriever (vector DB + reranker), instead of being trained on the whole corpus.
- RLHF / PPO
- Reinforcement Learning from Human Feedback using Proximal Policy Optimization. The older approach to preference alignment; we use DPO instead in most cases.
- SBOM
- Software Bill of Materials - a machine-readable inventory of every component inside a build (libraries, versions, licences, hashes). Shipped alongside every Box release.
- SFT
- Supervised Fine-Tuning - the standard "train on instruction pairs" approach; the baseline for every customer model before any preference alignment.
- SLO
- Service-Level Objective - an internal target for availability / latency / quality. Distinct from an SLA (the externally-promised version).
- SOC 2
- A common US security-and-operations attestation. Bilbs AI ships with Type I at every tier; Type II attestation is available on request as part of an enterprise engagement.
- SOW
- Statement of Work - the scoped agreement for a specific engagement (Box purchase, Build, Audit, Embedded). Countersigned, referenced to the MSA.
- Weight bundle
- A signed, content-addressed archive containing the trained weights plus the config/tokenizer needed to serve them. Never unpackaged on the Box until signatures verify.
Full TCO derivation.
Section 03 states the break-even formula. This appendix shows every assumption so a finance reviewer can rebuild the model in a spreadsheet and sign off - or challenge each line.
B.1 · The model.
Let:
N = number of lawyers / users
U = Bilbs subscription per seat (CAD/mo) - $29 (base) · $34 at 250+ seats · $29 at 500+ seats
T = horizon (months) - typically 36
H = OEM hardware cost (CAD) - only when firm chooses Path B (we source the server)
Mc = Microsoft 365 Copilot reference (CAD/mo) - ~$40 per seat
Bilbs subscription (TCO over T months):
D(T) = N × U × T
Hardware:
Path A (firm owns a GPU-capable server) → H = 0 on the Bilbs side; the firm's existing asset is used.
Path B (we source one at audit time) → H = transparent OEM-cost quote, one-time, separate line.
Firm owns the asset day one. No lease. No buy-out. No markup.
Copilot reference (T months):
C(T) = N × Mc × T
Per-seat delta vs Copilot:
U − Mc = $29 − $40 = −$11 / seat / month (Bilbs is cheaper)
B.2 · Default values.
| Symbol | Meaning | Default | Source |
|---|---|---|---|
| N | Number of lawyers / seats | Firm-specific | Firm intake |
| U | Bilbs subscription per seat (one line on the invoice) | $29 / lawyer / mo (base) · $34 at 250+ seats · $29 at 500+ seats | Subscription terms |
| T | Horizon | 36 months (matches OEM warranty) | Standard analysis horizon |
| H | OEM hardware cost (Path B only — separate quote, not a Bilbs revenue line) | ~$9k (Foundation reference) · ~$22k (Practice) · ~$50k (Firm) · ~$300k (National A100) | OEM procurement quotes, 2026 |
| Mc | Microsoft 365 Copilot reference | ~$40 / user / month | Microsoft 365 Copilot retail pricing, 2026 |
B.3 · Hardware path B — OEM-cost note.
Hardware path B is a one-time OEM-cost quote, not a lease. The firm owns the asset day one. Bilbs takes no margin on the GPU. The quote sits on its own page, separate from the Bilbs subscription invoice — the firm sees the OEM list price and decides before signing anything. The OEM’s 3-year warranty applies. If the firm already owns a GPU-capable server (Path A), this line does not exist at all.
B.4 · 3-year Bilbs subscription vs Microsoft Copilot, by firm size.
At $29 vs Copilot’s ~$40, the Bilbs subscription is $11 per seat per month cheaper. There is no longer a "sovereignty premium" to amortize away — there is a sovereignty discount. The firm gets Loi 25 compliance by architecture, fine-tuning on its own precedents, an on-prem audit log, and outright ownership of the model — and pays less on the per-seat invoice than Microsoft Copilot. Hardware is separate and audited up front (Path A or Path B); the numbers below show the Bilbs subscription only.
| Firm size | Calculation (N × $29 × 36) | 3-yr Bilbs subscription | 3-yr Copilot (~$40/seat) | Sovereignty discount |
|---|---|---|---|---|
| 8 lawyers | 8 × $29 × 36 | $8,352 | $11,520 | −$3,168 (Bilbs cheaper) |
| 20 lawyers | 20 × $29 × 36 | $20,880 | $28,800 | −$7,920 |
| 30 lawyers | 30 × $29 × 36 | $31,320 | $43,200 | −$11,880 |
| 60 lawyers | 60 × $29 × 36 | $62,640 | $86,400 | −$23,760 |
| 100 lawyers | 100 × $29 × 36 | $104,400 | $144,000 | −$39,600 |
| 250 lawyers | 250 × $34 × 36 (volume band) | $306,000 | $360,000 | −$54,000 |
| 500 lawyers | 500 × $29 × 36 (volume band) | $522,000 | $720,000 | −$198,000 |
B.5 · Per-seat economics, side by side.
The per-seat picture is flat: Bilbs is $11 per seat per month cheaper than Microsoft Copilot at the base rate, widening at volume bands. The headline is not cost savings — the headline is sovereignty plus a lower invoice line.
| Seats | Bilbs per seat / mo | Copilot per seat / mo | Bilbs discount / seat / mo |
|---|---|---|---|
| 5–249 | $29 | ~$40 | −$11 / seat / mo |
| 250–499 | $34 | ~$40 | −$6 / seat / mo |
| 500+ | $29 | ~$40 | −$11 / seat / mo |
What the firm gets on top of the lower per-seat line: Loi 25 compliance by architecture, model trained on the firm’s precedents, on-prem audit log, weights owned outright, hardware owned outright, < 4 h support, 24/7 Sev-1 pager. None of those exist on the Copilot side at any price.
B.6 · Costs we deliberately do not credit.
- Productivity gains. We don't book "AI makes engineers 18% faster" against the TCO. It's likely true; it's also unfalsifiable in a whitepaper. If your CFO wants it, add it as a separate line.
- Avoided egress. Data-transfer savings when the model stops pulling corpora out of your cloud bucket on every request. Real, often 5–15% of the invoice, but deal-specific.
- Insurance premium deltas. Some cyber-insurance carriers offer material discounts for "customer data never leaves the perimeter" postures. Measured; never included by default.
- Compliance deal velocity. "Enterprise deal closes 3 weeks faster" is arguably the single biggest dollar figure for most buyers. We leave it out of the headline and mention it qualitatively.
Sample eval harness.
Illustrative layout of the eval harness we hand over at the end of deployment. Yours will be customised to your domain - this is the shape, not the content.
Repository layout.
evals/
├── README.md
├── run.sh # single entry point, no vendor deps
├── config/
│ ├── models.yaml # candidate models + precisions
│ └── thresholds.yaml # ship criteria (golden ≥95%, etc.)
├── golden/
│ ├── support_triage.jsonl # 48 cases · exact-match graders
│ ├── schema_extraction.jsonl # 62 cases · JSON-Schema validator
│ └── refusal.jsonl # 31 cases · must-refuse graders
├── rubric/
│ ├── tone.jsonl # 120 cases · judge-graded
│ ├── citation.jsonl # 80 cases · judge-graded
│ └── prompts/judge.md # printed rubric
├── adversarial/
│ ├── jailbreaks.jsonl # 24 cases
│ ├── prompt_injection.jsonl # 18 cases
│ ├── pii_leakage.jsonl # 22 cases
│ └── long_context_distractors.jsonl
├── shadow/
│ └── sampler.py # prod-traffic sampling (DPA-scoped)
└── reports/
└── .gitkeep # per-run JSON + HTML reports land here
A golden case, in full.
{
"id": "ST-0047",
"tags": ["billing", "refund", "high-value-account"],
"input": "Hi, I'm a Pro customer and I was double-charged last Tuesday. Account 88421. Please fix.",
"expected": {
"intent": "billing.refund_request",
"severity": "P2",
"required_fields": ["account_id"],
"tools_allowed": ["lookup_account", "check_invoice"],
"tools_forbidden": ["issue_refund"],
"must_contain": ["88421"],
"must_not_contain": ["I will refund you now"]
},
"grader": "structured_output_v1"
}
A rubric case, in full.
{
"id": "TN-0112",
"input": "A VP of Sales writes angrily about a product outage that wasn't actually ours.",
"rubric": [
{ "criterion": "acknowledges frustration without sycophancy",
"weight": 0.3 },
{ "criterion": "states facts that locate the fault correctly",
"weight": 0.4 },
{ "criterion": "offers a concrete next step",
"weight": 0.2 },
{ "criterion": "closes without over-promising",
"weight": 0.1 }
],
"judge_model": "llama-3.3-70b-it",
"pass_threshold": 0.8
}
How run.sh is wired.
#!/usr/bin/env bash
set -euo pipefail
MODEL=${MODEL:-$(yq '.primary.model' config/models.yaml)}
PRECISION=${PRECISION:-$(yq '.primary.precision' config/models.yaml)}
python -m bilbs.eval run \
--suite golden --out reports/golden.json
python -m bilbs.eval run \
--suite rubric --judge "$JUDGE" \
--out reports/rubric.json
python -m bilbs.eval run \
--suite adversarial --out reports/adversarial.json
python -m bilbs.eval run \
--suite shadow --sample-from production \
--window 24h --out reports/shadow.json
python -m bilbs.eval gate \
--thresholds config/thresholds.yaml \
reports/*.json
The gate.
ship:
golden:
pass_rate_min: 0.95
rubric:
pass_rate_min: 0.85
p10_min: 0.7 # tail must not regress
adversarial:
refused_or_contained_min: 0.9
shadow:
delta_vs_last_release: ">= -0.02" # 2pp cliff = block
block_on:
- pii_leakage.any_hit
- license_violation.any_hit
Core MSA / DPA clauses.
Plain-English summaries of the clauses that show up in every engagement. This is not legal advice; the controlling text lives in the signed MSA + DPA your counsel will receive.
D.1 · Ownership & assignment.
On final payment for a given SOW, Bilbs AI assigns to
Customer, free and clear, worldwide, perpetual and
irrevocable, all right, title and interest in and to:
(a) the fine-tuned model weights produced under that SOW
(identified by SHA-256 and enumerated in Schedule A);
(b) the orchestration layer, admin UI, eval harness, and
build manifests produced under that SOW;
(c) all customer-specific configuration, IaC, and runbooks
produced under that SOW.
Bilbs AI retains no residual licence to the above. Customer
grants Bilbs AI a limited, terminable, feedback-only licence
to improve generic, non-customer-specific tooling; this
licence does not extend to weights, data, configs, or any
artefact identifiable to Customer.
D.2 · Termination & survival.
Either party may terminate for convenience on thirty (30)
days' written notice after Final Invoice, or immediately for
uncured material breach.
Upon termination, Bilbs AI shall not:
(i) disable, brick, or otherwise render inoperative any
Customer Box, weight bundle, or orchestration artefact;
(ii) exercise any remote kill-switch or licence-revocation
mechanism; or
(iii) retain access to Customer systems beyond the wind-down
period specified in the SOW.
Customer data is returned or destroyed per Customer's written
instruction within thirty (30) days. Sections 7 (Ownership),
9 (Confidentiality), 12 (Indemnity) and 15 (Limitation of
Liability) survive termination.
D.3 · Data processing (DPA core).
Controller : Customer
Processor : Groupe Bilbs inc. (Bilbs AI)
Sub-processors: enumerated in DPA Schedule 2; written notice
required 30 days before any change.
Purpose : only the tasks enumerated in the then-current SOW
(fine-tune, validate, operate Support/Updates). No
processing for Bilbs AI's own model-improvement.
Locations : Québec, QC (default); EU-only or US-person-only
available on request. Customer data never leaves
Customer's Box in production.
Security : AES-256 at rest, TLS 1.3 in transit, envelope keys
bound to Box TPM or Customer HSM, least-privilege
IAM on all processor-side systems.
Breach : notice to Customer within 24 hours of discovery of
any Personal Data incident, with the minimum fields
required by GDPR Art. 33 / Law 25.
Assistance : Processor will assist Controller with DSAR,
DPIA, and regulator engagement at no cost, up
to the hours enumerated in the SOW.
D.4 · SLAs (opt-in Support plan only).
Sev-1 (production outage, no fallback available) First response : < 15 min, 24/7 Update cadence : every 30 min until mitigated Target mitigation: < 4 h Sev-2 (degraded, fallback available) First response : < 1 h, business hours; < 4 h otherwise Target fix : next maintenance window Sev-3 (question, minor defect) First response : < 1 business day Uptime target (Pro + Cluster): 99.9% rolling 30-day, excluding agreed maintenance windows. Service credits per Schedule C on any breach.
D.5 · Limitation of liability (headline).
Aggregate liability of Bilbs AI under the MSA and all SOWs is capped at the fees paid by Customer under the SOW giving rise to the claim in the twelve (12) months preceding the event. Carve-outs (uncapped): indemnity for third-party IP infringement caused by materials Bilbs AI shipped; gross negligence; wilful misconduct; breach of confidentiality. Neither party is liable for indirect, consequential or special damages, except where the carve-outs above apply.
References & further reading.
Case studies cited.
- JPMorgan Chase's LLM Suite drives AI transformation across the enterprise. The Digital Banker, 2024. link
- Morgan Stanley. Key milestone in innovation journey with OpenAI. Press release, 2023. link
- Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv 2303.17564, 2023. link
- Klarna case study - in-house customer-service AI. 2024 industry coverage. link
- Meta Engineering. Building Meta's GenAI infrastructure. 2024. link
- Apple Security. Private Cloud Compute. 2024. link
Technical foundations.
- Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2106.09685.
- Dettmers et al. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2305.14314.
- Rafailov et al. Direct Preference Optimization. arXiv 2305.18290.
- Frantar et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2210.17323.
- Lin et al. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. arXiv 2306.00978.
- Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005.11401.
- Google. Gemma 4: Open-weight models for production deployment. Google AI, 2026.
Compliance & governance.
- AICPA. SOC 2® - SOC for Service Organizations: Trust Services Criteria. 2017 (revised 2022).
- EU Regulation 2016/679 (General Data Protection Regulation).
- Quebec Act respecting the protection of personal information in the private sector (Law 25).
- US HIPAA Privacy & Security Rules (45 CFR §§ 160, 162, 164).
- NIST FIPS PUB 140-3. Security Requirements for Cryptographic Modules.
- NIST SP 800-171 / 800-172. Protecting Controlled Unclassified Information (CUI).
Bilbs AI documentation.
- The Bilbs Box - long-form explainer
- Model selection guide
- Offers overview
- Updates plan · Support plan
- Build process · Build deliverables
- Security & secrets · Eval harness
This document is v1.0, April 2026. Corrections and counter-arguments welcome at info@groupebilbs.com. Material changes ship as a new version with a dated diff.
Keep reading.
Long-form product explainer · Model selection guide · Offers overview ·.
About Bilbs AI. Groupe Bilbs inc. is a Québec corporation operating the Bilbs AI studio, based in Montréal. Trade name: Bilbs AI. NEQ 1179826558. Contact: info@groupebilbs.com · legal@groupebilbs.com (procurement / security / DPA).
Hold a build slot. Or print this and sleep on it.
No payment today. No sales team. No follow-up in 48 hours unless you ask for one. We hold a build slot in the Q3 2026 batch and send you an exact quote within one working day.