# 10 · Security and secrets

> **Audience.** Every engineer. Security is everyone's job; this page
> codifies how we practise it.
> ⟵ [internal index](./README.md)

## 01 · Threat model

The threats we care about, in priority order:

1. **Supply-chain compromise** of a weight bundle we ship to a
   customer (hostile code running in the customer's rack).
2. **Signing-key theft** enabling the above.
3. **Customer-data exfiltration** during an engagement.
4. **Runtime vulnerabilities** in our platform shipping to production.
5. **Insider threat** — an engineer turns hostile.
6. **Ransom / extortion** against our build infrastructure.

We do not centrally worry about DDoS of `bilbs.ai` (Cloudflare handles
it) or classical phishing (every account is YubiKey-gated).

## 02 · Signing keys

The most valuable secret we own. If a bad actor gets the signing key,
they can push a tampered bundle that customer updaters will accept.

### Hierarchy

- **Root key** (offline). Ceremonially generated. Used only to sign
  intermediate keys. Stored as **Shamir 3-of-5** on:
  - 2× Tangem hardware wallets (separate jurisdictions).
  - 3× printed paper shards in sealed envelopes at different physical
    sites (including legal counsel).
- **Intermediate signing keys**, one per year (e.g., `prod/2026`).
  Live in Hashicorp Vault PKI; hot in CI via short-lived OIDC.
- **Ephemeral per-release keys**, minted by Vault for each CI run,
  TTL 30 min.

Every layer is logged in our Rekor mirror at `log.bilbs.ai`.

### Rotation

- Intermediate: yearly, with a 90-day overlap so bundles signed late
  in year N still verify after year N+1 starts.
- Root: only on compromise or every 5 years.

Rotation runbook: `ops/sops/rotate-signing-key.md` (internal).

### Verification on the customer side

- Updater ships with our Sigstore root pinned in the binary.
- On every artefact pull, signature is verified offline.
- Tampered bundles never reach disk.

## 03 · Secrets management

**Hashicorp Vault** is the single source of truth. Dev secrets in
1Password as a convenience layer for humans.

### Secrets tiering

| Tier | Where | Rotation | Who accesses |
|------|-------|---------:|--------------|
| Human-interactive dev | 1Password `Bilbs — Engineering` | 90 d | all engineers |
| CI / runtime | Vault `kv/ci/` and `kv/runtime/` | per-release (short-lived) | CI + platform services |
| Customer-specific | Vault `kv/clients/<slug>/` | per-engagement | engineers on that engagement only |
| Signing | Vault PKI `pki/signing/` | yearly | automation only, human access requires 2-person |
| Root | offline Shamir | 5-year | ceremony only |

### Vault policies

Per-engineer policy files in `deploy/vault/policies/`. Reviewed
quarterly. Dead-man: if an engineer doesn't authenticate for 90
days, their policy expires; they re-onboard via the security process
below.

### Local `.env` policy

- `.env` in any repo is gitignored + blocked by a pre-commit hook
  (`ripsecrets`).
- Template is `.env.example`, committed.
- `direnv` loads `.env` into shells per project.
- We never paste secrets into Slack, even in private channels.

## 04 · SBOM and supply chain

- Every OCI image ships with an SBOM (`syft`-generated, CycloneDX).
- Every binary release gets an SBOM attachment on GitHub Releases.
- `grype` runs on every PR and nightly on `main`. Any **high** or
  **critical** CVE in a dependency blocks merge unless suppressed
  with a documented reason and a deadline.
- `deps-update` runs a weekly automated PR rolling minor + patch
  versions.
- Yearly, a human reviews every direct dependency for ownership /
  maintenance status (the "are they still here?" audit).

SLSA compliance:

- Platform releases aim for **SLSA Level 3**.
- Customer weight bundles that include customer data aim for Level 3
  with a documented caveat about reproducibility (non-deterministic
  NCCL).
- Level 4 is aspirational; we don't claim it.

## 05 · The two-person rule

Required for:
- Signing a root-level release.
- Rotating a signing key.
- Accessing a customer's production Box in a non-Support tunnel.
- Deploying to `registry.bilbs.ai` or `updates.bilbs.ai` from
  anywhere other than CI.
- Exporting a customer's fine-tune data.

Enforced via Vault policies that require a second-approver on the
relevant paths.

## 06 · Access to customer infrastructure

### Build phase

- Read access to a customer's corpus via a **bastion host they
  control**. We do not replicate their corpus to our laptops or our
  lab box in bulk.
- Writes to their cloud account via **time-bound IAM role** we
  assume with their approval.
- Never: persistent service-principals with long-lived keys.

### Production (post-handover)

- Box has **no default outbound** to us.
- WireGuard support tunnel is off unless:
  - Customer is on the Support plan AND
  - Admin clicks "Open tunnel" in their UI (explicit opt-in) AND
  - Tunnel auto-revokes in 4 h max.
- All actions during a tunnel session are logged in the customer's
  audit log with our engineer as `actor_email`.

## 07 · Endpoint security

Every engineering laptop:
- Full-disk encryption (FileVault / LUKS).
- YubiKey-based login + Vault unlock.
- MDM (Mosyl for macOS; Fleet for Linux).
- Automatic OS updates within 7 days.
- No local admin on customer-shared projects without a ticket.
- Signed Git commits (`git config commit.gpgsign true`).

Lost-laptop runbook: `ops/sops/lost-laptop.md`. Wipe remotely within
15 min.

## 08 · Identity & access

- Everyone has a YubiKey (two, one backup).
- Every service: SSO via Google Workspace where possible; direct
  YubiKey otherwise.
- No shared accounts. Ever. If a service doesn't support SSO and
  also doesn't support per-person accounts (rare), document it and
  revisit.
- Google Workspace audit logs reviewed monthly.

## 09 · Vulnerability disclosure

Public policy: `/.well-known/security.txt` on `bilbs.ai`. Points at
`security@groupebilbs.com`.

Response SLA:
- Acknowledge report < 24 h.
- Triage < 72 h.
- Fix critical in < 7 d, high in < 30 d.
- Publish advisory post-fix.

Bug-bounty: not today (single-operator). We pay at our discretion for
thoughtful reports; rare but precedented.

## 10 · Data handling

- Customer corpus data is **processed, not stored** during an
  engagement. Outputs (weights, evals) are artefacts; the inputs are
  referenced, not copied.
- When inputs must be cached locally (e.g., tokeniser sees them), the
  cache is on the customer's bastion or an ephemeral cloud instance
  that is destroyed after the run.
- Deletion obligations in the DPA are honoured at D+30 absent a
  retention override.
- We log data access at bastion level; customers get that log stream
  on request.

## 11 · Physical security

- Hardware in transit between OEM → integrator → customer:
  - Sealed flight cases with tamper-evident stickers.
  - ShockWatch + TiltWatch indicators.
  - Insured freight with tracking.
- Integrator floor: alarmed, 24/7 video, limited personnel.
- Lab box: behind a locked door, on an isolated VLAN, no physical
  access for any customer without a handler present.

## 12 · Incident: security breach

If we believe a breach has occurred:

1. **Contain** — revoke tokens, rotate keys, isolate host.
2. **Preserve** — take disk + memory snapshots before remediation.
3. **Notify** — founder + affected customers within 72 h of
   reasonable certainty. We do not sit on news.
4. **Report** — regulator notifications where legally required
   (Québec's CAI, GDPR supervisory, state AGs, etc.).
5. **Remediate** — patch, re-issue, re-sign.
6. **Disclose** — public advisory after remediation with specifics.
7. **Postmortem** — blameless, public where appropriate.

Legal counsel on retainer for incident-response. Contact in
1Password.

## 13 · Things you will be tempted to do that you must not

- Storing a customer's data on your laptop "just for a minute."
- Using `sudo` on a customer's Box outside a WireGuard tunnel session.
- Hard-coding a secret in a config file "temporarily."
- Committing a `.env` to a private repo because "it's only us."
- Skipping cosign verification on an artefact because "CI already
  checked."
- Sharing a YubiKey with anyone for any reason.
- Running a production migration without the two-person approval.
- Using a personal AI tool (like an IDE assistant that calls out to a
  cloud model) on a file containing customer data.

Each of these has ended in an incident somewhere in the industry. We
are not above the list.

## 14 · Compliance summary (for our own reference)

- **SOC 2 Type I**: attested. Type II: targeted Q4 2026.
- **ISO 27001**: not today; revisit at team of 3+.
- **GDPR / UK GDPR / Law 25**: yes, DPA template ready.
- **HIPAA BAA**: available as add-on; not default.
- **FedRAMP**: no.
- **PCI-DSS**: not as a processor; Cluster supports PCI-adjacent data
  paths if configured.
- **FIPS 140-3**: supported on Cluster on request (validated
  cryptographic modules).

Compliance calendar and renewal cadence in `ops/compliance/calendar.md`.

---

Next: [11-engagement-playbook.md](./11-engagement-playbook.md).
