The first time a senior clinician asks "where exactly did that patient's data go when the model produced this answer?", the production AI deployment either has a forensically-defensible answer or it does not. There is no middle ground. We have shipped healthcare AI across 30+ facilities — clinical chatbots, ambient scribing, medical RAG agents at 98.4% citation accuracy in production — and every one of them passes the same 12 controls before a clinician ever sees it. The list is short, boring, and load-bearing.
This is the checklist we run on the first kickoff call of every healthcare engagement. If a control on the list is not in scope, we re-shape the engagement or decline it. HIPAA does not negotiate, and the cost of skipping any one of these is rebuilding the entire deployment under audit pressure six months in. Better to design for it in week one.
1. BAA signed before any PHI touches the engagement
Step zero. A signed Business Associate Agreement covers Aiinfox and any sub-processor (cloud provider, model vendor, vector store) that will touch protected health information. We will not look at a sample dataset, run a discovery call against real records, or accept screenshots that contain PHI before the BAA is countersigned. If the vendor stack you are evaluating includes any provider that will not sign a BAA — some hosted LLM providers explicitly refuse — that provider is out of scope for HIPAA work and the architecture pivots to self-hosted before any other decision is made.
2. Self-hosted model inference inside the customer VPC
Patient data should not leave the customer's network during inference. For HIPAA-scoped engagements we deploy Llama 3 70B or 8B on vLLM inside the customer's AWS, Azure, or GCP VPC — or on-prem on customer-owned hardware for the highest-sensitivity engagements. Hosted LLM APIs are used only when (a) a BAA is in place with the provider, (b) the engagement risk tolerance accepts the egress, and (c) the data classification has been reviewed by the customer's privacy officer. The default is self-hosted. The exception is documented.
3. Vector store and retrieval index inside the perimeter
Self-hosting the LLM but using a managed vector database outside the customer cloud still leaks. The chunked, embedded representation of PHI is itself PHI for HIPAA purposes — and the retrieval-time query carries patient identifiers. Standard pattern: pgvector inside the existing customer Postgres, or Qdrant deployed on the customer's Kubernetes. Embeddings models run locally on a CPU or small GPU instance. No data crosses out of the VPC during ingestion, indexing, or query time.
4. Identity, access, and least-privilege role design
Every user, service, and model endpoint authenticates against the customer's identity provider (Okta, Azure AD, AWS IAM Identity Center). Application roles are scoped to the smallest data set each role legitimately needs — a triage chatbot sees demographic + reason-for-visit fields, not the full chart. Service-account credentials rotate on a 30-day cadence at minimum. No long-lived tokens, no shared service accounts, no hardcoded keys in the model's prompt or tool definitions.
5. PII / PHI redaction at every untrusted boundary
Any LLM call that crosses a less-trusted boundary (hosted model API, third-party analytics, log shipping to an external observability vendor) goes through a redaction layer that strips PHI before transmission. The redaction layer is rule-based for high-confidence identifiers (MRN format, SSN, DOB, names matched against the patient table) and supplemented by a smaller classifier for free-text identifiers. The redacted version is what crosses the boundary; the original (encrypted at rest with a customer-managed KMS key) stays inside the VPC.
6. End-to-end audit logging on every model and tool call
When a clinician three weeks from now asks why the agent produced a specific answer, the audit log must reconstruct the full chain: input prompt, retrieved context (with document IDs and citation links), model version + prompt version + temperature, tool calls and their arguments and responses, the final output, and the user identity that triggered it. The log is tamper-evident (append-only, with cryptographic chaining where the customer's compliance posture requires it) and retention matches the customer's medical-records retention policy — typically six to ten years in the US, jurisdiction-dependent elsewhere.
7. Refusal layer for safety-critical query categories
A confidently-wrong AI answer in healthcare can kill people. Every clinical AI deployment ships with an explicit refusal layer scoped to safety-critical query categories — drug dosage, drug interactions, contraindications, triage severity, dangerous procedures. On those categories, the model is configured to say "I cannot answer this — escalating to a clinician" rather than guess, even when retrieval returned a partially-relevant chunk. The refusal threshold is tuned against a clinician-reviewed eval set — not against a generic confidence score the model emits.
8. Required citations on every clinical answer
Every retrieval-augmented answer links every clinical claim to a retrieved source — formulary entry, clinical guideline section, protocol document, or prior visit note. Answers without citations are rejected before display. The citation is rendered inline with the answer (not as a footer the user can ignore) and clicking it surfaces the source passage in context. This is the structural defence against hallucination; the model is constrained by what retrieval surfaced, and the user can verify any individual claim.
9. Clinician-reviewed eval set in week one
The eval set is built before the model is selected, before the retrieval architecture is finalised, and before any user sees the system. A senior clinician on the customer side reviews 200-500 representative queries, agrees on the correct answer and the correct refusal cases, and signs off on the safety-critical categories' acceptance thresholds. This becomes the contract for the build. Every prompt change, every model swap, every chunking tweak runs against the eval set before merging. Regression past threshold blocks deployment — see our [eval harness blog post](/blog/ai-evals-from-scratch) for the underlying pattern.
10. Production drift monitoring against the eval set
CI evals catch known regressions; production evals catch unknown ones. Sample 5-10% of production traffic, run it through the clinician-validated rubric (LLM-as-judge plus deterministic checks on citation correctness, refusal correctness, retrieval recall), and alert when category-level metrics shift more than two standard deviations from baseline. Drug-dosage and triage categories get tighter thresholds (alert on any single-standard-deviation shift). The customer's clinical informatics team receives the drift report weekly.
11. Encryption, key management, and customer-managed keys
Data at rest is encrypted with customer-managed KMS keys — not the cloud provider's default keys. The customer retains control of key rotation, key access policies, and the ability to revoke. Data in transit is TLS 1.2+ end-to-end including between internal services (no plaintext on the internal network). Database backups are encrypted with the same customer-managed keys. The customer can audit which services have decrypt access at any point in time.
12. Documented runbooks, on-call, and incident response
HIPAA Security Rule requires documented incident response. We ship every healthcare engagement with runbooks covering common operational scenarios (model degradation, retrieval failure, refusal-rate spike, latency regression), an on-call rotation that names the responsible engineer for each business-hours window, and a 60-day breach-notification runbook aligned to the customer's compliance team. If the customer's existing incident-response plan covers AI systems, we slot into it; if not, we draft the AI-specific addendum during the engagement.
Three deal-breakers worth calling out
Three things will end the engagement before the rest of the checklist matters.
- Sending PHI to a hosted LLM provider that will not sign a BAA. Non-negotiable. Architecture pivots to self-hosted Llama or another open-weight model.
- Designing an AI that takes irreversible clinical action without a human in the loop. Every Aiinfox healthcare deployment defers final clinical decisions to a clinician — period.
- Logging PHI to a generic observability vendor (Datadog, Sentry) without redaction. Logs are PHI when they contain identifiers. Redact at ingress to the logging layer, not at the dashboard.
Wrapping up
HIPAA-compliant AI is mostly a discipline problem, not a technology problem. Pick the controls in week one, design the architecture around them, and gate every release against the clinician-reviewed eval set. The teams we have seen succeed treat compliance as a first-class engineering requirement; the teams that retrofit it after a working demo end up rebuilding under audit pressure. Run the checklist before the build. Build for the checklist. Ship the AI clinicians actually adopt.
Frequently asked questions
Do we need a BAA with our LLM provider for a HIPAA-compliant deployment?
Yes — if any PHI will cross to the LLM provider during inference. Anthropic, OpenAI (specific Enterprise tiers), and Google Cloud all offer BAAs under their healthcare SKUs; Azure OpenAI is BAA-covered by default. For deployments where a BAA cannot be put in place, the architecture must pivot to a self-hosted open-weight model (Llama 3, Mistral) running inside the customer's VPC. See our [healthcare AI development page](/healthcare-ai-development) for the standard self-hosted pattern.
Can we use ChatGPT or Claude directly in a clinical workflow?
Only if a BAA is in place with the provider, the engagement risk tolerance accepts external inference, and PHI redaction is wired in at the boundary. For most clinical workflows we recommend [a self-hosted Llama 3 fine-tune](/case-studies/healthcare-llm-finetuning) inside the customer VPC instead — the egress concern disappears and the model can be tuned on de-identified clinical data without a third-party processor.
How long does a HIPAA-compliant AI deployment take to ship?
Six to eight weeks for a clinical chatbot or RAG agent pilot in a single facility. Eight to twelve weeks for an AI HMS implementation including HL7 / FHIR migration and clinician training. Twelve weeks for a fine-tuned healthcare LLM with curated training set and self-hosted deployment. Multi-facility rollouts replicate per site after the first — typically four weeks each.
Will the AI replace clinicians in your healthcare deployments?
No. Every Aiinfox healthcare deployment is built so the AI assists, scribes, summarises, retrieves, and flags — but every clinical decision remains with the clinician. The system defers explicitly on safety-critical queries via the refusal layer described above. This is both a compliance requirement and the only design clinicians actually adopt long-term. To scope a HIPAA-aligned AI build for your facility, [book a 30-minute discovery call](/contact-us).
