The US AI services market in 2026 is full of people calling themselves AI development companies. The pattern is familiar from the 2014 mobile-app gold rush and the 2018 blockchain wave: a flood of vendors with thin demos, a smaller number of teams that can actually ship, and almost no public way for a Series B CTO to tell them apart on the first call. The cost of getting this wrong is high. A failed engagement at this stage is not just a sunk fee — it is six months of internal calendar, a regulatory blind spot, and a board narrative that takes a year to repair.
I have led delivery on more than 50 production AI systems in the US, EU, and Asia, and I have audited around a dozen more after a previous vendor walked away. The verification questions below come directly from those takeover audits. None of them are clever. All of them are load-bearing. If a vendor cannot answer them in the first 30-minute call, the engagement will not survive the regulatory review six months in.
1. Verify the team — not the company
The single largest source of failed US AI engagements is bait-and-switch staffing. A senior engineer joins the sales call, the proposal lands, the contract is signed, and the engagement is then quietly handed to a different pool of junior engineers who were not on the call. Three months later the client discovers that the architecture they thought they bought was being learned on the job by people who had not touched a production LLM before.
The verification is direct. Ask which named engineer will write production code. Ask how many years they have shipped LLM systems specifically — not generic software, not data science. Ask for their commits on prior public or referenceable work. Ask whether they will be on every standup. If the answer is hedged, the engagement is staffed by a junior bench and you are paying senior rates for it.
2. HIPAA — go beyond the BAA tick-box
Every US healthcare AI vendor will say they sign Business Associate Agreements. That is the minimum, not the differentiator. The differentiators are the architectural decisions a vendor makes to honour the BAA, and most of those decisions get made in the first two weeks of the engagement — which is why they need to be verified before signing.
- Will inference run inside our VPC, or will PHI cross to a hosted LLM provider? If the latter, which provider, and is there a separately signed BAA with that provider?
- How is PHI redacted at every boundary that crosses a less-trusted system (logging, analytics, external APIs)?
- How are model and tool-call audit logs structured so that a clinician can reconstruct any specific answer six weeks after the fact?
- How does the refusal layer handle safety-critical query categories (drug dosage, triage severity, contraindications)?
- What is the eval set used to validate clinical answers, and who from our clinical staff signs off on it?
A US vendor with real HIPAA experience can answer these in narrative on the call. A vendor reading from a compliance checklist will hedge. See our [HIPAA AI development services page](/hipaa-ai-development-usa) and the [12-point HIPAA AI deployment checklist](/blog/hipaa-ai-deployment-checklist) for the underlying pattern we run.
3. SOC 2 — the type matters, the scope matters more
SOC 2 Type II is the bar most US enterprise procurement teams expect. A vendor saying they are SOC 2 Type I attested is a yellow flag — it means the controls were documented at a point in time, not observed in operation. SOC 2 Type II requires a 6-12 month observation window, which is the evidence enterprise security teams actually want.
More important than the report type is the scope of the report. A vendor's SOC 2 report covers a specific set of systems and services. If the AI engagement will run inside the customer's VPC, the SOC 2 scope needs to cover the development practices, code review, secret handling, and access controls that operate in the customer environment. Ask for the SOC 2 report under NDA and read the scope section, not just the cover page. Our [SOC 2-aligned AI development page](/soc-2-ai-development-usa) details the pattern we run for enterprise procurement reviews.
4. CCPA and the multi-state privacy patchwork
California's CCPA / CPRA is the most-cited US state privacy regime, but it is no longer alone. As of 2026, 19 US states have comprehensive consumer-privacy laws on the books (Virginia, Colorado, Connecticut, Utah, Texas, Oregon, Montana, Iowa, Tennessee, Indiana, Delaware, New Hampshire, New Jersey, Kentucky, Maryland, Minnesota, Rhode Island, Nebraska, and California). A vendor designing a US-scope AI system that ignores the multi-state patchwork is shipping a system that will fail a state attorney general review.
The specific verifications: how are deletion rights propagated through embeddings and retrieval indexes (deleting a row in the source database does not delete the embedded chunk); how is the consumer right to opt out of automated decisioning honoured in the agent's decision logic; how is the cross-state data flow logged for the audit trail the state regulators ask for. Vendors that treat privacy as a legal-team problem rather than an engineering-team problem will fail these verifications.
5. Eval-first delivery, or the engagement will drift
Almost every failed US AI engagement I have audited had the same missing piece — there was no eval harness, or there was one but it was built after the system was already in production and nobody trusted it. The model that worked in the prompt playground started misbehaving the moment real users hit it, and the team had no way to know which prompt change broke what.
Ask the vendor in the kickoff call: when does the eval set get built? If the answer is anything other than "week one, before architecture choices are finalised", the engagement will drift. The eval set is the contract — 200-500 representative queries with correct answers, correct citations, correct refusal flags, gated on every prompt change in CI. See the [LLM eval harness post](/blog/ai-evals-from-scratch) for the structural pattern.
6. Time-zone honesty — US-hours coverage versus US-located
Many US buyers conflate "US-located" with "US-hours coverage". They are different. A 100% US-located vendor with a 9-5 PT schedule and weekend dark is not actually covering Pacific business hours for an East Coast customer with European partners. A globally-distributed vendor with US-hours coverage on the named senior engineer can deliver more usable coverage than a US-located team in a different US time zone with a different on-call schedule.
What to verify: which named engineer is on what hours; how is the on-call rotation structured during the build phase versus post-launch; how is the customer's clinical or compliance team's review feedback turned around within their working window. The vendor that has thought through this for US customers — not just announced an "East Coast availability" pseudo-policy — has shipped enough US engagements to take the time-zone math seriously. Our [USA AI development page](/ai-development-company-usa) and our regional pages for [New York](/ai-development-company-new-york), [San Francisco](/ai-development-company-san-francisco), and [Austin](/ai-development-company-austin) detail how coverage works per market.
7. Takeover audit signs — the system you will inherit
Around a third of the AI engagements I have led in the last two years started as takeover audits. A prior vendor shipped, walked away, the system started drifting, and the client called us in to evaluate. The pattern of what makes a takeover painful versus straightforward is consistent enough to be a verification checklist for the vendor you are about to hire.
- Is the eval set documented, versioned, and checked into the customer's repo? No → takeover will be painful.
- Are prompt versions tagged with model version, retrieval config, and tool definitions? No → takeover will be painful.
- Is observability instrumented with per-step latency, cost per turn, and category-level accuracy? No → the system is opaque after handoff.
- Are runbooks, on-call docs, and incident-response playbooks part of the deliverables? No → the customer's team cannot operate the system.
- Is every secret rotated and managed via a KMS / secret store, not hardcoded in code or prompts? No → security review will fail.
Ask the vendor whether all five are part of the deliverable. Ask to see a redacted example from a prior engagement. A vendor that ships these as standard has thought about what happens after the engagement ends. A vendor that treats them as a phase-2 retainer upsell has not.
8. Fixed-price scope versus T&M — what risk is being transferred
Time-and-materials engagements transfer schedule and scope risk to the customer. Fixed-price engagements transfer it to the vendor — assuming the scope is well-defined and the vendor stands behind the overrun. The right question is not "which model is cheaper?" because at the right scope they are equivalent on cost; it is "which party is incentivised to ship on time?"
Fixed-price done badly is a vendor that pads the estimate by 40% to absorb risk, which is worse for the customer than honest T&M. Fixed-price done well is a vendor that scopes carefully, commits to a six-week target, and eats the overrun cost if they miss for reasons on their side. Ask explicitly: if you miss the six-week target for reasons on your side, who pays? The vendors that flinch at that question have not figured out how to deliver fixed-price honestly.
9. Industry-specific verification — healthcare and fintech
Generic AI vendor verification is necessary but not sufficient for regulated US industries. Healthcare adds the BAA, the clinician-validated eval set, the safety-critical refusal layer, and the FDA digital-health guidance for clinical decision support. Fintech adds the OCC / FFIEC model risk management framework, the SR 11-7 model validation expectations, and the explainability requirements for ECOA-covered decisioning.
A vendor with real healthcare experience can talk about FHIR mappings, HL7 integration patterns, and refusal calibration on drug-dosage queries — not just say the word HIPAA. A vendor with real fintech experience can talk about SR 11-7 model validation documentation, adverse action notice generation for credit decisions, and OCC examiner expectations — not just say the word SOC 2. See our [healthcare AI development for the US](/healthcare-ai-development-usa) and [fintech AI development for the US](/fintech-ai-development-usa) pages for the regulatory patterns we operate against.
10. References — talk to a customer who terminated
Every vendor will hand you references that loved them. Useful, but selection-biased. Ask instead for a reference from a customer who ended the engagement after launch — not as a failure, but as a successful handoff to the customer's internal team. A vendor that ships engagements designed to be operable by the customer's own team will have those references. A vendor that ships engagements designed to retain the customer indefinitely on a support retainer will not.
This is also where you verify the senior-engineer claim in real-world terms. Ask the reference: did the senior engineer who was on the sales call stay on the engagement end-to-end? Did the same humans run the project across the 6-12 week build window? If the reference hedges, the staffing model is bait-and-switch.
What to walk away from
Three patterns are immediate disqualifiers in a US AI vendor evaluation in 2026. First, any vendor that cannot name the senior engineer who will write production code on the first 30-minute call. Second, any vendor that proposes shipping the system before building the eval set. Third, any vendor that quotes "AI strategy" or "AI discovery" as a billable phase before the production build — the discovery is a 30-minute call, not a six-week engagement, and the vendors who bill for it have not figured out how to monetise the actual build.
Wrapping up
The US AI services market in 2026 has a wide quality range and a thin set of useful filters. The verification questions above are the ones that survive contact with the regulatory review six months in. Run them on the first call, insist on written commitments, and prioritise vendors whose engineering discipline matches the regulatory weight of the workload you are buying.
If you are scoping an AI build for a US healthcare, fintech, or enterprise SaaS workload — and you want a 30-minute conversation where we answer the questions above on the call rather than after the contract — [book a discovery call](/contact-us). One conversation, one fixed-price scope inside 72 hours, and an honest read on whether the engagement is the right fit.
