Aiinfox logo
All articles
Industry June 2, 2026 12 min read

AI Vendor Takeover Audit: 7 Signs Your Current Vendor Isn't Shipping

Seven warning signs that an AI engagement is stuck — no evals, no observability, no audit trail, junior-pool swap, perpetual discovery, demo-only output — and the takeover process to recover.

MS

Manjeet Singh

Senior engineering team · Aiinfox

Roughly a third of the AI engagements I have led in the last two years started as takeover audits. The pattern is consistent enough to be predictable: a prior vendor was hired in good faith, the engagement looked healthy for the first six to eight weeks, and then the velocity quietly stalled. By month four the customer was paying for a system that looked like progress on the standup but never seemed to move past "final integration testing." By month six the customer realised the vendor was running out the clock.

The seven signs below are the symptoms we look for in the first audit call. If three or more are present, the engagement is not going to recover under the current vendor. The economics of waiting it out are worse than the economics of switching — and the longer the takeover is delayed, the harder the rebuild gets. Below are the symptoms, what they mean structurally, and the takeover process we run when a customer makes the call.

Sign 1: No eval harness in the repo

Open the customer's repo and search for a tests directory containing the LLM eval set. If there isn't one — or there's a stub that hasn't been touched in three months — the engagement has no automated way to measure whether the system is getting better or worse. Every prompt edit is a leap of faith, every model swap is unmeasured, and every regression goes silent until a user complains.

This is the single highest-leverage symptom because it explains so many of the others. A team running a real [eval harness](/blog/llm-eval-harness-101) catches drift in days. A team without one finds out about regressions from customer complaints in months. The absence of the eval is the absence of a feedback loop, and without a feedback loop the team is iterating blind.

Sign 2: No observability instrumented

Ask the vendor to show the production dashboard. If the answer is a Grafana page with CPU and memory graphs and nothing LLM-specific — no per-call traces, no token-cost tracking, no latency histograms broken down by span, no refusal-rate dashboard, no tool-call success rate — the system is a black box in production. The vendor cannot answer the question "why was last Tuesday slower than the Tuesday before?" because the data does not exist.

See our [agent observability checklist](/blog/ai-agent-observability-what-to-instrument) for what should be instrumented. The presence or absence of LLM-specific observability is the second-strongest predictor of whether the engagement is in good hands.

Sign 3: No structured audit trail of model calls

In any regulated environment — healthcare, finance, legal, government — the customer's compliance team will eventually ask: "show me the full input/output for the model call that produced this answer to this user at this timestamp." If the vendor cannot produce the trace in under 30 seconds, the audit will fail. If the vendor says "we will need to enable logging and then we can reproduce going forward," the regulator will not be satisfied.

Sign 4: The senior engineer from the sales call disappeared

The bait-and-switch staffing pattern. A senior engineer joined the sales call, the proposal landed, the contract was signed, and then a different pool of junior engineers actually built the system. The senior shows up for occasional steering calls but does not write the code. By month three the customer realises the architecture they thought they bought was being learned on the job by engineers without LLM production experience.

Verification: ask for the git log filtered by author for the last three months. If the senior named in the SOW has fewer than 20% of the commits, they are not the engineer on the build. See our [evaluating offshore senior AI engineers post](/blog/evaluating-offshore-ai-senior-engineers) for the verification patterns that catch this before signing — and our [hiring AI development company in USA post](/blog/hiring-ai-development-company-usa-2026) for the contract structure that prevents it.

Sign 5: Perpetual discovery — no shipped milestone in 60+ days

The vendor is in week 14 of the engagement and there is still no production-deployed milestone. The standups are running, the slack channel is busy, the documents are being written, but nothing has actually shipped to production. Every two-week sprint produces "final integration with the X system, will be ready next sprint." Then the next sprint produces "we found an issue with the Y dependency, will be ready next sprint."

Perpetual discovery is the engagement pattern where motion is mistaken for progress. The economics favour the vendor — they keep billing — and the customer keeps paying because the alternative (start over) feels worse than "two more sprints." The honest read is that two more sprints will produce the same two-more-sprints conversation. The takeover decision is almost always the right call once this pattern has held for 60 days.

Sign 6: Scope creep without code

The original scope was a RAG agent for customer support. By month four the scope has expanded to include voice integration, mobile app screens, CRM workflows, and an admin dashboard. The contract value has doubled. The deployed scope is still zero. This is scope creep without code — the vendor is selling more billable work without delivering on the original deliverable.

Healthy engagements ship the v1 scope to production before expanding to v2. Engagements that take on v2 scope while v1 is still incomplete almost always fail to deliver either. The honest scope discipline — "ship v1 first, then evaluate whether v2 is worth scoping" — is the discipline takeover-prone vendors lack.

Sign 7: Demo-only output, no production-deployed system

The vendor demos beautifully on every steering call. The chatbot answers test questions perfectly. The voice agent has a polished happy-path demo. The agent autonomously books a calendar appointment in the demo environment. But the system is not deployed to production users — it is running on a vendor-controlled staging environment with hand-picked test queries. The first time real users hit it, the failure rate is 4x higher than the demo suggested.

The honest read: a demo on a curated query set is not evidence the system works. The evidence is production traffic, with full observability, against a real eval set, sustained for at least four weeks. A vendor in month four with no production traffic is a vendor whose system has not been honestly tested.

The takeover process — how we run a recovery

The takeover process Aiinfox runs when a customer calls us in to recover a stuck engagement has three phases. The phases are deliberately ordered to surface the truth fast and avoid the rebuild-everything trap (which is the other failure mode in takeover engagements).

Phase 1: Read the code, instrument the gaps (week 1-2)

We start by reading the existing codebase end-to-end. Not interviewing the prior vendor, not reading the documents — reading the code. The git log tells the truth about who wrote what; the test files (or absence) tell the truth about what is measured; the deployment config tells the truth about what is shipped. Two engineers, two weeks, an honest written read on what the customer actually owns versus what the SOW described.

In parallel we instrument the basic observability gaps — token cost tracking, per-call traces, latency histograms — so by week two the customer can see what their system is actually doing in production. The instrumentation is non-invasive (no architecture changes) and produces a baseline measurement we can iterate against.

Phase 2: Ship the smallest valuable change (week 3-5)

After the read, we pick the single smallest change that produces visible value for the customer and ship it to production. Not the rebuild, not the v2 scope — the smallest valuable change. Often this is a prompt refactor that reduces cost 40%, a retrieval-config change that lifts citation accuracy 8 points, or an eval set we backfill from the customer's support logs that immediately surfaces three production regressions.

The purpose of the smallest valuable change is to demonstrate to the customer that the takeover team can ship — and to demonstrate to the team that the system is workable. If the smallest valuable change ships in week 3-5, the customer's confidence is restored and the rest of the rebuild plan can be scoped honestly. If it does not ship, the customer learns early that the system needs a deeper rebuild than the audit suggested.

Phase 3: Rebuild plan, fixed-price, no scope creep (week 6+)

By week 6 we have read the code, instrumented the gaps, shipped a valuable change, and have an honest picture of what needs rebuilding. We deliver a written rebuild plan — fixed-price, defined scope, weekly milestones, eval gates on every milestone, full handoff at the end. The plan is sized to the actual gap, not to maximise vendor billing. Customers come out of takeover engagements typically 4-8 weeks later with a system that ships and a delivery pattern they understand.

The cost of waiting vs the cost of switching

Customers stuck in a failing engagement usually delay the takeover decision by 60-90 days past the point where it is obvious. The reasoning is understandable — switching feels expensive, the sunk cost is real, and the chance the current vendor pulls it out feels like it must be non-zero. The honest math:

  • Wait 90 more days: ~$120,000-300,000 of additional billing to the failing vendor, no production system, eval debt accumulates, regulatory exposure widens.
  • Switch now to a senior takeover team: $40,000-120,000 fixed-price for the audit + recovery, production system shipping in 4-8 weeks, eval harness and observability instrumented, handoff-ready deliverables.

The wait-and-see path is almost always more expensive in dollars and far more expensive in calendar. The pattern of customers we have helped: the takeover engagements that started 60 days into the stall recovered to production faster than the ones that started six months in. The structural reason: there was less wrong code to inherit, fewer broken assumptions to unwind, and less customer trust to rebuild.

What to ask the next vendor — to avoid the same trap

If you are coming out of a failed engagement and scoping the next one, the avoidance checklist is the same seven signs above run as questions in the kickoff:

  • Show me the eval set you would build in week one for our scope. What categories, how many cases, what gating thresholds?
  • Show me your observability instrumentation defaults — per-call traces, token cost, latency, refusal rate, tool-call success.
  • Show me a redacted audit-trail example from a prior production deployment.
  • Commit in writing that the senior named in the SOW will write 60%+ of the merged PRs.
  • Show me the milestone schedule — when is the first production-deployed milestone? (Should be week 4-6, not week 14.)
  • Show me the IP assignment language — does it cover prompts, evals, retrieval indexes, runbooks?
  • Show me the takeover deliverable — what does my team own at the end of the engagement?

Wrapping up

The seven signs are not subtle. They are the symptoms of an engagement where the vendor's delivery discipline does not match the customer's expectation. The hard part is not detecting them — most customers detect them by month three. The hard part is making the takeover decision before the sunk-cost reasoning extends the failing engagement by another quarter.

If you are evaluating whether a current AI engagement is recoverable or worth taking over — [book a discovery call](/contact-us). We will run a written 30-minute audit against the seven-sign checklist and return an honest read: is this engagement salvageable with the current vendor, or is the takeover math better? No upsell pressure on the call; we have walked away from takeovers where the current vendor was on track. The honest read is the whole product.

TaggedAI vendor takeoverAI engagement auditstuck AI projectAI vendor red flagsAI rescue engagementfailed AI project
Production AI, not slideware

Ready to ship the system this post describes?

30-minute scoping call. Senior engineers. Fixed-price scope in 72 hours.