Most AI development RFPs read like generic software procurement templates with the word "AI" sprinkled in. They ask for company history, team size, project lists, and a fixed-price quote against a one-paragraph scope. They do not ask the questions that actually predict whether the engagement will ship. The result is procurement teams selecting on price and brand recognition, and CTOs inheriting failed engagements six months later.

I have written and audited dozens of AI vendor RFPs across the US, UK, Canada, and Australia in the last three years. The 12 questions below come from those audits. Each one separates vendors who have shipped production AI from vendors who have read about it. Demand written answers, not slide decks. If a vendor cannot answer a question in writing inside 48 hours of receiving the RFP, the engagement will be discovery-phase for the rest of its life.

1. Who exactly will write the production code?

Ask for named engineers, their years of LLM production experience (not generic software), and a written commitment that the named engineer will be on every standup and write the merged PRs. Bait-and-switch staffing is the single largest source of failed AI engagements — a senior on the sales call, a junior pool on the build. Anchor the contract on names, not headcount.

2. Will an eval harness exist in week one — or week ten?

Eval-first means the test set is built before architecture choices are finalized. Vendors who build the eval set in week one anchor every subsequent decision on measurable accuracy. Vendors who defer it to phase two have a reason — they do not want their early prompt iterations measured. By the time the eval lands, the architecture is locked in and the regressions are silent. See our [LLM eval harness deep-dive](/blog/ai-evals-from-scratch) for the structural pattern.

Ask: in which sprint does the eval set get drafted, reviewed, and frozen?
Ask: how many test cases will the v1 eval set contain, broken down by category (factual, refusal, format, latency, cost)?
Ask: which member of our team will sign off on the eval set as representative of production traffic?
Ask: is the eval gated in CI so prompt or model changes that regress accuracy cannot merge?

3. Where does inference run — and which provider sees the data?

The deployment architecture is the single most consequential decision in an AI engagement, and the decision gets made in the first two weeks. Ask whether inference will run inside your VPC, on a shared SaaS provider, or on the vendor's infrastructure. Ask which LLM provider receives the prompts and what data residency commitments are in writing. Ask how secrets, embeddings, and audit logs are isolated per tenant.

A vendor with real deployment experience will distinguish between hosted-LLM-with-DPA, hosted-LLM-with-BAA, dedicated tenancy, and self-hosted Llama/Mistral inside the customer VPC — and they will recommend the right pattern for the workload's regulatory posture. See our [HIPAA AI deployment checklist](/blog/hipaa-ai-deployment-checklist) and [UK GDPR practical guide](/blog/uk-gdpr-ai-development-practical-guide-2026) for the recurring decision points by region.

4. Who owns the IP, the prompts, and the eval set?

Standard work-for-hire language covers the source code. AI engagements have three other assets that frequently fall outside the contract: the prompt library, the eval set, and the fine-tuned model weights. Vendors who quietly retain ownership of these assets keep the customer locked in long after the engagement ends.

5. What are the takeover terms?

The takeover audit checklist is the single best predictor of a healthy engagement. Ask whether the vendor will deliver, as part of the standard deliverable set: versioned prompt library with model and retrieval config tags, eval set checked into the customer's repo, observability instrumented and dashboards exported, runbooks for incident response, secret rotation procedures, and a written handoff plan for the customer's team to operate the system without the vendor.

A vendor that ships these as standard has thought about what happens when the engagement ends. A vendor that treats them as a phase-2 retainer upsell is selling lock-in. See our deeper post on [vendor takeover audit signs](/blog/ai-vendor-takeover-audit-signs) for the symptoms that surface when this question is dodged.

6. Compliance posture — beyond the certification badges

Every vendor says they are SOC 2, HIPAA-ready, GDPR-compliant, and ISO 27001 attested. The question is what those certifications actually cover. Ask for the SOC 2 Type II report under NDA and read the scope section — does it cover the development environment that will write your code, or only the corporate SaaS the company uses internally?

US: SOC 2 Type II scope must cover AI development practices, code review, secret handling. HIPAA BAA available if PHI is involved. CCPA + 18-state patchwork patterns documented. See [SOC 2 AI development](/soc-2-ai-development-usa).
UK: UK GDPR + DPIA workflow documented; ICO guidance on automated decisioning referenced in writing. See [UK GDPR AI development](/uk-gdpr-ai-development).
Canada: PIPEDA + Quebec Law 25 + OSFI E-23 for federally-regulated banks. See [PIPEDA AI development](/pipeda-ai-development-canada).
Australia: Privacy Act 1988 + APP-compliant data flows + APRA CPS 234/230 if financial services. See [Privacy Act AI development](/privacy-act-ai-development-australia).

7. What real metrics do you ship — not vanity stats?

Every AI vendor's pitch deck has the same five metrics: "50% productivity gain," "10x faster," "90% accuracy," "ROI in 3 months," and a customer quote. Almost none of them are measured against a real eval set. Ask for production metrics with categories: citation accuracy (for RAG), tool-call success rate (for agents), refusal rate, p95 latency, cost per request, and category-level eval scores. Ask for them on a recent engagement, not the lighthouse from 2023.

Concrete examples from our own production stack: 98.4% citation accuracy on a [medical-inquiry RAG agent](/case-studies/medical-inquiry-system) at 6% refusal rate, 68% L1 deflection on a [telco SMS bot](/case-studies/twilio-chatbot) handling 110k conversations/week, 1,400 hours/month saved on an EU insurance back-office build, 47% completion lift on an [adaptive interview agent](/case-studies/interview-agent), sub-2-second voice latency on a 4,000-calls-a-day [outbound voice agent](/case-studies/voice-agent). All measurable, all reproducible against the eval sets that produced them.

8. What does observability look like on day one?

Observability is not an afterthought for AI systems — it is the operating manual. Without per-call audit logs, token-cost tracking, latency p50/p95/p99, refusal-rate dashboards, and tool-call success rates, the system is opaque the moment users hit it. Ask the vendor which observability stack they instrument by default and what the day-one dashboard looks like. See our [agent observability post](/blog/ai-agent-observability-what-to-instrument) for the structural inventory.

A vendor with production experience will name Braintrust, Langfuse, Phoenix Arize, or an equivalent — and they will have an opinion on which is the right call for your traffic profile. A vendor without production experience will say "we will instrument observability as part of the production hardening phase," which means it does not exist yet.

9. What is the eval cadence after launch?

Production AI drifts. Model providers update versions, prompt edits creep in, retrieval indexes age out of date, and the eval set itself needs expanding as new query categories emerge. A vendor that ships once and walks away leaves you with a frozen system in a moving world. Ask what the post-launch eval cadence is — weekly automated runs against the golden set, monthly review against drift detection, quarterly expansion of the eval set itself.

The honest answer involves either a retainer with a defined eval cadence, or a written handoff plan for the customer's team to run the eval cadence themselves. Either is acceptable. "We will check in if you have issues" is not.

10. How does escalation work when something breaks?

Every production AI system will have incidents — a model provider outage, an embedding index corruption, a prompt regression that slipped past the eval gate. The question is who answers the page at 2am. Ask for the on-call coverage windows, the response-time SLA, the named engineers in the rotation, and the communication channel (Slack Connect, dedicated PagerDuty integration, email-only).

11. Cost transparency — what changes the bill?

AI engagements have moving cost components that traditional software does not: per-token LLM costs, embedding generation costs, vector store hosting, voice STT/TTS billing, observability platform fees, and managed-eval-platform subscriptions. Ask the vendor for a written cost-per-component breakdown: per LLM call, per 1k tokens of context, per voice-agent minute, per RAG ingestion run.

The same vendor should be able to project the monthly run-rate for your projected traffic volume, broken down by component. A vendor who quotes a single all-in monthly number is either masking the cost structure or has not done the math. See our [RAG vs fine-tuning cost breakdown](/blog/rag-vs-fine-tuning-cost-breakdown-2026) and [voice agent ROI math](/blog/voice-agent-roi-cost-math-2026) for the underlying cost models.

12. What is the contract structure — and what risk is each party taking?

The contract structure determines the incentive structure for the rest of the engagement. Time-and-materials transfers schedule and scope risk to the customer; the vendor is incentivized not to finish. Fixed-price transfers risk to the vendor, but only if the scope is well-defined and the vendor stands behind overruns. Hybrid milestone-based contracts can work if the milestones are real acceptance criteria, not progress reports.

Fixed-price with written acceptance criteria + a vendor-absorbs-overrun clause: best for v1 scoped builds.
Milestone-based with eval-set acceptance gates: best for multi-phase deliveries with regulatory review.
Time-and-materials with weekly burn caps and signed weekly scope: acceptable for true exploratory work.
Pure T&M with no cap and no fixed scope: avoid unless the engagement is genuinely research.

How to score the responses

Send the 12 questions to four to six vendors and score the written responses on a simple rubric: 2 points for a substantive answer with a named example, 1 point for a hedge, 0 points for a dodge or a checklist response. The vendors who score 18+ out of 24 have shipped production AI. The vendors who score under 12 are pitching slideware. The narrowing happens before the first sales call, which is when the unfit vendors are most expensive to engage with.

The 12 questions also act as the spine of the SOW. The vendor's written answers become contractual commitments — named engineers, eval cadence, deployment architecture, IP assignment, takeover deliverables. If the vendor's actual delivery later diverges from the written RFP response, you have explicit grounds for escalation or termination.

The vendor you want is the one who answers fast

A vendor that has actually shipped 50+ AI systems can answer the 12 questions in writing inside 48 hours. The answers exist as standard language in their pre-sales materials because they have written them dozens of times before. A vendor that needs three weeks to draft answers is figuring out the answers for the first time during your RFP. That delta — speed of substantive response — is one of the cleanest signals you will get in the entire selection process.

If you are scoping an AI engagement and want a 30-minute call that runs the 12-question rubric against your specific build — [book a discovery call](/contact-us). We will return a written one-pager inside 72 hours with the SOW shape, the eval-set proposal, the deployment architecture, and the fixed-price number. The same artifact your other shortlisted vendors should be able to produce.

TaggedAI development RFPAI vendor evaluationAI procurement templateAI RFP questionsAI vendor selectionAI engagement contract

AI Development RFP Template: 12 Questions Every Vendor Should Answer in Writing

1. Who exactly will write the production code?

2. Will an eval harness exist in week one — or week ten?

3. Where does inference run — and which provider sees the data?

4. Who owns the IP, the prompts, and the eval set?

5. What are the takeover terms?

6. Compliance posture — beyond the certification badges

7. What real metrics do you ship — not vanity stats?

8. What does observability look like on day one?

9. What is the eval cadence after launch?

10. How does escalation work when something breaks?

11. Cost transparency — what changes the bill?

12. What is the contract structure — and what risk is each party taking?

How to score the responses

The vendor you want is the one who answers fast

More articles

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

PIPEDA + Quebec Law 25 for AI in Canada: 2026 Compliance Checklist

Australian Privacy Act + APPs for AI Development in 2026

Offshore AI Development in 2026: What Actually Works and What Doesn't

AI Vendor Takeover Audit: 7 Signs Your Current Vendor Isn't Shipping

Ready to ship the system this post describes?