How do you stop the LLM making things up?

You don't trust the model as a source of truth. Give it the context it needs (RAG), let it say 'I don't know', validate output against schemas (structured JSON) and add a deterministic layer that checks critical facts before showing or acting on them.

What latency is acceptable?

It depends on the flow. For live chat, token streaming makes a 1-3 s first response acceptable. For background jobs (summaries, classification, generation) the user isn't waiting, so you can use slower, stronger models. The key is not to block the UI: streaming for synchronous, queues for asynchronous.

← Blog · Jun 9, 2026 · AI

LLMs in production: what we learned shipping them in client software

Q: What does an LLM really cost in production?

Price per token is only part of it. In real projects the dominant cost is usually context: if you stuff long documents or full history into every call, cost explodes. With prompt caching, selective retrieval (RAG) and a small model for simple tasks, a typical support or summarization feature usually costs between €20 and €300 a month at mid volumes, not thousands.

Q: Do I need to train my own model?

Almost never at the start. In 90% of cases a general model with a good prompt, RAG over your data and validation covers the need. Fine-tuning is only worth it with a very repetitive pattern, labeled data and a cost or latency target that prompting can't reach.

Over the past year we've put language models into support, content generation, classification and internal assistants. The gap between "a demo that wows" and "a feature that survives in production" is enormous, and it almost always comes down to the same four things: real cost, latency, output reliability and evaluation. These are the notes we wish we'd had before starting.

1. Cost is not the price per token

The bill isn't driven by the model: it's driven by context. Sending the full conversation history or whole documents on every call multiplies spend without improving the answer. What works: retrieve only the relevant chunks (RAG), cache system prompts, summarize long history, and use a small, cheap model for simple tasks, saving the big one for what truly needs it. With that discipline a typical feature costs tens or a few hundred euros a month, not thousands.

2. Treat output as untrusted input

An LLM is not a source of truth; it's a plausible-text generator. In a product that means: ask for structured JSON and validate it against a schema, explicitly allow "I don't know", ground answers in real data via RAG, and add a deterministic layer that verifies critical facts (prices, dates, permissions) before showing them or acting. You don't eliminate hallucination; you contain it with engineering around the model.

3. Latency: streaming for synchronous, queues for asynchronous

For live chat, token streaming changes perception: 1-3 seconds to the first word is fine if the text flows. For work that doesn't need instant attention (nightly summaries, bulk classification, report generation) use a queue with workers: the user isn't waiting and you can afford stronger models. The classic mistake is blocking the UI while waiting synchronously for a long response.

4. Versioned prompts and evaluation, or you don't know if you broke something

The prompt is code: it lives in the repo, has a version and gets reviewed in a PR. Without a set of evaluation cases (real inputs with expected output), any prompt or model change is blind. You don't need an expensive platform: a collection of 30-100 examples and a script that measures hits catches regressions before they reach the client. That's the difference between iterating with confidence and praying.

5. Mistakes we keep seeing

Starting with fine-tuning. Almost always premature: prompting + RAG covers 90% and is cheaper to maintain.
Measuring nothing. Without evaluation or quality logs, the system silently degrades when the model or data changes.
One model for everything. Mixing trivial and complex tasks in the most expensive model burns budget.
No fallback plan. When the provider goes down or rate-limits, you want an alternate model or an honest degraded response.
Sensitive data without control. Decide what gets sent to the provider, anonymize, and comply with GDPR from day one.

FAQ

What does an LLM really cost in production?

Price per token matters least; context dominates. With caching, selective RAG and a small model for simple work, a typical feature usually costs €20-300/month at mid volumes.

Do I need to train my own model?

Almost never at first. A general model + good prompt + RAG covers most cases. Fine-tuning only pays off with very repetitive patterns and clear cost/latency targets.

How do you stop it making things up?

By not treating it as a source of truth: RAG, validated structured output, an "I don't know" option, and a deterministic layer that checks critical facts before display.

Want to add AI to your product without shooting yourself in the foot?

We design and ship LLMs in production with cost control, evaluation and fallback. Fixed price by milestones.

AI & chatbots service Request a quote

Related resources

Automate support without breaking the CRM — human handoff, idempotency and queues.
SaaS multi-tenant backend — where AI fits in a B2B SaaS.
The stack we use in 2026 — what we run around the model.

Published: June 9, 2026 · Written by the RoviDev studio.