← Blog · Jun 9, 2026 · AI

LLMs in production: what we learned shipping them in client software

Over the past year we've put language models into support, content generation, classification and internal assistants. The gap between "a demo that wows" and "a feature that survives in production" is enormous, and it almost always comes down to the same four things: real cost, latency, output reliability and evaluation. These are the notes we wish we'd had before starting.

1. Cost is not the price per token

The bill isn't driven by the model: it's driven by context. Sending the full conversation history or whole documents on every call multiplies spend without improving the answer. What works: retrieve only the relevant chunks (RAG), cache system prompts, summarize long history, and use a small, cheap model for simple tasks, saving the big one for what truly needs it. With that discipline a typical feature costs tens or a few hundred euros a month, not thousands.

2. Treat output as untrusted input

An LLM is not a source of truth; it's a plausible-text generator. In a product that means: ask for structured JSON and validate it against a schema, explicitly allow "I don't know", ground answers in real data via RAG, and add a deterministic layer that verifies critical facts (prices, dates, permissions) before showing them or acting. You don't eliminate hallucination; you contain it with engineering around the model.

3. Latency: streaming for synchronous, queues for asynchronous

For live chat, token streaming changes perception: 1-3 seconds to the first word is fine if the text flows. For work that doesn't need instant attention (nightly summaries, bulk classification, report generation) use a queue with workers: the user isn't waiting and you can afford stronger models. The classic mistake is blocking the UI while waiting synchronously for a long response.

4. Versioned prompts and evaluation, or you don't know if you broke something

The prompt is code: it lives in the repo, has a version and gets reviewed in a PR. Without a set of evaluation cases (real inputs with expected output), any prompt or model change is blind. You don't need an expensive platform: a collection of 30-100 examples and a script that measures hits catches regressions before they reach the client. That's the difference between iterating with confidence and praying.

5. Mistakes we keep seeing

FAQ

What does an LLM really cost in production?

Price per token matters least; context dominates. With caching, selective RAG and a small model for simple work, a typical feature usually costs €20-300/month at mid volumes.

Do I need to train my own model?

Almost never at first. A general model + good prompt + RAG covers most cases. Fine-tuning only pays off with very repetitive patterns and clear cost/latency targets.

How do you stop it making things up?

By not treating it as a source of truth: RAG, validated structured output, an "I don't know" option, and a deterministic layer that checks critical facts before display.

Want to add AI to your product without shooting yourself in the foot?

We design and ship LLMs in production with cost control, evaluation and fallback. Fixed price by milestones.

AI & chatbots service   Request a quote

Related resources

Published: June 9, 2026 · Written by the RoviDev studio.

Request a quote, no commitment

Tell me briefly about your project and I usually reply in under 30 minutes with feasibility, phases and a price range.

or email contacto@rovidev.com