Article
Right-size your in-house AI server — measure before you buy.
The most expensive mistake in bringing AI in-house is buying the hardware first. A GPU server is a real capital purchase, and the instinct is to either over-buy "to be safe" or under-buy to keep the quote down. Both are guesses, and both cost money — one in idle silicon, the other in a box that chokes the first time the whole team logs on. There's a better order of operations: measure your actual usage first, then spec the machine to fit it. Think of it as designing the server from real demand instead of a hunch.
What actually determines the size of the box
"How much GPU do I need?" has no one-size answer because the load is shaped by a handful of variables that are specific to how your team works:
- Concurrency — not how many people have access, but how many are generating at the same moment during a peak. A 25-person office rarely has 25 simultaneous requests; it might have five.
- Context length — how much text goes into each request. Summarizing long documents costs far more than answering short questions.
- Model size — a larger model needs more VRAM and runs slower; a smaller, task-tuned model may do the job on a fraction of the hardware.
- Peak versus average — you size for the peak you need to feel snappy, not the daily average, or the system feels slow exactly when everyone's using it.
The numbers worth measuring
Guesswork comes from sizing on vibes. Real sizing comes from a few concrete numbers, captured from actual use:
- Time to first token (TTFT) — how long a user waits before the answer starts. This is what "feels fast" actually measures.
- Tokens per second — how quickly the answer streams once it starts.
- Tokens by prompt component — how the token budget splits across the system prompt, the user's input, and the model's response. This tells you whether your cost is in long instructions, long inputs, or long outputs — and each points to a different optimization.
This is exactly the data the Interchange AI Gateway captures. Run it in front of whatever AI your team uses today — even cloud tools during a trial — and it records real TTFT, throughput, concurrency, and the token breakdown by component. After a few weeks you're no longer guessing; you have a demand profile.
A worked example
Here's how this plays out for a representative 25-person office. (The numbers below are illustrative — every business profiles differently, which is the whole point of measuring. Treat them as a shape, not a guarantee.)
Say a few weeks of measurement shows: 25 seats, but a busy-hour peak of about 5 concurrent generations; average input around 1,500 tokens with responses near 500 tokens; and a target of TTFT under a second with comfortable streaming speed. That profile doesn't need a multi-GPU rack. It points to a single modern 24 GB GPU running a quantized mid-size model — enough to keep five simultaneous users feeling instant, with headroom for growth, at a fraction of the "to be safe" quote. A different office with long-document summarization and ten concurrent users would profile larger, and you'd see that in the data instead of finding out after the purchase.
Start on the gateway, graduate to your own box
This gives you a natural, low-risk path. Start with the gateway in front of your current tools: you get usage visibility and policy control immediately, and you accumulate the demand profile that de-risks the hardware decision. When the numbers justify it, you commission an inference server sized to what you actually measured — not to a salesperson's worst case. The gateway then keeps running in front of your own hardware, doing the same job it did during measurement.
It's the difference between designing for your business and buying for a brochure. If you're weighing whether to bring AI in-house at all, start with what private AI actually means.