We're Training Our Own LLM. Here's What It Actually Takes.

Lior Mechlovich

6 min read

March 31, 2026

A few weeks ago, we started training our own language model.

Not as a research exercise. Not for a blog post. We're actually trying to replace GPT-4 in production — for one very specific task: having real-time sales conversations with website visitors.

We called it Navon (Hebrew for "wise"). Here's what I've learned so far about what it actually takes to build your own model, why we decided to do it, and the honest trade-offs nobody warns you about.

Why would anyone do this?

Our AI agents run on GPT-4. They work well — our conversation evaluations average 92+ across thousands of live sessions. So why mess with something that works?

Three reasons pushed us over the edge.

Intercom proved the playbook. When Fergal Reid announced Fin Apex — their custom model powering over a million support conversations per week — the message was clear. Vertical AI companies with enough domain data can build specialized models that beat frontier models at their specific task. If it works for customer support, it should work for sales.

We're sitting on the data. 48,000+ live sessions. 27,000 scoring 85+ on our evaluation system. 630,000 reasoning traces from our multi-agent architecture. Real conversations between AI agents and real prospects, with concrete outcomes: did they book a demo or not? This isn't synthetic data. It's the real thing.

The economics will only get better. At current volume, GPT-4 costs are manageable. At 10x volume, they won't be. A custom model on our own infrastructure could deliver the same quality at a fraction of the cost. Intercom saw 10x savings. We expect similar.

What you actually need before you start

I see a lot of teams excited about fine-tuning without understanding the prerequisites. Here's what we had before writing a single line of training code:

Enough high-quality data. We set minimum thresholds: 5,000+ evaluated sessions, 1,000+ scoring 85+, 500+ with clear conversion outcomes. We exceeded every threshold by 8-27x. If your data doesn't clear these bars, fine-tuning will disappoint you.

A strong evaluation signal. Every one of our conversations gets scored 0-100 across four dimensions: accuracy, sales effectiveness, human-like quality, and professional judgment. Each evaluation includes structured feedback — what the AI did well and specific issues to fix. Without this, you're training blind.

Reasoning traces, not just inputs and outputs. Most companies only have conversation transcripts. We have full reasoning chains from our LangGraph architecture — what context the AI considered, what rules it applied, how it chose its response strategy. This lets us train on how to think about sales, not just what to say.

A benchmark you trust. Before training anything, we built an evaluation benchmark: 500 known-good sessions, 200 known-bad ones, 100 edge cases. Any model we train has to beat our current system on this benchmark before it touches production. Build the eval before you build the model.

The honest pros and cons

I'll be direct about what's good and what's hard.

The good:

It's surprisingly accessible. We trained a 14B-parameter model on a single A10G GPU (24GB) using QLoRA. Total GPU cost: about $25. The tooling — Unsloth, PEFT, HuggingFace — is mature enough that the ML part is actually the easy part.
Domain specificity is a real advantage. A 14B model trained on your data can match a frontier model at your specific task. We don't need PhD-level reasoning. We need excellent judgment about sales conversations. That's a narrower, more learnable problem.
You own the whole stack. No API rate limits. No surprise pricing changes. No dependency on a vendor's model updates potentially breaking your product. Once it works, it's yours.
Latency wins are free. Our model runs 37% faster than GPT-4 on the same task. When you control the inference, you can optimize for your exact use case.

The hard:

Infrastructure is 80% of the work. We needed nine attempts to get training running. Every failure was infrastructure: wrong Python version, VRAM limits, SSM timeouts, dependency conflicts. The ML configuration was straightforward once the devops cooperated.
Evaluation is treacherous. Our first eval showed the model losing to GPT-4 83% of the time. We nearly killed the project. Turns out the eval was wrong — we were testing without production context. When we replayed through the full pipeline, it was 80% ties. You can easily convince yourself a good model is bad (or a bad model is good) with the wrong evaluation setup.
It never feels "done." Sequence length, prompt budgets, token allocation, streaming, caching, routing — each one is a rabbit hole. You're not just training a model. You're building a production ML system with its own operational surface area.
The opportunity cost is real. Every hour spent on model training is an hour not spent on product, sales, or customer work. For a startup, that trade-off is sharp.

The moment we almost killed it

I want to be honest about this because I think it's the most important part of the story.

Our first evaluation showed GPT-4 winning 83% of head-to-head comparisons. Navon won 17%. Zero ties. The numbers looked devastating.

But something felt off. Navon's responses weren't bad — they were often more specific, referencing product details that only made sense with knowledge base context. We had tested the model without giving it the same context it would receive in production.

It was like judging a pilot's skill by making them fly blindfolded.

When we rebuilt the evaluation to replay through the actual production pipeline — full knowledge base retrieval, org settings, qualification criteria, conversation history — the results flipped completely:

80% ties — the judge couldn't tell the difference
15% Navon wins
5% GPT-4 wins

Same model. Same weights. Completely different conclusion. If we'd trusted the first eval, we would have abandoned a model that actually works.

The lesson: if your evaluation doesn't match your production setup, your results are meaningless. And "close enough" isn't close enough.

What surprised me

Sequence length matters more than anything. Going from 2,048 to 4,096 tokens was the single biggest quality improvement — more impactful than any hyperparameter change. The model was already good enough; it just needed to see more context. We built a dynamic budget allocator that prioritizes knowledge base content over lower-value sections, squeezing the most out of every token.

SFT alone gets you very far. We expected to need reinforcement learning, DPO, synthetic data augmentation. So far, plain supervised fine-tuning on 18K high-quality examples ties GPT-4 on our specific task. The research from Chroma, Cursor, and Kimi all say: SFT first, RL second. We're still on step one and it's already competitive.

Three days from zero to production. Day one: data exploration, export, benchmark. Day two: nine training attempts, first successful model, inference server. Day three: evaluation reframe, model routing, production deploy, streaming. I genuinely didn't expect to go from "should we try this?" to "it's serving real traffic" in three days.

Where we are right now

Navon is live on a whitelisted customer org. Real visitors are having real conversations powered by our custom model. Every response gets compared against what GPT-4 would have said.

The eval metrics look strong: 80% ties, 37% faster, same production infrastructure. But eval metrics aren't the real test.

The real test is conversion rate. Does this model book the same number of demos as GPT-4? More? Fewer? We're collecting that data right now.

What's next

If the conversion data holds up:

More agents. We have training data for all four specialized agents in our architecture. The Discovery agent was first because it has the highest volume. Technical Consultant is likely next.
RL with conversion rewards. SFT teaches the model to imitate good conversations. RL teaches it to optimize for outcomes. We've designed a multi-signal reward: conversion outcome as primary, eval score as auxiliary, penalties for hallucination and missed opportunities.
Pipeline consolidation. Right now we have separate systems for retrieval, reranking, and generation. Research from Chroma's Context-1 suggests a single model doing all three beats the pipeline approach. That's a longer-term bet.

If the conversion data doesn't hold up — we'll learn from that too. The beauty of per-org routing is we can switch back to GPT-4 with an environment variable change. Zero risk to the rest of our customers.

Building your own model isn't for everyone. You need the data, the evaluation infrastructure, and the stomach for a roller coaster of results that will make you question the whole thing at least once.

But if you're a vertical AI company sitting on thousands of domain-specific conversations with clear outcome signals — the path is more accessible than you think. A single GPU, a few days, and about $25 in compute got us to a model that ties GPT-4 at our core task.

Stay tuned. The conversion data will tell us whether "ties on quality" translates to "ties on revenue." That's the only number that actually matters.

If you want to see what our AI agent looks like in action — whether it's running on GPT-4 or Navon — try it on our site. You might not be able to tell the difference. That's the point.