Two Years of B2B Sales Conversations Built Our Data Moat. Here's How We Use It.

Lior Mechlovich

6 min read

March 31, 2026

Two years ago, we deployed our first AI sales agent on a customer's website. One org. A few hundred conversations. A basic prompt and a knowledge base.

Today, we power real-time AI sales conversations across dozens of B2B websites. Thousands of sessions per month. Every industry from cybersecurity to fintech to HR tech. Every type of buyer question you can imagine — pricing, security, integrations, "how are you different from X?"

The product got better. But the real asset isn't the product. It's the data underneath it.

Every conversation we power makes the next one smarter. Not in a hand-wavy "AI learns" way. In a very specific, measurable way that compounds across every layer of the system.

Here's how.

The numbers behind the moat

Let me be concrete about what two years of live conversations actually produced:

48,000+ live sessions across dozens of organizations
40,000+ evaluated conversations — each scored 0-100 with structured feedback
27,000+ gold-standard sessions scoring 85 or above
630,000 reasoning traces — full decision chains showing what context the AI considered and how it chose its response
162,000+ conversation turns exported and anonymized for training

This isn't a static dataset. It grows every day. Every new org that goes live adds a new domain, new buyer personas, new edge cases. The data gets broader and deeper over time.

Layer 1: Better retrieval through cross-org learning

The most immediate way conversation data compounds is in retrieval — how the AI finds the right knowledge base content to answer a buyer's question.

Early on, our retrieval was simple: embed the question, embed the knowledge base entries, rank by cosine similarity. It worked for obvious questions. It failed on nuanced ones.

The problem with cosine similarity: it measures how close two pieces of text are in meaning, not whether one actually answers the other. "Data governance overview" might score higher than "data encryption standards" for a question about security — because the words are semantically closer, even though the second document is the real answer.

We trained a cross-encoder reranker to fix this. The key insight: we trained it on conversations from our entire customer base, not just one org. 315,000 training pairs. Mixed negatives — hard negatives from the same org's KB, random negatives from different orgs' KBs.

That cross-org training is what makes it work. A model trained on a single org's data learns that org's vocabulary. A model trained across dozens of orgs learns what "relevant" actually means across domains. It understands that a cybersecurity buyer asking about "compliance" needs different content than an HR tech buyer asking the same word.

The result: 34% more relevant knowledge base entries surfaced compared to our previous model. Same architecture, same latency. Just better training data.

Layer 2: Evaluation as a continuous feedback loop

Every conversation that flows through our system gets evaluated. Not by a human — that doesn't scale. By an LLM judge scoring across four dimensions:

Accuracy (30%) — did the AI get the facts right?
Sales effectiveness (25%) — did it move the conversation toward qualification?
Human-like quality (25%) — did it sound natural, not robotic?
Professional judgment (20%) — did it know when to push and when to back off?

Each evaluation includes structured feedback: what the AI did well, and specific issues requiring attention. Over two years, these evaluations have surfaced clear patterns across our entire customer base:

Missed opportunity — 33% of flagged issues
Poor follow-up — 14%
Factual inaccuracy — 8%
Robotic engagement — 7%

These aren't abstract quality metrics. They're direct training signals. Every time we retrain a model or adjust a prompt, we can measure whether "missed opportunity" went from 33% to 25%. The evaluation data tells us exactly what to improve and whether we actually improved it.

A new customer deploying today benefits from patterns learned across every conversation we've ever had. Their AI agent starts smarter on day one because our evaluation system has already identified — and the system has already corrected — the most common failure modes.

Layer 3: Reasoning traces change the game

Most AI systems only record inputs and outputs. The buyer asked X, the AI responded Y. Useful, but limited.

Our multi-agent architecture records the full reasoning chain for every turn. What knowledge base entries were retrieved. How the orchestrator decided which specialist agent should handle the message. What qualification criteria were considered. Which conversation phase the AI determined it was in.

630,000 of these traces.

This matters because it lets us train on how to think about sales, not just what to say. When a new model sees thousands of examples of "buyer asked a technical question → orchestrator routed to Technical Consultant → retrieved security docs → crafted response with specific implementation details," it learns the decision-making pattern. Not just the final answer.

These traces are also how we debug at scale. When a conversation goes wrong, we can trace exactly where the reasoning broke down — was it retrieval? Routing? The response itself? That diagnostic capability feeds back into the system.

Layer 4: Chunking and RAG that improves with usage

How you chunk and retrieve knowledge base content sounds like a solved problem. It's not.

Over two years, we've iterated from a basic "chunk by paragraph" approach to a system that understands what types of content buyers actually need. Our retrieval pipeline currently fires multiple search strategies per turn — semantic search, conversation history matching, pain point matching, industry-specific retrieval.

Every conversation teaches us which chunks get surfaced and which actually help. When a session scores 95 and the AI used a specific knowledge base entry to handle an objection, that's a signal about chunk quality. When a session scores 40 because the right content was in the KB but never got retrieved, that's a signal about retrieval strategy.

This data drove our decision to train a custom reranker. And it's driving our next move: consolidating from 15 parallel retrieval queries per turn down to 3, with a reranker compensating for the narrower initial retrieval. We can only make that trade-off confidently because we have two years of data showing which retrieval strategies actually contribute to good outcomes.

Layer 5: Training our own model

Everything above was a prerequisite for the most ambitious use of our data: training our own language model.

We fine-tuned a 14B-parameter open-weight model on 18,000 gold-standard conversation turns. The result ties GPT-4 on our task 80% of the time, at 37% lower latency and a fraction of the inference cost.

But here's the thing: that model only exists because of the data layers underneath it.

Without 27,000 gold-standard sessions, we wouldn't have enough positive examples to train on
Without the evaluation system, we wouldn't know which sessions are gold-standard
Without reasoning traces, we'd only be training on inputs and outputs — not decision-making
Without cross-org diversity, the model would overfit to one company's vocabulary
Without the reranker, the model would receive lower-quality context and produce worse responses

Each layer makes the next layer possible. That's the flywheel.

Why this compounds and is hard to replicate

A competitor starting today faces a cold start problem at every layer.

They'd need thousands of live conversations before their evaluation system produces meaningful patterns. They'd need months of eval data before they can reliably identify gold-standard sessions. They'd need gold-standard sessions across multiple orgs and industries before they can train a reranker that generalizes. And they'd need all of the above before fine-tuning a model is even worth attempting.

Each layer takes time to build — not because the engineering is impossibly hard, but because the data takes time to accumulate. You can't shortcut "two years of live B2B sales conversations across dozens of organizations." There's no synthetic data trick that replicates the variety of real buyer questions from real industries with real outcomes.

The moat isn't the model. The moat isn't the retrieval system. The moat is the data that makes both of them better with every conversation — and the compounding effects between layers that multiply the value of each new data point.

What the next two years look like

The flywheel is spinning, but we're still early.

Reinforcement learning from outcomes. Right now, we train on imitation: learn from conversations that scored well. The next step is training on outcomes: learn from conversations that actually converted. SFT teaches the model to sound good. RL teaches it to close.

Continuous retraining. As more orgs onboard and conversation volume grows, we can periodically retrain with fresh data. The model improves without manual intervention. Every new industry vertical adds new patterns to the training set.

Collapsing the pipeline. Today, retrieval, reranking, and response generation are separate systems. Research suggests a single model doing all three outperforms the pipeline approach. Our data — retrieval traces, reranker scores, conversation outcomes — is exactly what you'd need to train that unified model.

Two years of powering B2B sales conversations didn't just build a product. It built a data engine that makes the product better automatically.

Every conversation teaches the retrieval system what "relevant" means. Every evaluation teaches the system what "good" looks like. Every reasoning trace teaches the next model how to think. Every new org widens the aperture.

That's the moat. Not code. Not models. Data that compounds — and a system designed to turn every conversation into fuel for the next one.

Want to see what this looks like in practice? Visit our site and start a conversation. You'll be talking to the product of 48,000 conversations that came before yours — and yours will make the next one even better.