Adding AI to our own SaaS: the three questions that shaped NetPath's agent

There's a specific flavour of LLM integration advice that goes around the consultancy circuit: three buzzwords on a slide, a RAG diagram, a vague story about "enhanced productivity". We've sat through enough of those presentations to know what they're worth.

The reason we're writing this one is that we're in the middle of doing the work ourselves. NetPath — the multi-vendor network observability and analysis platform we're building — shipped an LLM-powered assistant into its chat interface this quarter. It's our own SaaS. We're our own client. And before we wrote a single prompt, we ran ourselves through the same three-question framework we walk client engagements through. This post is what came out of that exercise.

If you're thinking about adding AI to a platform you already run, or if you're scoping an AI pilot internally, the framework is directly applicable. If you're just curious how we're building the AI side of NetPath, that's what the rest of the post is about.

The three questions (and our answers)

1. What's the counterfactual?

Every proposed AI feature has a counterfactual: what does a user do today, without the model?

For NetPath, the baseline is our UI. A network engineer can already answer most questions about their topology by clicking through our existing interface. "How many routers are in the network?" is a number at the top of the dashboard. "What's the shortest path from router-a to router-b?" is a topology view with a path overlay. "Show me all critical alerts" is a filtered list.

None of that is hard. But it's twenty seconds of clicks every time, and for compound questions — "find me all routers with asymmetric OSPF neighbors on interfaces above 80% utilization" — there's no single view. The engineer either builds a mental model from three dashboards, writes a SQL query against our Postgres, or gives up.

So the counterfactual wasn't "nothing". It was "the existing UI, with friction that compounds on compound questions". The bar we set for the AI agent:

Save at least 30 seconds on single-concept queries (the kind you'd otherwise click through).
Make compound queries possible for users who can't or won't write SQL.
Never be the only source of truth. Every answer the model gives has to be verifiable in the existing UI with one click.

That last bullet is the one that matters the most, and it's what decides the answer to question two.

2. What's the failure mode we can afford?

LLMs are probabilistic. Even ours, with strict tool calling and retrieval off real topology data, will occasionally hallucinate a hostname, misread a path cost, or confidently answer a question with stale data from the wrong tool call.

In our three-category framing — Category A (user catches errors at zero cost), Category B (human review before propagation), Category C (output goes straight to production) — we landed firmly on Category A.

The reasoning: network engineers aren't casual users. They will absolutely notice if the model says "there are 47 routers" and the topology view in the next tab says 48. They're paid to notice. The failure mode we could afford was "the model is wrong and the engineer sees it immediately, with one click to verify". What we could not afford was "the model takes an action — reroutes traffic, changes BGP local-pref, triggers a discovery job — based on a wrong assumption."

Everything downstream of that decision flows from it:

The NetPath agent has ten read-only tools: topology summary, router / link / interface queries, path calculation, failure simulation, SPOF analysis, centrality, asymmetry, alerts, metrics. Zero tools that mutate state. Zero tools that commit config.
Every tool call returns structured data the engineer can cross-check. Path calculations return the hop list; the engineer can click any hop and land in the deterministic topology view for that router. If the model misread the path, the click catches it.
Every answer is persisted with its tool calls and raw tool outputs. When an engineer says "I don't trust that answer", we can pull the exact data the model saw and diff it against what the UI would have shown. No guessing.
The agent runs at most 10 tool-calling rounds per user message, after which it hands off with a clear "I can't answer this reliably — try rephrasing" response. We'd rather fail loudly than loop silently burning tokens.

That's the architecture a Category A assistant gets. If we'd gone straight for a Category C agent — auto-remediation, auto-failover, auto-config — we'd be writing a very different system with ten times the engineering cost and much stricter eval requirements. We're not there yet, and I'm not sure we will be this year.

3. What's our moat?

This is the question almost nobody asks at kickoff. Every LLM engagement starts with "we'll use GPT-5" or "we'll use Claude" or "we'll run Llama on our own hardware". None of that is the moat. The foundation model is a commodity that any competitor can buy from three vendors for the price of an API call.

Our moat, for the NetPath agent, is in three places:

The tools, not the prompts. The ten tools we wrote are the real product. A generic chatbot can't run a single-point-of-failure analysis against your live network topology. A generic chatbot doesn't know what an OSPFv3 interface ID is, or why it doesn't match ifIndex on Juniper but does on Arista. Our tools are bound to our topology graph, our multi-vendor normalization layer, our BGP-LS ingestion pipeline, and our ClickHouse analytics store. Swapping the LLM behind them is a weekend of work. Replicating the tools is months of network-engineering expertise.

The evaluation harness, built second. We made a deliberate decision early on to build the eval harness before the production chat UI. It's a set of ~80 real questions — the kind a NOC engineer would actually ask — each with a known correct answer that we can compute directly from the database. Every prompt change, every model upgrade, every new tool runs through this harness. We know, with a number, whether a change improved things or regressed them. Most AI teams we see are flying blind on this; they'll ship a prompt change because "it feels better on the example we tested" and then wonder why something else broke two weeks later.

The feedback loop, wired in from day one. Every chat session is persisted with its tool call history. We review logs weekly looking for two patterns: questions the model answered wrong (becomes a new eval case) and questions the model tried to answer but couldn't because no existing tool covered the query (becomes a new tool or a tool-description refinement). The flywheel is that real usage data refines the system — which is exactly the moat that a competitor bolting on GPT-5 in a week doesn't have.

Where we're at, concretely

The agent is in active alpha. It's wired into the chat interface that sits beside the topology view. It answers questions about the connected network in natural language, runs path calculations and failure simulations on demand, and surfaces results as tables with deep-links back into the deterministic UI.

What we're learning in live usage:

Tool descriptions matter more than the system prompt. The first cut of our system prompt was long and careful. It told the model exactly when to use each tool. That prompt has since been trimmed to about a third of its original size, because we found that most of our "the model used the wrong tool" incidents were fixed by rewriting the tool's own description to be clearer about its scope and its limits.
Structured output beats prose. Network engineers don't want paragraphs. They want a table with hostnames and numbers they can verify. We nudged the system prompt hard toward tables-first, and user satisfaction jumped.
The top 20 questions are 80% of usage. We expected a long tail. We got a very short tail: the same handful of queries account for almost all sessions. That means our eval harness can be small and focused, and it means tool coverage for the top 20 questions is worth more than clever reasoning for the rest.

There's a whole other article to write about the decisions we made at the tool level — how we handled vendor-specific quirks, why we chose SSE streaming over WebSockets for the chat UI, how we persist tool results for auditability without blowing up the database. Not today.

What this framework is good for

If you're thinking about adding an LLM-powered feature to a product you already run, or you're scoping an internal AI project, these three questions are a solid starting point:

What's the counterfactual? Is there a number attached to the friction the AI is supposed to remove?
What failure mode can you afford? Are you building for the right category — and are you willing to spend the engineering that category actually requires?
What's your moat? If it's just "we'll use GPT-5", you don't have one yet. The moat is in your tools, your evaluation harness, and your feedback loop — and all three of those have to be built deliberately, usually before the glamorous demo-able chat UI.

If the answers are clear, you're past the 90th percentile of the AI projects we see. If they're not, the fix isn't a bigger model — it's going back to the scoping conversation and admitting the answers aren't obvious yet.

If you want a second pair of eyes on that conversation — or you're curious how we're applying the framework to NetPath and want a closer look — get in touch. Our AI engagements start with a scoping call exactly like this one. No slides, no platform pitch, just the three questions.