Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do
Hacker News (score: 17)Description
Here's a demo if you're interested: https://www.youtube.com/watch?v=cc4DWrJF7hk. When agents fail, choose wrong tools, or blow cost budgets, there's no way to know why - usually just logs and guesswork. As agents move from demos to production with real SLAs and real users, this is not sustainable.
Neel and I lived this, building agents at SenseHQ and Accenture where we found that debugging agents was often harder than actually building them. Agents are untrustworthy in prod because there’s no good infrastructure to verify what they’re actually doing.
In practice this looks like: - A support agent that began misclassifying refund requests as product questions, which meant customers never reached the refund flow. - A document drafting agent that would occasionally hallucinate missing sections when parsing long specs, producing confident but incorrect outputs. There’s no stack trace or 500 error and you only figure this out when a customer is angry.
We both realized teams were flying blind in production, and that agent native monitoring was going to be foundational infrastructure for every serious AI product. We started Sentrial as a verification layer designed to take care of this.
How it works: You wrap your client with our SDK in only a couple of lines. From there, we detect drift for you: - Wrong tool invocations - Misunderstood intents - Hallucinations - Quality regressions over time. You see it on our platform before a customer files a ticket.
There’s a quick mcp set up, just give claude code: claude mcp add --transport http Sentrial https://www.sentrial.com/docs/mcp
We have a free tier (14 days, no credit card required). We’d love any feedback from anyone running agents whether they be for personal use or within a professional setting.
We’ll be around in the comments!
More from Hacker
Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference
Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference Hey HN — I’m Veer and my cofounder is Suryaa. We're building Cumulus Labs (YC W26), and we're releasing our latest product IonRouter (<a href="https://ionrouter.io/" rel="nofollow">https://ionrouter.io/</a>), an inference API for open-source and fine tuned models. You swap in our base URL, keep your existing OpenAI client code, and get access to any model (open source or finetuned to you) running on our own inference engine.<p>The problem we kept running into: every inference provider is either fast-but-expensive (Together, Fireworks — you pay for always-on GPUs) or cheap-but-DIY (Modal, RunPod — you configure vLLM yourself and deal with slow cold starts). Neither felt right for teams that just want to ship.<p>Suryaa spent years building GPU orchestration infrastructure at TensorDock and production systems at Palantir. I led ML infrastructure and Linux kernel development for Space Force and NASA contracts where the stack had to actually work under pressure. When we started building AI products ourselves, we kept hitting the same wall: GPU infrastructure was either too expensive or too much work.<p>So we built IonAttention — a C++ inference runtime designed specifically around the GH200's memory architecture. Most inference stacks treat GH200 as a compatibility target (make sure vLLM runs, use CPU memory as overflow). We took a different approach and built around what makes the hardware actually interesting: a 900 GB/s coherent CPU-GPU link, 452GB of LPDDR5X sitting right next to the accelerator, and 72 ARM cores you can actually use.<p>Three things came out of that that we think are novel: (1) using hardware cache coherence to make CUDA graphs behave as if they have dynamic parameters at zero per-step cost — something that only works on GH200-class hardware; (2) eager KV block writeback driven by immutability rather than memory pressure, which drops eviction stalls from 10ms+ to under 0.25ms; (3) phantom-tile attention scheduling at small batch sizes that cuts attention time by over 60% in the worst-affected regimes. We wrote up the details at cumulus.blog/ionattention.<p>On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the tradeoff we're actively working on.<p>Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at <a href="https://ionrouter.io" rel="nofollow">https://ionrouter.io</a>.<p>You can try the playground at <a href="https://ionrouter.io/playground" rel="nofollow">https://ionrouter.io/playground</a> right now, no signup required, or drop your API key in and swap the base URL — it's one line. We built this so teams can see the power of our engine and eventually come to us for their finetuned model needs using the same solution.<p>We're curious what you think, especially if you're running finetuned or custom models — that's the use case we've invested the most in. What's broken, what would make this actually useful for you?
No other tools from this source yet.