🛠️ All DevTools
Showing 1–20 of 3738 tools
Last Updated
March 12, 2026 at 04:03 PM
Show HN: Axe A 12MB binary that replaces your AI framework
Hacker News (score: 57)[Other] Show HN: Axe A 12MB binary that replaces your AI framework
Show HN: We analyzed 1,573 Claude Code sessions to see how AI agents work
Hacker News (score: 84)[Monitoring/Observability] Show HN: We analyzed 1,573 Claude Code sessions to see how AI agents work We built rudel.ai after realizing we had no visibility into our own Claude Code sessions. We were using it daily but had no idea which sessions were efficient, why some got abandoned, or whether we were actually improving over time.<p>So we built an analytics layer for it. After connecting our own sessions, we ended up with a dataset of 1,573 real Claude Code sessions, 15M+ tokens, 270K+ interactions.<p>Some things we found that surprised us: - Skills were only being used in 4% of our sessions - 26% of sessions are abandoned, most within the first 60 seconds - Session success rate varies significantly by task type (documentation scores highest, refactoring lowest) - Error cascade patterns appear in the first 2 minutes and predict abandonment with reasonable accuracy - There is no meaningful benchmark for 'good' agentic session performance, we are building one.<p>The tool is free to use and fully open source, happy to answer questions about the data or how we built it.
google-ai-edge/LiteRT
GitHub Trending[Other] LiteRT, successor to TensorFlow Lite. is Google's On-device framework for high-performance ML & GenAI deployment on edge platforms, via efficient conversion, runtime, and optimization
InsForge/InsForge
GitHub Trending[Other] Give agents everything they need to ship fullstack apps. The backend built for agentic development.
langflow-ai/openrag
GitHub Trending[Other] OpenRAG is a comprehensive, single package Retrieval-Augmented Generation platform built on Langflow, Docling, and Opensearch.
Show HN: Autoresearch@home
Hacker News (score: 49)[Other] Show HN: Autoresearch@home autoresearch@home is a collaborative research collective where AI agents share GPU resources to collectively improve a language model. Think SETI@home, but for model training.<p>How it works: Agents read the current best result, propose a hypothesis, modify train.py, run the experiment on your GPU, and publish results back. When an agent beats the current best validation loss, that becomes the new baseline for every other agent. Agents learn from great runs and failures, since we're using Ensue as the collective memory layer.<p>This project extends Karpathy's autoresearch by adding the missing coordination layer so agents can actually build on each other's work.<p>To participate, you need an agent and a GPU. The agent handles everything: cloning the repo, connecting to the collective, picking experiments, running them, publishing results, and asking you to verify you're a real person via email.<p>Send this prompt to your agent to get started: Read <a href="https://github.com/mutable-state-inc/autoresearch-at-home" rel="nofollow">https://github.com/mutable-state-inc/autoresearch-at-home</a> follow the instructions join autoresearch and start contributing.<p>This whole experiment is to prove that agents work better when they can build off other agents. The timeline is live, so you can watch experiments land in real time.
Show HN: A context-aware permission guard for Claude Code
Hacker News (score: 55)[Other] Show HN: A context-aware permission guard for Claude Code We needed something like --dangerously-skip-permissions that doesn’t nuke your untracked files, exfiltrate your keys, or install malware.<p>Claude Code's permission system is allow-or-deny per tool, but that doesn’t really scale. Deleting some files is fine sometimes. And git checkout is sometimes not fine. Even when you curate permissions, 200 IQ Opus can find a way around it. Maintaining a deny list is a fool's errand.<p>nah is a PreToolUse hook that classifies every tool call by what it actually does, using a deterministic classifier that runs in milliseconds. It maps commands to action types like filesystem_read, package_run, db_write, git_history_rewrite, and applies policies: allow, context (depends on the target), ask, or block.<p>Not everything can be classified, so you can optionally escalate ambiguous stuff to an LLM, but that’s not required. Anything unresolved you can approve, and configure the taxonomy so you don’t get asked again.<p>It works out of the box with sane defaults, no config needed. But you can customize it fully if you want to.<p>No dependencies, stdlib Python, MIT.<p>pip install nah && nah install<p><a href="https://github.com/manuelschipper/nah" rel="nofollow">https://github.com/manuelschipper/nah</a>
CRusTTY: A pedagogical C interpreter with time-travel debugging capabilities
Hacker News (score: 11)[Other] CRusTTY: A pedagogical C interpreter with time-travel debugging capabilities
Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do
Hacker News (score: 17)[Monitoring/Observability] Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do Hey HN! We're Neel and Anay, and we’re building Sentrial (<a href="https://sentrial.com">https://sentrial.com</a>). It’s production monitoring for AI products. We automatically detect failure patterns: loops, hallucinations, tool misuse, and user frustrations the moment they happen. When issues surface, Sentrial diagnoses the root cause by analyzing conversation patterns, model outputs, and tool interactions, then recommends specific fixes.<p>Here's a demo if you're interested: <a href="https://www.youtube.com/watch?v=cc4DWrJF7hk" rel="nofollow">https://www.youtube.com/watch?v=cc4DWrJF7hk</a>. When agents fail, choose wrong tools, or blow cost budgets, there's no way to know why - usually just logs and guesswork. As agents move from demos to production with real SLAs and real users, this is not sustainable.<p>Neel and I lived this, building agents at SenseHQ and Accenture where we found that debugging agents was often harder than actually building them. Agents are untrustworthy in prod because there’s no good infrastructure to verify what they’re actually doing.<p>In practice this looks like: - A support agent that began misclassifying refund requests as product questions, which meant customers never reached the refund flow. - A document drafting agent that would occasionally hallucinate missing sections when parsing long specs, producing confident but incorrect outputs. There’s no stack trace or 500 error and you only figure this out when a customer is angry.<p>We both realized teams were flying blind in production, and that agent native monitoring was going to be foundational infrastructure for every serious AI product. We started Sentrial as a verification layer designed to take care of this.<p>How it works: You wrap your client with our SDK in only a couple of lines. From there, we detect drift for you: - Wrong tool invocations - Misunderstood intents - Hallucinations - Quality regressions over time. You see it on our platform before a customer files a ticket.<p>There’s a quick mcp set up, just give claude code: claude mcp add --transport http Sentrial <a href="https://www.sentrial.com/docs/mcp">https://www.sentrial.com/docs/mcp</a><p>We have a free tier (14 days, no credit card required). We’d love any feedback from anyone running agents whether they be for personal use or within a professional setting.<p>We’ll be around in the comments!
Show HN: I built a tool that watches webpages and exposes changes as RSS
Hacker News (score: 48)[Other] Show HN: I built a tool that watches webpages and exposes changes as RSS I built Site Spy after missing a visa appointment slot because a government page changed and I didn’t notice for two weeks.<p>It watches webpages for changes and shows the result like a diff. The part I think HN might find interesting is that it can monitor a specific element on a page, not just the whole page, and it can expose changes as RSS feeds.<p>So instead of tracking an entire noisy page, you can watch just a price, a stock status, a headline, or a specific content block. When it changes, you can inspect the diff, browse the snapshot history, or follow the updates in an RSS reader.<p>It’s a Chrome/Firefox extension plus a web dashboard.<p>Main features:<p>- Element picker for tracking a specific part of a page<p>- Diff view plus full snapshot timeline<p>- RSS feeds per watch, per tag, or across all watches<p>- MCP server for Claude, Cursor, and other AI agents<p>- Browser push, Email, and Telegram notifications<p>Chrome: <a href="https://chromewebstore.google.com/detail/site-spy/jeapcpanagdgipcfnncmogeojgfofige" rel="nofollow">https://chromewebstore.google.com/detail/site-spy/jeapcpanag...</a><p>Firefox: <a href="https://addons.mozilla.org/en-GB/firefox/addon/site-spy/" rel="nofollow">https://addons.mozilla.org/en-GB/firefox/addon/site-spy/</a><p>Docs: <a href="https://docs.sitespy.app" rel="nofollow">https://docs.sitespy.app</a><p>I’d especially love feedback on two things:<p>- Is RSS actually a useful interface for this, or do most people just want direct alerts?<p>- Does element-level tracking feel meaningfully better than full-page monitoring?
Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos
Hacker News (score: 23)[API/SDK] Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos Hey HN — we’re Rajit, Land, and Alex. We’re building Prism (<a href="https://www.prismvideos.com">https://www.prismvideos.com</a>), an AI video creation platform and API.<p>Here’s a quick demo of how you can remix any video with Prism: <a href="https://youtu.be/0eez_2DnayI" rel="nofollow">https://youtu.be/0eez_2DnayI</a><p>Here’s a quick demo of how you can automate UGC-style ads with Openclaw + Prism: <a href="https://www.youtube.com/watch?v=5dWaD23qnro" rel="nofollow">https://www.youtube.com/watch?v=5dWaD23qnro</a><p>Accompanying skill.md file: <a href="https://docs.google.com/document/d/1lIskVljW1OqbkXFyXeLHRsfMictCfuxGGwczAnB1vhk" rel="nofollow">https://docs.google.com/document/d/1lIskVljW1OqbkXFyXeLHRsfM...</a><p>Making an AI video today usually means stitching together a dozen tools (image generation, image-to-video, upscalers, lip-sync, voiceover, and an editor). Every step turns into export/import and file juggling, so assets end up scattered across tabs and local storage, and iterating on a multi-scene video is slow.<p>Prism keeps the workflow in one place: you generate assets (images/video clips) and assemble them directly in a timeline editor without downloading files between tools. Practically, that means you can try different models (Kling, Veo, Sora, Hailuo, etc) and settings for a single clip, swap it on the timeline, and keep iterating without re-exporting and rebuilding the edit elsewhere.<p>We also support templates and one-click asset recreation, so you can reuse workflows from us or the community instead of rebuilding each asset from scratch. Those templates are exposed through our API, letting your AI agents discover templates in our catalog, supply the required inputs, and generate videos in a repeatable way without manually stitching the workflow together.<p>We built Prism because we were making AI videos ourselves and were unsatisfied with the available tools. We kept losing time to repetitive “glue work” such as constantly downloading files, keeping track of prompts/versions, and stitching clips in a separate video editing software. We’re trying to make the boring parts of multi-step AI video creation less manual so users can generate → review → edit → assemble → export, all inside one platform.<p>Pricing is based on usage credits, with a free tier (100 credits/month) and free models, so you can try it without providing a credit card: <a href="https://prismvideos.com">https://prismvideos.com</a>.<p>We’d love to hear from people who’ve tried making AI videos: where does your workflow break, what parts are the most tedious, and what do you wish video creation tools on the market could do?
Show HN: Klaus – OpenClaw on a VM, batteries included
Show HN (score: 5)[DevOps] Show HN: Klaus – OpenClaw on a VM, batteries included We are Bailey and Robbie and we are working on Klaus (<a href="https://klausai.com/" rel="nofollow">https://klausai.com/</a>): hosted OpenClaw that is secure and powerful out of the box.<p>Running OpenClaw requires setting up a cloud VM or local container (a pain) or giving OpenClaw root access to your machine (insecure). Many basic integrations (eg Slack, Google Workspace) require you to create your own OAuth app.<p>We make running OpenClaw simple by giving each user their own EC2 instance, preconfigured with keys for OpenRouter, AgentMail, and Orthogonal. And we have OAuth apps to make it easy to integrate with Slack and Google Workspace.<p>We are both HN readers (Bailey has been on here for ~10 years) and we know OpenClaw has serious security concerns. We do a lot to make our users’ instances more secure: we run on a private subnet, automatically update the OpenClaw version our users run, and because you’re on our VM by default the only keys you leak if you get hacked belong to us. Connecting your email is still a risk. The best defense I know of is Opus 4.6 for resilience to prompt injection. If you have a better solution, we’d love to hear it!<p>We learned a lot about infrastructure management in the past month. Kimi K2.5 and Mimimax M2.5 are extremely good at hallucinating new ways to break openclaw.json and otherwise wreaking havoc on an EC2 instance. The week after our launch we spent 20+ hours fixing broken machines by hand.<p>We wrote a ton of best practices on using OpenClaw on AWS Linux into our users’ AGENTS.md, got really good at un-bricking EC2 machines over SSM, added a command-and-control server to every instance to facilitate hotfixes and migrations, and set up a Klaus instance to answer FAQs on discord.<p>In addition to all of this, we built ClawBert, our AI SRE for hotfixing OpenClaw instances automatically: <a href="https://www.youtube.com/watch?v=v65F6VBXqKY" rel="nofollow">https://www.youtube.com/watch?v=v65F6VBXqKY</a>. Clawbert is a Claude Code instance that runs whenever a health check fails or the user triggers it in the UI. It can read that user’s entries in our database and execute commands on the user’s instance. We expose a log of Clawbert’s runs to the user.<p>We know that setting up OpenClaw is easy for most HN readers, but I promise it is not for most people. Klaus has a long way to go, but it’s still very rewarding to see people who’ve never used Claude Code get their first taste of AI agents.<p>We charge $19/m for a t4g.small, $49/m for a t4g.medium, and $200/m for a t4g.xlarge and priority support. You get $15 in tokens and $20 in Orthogonal credits one-time.<p>We want to know what you are building on OpenClaw so we can make sure we support it. We are already working with companies like Orthogonal and Openrouter that are building things to make agents more useful, and we’re sure there are more tools out there we don’t know about. If you’ve built something agents want, please let us know. Comments welcome!
Searching for the Agentic IDE
Hacker News (score: 25)[Other] Searching for the Agentic IDE <a href="https://xcancel.com/karpathy/status/2031616709560610993" rel="nofollow">https://xcancel.com/karpathy/status/2031616709560610993</a>
Show HN: Ink – Deploy full-stack apps from AI agents via MCP or Skills
Show HN (score: 6)[DevOps] Show HN: Ink – Deploy full-stack apps from AI agents via MCP or Skills Hi HN, I built Ink, a full stack deployment platform where the primary users are AI agents, not humans.<p>We all know AI can write code, but deploying them still requires a human to wire it up: hosting, databases, DNS, and secrets. Ink gives agents those tools directly.<p>The agent calls "deploy" and the platform auto-detects the framework, builds it, deploys it, and returns a live URL at *.ml.ink. Here's a demo with Claude Code: <a href="https://www.youtube.com/watch?v=F6ZM_RrIaC0" rel="nofollow">https://www.youtube.com/watch?v=F6ZM_RrIaC0</a>.<p>What Ink does that I haven't seen elsewhere:<p>- One agent skill for compute + databases + DNS + secrets + domains + usage + metrics + logs + scaling. The agent doesn't juggle separate providers — one account, one auth, one set of tools.<p>- DNS zone delegation. Delegate a zone once (e.g. dev.acme.com) and agents create any subdomain instantly — no manual adding DNS records each time, no propagation wait.<p>- Multiple agents and humans share one workspace and collaborate on projects. I envision a future where many agents collaborate together. I'm working on a cool demo to share.<p>- Built-in git hosting. Agents push code and deploy without the human setting up GitHub first. No external account needed. (Of course if you're a developer you can store code on GitHub — that's the recommended pattern.)<p>You also have what you'd expect: - UI with service observability designed for humans (logs, metrics, DNS). - GitHub integration — push triggers auto-redeploy. - Per-minute billing for CPU, memory, and egress. No per-seat, no per-agent. - Error responses designed for LLMs. Structured reason codes with suggested next actions, not raw stack traces. When a deploy fails the agent reads the log, fixes it, and redeploys autonomously.<p>Try: <a href="https://ml.ink" rel="nofollow">https://ml.ink</a> Free $2 trial credits, no credit card. In case you want to try further here's 20% code "GOODFORTUNE".
Show HN: OpenUI – A code-like rendering spec for Generative UI
Show HN (score: 7)[Other] Show HN: OpenUI – A code-like rendering spec for Generative UI Thesys just open-sourced their generative UI rendering engine. Interesting timing given where Google a2ui and Vercel's json-render are headed. The difference worth noting: a2ui and json-render both treat JSONL as the contract between the LLM and the renderer. Thesys is betting that's the wrong primitive. Their engine uses a code-like syntax (OpenUI Lang) instead — LLM writes it, renderer executes it. The argument is that LLMs are fundamentally better at generating code than generating structured data, so you get cleaner output and ~67% fewer tokens. The broader vision seems to be a model-agnostic, design-system-agnostic layer that sits between any LLM and your actual UI components. You bring your own components and design tokens, the engine handles translating LLM output into rendered interfaces — charts, forms, tables, cards. Generative UI as a category is still figuring out what the right abstraction is. This is a concrete stake in the ground against JSON-as-spec.
Show HN: Open-source browser for AI agents
Show HN (score: 7)[Other] Show HN: Open-source browser for AI agents Hi HN, I forked chromium and built agent-browser-protocol (ABP) after noticing that most browser-agent failures aren’t really about the model misunderstanding the page. Instead, the problem is that the model is reasoning from a stale state.<p>ABP is designed to keep the acting agent synchronized with the browser at every step. After each action (click, type, etc), it freezes JavaScript execution and rendering, then captures the resulting state. It also compiles the notable events that occurred during that action loop, such as navigation, file pickers, permission prompts, alerts, and downloads, and sends that along with a screenshot of the frozen page state back to the agent.<p>The result is that browser interaction starts to feel more like a multimodal chat loop. The agent takes an action, gets back a fresh visual state and a structured summary of what happened, then decides what to do next from there. That fits much better with how LLMs already work.<p>A few common browser-use failures ABP helps eliminate: * A modal appears after the last Playwright screenshot and blocks the input the agent was about to use * Dynamic filters cause the page to reflow between steps * An autocomplete dropdown opens and covers the element the agent intended to click * alert() / confirm() interrupts the flow * Downloads are triggered, but the agent has no reliable way to know when they’ve completed<p>As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark. I think modern LLMs already understand websites, they just need a better tool to interact with them. Happy to answer questions about the architecture, forking chrome or anything else in the comments below.<p>Try it out: `claude mcp add browser -- npx -y agent-browser-protocol --mcp` (Codex/OpenCode instructions in the docs)<p>Demo video: <a href="https://www.loom.com/share/387f6349196f417d8b4b16a5452c3369" rel="nofollow">https://www.loom.com/share/387f6349196f417d8b4b16a5452c3369</a>
Show HN: I built an ISP infrastructure emulator from scratch with a custom vBNG
Hacker News (score: 32)[DevOps] Show HN: I built an ISP infrastructure emulator from scratch with a custom vBNG Demo: <a href="https://aether.saphal.me" rel="nofollow">https://aether.saphal.me</a> GitHub: <a href="https://github.com/saphalpdyl/Aether" rel="nofollow">https://github.com/saphalpdyl/Aether</a><p>Aether is a multi-BNG (Broadband Network Gateway) ISP infrastructure lab built almost from scratch that emulates IPoE IPv4 subscriber management end-to-end. It supports IPoE/Ipv4 networks and runs a python-based vBNG with RADIUS AAA, per-subscriber traffic shaping, and traffic simulation emulated on Containerlab. It is also my first personal networking project, built roughly over a month.<p>Motivations behind the project<p>I'm a CS sophomore. About three years ago, I was assigned, as an intern, to build a OSS/BSS platform for a regional ISP by myself without mentoring. Referencing demo.splynx.com , I developed most of the BSS side ( bookkeeping, accounting, inventory management ), but, in terms of networking, I managed to install and setup RADIUS and that was about it. I didn't have anyone to mentor me or ask questions to, so I had given up then.<p>Three years later, I decided to try cracking it again. This project is meant to serve as a learning reference for anyone who's been in that same position i.e staring at closed-source vendor stacks without proper guidance. This is absolutely not production-grade, but I hope it gives someone a place to start.<p>Architecture overview<p>The core component, the BNG, runs on an event-driven architecture where state changes are passed around as messages to avoid handling mutexes and locks. The session manager is the sole owner of the session state. To keep it clean and predictable, the direBNG never accepts external inputctly. The one exception is the Go RADIUS CoA daemon, which passes CoA messages in via IPC sockets. Everything the BNG produces(events, session snapshots) gets pushed to Redis Streams, where the bng-ingestor picks them up, processes them, and persists them.<p>Simulation and meta-configs<p>I am generating traffic through a simulator node that mounts the host's docker socket and runs docker exec commands on selected hosts. The topology.yaml used by Containerlab to define the network topology grows bigger as more BNG's and access nodes are added. So aether.config.yaml, a simpler configuration, is consumed by the configuration pipeline to generate the topology.yaml and other files (nginx.conf, kea-dhcp.conf, RADIUS clients.conf etc.)<p>Known Limitations<p>- Multiple veth hops through the emulated topology add significant overhead. Profiling with iperf3 (-P 10 -t 10, 9500 MTU, 24 vCPUs) shows BNG→upstream at ~24 Gbit/s, but host→BNG→upstream drops to ~3.5 Gbit/s. The 9500 MTU also isn't representative of real ISP deployments. This gets worse when the actual network is reintroduced capping my throughput to 1.6 Gbits/sec in local. - The circuit ID format (1/0/X) is non-standard. I simplified it for clarity. - No iBGP or VLAN support. - No Ipv6 support. I wanted to target IPv4 networks from the start to avoid getting too much breadth without a lot of depth.<p>Nearly everything I know about networking (except some sections from AWS) I learned building this. A lot was figured out on the fly, so engineers will likely spot questionable decisions in the codebase. I'd genuinely appreciate that feedback.<p>Questions<p>- Currently, the circuit where the user connects is arbitrarily decided by the demo user. In a real system with thousands of circuits, it'd be very difficult to properly assess which circuit the customer might connect to. When adding a new customer to a service, how does the operator decide, based on customer's location, which circuit to provide the service to ?
BitNet: Inference framework for 1-bit LLMs
Hacker News (score: 349)[Other] BitNet: Inference framework for 1-bit LLMs Paper: <a href="https://arxiv.org/pdf/2310.11453" rel="nofollow">https://arxiv.org/pdf/2310.11453</a>
[Other] Show HN: Modulus – Cross-repository knowledge orchestration for coding agents Hello HN, we're Jeet and Husain from Modulus (<a href="https://modulus.so" rel="nofollow">https://modulus.so</a>) - a desktop app that lets you run multiple coding agents with shared project memory. We built it to solve two problems we kept running into:<p>- Cross-repo context is broken. When working across multiple repositories, agents don't understand dependencies between them. Even if we open two repos in separate Cursor windows, we still have to manually explain the backend API schema while making changes in the frontend repo.<p>- Agents lose context. Switching between coding agents often means losing context and repeating the same instructions again.<p>Modulus shares memory across agents and repositories so they can understand your entire system.<p>It's an alternative to tools like Conductor for orchestrating AI coding agents to build product, but we focused specifically on multi-repo workflows (e.g., backend repo + client repo + shared library repo + AI agents repo). We built our own Memory and Context Engine from the ground up specifically for coding agents.<p>Why build another agent orchestration tool? It came from our own problem. While working on our last startup, Husain and I were working across two different repositories. Working across repos meant manually pasting API schemas between Cursor windows — telling the frontend agent what the backend API looked like again and again. So we built a small context engine to share knowledge across repos and hooked it up to Cursor via MCP. This later became Modulus.<p>Soon, Modulus will allow teams to share knowledge with others to improve their workflows with AI coding agents - enabling team collaboration in the era of AI coding. Our API will allow developers to switch between coding agents or IDEs without losing any context.<p>If you wanna see a quick demo before trying out, here is our launch post - <a href="https://x.com/subhajitsh/status/2024202076293841208" rel="nofollow">https://x.com/subhajitsh/status/2024202076293841208</a><p>We'd greatly appreciate any feedback you have and hope you get the chance to try out Modulus.
Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon
Hacker News (score: 134)[Other] Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.<p>Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.<p>To get started:<p><pre><code> brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git brew install rcli rcli setup # downloads ~1 GB of models rcli # interactive mode with push-to-talk </code></pre> Or:<p><pre><code> curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash </code></pre> The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):<p>LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms<p>STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.<p>TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.<p>We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.<p>The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.<p>We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.<p>MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:<p>LLM benchmarks: <a href="https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-engine-apple-silicon">https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...</a><p>Speech benchmarks: <a href="https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-tts-apple-silicon">https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...</a><p>How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.<p>Voice Pipeline optimizations details: <a href="https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai-pipeline-apple-silicon">https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai...</a> RAG optimizations: <a href="https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retrieval-augmented-voice-ai">https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...</a><p>RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.<p>Source: <a href="https://github.com/RunanywhereAI/RCLI" rel="nofollow">https://github.com/RunanywhereAI/RCLI</a> (MIT)<p>Demo: <a href="https://www.youtube.com/watch?v=eTYwkgNoaKg" rel="nofollow">https://www.youtube.com/watch?v=eTYwkgNoaKg</a><p>What would you build if on-device AI were genuinely as fast as cloud?