🛠️ All DevTools
Showing 1–20 of 4676 tools
Last Updated
May 20, 2026 at 12:01 AM
Gemini 3.5 Flash
Hacker News (score: 192)[API/SDK] Gemini 3.5 Flash <a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flash" rel="nofollow">https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...</a>
[CLI Tool] Show HN: LibreOffice-rs – I built a pure-Rust LibreOffice using autoresearch Hey HN,<p>I built libreoffice-rs: a pure-Rust, std-only library + CLI for reading, writing, converting, and rendering office documents — with *zero* LibreOffice, Java, or C dependencies.<p>100x faster... I know, I know.<p>It supports DOCX, XLSX, PPTX, ODT/ODS/ODP, PDF, Markdown, CSV, HTML, SVG, and more. The CLI is designed to feel familiar:<p>```bash cargo install libreoffice-pure<p># soffice-style usage libreoffice-pure --headless --convert-to pdf report.docx libreoffice-pure --headless --convert-to csv spreadsheet.xlsx<p># Markdown extraction libreoffice-pure docx-to-md report.docx report.md libreoffice-pure pptx-to-md slides.pptx slides.md<p># Render pages as images libreoffice-pure docx-to-pngs report.docx pages/ --dpi 144 ```
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hacker News (score: 34)[Monitoring/Observability] Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hacker News (score: 18)[Other] Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>
rtk-ai/rtk
GitHub Trending[CLI Tool] CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
Show HN: Hsrs – Type-Safe Haskell Bindings Generator for Rust
Hacker News (score: 17)[API/SDK] Show HN: Hsrs – Type-Safe Haskell Bindings Generator for Rust Hey everyone! I've been working on hsrs, a type-safe Haskell Bindings Generator for Rust.<p>I couldn't really find any bindings generator that would create type-safe, rich bindings for Haskell from Rust. Naturally, both languages have rich type systems, so I was amazed that no awesome bindings generator already existed, hence I decided to write my own. hsrs feels very similar to pyo3 and napi-rs, and if you've used those, hsrs will feel right at home.<p>What's unique about hsrs as opposed to hs-bindgen is that it has type-safe bindings for rich types, like Result, Maybe, etc. while also generating Haskell bindings. The repo contains a minimal example, and more details are available in the haskell discourse: <a href="https://discourse.haskell.org/t/ann-hsrs-ergonomic-haskell-bindings-for-rust/14129" rel="nofollow">https://discourse.haskell.org/t/ann-hsrs-ergonomic-haskell-b...</a>
LLMCap – A proxy that hard-stops LLM API calls when you hit a dollar cap
Hacker News (score: 21)[Other] LLMCap – A proxy that hard-stops LLM API calls when you hit a dollar cap
Sieve – scans Cursor/Claude chat history for leaked API keys
Hacker News (score: 12)[Code Quality] Sieve – scans Cursor/Claude chat history for leaked API keys Background: I was using Cursor to set up an OpenAI integration.The agent read my .env file, added the key to the config, and everything worked. What I didn't think about: that key was now sitting in a plaintext SQLite database at ~/Library/ApplicationSupport/Cursor/User/workspaceStorage/..<p>AI coding tools (Cursor, Claude Code, Copilot, Cline) routinely read .env files as part of normal operation. Every secret they touch gets embedded in their local transcript/state files — unencrypted, outside .gitignore, persisted indefinitely.<p>Standard secret scanners (gitleaks, detect-secrets) scan git repos. Nobody scans AI transcript stores. That's the gap.<p>Sieve scans those files locally on your Mac. Flags exposed keys by severity. Redacts them in-place. Stores fingerprints in Keychain — never plaintext. Covers Cursor, Claude Code, Claude Desktop, Copilot, Cline, Roo Cline, Windsurf, Gemini CLI, and .env files.<p>Happy to answer questions about how the SQLite parsing works or the detection rules.
[Other] Show HN: Tracecast – open-source generative data apps built on top of Marimo Hi HN, I'm Malachy, the founder of Tracecast. This project lets you generate interactive data apps on top of your data, using a Cursor-style AI chat. It stitches together Marimo, LangGraph agents, and data warehouse query tools. It has an Apache 2.0 license.<p>The initial use case that spurred this project was business analytics, specifically generating product usage dashboards.<p>This project's main inspiration is Marimo, an open source python notebook that can be "queried with SQL, run as a script, and deployed as an app" [1]. The recent release of Marimo Pair [2] demonstrated the power of connecting AI agents like Claude Code to Marimo notebooks directly. This project seeks to build on that work by incorporating a LangGraph agent with two key abilities: (1) the ability to execute queries against a connected data warehouse (such as Snowflake); (2) the ability to write Marimo notebooks.<p>When prompted, the LangGraph agent will run exploratory data analysis using database query tools. Then, it creates a polished Marimo notebook that's presented to the user in read-only mode. This project intentionally hides the Marimo edit mode. That means that the end user only ever sees a finished, read-only data app. Ease of use and trust in AI output were the main drivers behind this decision.<p>4 data sources are currently supported: Snowflake, BigQuery, Postgres, and Metabase. The code for the database query tools was derived from Google's open source MCP Toolbox for Databases.<p>There is currently no support for MCP. Instead, data query tools are hardcoded. This decision was made to ensure high quality AI queries and limit tool bloat.<p>This is an early stage project, and is configured to only run locally at this time.<p>[1] <a href="https://github.com/marimo-team/marimo" rel="nofollow">https://github.com/marimo-team/marimo</a> [2] <a href="https://news.ycombinator.com/item?id=47678844">https://news.ycombinator.com/item?id=47678844</a>
Loopmaster – Livecoding Music IDE
Hacker News (score: 80)[IDE/Editor] Loopmaster – Livecoding Music IDE
Show HN: Haystack – Review the PRs that need human attention
Hacker News (score: 19)[Code Quality] Show HN: Haystack – Review the PRs that need human attention Hey HN! We're building Haystack (<a href="https://haystackeditor.com/">https://haystackeditor.com/</a>) to help teams deal with the explosion in the number of pull requests that need to be reviewed due to the rise of coding agents.<p>Haystack replaces the GitHub PR review system with a queue that triages each PR before a human has to read any diffs. It looks at the diffs, the codebase, and the coding-agent conversation that produced the PR. Haystack then routes it into one of three buckets:<p>1. Safe to merge. This means the PR has enough evidence behind it that the team can merge it without another human's review.<p>Some examples:<p>-- A small UI copy change that includes a screenshot showing the final state<p>-- A backend change where the author clearly tested the important paths and ran the changes in a real environment<p>2. Needs fixes. This means that the PR has bugs or violates a rule in your codebase and therefore the PR needs to be fixed by the author.<p>Some examples:<p>-- The agent was asked to make loading a large table faster by adding pagination, but the PR still loads every result at once and "implements" pagination in the UI<p>-- The PR silently catches an error instead of logging, surfacing, or handling it. This violates the team's "no silent error swallowing" rule<p>3. Needs human review. This means that the PR could not be sufficiently verified by the author or is touching a sensitive part of the codebase (determined by user-input guidelines) and thus requires human review.<p>Some examples:<p>-- The PR changes a significant amount of logic in billing<p>-- The PR changes an important user flow like onboarding, but the author only ran unit tests and never opened the app to check the flow end-to-end. That violates the team's rule that high-impact user-facing changes need manual verification.<p>Instead of starting with line-by-line diffs, Haystack immediately tells the reviewer the goal behind the PR, what design decisions the author made (informed by their coding-agent conversation), and how much the author did to verify that the pull request works (e.g. run scripts, checked the frontend, etc.).<p>In this way, review shifts from "what changed?" to "is this the right behavior and is there evidence that it works?".<p>Here's a quick demo: <a href="https://www.tella.tv/video/streamlining-code-reviews-with-haystack-65zj" rel="nofollow">https://www.tella.tv/video/streamlining-code-reviews-with-ha...</a><p>We previously launched Haystack as a tool for understanding large PRs (<a href="https://news.ycombinator.com/item?id=45201703">https://news.ycombinator.com/item?id=45201703</a>). As many of you can probably relate to, the release of Opus 4.5 completely shattered our conception of how fast an engineer could craft a PR.<p>And as coding agents got even better from 4.5, we realized that pull requests did not scale along with our coding velocity. With each member of our team being able to pump out more than 20 pull requests a day, code review quickly became cognitively exhausting and less helpful.<p>After talking with other folks, we learned many feel similarly, and currently face the binary option of either not doing review at all or trying to keep up with a fire hose of pull requests.<p>Haystack is our attempt at a third path. We still believe in code review, but as coding agents produce more code, human reviewer attention becomes more valuable and more expensive.<p>Haystack helps teams spend that attention on the PRs where a human can meaningfully change the outcome of that PR. And for such PRs, Haystack shows the reviewer what the PR intended to do, whether the author showed that it works, and what design decisions need a second pair of eyes.<p>We're still quite early and are figuring out whether Haystack truly makes code review better. We would love any and all feedback!
We stopped AI bot spam in our GitHub repo using Git's –author flag
Hacker News (score: 57)[Other] We stopped AI bot spam in our GitHub repo using Git's –author flag
HKUDS/CLI-Anything
GitHub Trending[CLI Tool] "CLI-Anything: Making ALL Software Agent-Native" -- CLI-Hub:https://clianything.cc/
Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS
Hacker News (score: 311)[Other] Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS
NASA still maintains some of the Voyager spacecraft code from the 70s era
Hacker News (score: 19)[Other] NASA still maintains some of the Voyager spacecraft code from the 70s era
Show HN: Pdf2md – 10MB Rust PDF-to-Markdown Tool with a Free API
Show HN (score: 7)[Other] Show HN: Pdf2md – 10MB Rust PDF-to-Markdown Tool with a Free API
Show HN: How to Kill the Dead Internet
Show HN (score: 7)[Code Quality] Show HN: How to Kill the Dead Internet Ok, so maybe "how to revive the internet" would be more accurate, but if you're reading this, I got your attention, right? Here's why I want you to read on: I built a free extension, D-slop, to disincentivize anyone from posting AI writing, and eventually images and video as well, on the internet.<p>For writing, it checks known vocab and punctuation tells, as well as subtler tells related to cadence, and assigns it a score subject to an adjustable threshold. If the text fails, users have the option to flag offending text, hide it, or block the page entirely (with the option to see anyway).<p>For media, it's admittedly fairly weak, as it relies on C2PA metadata which is stripped from all of the social media sites where it would be most helpful. (Anyone else have chronically online boomer parents continually gobbling up slop like it's real information?)<p>I have a D-slop+ version in the works that should be able to handle the media itself, but it's going to have to make API calls to have real teeth, which means I can't offer it for free. If this extension validates the concept, I'm happy to build it for y'all.<p>Yes, I vibe-coded it, but an ancillary bonus to the project accrued when it inspired me to cook dinner listening to Metallica's "Fight Fire with Fire," which in turn brought my 5 y/o running into the kitchen with every musical instrument in the house for an impromptu karaoke speed metal session.<p>It's MIT license open-source, full brief at <a href="https://github.com/jared-the-automator/d-slop" rel="nofollow">https://github.com/jared-the-automator/d-slop</a>; This forum is full of people smarter than me, so I'm open to suggestions.
Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep
Hacker News (score: 16)[Other] Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep Hey HN! We (Stephan and Thomas) recently open-sourced Semble. We kept running into the same problem while using Claude Code on large codebases: when the agent can't find something directly, it falls back to grep, reading full files or launching subagents. This uses a lot of tokens, and often still misses the relevant code. There are existing tools for this, but they were either too slow to index on demand, needed API keys, or had poor retrieval quality.<p>Semble is our solution for this. It combines static Model2Vec embeddings (using our latest static model: potion-code-16M) with BM25, fused via RRF and reranked with code-aware signals. Everything runs on CPU since there's no transformers involved. On our benchmark of ~1250 query/document pairs across 63 repos and 19 languages, it uses 98% fewer tokens than grep+read and reaches 99% of the retrieval quality of a 137M-parameter code-trained transformer, while being ~200x faster.<p>Main features:<p>- Token-efficient: 98% fewer tokens than grep+read<p>- Fast: ~250ms to index a typical repo on our benchmark, ~1.5ms per query on CPU (very large repos may take longer)<p>- Accurate: 0.854 NDCG@10, 99% of the best transformer setup we tested<p>- MCP server: drop-in for Claude Code, Cursor, Codex, OpenCode<p>- Zero config: no API keys, no GPU, no external services<p>Install in Claude Code with: claude mcp add semble -s user -- uvx --from "semble[mcp]" semble<p>Or check our README for other installation instructions, benchmarks, and methodology:<p>Semble: <a href="https://github.com/MinishLab/semble" rel="nofollow">https://github.com/MinishLab/semble</a><p>Benchmarks: <a href="https://github.com/MinishLab/semble/tree/main/benchmarks" rel="nofollow">https://github.com/MinishLab/semble/tree/main/benchmarks</a><p>Model: <a href="https://huggingface.co/minishlab/potion-code-16M" rel="nofollow">https://huggingface.co/minishlab/potion-code-16M</a><p>Let us know if you have any feedback or questions!
Nim-Presto – REST API Framework for Nim Language
Hacker News (score: 40)[API/SDK] Nim-Presto – REST API Framework for Nim Language
K-Dense-AI/scientific-agent-skills
GitHub Trending[Other] A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.