Show HN: Optimizing LiteLLM with Rust – When Expectations Meet Reality

Hacker News (score: 24)

Found: November 18, 2025

ID: 2437

Description

Other

Show HN: Optimizing LiteLLM with Rust – When Expectations Meet Reality I've been working on Fast LiteLLM - a Rust acceleration layer for the popular LiteLLM library - and I had some interesting learnings that might resonate with other developers trying to squeeze performance out of existing systems.

My assumption was that LiteLLM, being a Python library, would have plenty of low-hanging fruit for optimization. I set out to create a Rust layer using PyO3 to accelerate the performance-critical parts: token counting, routing, rate limiting, and connection pooling.

The Approach

- Built Rust implementations for token counting using tiktoken-rs

- Added lock-free data structures with DashMap for concurrent operations

- Implemented async-friendly rate limiting

- Created monkeypatch shims to replace Python functions transparently

- Added comprehensive feature flags for safe, gradual rollouts

- Developed performance monitoring to track improvements in real-time

After building out all the Rust acceleration, I ran my comprehensive benchmark comparing baseline LiteLLM vs. the shimmed version:

Function Baseline Time Shimmed Time Speedup Improvement Status

token_counter 0.000035s 0.000036s 0.99x -0.6%

count_tokens_batch 0.000001s 0.000001s 1.10x +9.1%

router 0.001309s 0.001299s 1.01x +0.7%

rate_limiter 0.000000s 0.000000s 1.85x +45.9%

connection_pool 0.000000s 0.000000s 1.63x +38.7%

Turns out LiteLLM is already quite well-optimized! The core token counting was essentially unchanged (0.6% slower, likely within measurement noise), and the most significant gains came from the more complex operations like rate limiting and connection pooling where Rust's concurrent primitives made a real difference.

Key Takeaways

1. Don't assume existing libraries are under-optimized - The maintainers likely know their domain well 2. Focus on algorithmic improvements over reimplementation - Sometimes a better approach beats a faster language 3. Micro-benchmarks can be misleading - Real-world performance impact varies significantly 4. The most gains often come from the complex parts, not the simple operations 5. Even "modest" improvements can matter at scale - 45% improvements in rate limiting are meaningful for high-throughput applications

While the core token counting saw minimal improvement, the rate limiting and connection pooling gains still provide value for high-volume use cases. The infrastructure I built (feature flags, performance monitoring, safe fallbacks) creates a solid foundation for future optimizations.

The project continues as Fast LiteLLM on GitHub for anyone interested in the Rust-Python integration patterns, even if the performance gains were humbling.

Edit: To clarify - the negative performance for token_counter is likely in the noise range of measurement, suggesting that LiteLLM's token counting is already well-optimized. The 45%+ gains in rate limiting and connection pooling still provide value for high-throughput applications.

More from Hacker

Show HN: 1Code – Open-source Cursor-like UI for Claude Code

Show HN: 1Code – Open-source Cursor-like UI for Claude Code Hi, we're Sergey and Serafim. We've been building dev tools at 21st.dev and recently open-sourced 1Code (<a href="https://1code.dev" rel="nofollow">https://1code.dev</a>), a local UI for Claude Code.Here's a video of the product: <a href="https://www.youtube.com/watch?v=Sgk9Z-nAjC0" rel="nofollow">https://www.youtube.com/watch?v=Sgk9Z-nAjC0</a>Claude Code has been our go-to for 4 months. When Opus 4.5 dropped, parallel agents stopped needing so much babysitting. We started trusting it with more: building features end to end, adding tests, refactors. Stuff you'd normally hand off to a developer. We started running 3-4 at once. Then the CLI became annoying: too many terminals, hard to track what's where, diffs scattered everywhere.So we built 1Code.dev, an app to run your Claude Code agents in parallel that works on Mac and Web. On Mac: run locally, with or without worktrees. On Web: run in remote sandboxes with live previews of your app, mobile included, so you can check on agents from anywhere. Running multiple Claude Codes in parallel dramatically sped up how we build features.What’s next: Bug bot for identifying issues based on your changes; QA Agent, that checks that new features don't break anything; Adding OpenCode, Codex, other models and coding agents. API for starting Claude Codes in remote sandboxes.Try it out! We're open-source, so you can just bun build it. If you want something hosted, Pro ($20/mo) gives you web with live browser previews hosted on remote sandboxes. We’re also working on API access for running Claude Code sessions programmatically.We'd love to hear your feedback!

Show HN: WOLS – Open standard for mushroom cultivation tracking

Show HN: WOLS – Open standard for mushroom cultivation tracking I built an open labeling standard for tracking mushroom specimens through their lifecycle (from spore/culture to harvest).v1.1 adds clonal generation tracking (distinct from filial/strain generations) and conforms to JSON-LD for interoperability with agricultural/scientific data systems.Spec (CC 4.0): <a href="https://wemush.com/open-standard/specification" rel="nofollow">https://wemush.com/open-standard/specification</a> Client libraries (Apache 2.0): Python + CLI: pip install wols (also on GHCR) TypeScript/JS: npm install @wemush/wolsBackground: Mycology has fragmented data practices (misidentified species, inconsistent cultivation logs, no shared vocabulary for tracking genetics across generations). This is an attempt to fix that.Looking for feedback from anyone working with biological specimen tracking, agricultural data systems, or mycology.

OpenGitOps

I couldn't find a logging library that worked for my library, so I made one

No other tools from this source yet.