Show HN: Serve 100 Large AI models on a single GPU with low impact to TTFT

Show HN (score: 5)
Found: November 08, 2025
ID: 2321

Description

Other
Show HN: Serve 100 Large AI models on a single GPU with low impact to TTFT I wanted to build an inference provider for proprietary AI models, but I did not have a huge GPU farm. I started experimenting with Serverless AI inference, but found out that coldstarts were huge. I went deep into the research and put together an engine that loads large models from SSD to VRAM up to ten times faster than alternatives. It works with vLLM, and transformers, and more coming soon.

With this project you can hot-swap entire large models (32B) on demand.

Its great for:

Serverless AI Inference

Robotics

On Prem deployments

Local Agents

And Its open source.

Let me know if anyone wants to contribute :)

More from Show

Show HN: Matrirc – run irssi in 2026, talk to people on Matrix

Show HN: Matrirc – run irssi in 2026, talk to people on Matrix This solves no real problem — Element works, there&#x27;s already a Matrix-to-IRC bridge running on half the FOSS networks, and probably nobody under 30 has opened irssi voluntarily this decade.<p>I wrote it anyway because I miss Esc 4 and clunky window-split commands.<p>Matrirc is a local IRC server that speaks Matrix on the back. Point irssi at localhost:6667, log in with Matrix creds, rooms show up as channels.<p>brew tap pawelb0&#x2F;tap brew install matrirc

No other tools from this source yet.