Show HN: Docuglean – Extract Structured Data from PDFs/Images Using AI

Show HN (score: 5)

Found: November 20, 2025

ID: 2465

Description

API/SDK

Show HN: Docuglean – Extract Structured Data from PDFs/Images Using AI Hi HN! I built Docuglean, an open-source SDK for intelligent document processing that works with OpenAI, Mistral, Google Gemini, and Hugging Face models.

The idea came from repeatedly writing boilerplate code to extract structured data from invoices, receipts, and other documents. Instead of wrestling with different API formats, I wanted a unified interface that:

- Extracts structured data using Zod/Pydantic schemas - Classifies and splits multi-section documents (e.g., medical records) - Processes documents in batches with automatic error handling - Works locally without APIs (for PDFs, DOCX, XLSX, etc.)

Key features: - Available for both TypeScript and Python - Batch processing with concurrent requests - Document classification (splits 100+ page docs by category) - Local parsers (no API needed for basic extraction) - Apache 2.0 licensed

Currently supports OpenAI, Mistral, Gemini, and Hugging Face. Planning to add Together AI, Anthropic, and more.

Would love feedback on the API design and what features would be most useful

More from Show

No other tools from this source yet.

Show HN: Docuglean – Extract Structured Data from PDFs/Images Using AI

Description

More from Show

DevTools Assistant