Hacker News with Generative AI: Data Extraction

Show HN: Robust LLM Extractor for HTML/Markdown in TypeScript (github.com/lightfeed)
Use LLMs to robustly extract structured data from HTML and markdown.

TypeScript, HTML, Data Extraction, Software

14 points by andrew_zhong 65 days ago | 0 comments

Cell Mates: Extracting Useful Information from Tables for LLMs (gojiberries.io)
We seem to have cracked the art of distilling information in words and images using large machine learning models. But our ability to exploit useful information in tabular data using large models is mostly missing.

Machine Learning, Data Extraction, Tables

33 points by goji_berries 73 days ago | 1 comments

Why extracting data from PDFs is still a nightmare (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.

Data Extraction, PDF, Technology, Software

8 points by lxm 74 days ago | 2 comments

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual) (github.com/ses4255)
This OCR system is specifically designed to extract structured data from complex educational materials—such as exam papers—in a format optimized for machine learning (ML) training.

Machine Learning, OCR, Education, Data Extraction, Open Source

170 points by ses425500000 105 days ago | 38 comments

Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.

Data Science, Data Extraction, PDF, Software, Business

14 points by ilamont 122 days ago | 1 comments

Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.

Data Extraction, PDFs, Data Science, Technology, Business

9 points by rntn 129 days ago | 0 comments

Show HN: A lightweight LLM proxy to get structured results from most LLMs (l1m.io)
l1m is the easiest way to get structured data from unstructured text or images using LLMs. No prompt engineering, no chat history, just a simple API to extract structured json from text or images.

Open Source, API, Data Extraction

13 points by lunarcave 142 days ago | 2 comments

Run structured extraction on documents/images locally with Ollama and Pydantic (github.com/vlm-run)
Welcome to VLM Run Hub, a comprehensive repository of pre-defined Pydantic schemas for extracting structured data from unstructured visual domains such as images, videos, and documents.

Data Extraction, Computer Vision, Machine Learning, Python, Open Source

170 points by EarlyOom 149 days ago | 29 comments

From PDFs to Insights: Structured Outputs from PDFs with Gemini 2.0 (philschmid.de)
This week Google DeepMind released Gemini 2.0, including Gemini 2.0 Flash (General Available), Gemini 2.0 Flash-Lite (New cost-efficient) and Gemini 2.0 Pro (Experimental). All models support up to at least 1 million input tokens with support for text, images and audio and function calling/structured outputs.

AI, Data Extraction, Google

16 points by BOOSTERHIDROGEN 161 days ago | 0 comments

PDF Hell: Why Is Extracting Data Still a Nightmare? (2024) (unstract.com)
If you have tried to extract text from PDFs you would have come across a myriad of complications related to it. It is relatively easy to do a POC or experiment, but when it comes to handling PDFs from the real world on a consistent basis, it is a tremendously difficult problem to solve.

Data Extraction, PDFs, Software Development

8 points by naren87 162 days ago | 1 comments

Fun building better table extraction models (aryn.ai)
Aryn’s goal is to provide analytics over unstructured documents, typically PDFs. We believe that integral to that goal is the ability to break down these documents into a more structured form that we can use to extract more information than you might be able to by simply stripping out all of the text and embedding it with a language model.

Data Extraction, Machine Learning, Artificial Intelligence

10 points by mehulashah 176 days ago | 0 comments

Nvidia-Ingest: Multi-modal data extraction (github.com/NVIDIA)
NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.

Data Extraction, Microservices, Generative AI, PDFs, NVIDIA

145 points by mihaid150 190 days ago | 45 comments

Liberate tabular data from scanned documents (blog.wzb.eu)
During the last months I often had to deal with the problem of extracting tabular data from scanned documents.

Data Extraction, OCR, Document Processing

6 points by leonry 212 days ago | 0 comments

Gemini Flash 2.0 Experimental: A bit more accurate, but slower (ycombinator.com)
I just updated my data extraction leaderboard with Gemini Flash 2.0 Experimental. It is quite a bit slower with large input token sizes right now, but a bit more accurate than Gemini 1.5 Flash which is at the top right now.

AI, Data Extraction, Benchmarking

3 points by peytoncasper 219 days ago | 1 comments

Show HN: Documind – Open-source AI tool to turn documents into structured data (github.com/DocumindHQ)

Open-source, AI, Data Extraction, Document Processing, Software

169 points by Tammilore 243 days ago | 64 comments

Show HN: Get any domain's brand data via API (brand.dev)
Get logos, primary colors, descriptions, classifications, and more from any domain with a single api call.

API, Brand Data, Domain Information, Web Development, Data Extraction

32 points by ICodeSometimes 245 days ago | 33 comments

Scaling Document Data Extraction with LLMs and Vector Databases (timescale.com)
Extracting structured data from unstructured documents is a powerful use case for large language models (LLMs). This sort of data extraction from complex documents has always remained a challenge. Done either completely manually or using current intelligent document processing (IDP) platforms that utilize previous-generation machine learning or natural language processing (NLP) techniques is very time-consuming and tedious.

Data Extraction, Vector Databases

12 points by avthar 245 days ago | 2 comments

Maxun: Open-Source No-Code Web Data Extraction Platform (github.com/getmaxun)
Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!

Open Source, Web Scraping, Data Extraction, Automation

58 points by thunderbong 252 days ago | 8 comments

Show HN: Tile.run – Extract structured data from any document via API (tile.run)
tile.run is the simplest way to extract data from any document with best in class accuracy.

API, Data Extraction, Software, Web Development

11 points by ntkris 254 days ago | 6 comments

Launch HN: Midship (YC S24) – Turn PDFs, docs, and images into usable data (ycombinator.com)
Hey HN, we are Max, Kieran, and Aahel from Midship (https://midship.ai). Midship makes it easy to extract data from unstructured documents like pdfs and images.

YC Startups, Data Extraction, Document Processing, AI, Software

121 points by maxmaio 254 days ago | 61 comments

Video scraping: extracting JSON from a 35s screen capture for 1/10th of a cent (simonwillison.net)
The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

Web Scraping, Data Extraction, Automation, Cost Optimization, Efficiency

309 points by simonw 274 days ago | 46 comments

Extracting financial disclosure and police reports with OpenAI Structured Output (github.com)
Instantly share code, notes, and snippets.

OpenAI, Finance, Law Enforcement, Data Extraction

254 points by danso 281 days ago | 89 comments

Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes (github.com/bjesus)
Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.

Command Line, Web Scraping, Tools, Programming, Data Extraction

49 points by yoavm 292 days ago | 7 comments

Table Extraction Using LLMs (nanonets.com)
Picture this - you’re drowning in a sea of PDFs, spreadsheets, and scanned documents, searching for that one piece of data trapped somewhere in a complex table.

Machine Learning, Data Extraction, Tables

18 points by StarrySkies11 298 days ago | 1 comments

Bitten by Unicode (pyatl.dev)
One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.

Python, Programming, Text Processing, Data Extraction

130 points by pryelluw 313 days ago | 129 comments

Show HN: Painless Data Extraction and Web Automation (agentql.com)

Web Automation, Data Extraction, Tools

11 points by shuhao_tf 333 days ago | 1 comments

DocuPanda – convert any PDF/image to a schema-driven, structured JSONs (docupanda.com)

Data Extraction, JSON, PDF Conversion

16 points by uri_merhav 343 days ago | 4 comments

How to crawl big websites with no sitemap? (ycombinator.com)

Web Scraping, Data Extraction, Website Crawling

9 points by mateozaratefw 380 days ago | 9 comments

NuExtract: A LLM for Structured Extraction (huggingface.co)

Structured Data, Data Extraction

16 points by dmezzetti 384 days ago | 13 comments

Ask HN: How to OCR a PDF and preserve whitespace? (ycombinator.com)

OCR, PDF, Data Extraction, Programming

26 points by GirkovArpa 406 days ago | 17 comments