Hacker News with Generative AI: Data Extraction

Show HN: Documind – Open-source AI tool to turn documents into structured data (github.com/DocumindHQ)
Show HN: Get any domain's brand data via API (brand.dev)
Get logos, primary colors, descriptions, classifications, and more from any domain with a single api call.
Scaling Document Data Extraction with LLMs and Vector Databases (timescale.com)
Extracting structured data from unstructured documents is a powerful use case for large language models (LLMs). This sort of data extraction from complex documents has always remained a challenge. Done either completely manually or using current intelligent document processing (IDP) platforms that utilize previous-generation machine learning or natural language processing (NLP) techniques is very time-consuming and tedious.
Maxun: Open-Source No-Code Web Data Extraction Platform (github.com/getmaxun)
Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!
Show HN: Tile.run – Extract structured data from any document via API (tile.run)
tile.run is the simplest way to extract data from any document with best in class accuracy.
Launch HN: Midship (YC S24) – Turn PDFs, docs, and images into usable data (ycombinator.com)
Hey HN, we are Max, Kieran, and Aahel from Midship (https://midship.ai). Midship makes it easy to extract data from unstructured documents like pdfs and images.
Video scraping: extracting JSON from a 35s screen capture for 1/10th of a cent (simonwillison.net)
The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.
Extracting financial disclosure and police reports with OpenAI Structured Output (github.com)
Instantly share code, notes, and snippets.
Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes (github.com/bjesus)
Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.
Table Extraction Using LLMs (nanonets.com)
Picture this - you’re drowning in a sea of PDFs, spreadsheets, and scanned documents, searching for that one piece of data trapped somewhere in a complex table.
Bitten by Unicode (pyatl.dev)
One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.
Show HN: Painless Data Extraction and Web Automation (agentql.com)
DocuPanda – convert any PDF/image to a schema-driven, structured JSONs (docupanda.com)
How to crawl big websites with no sitemap? (ycombinator.com)
NuExtract: A LLM for Structured Extraction (huggingface.co)
Ask HN: How to OCR a PDF and preserve whitespace? (ycombinator.com)
Show HN: EndType – Extract structured data from images, video and PDFs (endtype.com)
Show HN: Extract Data from Line Chart Image (github.com/tdsone)
TotalRecall: Extracts and displays data from the Windows 11 Recall feature (github.com/xaitax)
FineWeb: Decanting the web for the finest text data at scale (huggingface.co)
Apify Meets AI: Crafting AI with Apify for Robust NL Web Scraping (medium.com)
Special Characters Attack: Toward Scalable Training Data Extraction from LLMs (arxiv.org)