From PDFs to Insights: Structured Outputs from PDFs with Gemini 2.0(philschmid.de) This week Google DeepMind released Gemini 2.0, including Gemini 2.0 Flash (General Available), Gemini 2.0 Flash-Lite (New cost-efficient) and Gemini 2.0 Pro (Experimental). All models support up to at least 1 million input tokens with support for text, images and audio and function calling/structured outputs.
16 points by BOOSTERHIDROGEN 105 days ago | 0 comments
PDF Hell: Why Is Extracting Data Still a Nightmare? (2024)(unstract.com) If you have tried to extract text from PDFs you would have come across a myriad of complications related to it. It is relatively easy to do a POC or experiment, but when it comes to handling PDFs from the real world on a consistent basis, it is a tremendously difficult problem to solve.
Fun building better table extraction models(aryn.ai) Aryn’s goal is to provide analytics over unstructured documents, typically PDFs. We believe that integral to that goal is the ability to break down these documents into a more structured form that we can use to extract more information than you might be able to by simply stripping out all of the text and embedding it with a language model.
Nvidia-Ingest: Multi-modal data extraction(github.com/NVIDIA) NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.
32 points by ICodeSometimes 189 days ago | 33 comments
Scaling Document Data Extraction with LLMs and Vector Databases(timescale.com) Extracting structured data from unstructured documents is a powerful use case for large language models (LLMs). This sort of data extraction from complex documents has always remained a challenge. Done either completely manually or using current intelligent document processing (IDP) platforms that utilize previous-generation machine learning or natural language processing (NLP) techniques is very time-consuming and tedious.
Table Extraction Using LLMs(nanonets.com) Picture this - you’re drowning in a sea of PDFs, spreadsheets, and scanned documents, searching for that one piece of data trapped somewhere in a complex table.
18 points by StarrySkies11 242 days ago | 1 comments
Bitten by Unicode(pyatl.dev) One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a `float`. If there’s a hyphen in addition to the dollar symbol, it’s negative.