Hacker News with Generative AI: Information Retrieval

VideoRAG: Retrieval-Augmented Generation over Video Corpus (arxiv.org)
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process.
How outdated information hides in LLM token generation probabilities (anj.ai)
The internet usually has the correct answer somewhere, but it’s also full of conflicting and outdated information. How do large language models (LLMs) such as ChatGPT, trained on internet scale data, handle cases where there’s conflicting or outdated information? (Hint: it’s not always the most recent answer as of the knowledge cutoff date; think about what LLMs are trained to do)
Wikipedia searches reveal differing styles of curiosity (scientificamerican.com)
Mapping explorers of Wikipedia rabbit holes revealed three different styles of human inquisitiveness: the “busybody,” the “hunter” and the “dancer”
Embedding Models for Information Retrieval in 2025 (datastax.com)
The just-released Voyage-3-large is the surprise leader in embedding relevance
RAG a 40GB Outlook inbox – Long term Staff member leaving, keeping knowledge (reddit.com)
I've been fascinated by this concept since the early days of AI, and using ChatGPT has made it feel incredibly achievable and only just understood the concept of RAG. The idea is to leverage a local LLM paired with an open web UI to create vector or other databases of the inbox
Unifying Generative and Dense Retrieval for Sequential Recommendation (arxiv.org)
Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations.
Prism: Manipulating Concepts in Latent Space (thesephist.com)
Foundation models gesture at a way of interacting with information that’s at once more natural and powerful than “classic” knowledge tools. But to build the kind of rich, directly interactive information interfaces I imagine, current foundation models and embeddings are far too opaque to humans.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) (infolab.stanford.edu)
The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research.
A new product that solves your tab hoarding problem and forgotten saved items (getstasher.com)
Stasher saves what you browse and brings up related links just when you need them
Information Batteries (2021) [pdf] (raghavan.usc.edu)
The Tao of Topic Maps (2000) (ontopia.net)
Someone once said that “a book without an index is like a country without a map”.
Understanding the BM25 full text search algorithm (emschwartz.me)
BM25, or Best Match 25, is a widely used algorithm for full text search. It is the default in Lucene/Elasticsearch and SQLite, among others. Recently, it has become common to combine full text search and vector similarity search into "hybrid search". I wanted to understand how full text search works, and specifically BM25, so here is my attempt at understanding by re-explaining.
Ask HN: Niche technical knowledge not found on the internet? (ycombinator.com)
What niche subjects are you interested in for which the knowledge is hard to come by on the internet?
Ask HN: The Web Post ChatGPT? (ycombinator.com)
The humble chat thread is rapidly becoming the defacto interface to information for so many right now.
Ask HN: Local RAG with private knowledge base (ycombinator.com)
Looking for a free, local, open source RAG solution for running a reference library with 1000s of technical PDFs and word docs.
The Knowledge Graph: things, not strings (2012) (google)
Search is a lot about discovery—the basic human need to learn and broaden your horizons. But searching still requires a lot of hard work by you, the user. So today I’m really excited to launch the Knowledge Graph, which will help you discover new information quickly and easily.
Bridging Search and Recommendation in Generative Retrieval (dl.acm.org)
Generative retrieval for search and recommendation is a promising paradigm for retrieving items, offering an alternative to traditional methods that depend on external indexes and nearest-neighbor searches.
NotebookLM launches feature to customize and guide audio overviews (google)
NotebookLM is a tool for understanding, built with Gemini 1.5. When you upload your sources, it instantly becomes an expert, grounding its responses in your material and giving you powerful ways to transform information. And since it’s your notebook, your personal data is never used to train NotebookLM.
Bloomberg Terminal – The Keys (libguides.nyit.edu)
The Bloomberg Terminal is Color Coded:
Phrase matching in Marginalia Search (marginalia.nu)
Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query.
A new semantic chunking approach for RAG (gpt3experiments.substack.com)
As we saw in my last blog post, there is a shape for stories.
Two kinds of LLM responses: Informational vs. Instructional (shabie.github.io)
When thinking of LLM evals especially in the context of RAGs, it occurred to me that there are two kinds of distinct responses people get from LLMs: informational and instructional.
"As We May Think" by Vannevar Bush (1945) (theatlantic.com)
As Director of the Office of Scientific Research and Development, Dr. Vannevar Bush has coordinated the activities of some six thousand leading American scientists in the application of science to warfare. In this significant article he holds up an incentive for scientists when the fighting has ceased. He urges that men of science should then turn to the massive task of making more accessible our bewildering store of knowledge.
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models (arxiv.org)
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be "over-compressed" in the embeddings.
Contextual Retrieval (anthropic.com)
For an AI model to be useful in specific contexts, it often needs access to background knowledge. For example, customer support chatbots need knowledge about the specific business they're being used for, and legal analyst bots need to know about a vast array of past cases.
AI as an Information Interface (zeynepevecen.dev)
STORM: Get a Wikipedia-like report on your topic (genie.stanford.edu)
I have carefully read the above and accepted all the terms and conditions
Data Engineering Vault: A 1000 Node Second Brain for DE Knowledge (ssp.sh)
Welcome to the Data Engineering Vault an integral part of my Second Brain. It’s a curated network of data engineering knowledge, designed to facilitate exploration and discovery. Here, you’ll find over 100+ interconnected terms, each serving as a gateway to deeper insights.
Integrating Vision into RAG Applications (pamelafox.org)
Long Context vs. RAG (jonathanadly.com)
One of the projects I have built is a long-standing retrieval-augmented generation (RAG) application. Documents are saved in a database, chunked into a reasonable amount of text that a large language model (LLM) can handle, and turned into numerical representation (vectors).