Hacker News with Generative AI: Information Retrieval

Towards Effective Extraction and Evaluation of Factual Claims (microsoft.com)
To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization.

Fact Checking, Information Retrieval, Artificial Intelligence, Evaluation Metrics

5 points by lobo_tuerto 163 days ago | 0 comments

Internet Search Is Not a Naive Information Retrieval Problem (gojiberries.io)
The research demonstrates something interesting about language models' ability to simulate search behavior in controlled conditions. But claiming equivalence to a "real search engine" is like saying you've built a military defense system because your soldiers performed well in peacetime maneuvers. The real test isn't whether it works when nobody's trying to break it—it's whether it works when half the internet is trying to game it for profit.

Search Engines, Language Models, Artificial Intelligence, Information Retrieval

14 points by deontology 166 days ago | 7 comments

Web search on the Anthropic API (anthropic.com)
Today, we're introducing web search on the Anthropic API—a new tool that gives Claude access to current information from across the web. With web search enabled, developers can build Claude-powered applications and agents that deliver up-to-date insights.

Generative AI, APIs, Information Retrieval

272 points by cmogni1 175 days ago | 63 comments

Bridging the gap between keyword and semantic search with SPLADE (2024) (arcturus-labs.com)
In information retrieval, we often find ourselves between two tools: keyword search and semantic search. Each has strengths and limitations. What if we could combine the best of both?

Information Retrieval, Search Engines, Semantic Search, Keyword Search

23 points by softwaredoug 177 days ago | 2 comments

Wikidive – AI guided rabbitholes in Wikipedia (tulv.in)
AI Guided Wikipedia Diving.

Artificial Intelligence, Wikipedia, Web Development, Information Retrieval

32 points by atulvi 178 days ago | 16 comments

Show HN: RAG, No Vectors (github.com/VectifyAI)
PageIndex is a document indexing system that builds search tree structures from long documents, making them ready for reasoning-based RAG.

Search, Information Retrieval

11 points by vectify_AI 201 days ago | 1 comments

RAG Without Vectors – Reasoning-Based RAG using PageIndex (ycombinator.com)
Traditional vector-based RAG often struggles with retrieval accuracy because it optimizes for similarity, not relevance. But what we truly need in retrieval is relevance, which requires reasoning. When working with professional documents that require domain expertise and multi-step reasoning, vector-based RAG and similarity search often fall short.

Information Retrieval, Reasoning, Document Retrieval

34 points by vectify_AI 202 days ago | 8 comments

Ask HN: How to Use "Deep Research"? (ycombinator.com)
The issue I have had with llms since the start has been the slop. Contrary to popular belief, the open web was mostly slop before ChatGPT. We just used to call it SEO blogspam. And all llms are trained on them.

AI, Information Retrieval, Web Search

22 points by muddi900 206 days ago | 17 comments

Search could be so much better. And I don't mean chatbots with web access (matterrank.ai)
Search could be so much better.

Search Engines, Artificial Intelligence, Information Retrieval

61 points by mfkhalil 210 days ago | 61 comments

RAG Without Vectors – PageIndex: Reasoning-Based Document Indexing (ycombinator.com)
We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents.

Information Retrieval, Document Indexing

12 points by vectify_AI 211 days ago | 9 comments

Infinite Retrieval: Attention enhanced LLMs in long-context processing (arxiv.org)
Limited by the context window size of Large Language Models(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task.

Generative AI, Information Retrieval, Attention Mechanisms

37 points by TaurenHunter 242 days ago | 7 comments

An Experimental Study of Bitmap Compression vs. Inverted List Compression (dl.acm.org)
Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta.

Database, Information Retrieval, Compression, Algorithms, Experimental Studies

32 points by westurner 243 days ago | 6 comments

Hard problems that reduce to document ranking (noperator.dev)
There are two claims I’d like to make:

Document Ranking, Computer Science, Algorithms, Information Retrieval

318 points by noperator 246 days ago | 54 comments

Show HN: The news in the last 30, 14, 7, 3, or 1 days (ubershmekel.github.io)

News, Web Development, Information Retrieval, Time Series

16 points by ubershmekel 256 days ago | 2 comments

Advancements in embedding-based retrieval at Pinterest Homefeed (medium.com)
At Pinterest Homefeed, embedding-based retrieval (a.k.a Learned Retrieval) is a key candidate generator to retrieve highly personalized, engaging, and diverse content to fulfill various user intents and enable multiple actionability, such as Pin saving and shopping.

Machine Learning, Information Retrieval, Pinterest, Recommender Systems, Content Recommendation

19 points by herbertl 260 days ago | 2 comments

TL;DW: Too Long; Didn't Watch Distill YouTube Videos to the Relevant Information (tldw.tube)

YouTube, Video Summaries, Content Curation, Information Retrieval

328 points by pkaeding 260 days ago | 192 comments

DeepSeek's cutoff date is July 2024: We extracted DeepSeek's system prompt (knostic.ai)
We extracted DeepSeek’s system prompt, below we’ll show how, and what we found. It isn't inherently hidden by design, but it's certainly interesting.

AI, Generative AI, Information Retrieval

8 points by gepeto42 264 days ago | 0 comments

Add "fucking" to your Google searches to neutralize AI summaries (gizmodo.com)
If you are tired of Google’s AI-powered search results leading you astray with poor information from bad sources, there is some good news. It turns out that if you include any expletives in your search query, Google will not return an AI Overview, as they are called, at the top of the results page.

Search Engines, Artificial Intelligence, Google, Information Retrieval

788 points by jsheard 271 days ago | 357 comments

Supercharge vector search with ColBERT rerank in PostgreSQL (vectorchord.ai)
Traditional vector search methods typically employ sentence embeddings to locate similar content. However, generating sentence embeddings through pooling token embeddings can potentially sacrifice fine-grained details present at the token level. ColBERT overcomes this by representing text as token-level multi-vectors rather than a single, aggregated vector. This approach, leveraging contextual late interaction at the token level, allows ColBERT to retain more nuanced information and improve search accuracy compared to methods relying solely on sentence embeddings.

Vector Search, PostgreSQL, Information Retrieval, AI

72 points by gaocegege 279 days ago | 14 comments

Anthropic – Citations (anthropic.com)
Claude is capable of providing detailed citations when answering questions about documents, helping you track and verify information sources in responses.

Artificial Intelligence, Research, Information Retrieval

8 points by punnerud 279 days ago | 0 comments

VideoRAG: Retrieval-Augmented Generation over Video Corpus (arxiv.org)
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process.

Video, Generative AI, Machine Learning, Information Retrieval

4 points by t55 289 days ago | 0 comments

How outdated information hides in LLM token generation probabilities (anj.ai)
The internet usually has the correct answer somewhere, but it’s also full of conflicting and outdated information. How do large language models (LLMs) such as ChatGPT, trained on internet scale data, handle cases where there’s conflicting or outdated information? (Hint: it’s not always the most recent answer as of the knowledge cutoff date; think about what LLMs are trained to do)

Generative AI, Information Retrieval, Data Bias

123 points by anjsimmo 293 days ago | 53 comments

Wikipedia searches reveal differing styles of curiosity (scientificamerican.com)
Mapping explorers of Wikipedia rabbit holes revealed three different styles of human inquisitiveness: the “busybody,” the “hunter” and the “dancer”

Wikipedia, Curiosity, Research, Psychology, Information Retrieval

72 points by ripe 293 days ago | 37 comments

Embedding Models for Information Retrieval in 2025 (datastax.com)
The just-released Voyage-3-large is the surprise leader in embedding relevance

Information Retrieval, Embedding Models, Generative AI, Machine Learning

12 points by tjake 293 days ago | 1 comments

RAG a 40GB Outlook inbox – Long term Staff member leaving, keeping knowledge (reddit.com)
I've been fascinated by this concept since the early days of AI, and using ChatGPT has made it feel incredibly achievable and only just understood the concept of RAG. The idea is to leverage a local LLM paired with an open web UI to create vector or other databases of the inbox

Artificial Intelligence, Information Retrieval, Email

13 points by miles 301 days ago | 3 comments

Unifying Generative and Dense Retrieval for Sequential Recommendation (arxiv.org)
Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations.

Generative AI, Recommendation Systems, Machine Learning, Information Retrieval

4 points by ashvardanian 302 days ago | 0 comments

Prism: Manipulating Concepts in Latent Space (thesephist.com)
Foundation models gesture at a way of interacting with information that’s at once more natural and powerful than “classic” knowledge tools. But to build the kind of rich, directly interactive information interfaces I imagine, current foundation models and embeddings are far too opaque to humans.

Artificial Intelligence, Machine Learning, Information Retrieval

3 points by diwank 303 days ago | 0 comments

The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) (infolab.stanford.edu)
The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research.

Search Engines, Information Retrieval, Web Development, Computer Science

16 points by gone35 307 days ago | 1 comments

A new product that solves your tab hoarding problem and forgotten saved items (getstasher.com)
Stasher saves what you browse and brings up related links just when you need them