Hacker News with Generative AI: Data Structures

Flattening ASTs with Arena Allocation (cs.cornell.edu)
Arenas, a.k.a. regions, are everywhere in modern language implementations.
Representing filesystems in databases efficiently with Hierarchical Ordering (danthegoodman.substack.com)
The way the we interface with files in a file system and rows in a database are fundamentally not the same.
Query Your Python Lists (github.com/mkalioby)
Leopards is a way to query list of dictionaries or objects as if you are filtering in DBMS.
SQLite Index Visualization (mrsuh.com)
After learning about indexes, I understood their basic structure, but I wanted to dig deeper — to explore the data structure, understand the algorithm, and learn how the index data is stored on disk.
Histogramming Bytes with Positional Popcount (blogspot.com)
A while ago, after some back and forth on twitter/X with @corsix, I dropped some implementation of byte histogramming without explaining anything. This post aims to rectify that lack of explanation.
I'm not mutable, I'm partially instantiated (dnmfarrell.com)
Incomplete data structures challenge our notion of mutability
How to Flatten nested JSON arrays (datazip.io)
Flattening nested JSON or MongoDB’s BSON or normalizing semi-structured data and writing queries on it for analytics or regular queries, is a common challenge in data processing.
Notes on Binary Soup (marginalia.nu)
I recently put together a small library called Slop, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before.
Sortledton: A Universal, Transactional Graph Data Structure [pdf] (vldb.org)
How to Correctly Sum Up Numbers (cedardb.com)
One of the first examples for loops in probably every programming language course is taking a list of numbers and calculating the sum of them.
Universal optimality of Dijkstra via beyond-worst-case heaps (arxiv.org)
This paper proves that Dijkstra's shortest-path algorithm is universally optimal in both its running time and number of comparisons when combined with a sufficiently efficient heap data structure.
Inserting a 0 bit in the middle of a value (wordpress.com)
Inserting a 0 bit in the middle of a value is not hard to do: we can split the original value into the bits below and at or above the target bit position, shift the top bits left by 1 more unit to make space, then reassemble everything together:
Zero or Sign Extend (wordpress.com)
A while back I had to deal with a bit-packed format that contained a list of integer values encoded in one of a pre-defined sets of bit widths, where both the allowed bit widths and the signed-ness were denoted in a header elsewhere.
RRR: A Succinct Rank/Select Index for Bit Vectors (2011) (alexbowe.com)
This blog post will give an overview of a static bitsequence data structure known as RRR, which answers arbitrary length rank queries in $\mathcal{O}(1)$ time, and provides implicit compression.
Skip Hash: A fast ordered map via software transactional memory (arxiv.org)
Scalable ordered maps must ensure that range queries, which operate over many consecutive keys, provide intuitive semantics (e.g., linearizability) without degrading the performance of concurrent insertions and removals.
Optimize the Data Layout (cedardb.com)
When you want to speed up your program, the obvious step is to recall the learnings of your data structure class and optimize the algorithmic complexity.
TypedDicts are better than you think (changs.co.uk)
TypedDict was introduced in PEP-589 which landed in Python 3.8.
Designing a Fast Concurrent Hash Table (ibraheem.ca)
I recently released papaya, a fast and feature-complete concurrent hash table for Rust. In this post I want to dive into the design and research that went into creating it, as well as why you might consider using it over existing solutions. If you're looking for an overview of papaya, you might find it more useful to consult the documentation.
Fast B-Trees (scattered-thoughts.net)
Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties:
Conflict-Free Replicated Data Types (crdt.tech)
A Conflict-free Replicated Data Type (CRDT) is a data structure that simplifies distributed data storage systems and multi-user applications.
CPython Runtime Internals: Key Data Structures and Runtime Bootstrapping (codingconfessions.com)
While this article is freely available to read online, I am also making a PDF of this article available. If you enjoy reading in that format, you can purchase it at the below link. If you are a paid subscriber you can find a 100% discount code in the header of the email, or just reach out to me via email or DM and I will give you the PDF.
The Simple Magic of Consistent Hashing (2011) (paperplanes.de)
The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?
Geometric Search Trees (g-trees.github.io)
We describe G-trees, a family of randomized, history-independent search tree data structures.
Dissecting the gzip format (2011) (infinitepartitions.com)
In this article I describe the DEFLATE algorithm that GZIP implements and depends on. The DEFLATE algorithm uses a combination of LZ77, Huffman codes and run-length-encoding; this article describes each in detail by walking through an example and developing source code to implement the algorithm. My aim is to implement readable rather than efficient or extensible code. I'll focus here on unzipping, rather than zipping, but by the end of the article, the zipping process should be clear.
Generic Freelist Allocator for Go (pkg.go.dev)
This package is not in the latest version of its module.
Packed structs in Zig make bit/flag sets trivial (hexops.com)
As we’ve been building Mach engine, we’ve been using a neat little pattern in Zig that enables writing flag sets more nicely in Zig than in other languages.
Know your Python container types (b-list.org)
There are a lot of container types available in the Python standard library, and it can be confusing sometimes to keep track of them all. So since it’s Christmas Eve and time to wrap any last-minute gifts, I’d like to wrap up this “Advent calendar” series with a guide to the most common types of data containers and what kinds of things you might want to wrap in them.
Bplustree.app (bplustree.app)
When Bloom filters don't bloom (2020) (cloudflare.com)
I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.
B-Trees and Database Indexes (planetscale.com)
B-trees are used by many modern DBMSs. Learn how they work, how databases use them, and how your choice of primary key can affect index performance.