SQLite Index Visualization(mrsuh.com) After learning about indexes, I understood their basic structure, but I wanted to dig deeper — to explore the data structure, understand the algorithm, and learn how the index data is stored on disk.
Histogramming Bytes with Positional Popcount(blogspot.com) A while ago, after some back and forth on twitter/X with @corsix, I dropped some implementation of byte histogramming without explaining anything. This post aims to rectify that lack of explanation.
How to Flatten nested JSON arrays(datazip.io) Flattening nested JSON or MongoDB’s BSON or normalizing semi-structured data and writing queries on it for analytics or regular queries, is a common challenge in data processing.
Notes on Binary Soup(marginalia.nu) I recently put together a small library called Slop, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before.
96 points by todsacerdoti 19 days ago | 14 comments
How to Correctly Sum Up Numbers(cedardb.com) One of the first examples for loops in probably every programming language course is taking a list of numbers and calculating the sum of them.
203 points by foweltschmerz 27 days ago | 47 comments
Inserting a 0 bit in the middle of a value(wordpress.com) Inserting a 0 bit in the middle of a value is not hard to do: we can split the original value into the bits below and at or above the target bit position, shift the top bits left by 1 more unit to make space, then reassemble everything together:
Zero or Sign Extend(wordpress.com) A while back I had to deal with a bit-packed format that contained a list of integer values encoded in one of a pre-defined sets of bit widths, where both the allowed bit widths and the signed-ness were denoted in a header elsewhere.
Optimize the Data Layout(cedardb.com) When you want to speed up your program, the obvious step is to recall the learnings of your data structure class and optimize the algorithmic complexity.
Designing a Fast Concurrent Hash Table(ibraheem.ca) I recently released papaya, a fast and feature-complete concurrent hash table for Rust. In this post I want to dive into the design and research that went into creating it, as well as why you might consider using it over existing solutions. If you're looking for an overview of papaya, you might find it more useful to consult the documentation.
189 points by burntsushi 42 days ago | 26 comments
Fast B-Trees(scattered-thoughts.net) Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties:
271 points by surprisetalk 45 days ago | 64 comments
Conflict-Free Replicated Data Types(crdt.tech) A Conflict-free Replicated Data Type (CRDT) is a data structure that simplifies distributed data storage systems and multi-user applications.
CPython Runtime Internals: Key Data Structures and Runtime Bootstrapping(codingconfessions.com) While this article is freely available to read online, I am also making a PDF of this article available. If you enjoy reading in that format, you can purchase it at the below link. If you are a paid subscriber you can find a 100% discount code in the header of the email, or just reach out to me via email or DM and I will give you the PDF.
The Simple Magic of Consistent Hashing (2011)(paperplanes.de) The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?
Dissecting the gzip format (2011)(infinitepartitions.com) In this article I describe the DEFLATE algorithm that GZIP implements and depends on. The DEFLATE algorithm uses a combination of LZ77, Huffman codes and run-length-encoding; this article describes each in detail by walking through an example and developing source code to implement the algorithm. My aim is to implement readable rather than efficient or extensible code. I'll focus here on unzipping, rather than zipping, but by the end of the article, the zipping process should be clear.
Know your Python container types(b-list.org) There are a lot of container types available in the Python standard library, and it can be confusing sometimes to keep track of them all. So since it’s Christmas Eve and time to wrap any last-minute gifts, I’d like to wrap up this “Advent calendar” series with a guide to the most common types of data containers and what kinds of things you might want to wrap in them.
When Bloom filters don't bloom (2020)(cloudflare.com) I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.
B-Trees and Database Indexes(planetscale.com) B-trees are used by many modern DBMSs. Learn how they work, how databases use them, and how your choice of primary key can affect index performance.