Hacker News with Generative AI: Data Structures

Ropey – A UTF8 text rope for manipulating and editing large text (github.com/cessen)
Ropey is a utf8 text rope for Rust, designed to be the backing text-buffer for applications such as text editors. Ropey is fast, robust, and can handle huge texts and memory-incoherent edits with ease.
Data evolution with set-theoretic types (dashbit.co)
Recently I have been working on projects that integrate Elixir with native code in C and Rust. One of the Rust libraries defines the following struct (with fields removed for simplicity):
Using black magic to make a fast circular buffer (2017) (calho.st)
Yesterday, I took a glance at the Wikipedia page for the circular buffer and was intrigued by an alleged optimization technique that I was not familiar with:
Flattening ASTs and other compiler data structures (2023) (cs.cornell.edu)
Arenas, a.k.a. regions, are everywhere in modern language implementations. One form of arenas is both super simple and surprisingly effective for compilers and compiler-like things. Maybe because of its simplicity, I haven’t seen the basic technique in many compiler courses—or anywhere else in a CS curriculum for that matter. This post is an introduction to the idea and its many virtues.
Today I learned that bash has hashmaps (2024) (xeiaso.net)
Hashmaps (associative arrays) are a great way to store a bag of key-value data. At work I was writing something that needed me to spawn a bunch of GPU instances, GPU availability is spread out by region and GPU type. I wanted to store a mapping of GPU kind to region name and for some reason I thought it would be a good idea to do it in bash.
Why is hash(-1) == hash(-2) in Python? (omairmajid.com)
While browsing Reddit the other day waiting for my code to compile, I ran across this question on r/Python:
Show HN: SPath is a Rust lib for query JSONPath over any semi-structured data (github.com/cratesland)
SPath: Query expressions for semi-structured data
Parallel-hashmap: drop-in replacement for unordered_map, unordered_set (github.com/greg7mdp)
This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map and std::set, with the following characteristics:
B-Trees: More Than I Thought I'd Want to Know (benjamincongdon.me)
Recently, I’ve been reading through the excellent Database Internals (Alex Petrov, 2019). The first half of the book is dedicated to the implementation of database storage engines – the subsystem(s) of a DBMS that handles long-term persistence of data. A surprising amount of this section discusses the implementation and optimization of various B-Tree data structures.
Static search trees: faster than binary search (curiouscoding.nl)
In this post, we will implement a static search tree (S+ tree) for high-throughput searching of sorted data, as introduced on Algorithmica.
I made a tiny library for switches and sum types in Lua (github.com/alurm)
Minimalistic sum types and switches for Lua
Visualizing memory layout of Rust's data types (youtube.com)
Exploring Alternatives to UUIDv4; Enter ULIDs (jirevwe.github.io)
UUIDv4 is a commonly used unique identifier format. UUIDv4 is a standardized format for generating unique identifiers that are widely used in distributed systems. Recently there have attempts to introduce new identifier formats that are shorter, url-friendly, lexographically sortable, collision-safe during generation.
Why HNSW is not the answer and disk-based alternatives might be more practical (pgvecto.rs)
HNSW (Hierarchical Navigable Small World) has become the go-to algorithm for many vector databases. Its multi-layered graph structure and ability to efficiently navigate vector embeddings make it particularly appealing. However, despite its apparent advantages, HNSW may not be the optimal solution for large-scale and dynamic vector similarity search. In this blog post, we challenge the dominance of HNSW and explore why disk-based alternatives, such as IVF (Inverted File Index), might be more practical for massive datasets.
Spreadsheets 1/3 – Rye Language (ryelang.org)
You can not have a higher level (closer to human) langauge, without also higher level (closer to human) value types. These then enable higher level operations on information you are working with.
Enum of Arrays (tigerbeetle.com)
A popular data-oriented design pattern is a Struct of Arrays. Is an Array of Enums also useful? Yes! But let’s do a refresher on struct of arrays (SoA) first:
Enum of Arrays (tigerbeetle.com)
A popular data-oriented design pattern is a Struct of Arrays. Is an Array of Enums also useful? Yes! But let’s do a refresher on struct of arrays (SoA) first:
Binfuse: C++ library for binary fuse filters, including a sharded filter (github.com/oschonrock)
Binary fuse filters are a recent (2022) development in the group of Approximate Membership Query filters
Show HN: A portable hash map in C (github.com/e-dant)
A small, portable, linear-probing hash map in C.
B+ Tree Visualization (cs.usfca.edu)
Algorithm Visualizations
How to pack ternary numbers in 8-bit bytes (compilade.net)
Good union types in Go would probably need types without a zero value (utoronto.ca)
One of the classical big reason to want union types in Go is so that one can implement the general pattern of an option type, in order to force people to deal explicitly with null values.
List of Lists of Lists (wikipedia.org)
This is a dynamic list and may never be able to satisfy particular standards for completeness. You can help by adding missing items with reliable sources.
Show HN: Vicinity – Fast, Lightweight Nearest Neighbors with Flexible Back Ends (github.com/MinishLab)
Vicinity is a light-weight, low-dependency vector store. It provides a simple and intuitive interface for nearest neighbor search, with support for different backends and evaluation.
Kth: High-Performance Selection Algorithms for Go (github.com/tsenart)
kth provides high-performance selection algorithms that find the k-th smallest element without sorting the entire dataset. It's especially useful for operations like finding top-N elements or medians in large datasets.
A simple way to understand CRDTs (interjectedfuture.com)
There's a simple way to understand CRDTs: It leverages algebra to unmix the inevitable mixing of data when syncing over an unreliable network.
Fine, I'll Play With Skiplists (buttondown.com)
A Log-Structured Merge tree, or LSM, is a popular data structure for storage engines. It’s what is used by RocksDB, which is sort of the poster child LSM. At a high level, the way an LSM works is that new writes get added to a memory-only index called a memtable and a durable file called a write-ahead log, or WAL. The memtable is used to serve queries and the WAL is used to provide durability.
FreeLRU: GC-less, fast and generic LRU hashmap library for Go (github.com/elastic)
FreeLRU allows you to cache objects without introducing GC overhead.
From string to AST: parsing (2019) (kubuszok.com)
Whether you have to do with data in form of CSV, JSON or a full-blooded programming language like C, JavaScript, Scala, or maybe a query language like SQL, you always transform some sequence of characters (or binary values) into a structured representation.
A short introduction to Interval Tree Clocks (2017) (separateconcerns.com)
One of the things I work on at Lima is master-master filesystem replication. In this kind of system, we need to track causality. In a nutshell, given two events modifying a given piece of data and originating from different nodes in the system, we want to know if one of those events could have influenced the other one, or in other words if one of those events “happened before” the other one.