Hacker News with Generative AI: Computer Architecture

Explaining my fast 6502 code generator (2023) (pubby.games)
To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.
MESI Cache Coherency Protocol Visualization (scss.tcd.ie)
No canvas support
Torrent-1: a RISC-V vector implementation inspired by the Cray X1 vector machine (hackaday.com)
The crux of most supercomputers is the ability to operate on many pieces of data at once — something video cards are good at, too. Enter T1 (short for Torrent-1), a RISC-V vector inspired by the Cray X1 vector machine.
I believe 6502 instruction set is a good first assembly language (nemanjatrifunovic.substack.com)
Deciding where to start is one of the hardest things about learning assembly programming. Unlike high-level languages, assembly is tightly connected to the hardware and deciding which CPU to use is an important first step.
Minor 387 Documentation Mystery (os2museum.com)
So here I am, writing a bit of test code to figure out the behavior of x87 FPUs with regard to saving and loading the FPU state (FSTENV/FLDENV and FSAVE/FRSTOR instructions in different modes and formats).
Nvidia RTX Blackwell GPU Architecture [pdf] (nvidia.com)
A RISC-V Progress Check: Benchmarking P550 and C910 (chipsandcheese.com)
RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.
Show HN: Vole Machine ISA Simulator (faresbakhit.github.io)
Disabling Zen 5's Op Cache and Exploring Its Clustered Decoder (chipsandcheese.com)
Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters. Each cluster serves one of the core’s two SMT threads. That creates parallels to AMD’s Steamroller architecture from the pre-Zen days. Zen 5 and Steamroller can both decode up to eight instructions per cycle with two threads active, or up to four per cycle for a single thread.
Using the most unhinged AVX-512 instruction to make fastest phrase search algo (gab-menezes.github.io)
Do you know when you go to your favorite search engine and search for something using double quotes, like a passage of a book/article or something very specific? That’s called phrase search (sometimes exact search). What we are telling the search engine is that we want these exact words in this exact order (this varies from search engine to search engine, but that’s the main idea).
Simple CPU Design (simplecpudesign.com)
Welcome to the world of tomorrow (Link), the past and the present of computer architectures, with a small sprinkling of flashing LEDs. All good computers have to have banks of flashing lights. This web site was inspired by an article by Alan Clements (Link), in this he discusses the pressures faced in teaching computer architectures.
LoongArch64 Subjective Higlights (0x80.pl)
I get back to work on simdutf recently, and noticed that the library gained support for LoongArch64. This is a custom design and custom ISA by Loongson from China. They provide documentation for scalar ISA, but not for the vector extension. Despite that, GCC, binutils, QEMU and other tools already support the ISA. To our luck, Jiajie Chen did an impressive work of reverse engineering the vector stuff and published results online as The Unofficial LoongArch Intrinsics Guide.
Impact of Low Temperatures on the 5nm SRAM Array Size and Performance (semiengineering.com)
A new technical paper titled “Novel Trade-offs in 5 nm FinFET SRAM Arrays at Extremely Low Temperatures” was published by researchers at University of Stuttgart, IIT Kanpur, National Yang Ming Chiao Tung University, Khalifa University, and TU Munich.
Checking whether an ARM NEON register is zero (lemire.me)
Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth.
Reverse-engineering a carry-lookahead adder in the Pentium (righto.com)
Addition is harder than you'd expect, at least for a computer.
What Every Hacker Should Know About TLB Invalidation [pdf] (grsecurity.net)
Finite Field Assembly: A Language for Emulating GPUs on CPU (leetarxiv.substack.com)
FF-asm is a programming language founded on the thesis: Math is mostly invented, rarely discovered.
The missing tier for query compilers (scattered-thoughts.net)
Database query engines used to be able to assume that disk latency was so high that the overhead of interpreting the query plan didn't matter. Unfortunately these days a cheap nvme ssd can supply data much faster than a query interpreter can process it.
Customasm – An assembler for custom, user-defined instruction sets (github.com/hlorenzi)
customasm is an assembler that allows you to provide your own custom instruction sets to assemble your source files!
Reverse Engineering the Constants in the Pentium FPU (righto.com)
Intel released the powerful Pentium processor in 1993, establishing a long-running brand of high-performance processors.1 The Pentium includes a floating-point unit that can rapidly compute functions such as sines, cosines, logarithms, and exponentials. But how does the Pentium compute these functions? Earlier Intel chips used binary algorithms called CORDIC, but the Pentium switched to polynomials to approximate these transcendental functions much faster. The polynomials have carefully-optimized coefficients that are stored in a special ROM inside the chip's floating-point unit.
RISC-V is making moves, but it has work to do if it wants to hit the mainstream (theregister.com)
RISC-V has been talked up as a challenger to Arm and x86, offering an open royalty-free architecture that promises flexibility and innovation without licensing costs. But for all the noise, you're more likely to find it buried inside IoT gadgets and obscure embedded systems than powering anything that'll typically grab a headline.
Emulating the FMAdd Instruction, Part 1: 32-bit Floats (drilian.com)
A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn't natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I'd share a little bit about how FMAdd works, since I've already been posting about how float rounding works.
Rambus DRAM (Rdram) (wikipedia.org)
Rambus DRAM (RDRAM), and its successors Concurrent Rambus DRAM (CRDRAM) and Direct Rambus DRAM (DRDRAM), are types of synchronous dynamic random-access memory (SDRAM) developed by Rambus from the 1990s through to the early 2000s.
Execution units are often pipelined (xoria.org)
In the context of out-of-order microarchitectures, I was under the impression that execution units remain occupied until the µop they’re processing is complete. This is often not the case.
Dividing unsigned 8-bit numbers (0x80.pl)
Division is quite an expansive operation. For instance, latency of the 32-bit division varies between 10 and 15 cycles on the Cannon Lake CPU, and for Zen4 this range is from 9 to 14 cycles. The latency of 32-bit multiplication is 3 or 4 cycles on both CPU models.
SVC16: Simplest Virtual Computer (github.com/JanNeuendorf)
This is the specification for an extremely simple "virtual computer" that can be emulated.
Bit-permuting 16 u32s at once with AVX-512 (blogspot.com)
The basic trick to apply the same bit-permutation to each of the u32s is to view them as matrix of 16 rows by 32 columns, transpose it into a 32 u16s, permute those u16s in the same way that we wanted to permute the bits of the u32s [1], then transpose back to 16 u32s. Easy:
Bit-permuting 16 u32s at once with AVX-512 (blogspot.com)
The basic trick to apply the same bit-permutation to each of the u32s is to view them as matrix of 16 rows by 32 columns, transpose it into a 32 u16s, permute those u16s in the same way that we wanted to permute the bits of the u32s [1], then transpose back to 16 u32s. Easy:
The Chiplet Revolution – Communications of the ACM (cacm.acm.org)
Reducing demands on a single chip by using smaller chips dedicated to specific functions.
Computer Architecture, Fifth Edition: A Quantitative Approach (2011) (dl.acm.org)
The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today.