Hacker News with Generative AI: Computer Architecture

Undocumented 8086 instructions, explained by the microcode (righto.com)
What happens if you give the Intel 8086 processor an instruction that doesn't exist?
Notes on the Pentium's microcode circuitry (righto.com)
Most people think of machine instructions as the fundamental steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the processor's control circuitry from complex logic gates, the control logic is implemented with code known as microcode, stored in the microcode ROM. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.
RISC architecture did change everything (wired.com)
“RISC architecture is gonna change everything.” Those absurdly geeky, incredibly prophetic words were spoken 30 years ago. Today, they’re somehow truer than ever.
An Interview with Zen Chief Architect Mike Clark (computerenhance.com)
Zen is one of the most important microarchitectures in the history of the x86 ecosystem. Not only is it the reigning champion in many x64 benchmarks, but it is also the architecture that enabled AMD’s dramatic rise in CPU marketshare over the past eight years: from 10% when the first Zen processor was launched, to 25% at the introduction of Zen 5.
RISC-V Processor Design – Lec 6 – EXU and Co-Simulation (ycombinator.com)
In this lecture, we stitch together a custom Instruction Set Simulator I created with the RISC-V CPU (now with the execution stage) and see the first instructions flowing in the pipeline.
Bypassing the Branch Predictor (nicula.xyz)
A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.
An Active Message Inspired Reconfigurable Architecture for Irregular Workloads (arxiv.org)
Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well.
C Is Not a Low-level Language: Your computer is not a fast PDP-11 (2018) (dl.acm.org)
In the wake of the recent Meltdown and Spectre vulnerabilities, it’s worth spending some time looking at root causes.
The Pentium contains a complicated circuit to multiply by three (righto.com)
In 1993, Intel released the high-performance Pentium processor, the start of the long-running Pentium line.
Comparing Two Verilog CPU Implementations Using EBMC (philipzucker.com)
About a year ago my friends and I built a 4bit cpu out of a kit from aliexpress. https://www.philipzucker.com/td4-4bit-cpu/ It’s a lot of fun. I also think the system is so simple that is is kind of a nice target for tinkering around with formal methods.
SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs (hanlab.mit.edu)
With Moore's law slowing down, hardware vendors are shifting toward low-precision inference. NVIDIA's latest Blackwell architecture introduces a new 4-bit floating point format (NVFP4), improving upon the previous MXFP4 format. NVFP4 features more precise scaling factors and a smaller microscaling group size (16 v.s. 32), enabling it to maintain 16-bit model accuracy even at 4-bit precision while delivering 4× higher peak performance.
Explaining my fast 6502 code generator (2023) (pubby.games)
To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.
MESI Cache Coherency Protocol Visualization (scss.tcd.ie)
No canvas support
Torrent-1: a RISC-V vector implementation inspired by the Cray X1 vector machine (hackaday.com)
The crux of most supercomputers is the ability to operate on many pieces of data at once — something video cards are good at, too. Enter T1 (short for Torrent-1), a RISC-V vector inspired by the Cray X1 vector machine.
I believe 6502 instruction set is a good first assembly language (nemanjatrifunovic.substack.com)
Deciding where to start is one of the hardest things about learning assembly programming. Unlike high-level languages, assembly is tightly connected to the hardware and deciding which CPU to use is an important first step.
Minor 387 Documentation Mystery (os2museum.com)
So here I am, writing a bit of test code to figure out the behavior of x87 FPUs with regard to saving and loading the FPU state (FSTENV/FLDENV and FSAVE/FRSTOR instructions in different modes and formats).
Nvidia RTX Blackwell GPU Architecture [pdf] (nvidia.com)
A RISC-V Progress Check: Benchmarking P550 and C910 (chipsandcheese.com)
RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.
Show HN: Vole Machine ISA Simulator (faresbakhit.github.io)
Disabling Zen 5's Op Cache and Exploring Its Clustered Decoder (chipsandcheese.com)
Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters. Each cluster serves one of the core’s two SMT threads. That creates parallels to AMD’s Steamroller architecture from the pre-Zen days. Zen 5 and Steamroller can both decode up to eight instructions per cycle with two threads active, or up to four per cycle for a single thread.
Using the most unhinged AVX-512 instruction to make fastest phrase search algo (gab-menezes.github.io)
Do you know when you go to your favorite search engine and search for something using double quotes, like a passage of a book/article or something very specific? That’s called phrase search (sometimes exact search). What we are telling the search engine is that we want these exact words in this exact order (this varies from search engine to search engine, but that’s the main idea).
Simple CPU Design (simplecpudesign.com)
Welcome to the world of tomorrow (Link), the past and the present of computer architectures, with a small sprinkling of flashing LEDs. All good computers have to have banks of flashing lights. This web site was inspired by an article by Alan Clements (Link), in this he discusses the pressures faced in teaching computer architectures.
LoongArch64 Subjective Higlights (0x80.pl)
I get back to work on simdutf recently, and noticed that the library gained support for LoongArch64. This is a custom design and custom ISA by Loongson from China. They provide documentation for scalar ISA, but not for the vector extension. Despite that, GCC, binutils, QEMU and other tools already support the ISA. To our luck, Jiajie Chen did an impressive work of reverse engineering the vector stuff and published results online as The Unofficial LoongArch Intrinsics Guide.
Impact of Low Temperatures on the 5nm SRAM Array Size and Performance (semiengineering.com)
A new technical paper titled “Novel Trade-offs in 5 nm FinFET SRAM Arrays at Extremely Low Temperatures” was published by researchers at University of Stuttgart, IIT Kanpur, National Yang Ming Chiao Tung University, Khalifa University, and TU Munich.
Checking whether an ARM NEON register is zero (lemire.me)
Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth.
Reverse-engineering a carry-lookahead adder in the Pentium (righto.com)
Addition is harder than you'd expect, at least for a computer.
What Every Hacker Should Know About TLB Invalidation [pdf] (grsecurity.net)
Finite Field Assembly: A Language for Emulating GPUs on CPU (leetarxiv.substack.com)
FF-asm is a programming language founded on the thesis: Math is mostly invented, rarely discovered.
The missing tier for query compilers (scattered-thoughts.net)
Database query engines used to be able to assume that disk latency was so high that the overhead of interpreting the query plan didn't matter. Unfortunately these days a cheap nvme ssd can supply data much faster than a query interpreter can process it.
Customasm – An assembler for custom, user-defined instruction sets (github.com/hlorenzi)
customasm is an assembler that allows you to provide your own custom instruction sets to assemble your source files!