Hacker News with Generative AI: Computer Architecture

Fundamental flaws of SIMD ISAs (2021) (bitsnbites.eu)
According to Flynn’s taxonomy SIMD refers to a computer architecture that can process multiple data streams with a single instruction (i.e. “Single Instruction stream, Multiple Data streams”).
The Dauug House - Dauug|36 minicomputer documentation (cs.wright.edu)
Dauug|36 is a 36-bit architecture for owner-built CPUs, controllers, and minicomputers. Only maker-scale assembly tools are necessary, so this architecture can be implemented anywhere on the planet without a semiconductor foundry. All you need is a bare circuit board, about 300 components, and some soldering practice.
Long-term L1 execution layer proposal: replace the EVM with RISC-V (ethereum-magicians.org)
How MOS 6502 Illegal Opcodes Work (2008) (pagetable.com)
The original NMOS version of the MOS 6502, used in computers like the Commodore 64, the Apple II and the Nintendo Entertainment System (NES), is well-known for its illegal opcodes: Out of 256 possible opcodes, 151 are defined by the architecture, but many of the remaining 105 undefined opcodes do useful things.
Efficient Architecture for RISC-V Vector Memory Access (arxiv.org)
Vector processors frequently suffer from inefficient memory accesses, particularly for strided and segment patterns.
INT 10h (int10h.org)
Learning Assembly for Fun, Performance and Profit (thechipletter.substack.com)
Low-level languages have been in the news recently. Use of Nvidia’s ptx has been revealed as part of DeepSeek’s ‘secret sauce’. And there is still plenty of interest in learning assembly language. A recent Substack post advocating learning assembly language for the venerable, but well loved, 6502 as a first step garnered over 240 ‘upvotes’ and more than 290 comments on Hacker News.
PDP-11/Hack de luxe (vcfed.org)
Not really satisfied with my breadboard attempt to build another PDP-11/Hack and also because I wanted to investigate a little bit more how the DCJ11 really works I decided to make another PDP-11/Hack. So I designed a Eurocard DCJ11 based Singleboard Computer. The PCBs arrived today and so nothing better then set them to use. But this time the PDP-11/Hack comes with an expansion slot.
Banked Memories for Soft SIMT Processors (arxiv.org)
Recent advances in soft GPGPU architectures have shown that a small (<10K LUT), high performance (770 MHz) processor is possible in modern FPGAs.
Undocumented 8086 instructions, explained by the microcode (righto.com)
What happens if you give the Intel 8086 processor an instruction that doesn't exist?
Notes on the Pentium's microcode circuitry (righto.com)
Most people think of machine instructions as the fundamental steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the processor's control circuitry from complex logic gates, the control logic is implemented with code known as microcode, stored in the microcode ROM. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.
RISC architecture did change everything (wired.com)
“RISC architecture is gonna change everything.” Those absurdly geeky, incredibly prophetic words were spoken 30 years ago. Today, they’re somehow truer than ever.
An Interview with Zen Chief Architect Mike Clark (computerenhance.com)
Zen is one of the most important microarchitectures in the history of the x86 ecosystem. Not only is it the reigning champion in many x64 benchmarks, but it is also the architecture that enabled AMD’s dramatic rise in CPU marketshare over the past eight years: from 10% when the first Zen processor was launched, to 25% at the introduction of Zen 5.
RISC-V Processor Design – Lec 6 – EXU and Co-Simulation (ycombinator.com)
In this lecture, we stitch together a custom Instruction Set Simulator I created with the RISC-V CPU (now with the execution stage) and see the first instructions flowing in the pipeline.
Bypassing the Branch Predictor (nicula.xyz)
A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.
An Active Message Inspired Reconfigurable Architecture for Irregular Workloads (arxiv.org)
Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well.
C Is Not a Low-level Language: Your computer is not a fast PDP-11 (2018) (dl.acm.org)
In the wake of the recent Meltdown and Spectre vulnerabilities, it’s worth spending some time looking at root causes.
The Pentium contains a complicated circuit to multiply by three (righto.com)
In 1993, Intel released the high-performance Pentium processor, the start of the long-running Pentium line.
Comparing Two Verilog CPU Implementations Using EBMC (philipzucker.com)
About a year ago my friends and I built a 4bit cpu out of a kit from aliexpress. https://www.philipzucker.com/td4-4bit-cpu/ It’s a lot of fun. I also think the system is so simple that is is kind of a nice target for tinkering around with formal methods.
SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs (hanlab.mit.edu)
With Moore's law slowing down, hardware vendors are shifting toward low-precision inference. NVIDIA's latest Blackwell architecture introduces a new 4-bit floating point format (NVFP4), improving upon the previous MXFP4 format. NVFP4 features more precise scaling factors and a smaller microscaling group size (16 v.s. 32), enabling it to maintain 16-bit model accuracy even at 4-bit precision while delivering 4× higher peak performance.
Explaining my fast 6502 code generator (2023) (pubby.games)
To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.
MESI Cache Coherency Protocol Visualization (scss.tcd.ie)
No canvas support
Torrent-1: a RISC-V vector implementation inspired by the Cray X1 vector machine (hackaday.com)
The crux of most supercomputers is the ability to operate on many pieces of data at once — something video cards are good at, too. Enter T1 (short for Torrent-1), a RISC-V vector inspired by the Cray X1 vector machine.
I believe 6502 instruction set is a good first assembly language (nemanjatrifunovic.substack.com)
Deciding where to start is one of the hardest things about learning assembly programming. Unlike high-level languages, assembly is tightly connected to the hardware and deciding which CPU to use is an important first step.
Minor 387 Documentation Mystery (os2museum.com)
So here I am, writing a bit of test code to figure out the behavior of x87 FPUs with regard to saving and loading the FPU state (FSTENV/FLDENV and FSAVE/FRSTOR instructions in different modes and formats).
Nvidia RTX Blackwell GPU Architecture [pdf] (nvidia.com)
A RISC-V Progress Check: Benchmarking P550 and C910 (chipsandcheese.com)
RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.
Show HN: Vole Machine ISA Simulator (faresbakhit.github.io)
Disabling Zen 5's Op Cache and Exploring Its Clustered Decoder (chipsandcheese.com)
Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters. Each cluster serves one of the core’s two SMT threads. That creates parallels to AMD’s Steamroller architecture from the pre-Zen days. Zen 5 and Steamroller can both decode up to eight instructions per cycle with two threads active, or up to four per cycle for a single thread.
Using the most unhinged AVX-512 instruction to make fastest phrase search algo (gab-menezes.github.io)
Do you know when you go to your favorite search engine and search for something using double quotes, like a passage of a book/article or something very specific? That’s called phrase search (sometimes exact search). What we are telling the search engine is that we want these exact words in this exact order (this varies from search engine to search engine, but that’s the main idea).