Hacker News with Generative AI: Parallel Computing

FFN Fusion: Rethinking Sequential Computation in Large Language Models (arxiv.org)
We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization.
I want a good parallel computer (raphlinus.github.io)
The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?
Parallel Histogram Computation with CUDA (khushi-411.github.io)
The aim of the blog posts is to introduce a parallel histogram pattern, where each output element can be updated by any thread. Therefore, we should coordinate among threads as they update the output value. In this blog post, we will read the introduction about using atomic operations to serialize the updates of each element. Then, we will study an optimization technique: privatization. Let’s dig in!
Sorting algorithms with CUDA (ashwanirathee.com)
Building on my previous post on sorting algorithms, I implemented the same algorithms using CUDA to explore performance improvements through parallel computing.
Speeding up computational lithography with the power and parallelism of GPUs (semiengineering.com)
A new lithography library brings mask optimization operations to GPUs.
Taichi: High-Performance Parallel Programming in Python (taichi-lang.org)
Taichi is a domain-specific language embedded in Python that helps you easily write portable, high-performance parallel programs.
3FS – a parallel file system from DeepSeek (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
DualPipe: Bidirectional pipeline parallelism algorithm (github.com/deepseek-ai)
DualPipe is an innovative bidirectional pipeline parallelism algorithm introduced in the DeepSeek-V3 Technical Report. It achieves full overlap of forward and backward computation-communication phases, also reducing pipeline bubbles. For detailed information on computation-communication overlap, please refer to the profile data.
DeepSeek Open Source Optimized Parallelism Strategies, 3 repos (github.com/deepseek-ai)
Here, we publicly share profiling data from our training and inference framework to help the community better understand the communication-computation overlap strategies and low-level implementation details.
TabulaROSA: Tabular OS Massively Parallel Heterogeneous Compute Engines (2018) (arxiv.org)
The rise in computing hardware choices is driving a reevaluation of operating systems.
Visualizing 6D Mesh Parallelism (main-horse.github.io)
This is a companion longpost for a fun project I’ve yet to finish. In here, I show the reader how I personally visualize the collective communications involved in a simple 2⁶ 6D parallel mesh:
Programming Language Memory Models (2021) (swtch.com)
Programming language memory models answer the question of what behaviors parallel programs can rely on to share memory between their threads.
Amdahl's Law (wikipedia.org)
In computer architecture, Amdahl's law (or Amdahl's argument[1]) is a formula that shows how much faster a task can be completed when you add more resources to the system.
Show HN: Chili. Rust port of Spice, a low-overhead parallelization library (github.com/dragostis)
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Richard Feynman and the Connection Machine (1989) (longnow.org)
One day when I was having lunch with Richard Feynman, I mentioned to him that I was planning to start a company to build a parallel computer with a million processors. His reaction was unequivocal, "That is positively the dopiest idea I ever heard."
I want a good parallel computer [video] (youtube.com)
Parallel Nix Evaluation (determinate.systems)
Welcome to the Parallel Future of Computation (higherorderco.com)
Bend: A Parallel Language (pages.dev)
Bend: A Python-Like Parallel Language for GPUs and Multicore CPUs (higherorderco.com)
Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x (hao-ai-lab.github.io)
Helios-NG: massively-parallel OS for manycore CPUs (geekdot.com)