Hacker News with Generative AI: Deep Learning

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch (arxiv.org)
CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases.
Three things everyone should know about Vision Transformers (arxiv.org)
After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis.
Double Descent Demystified: size of smallest non-zero singular value of X (arxiv.org)
Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime.
Improving Deep Learning with a Little Help from Physics (quantamagazine.org)
Rose Yu has a plan for how to make AI better, faster and smarter — and it’s already yielding results.
Show HN: Keep your PyTorch model in VRAM by hot swapping code (github.com/valine)
This is an example of how to hotswap PyTorch training code without unloading your model weights from VRAM.
Sparsely-Gated Mixture of Experts (MoE) (thegreenplace.net)
In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity.
A curated blog for learning LLM internals: tokenize, attention, PE, and more (ycombinator.com)
I've been diving deep into the internals of Large Language Models (LLMs) and started documenting my findings.
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation (lllyasviel.github.io)
Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory. Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments. Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). No timestep distillation. Video diffusion, but feels like image diffusion.
Show HN: I built a deep learning engine from scratch in Python (github.com/whitegra)
It implements deep learning architecture and training logic without relying on NumPy, PyTorch, or any external libraries. Every operation—tensor arithmetic, backpropagation, attention, and optimization—is executed through hand-written, minimal Python logic.
Understanding Some Limits of DeepSeek (ycombinator.com)
Recently I asked deepseek about how to use javascript to extract and makes computations with moodle in education. I note that the program did not consider two crucial points: 1) It is of utmost importance that the answer and the grade of the answer should be in the same row. 2) Don't modify the answer of the student. In this case the answer are ten letters in response to a test with ten questions.
The path to open-sourcing the DeepSeek inference engine (github.com/deepseek-ai)
A few weeks ago, during Open Source Week, we open-sourced several libraries. The response from the community has been incredibly positive - sparking inspiring collaborations, productive discussions, and valuable bug fixes. Encouraged by this, we’ve decided to take another step forward: contributing our internal inference engine back to the open-source community.
NoProp: Training neural networks without back-propagation or forward-propagation (arxiv.org)
The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter.
Universal photonic artificial intelligence acceleration (nature.com)
Over the past decade, photonics research has explored accelerated tensor operations, foundational to artificial intelligence (AI) and deep learning1,2,3,4, as a path towards enhanced energy efficiency and performance5,6,7,8,9,10,11,12,13,14.
Tom and Jerry One-Minute Video Generation with Test-Time Training (test-time-training.github.io)
Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos with strong temporal consistency and motion smoothness.
Deep Learning, Deep Scandal (garymarcus.substack.com)
Deep learning is indeed finally hitting a wall, in the sense of reaching a point of diminishing results. That’s been clear for months. One of the clearest signs of this is the saga of the just-released Llama 4, the latest failed billion (?) dollar attempt by one of the majors to create what we might call GPT-5 level AI.
AI masters Minecraft: DeepMind program finds diamonds without being taught (nature.com)
An artificial intelligence (AI) system has for the first time figured out how to collect diamonds in the hugely popular video game Minecraft — a difficult task requiring multiple steps — without being shown how to play.
AI image recognition detects bubble-like structures in the universe (phys.org)
To learn more about the deepest reaches of our own galaxy and the mysteries of star formation, Japanese researchers have created a deep learning model.
The Matrix Calculus You Need for Deep Learning (explained.ai)
Most of us last saw calculus in school, but derivatives are a critical part of machine learning, particularly deep neural networks, which are trained by optimizing a loss function.
Self-Supervised Learning from Images with JEPA (2023) (arxiv.org)
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
How DeepSeek Rewrote the Transformer [video] (youtube.com)
Physics-Based Deep Learning v4 (arxiv.org)
This document is a hands-on, comprehensive guide to deep learning in the realm of physical simulations.
Optimizing ML training with metagradient descent (arxiv.org)
A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space.
VGGT: Visual Geometry Grounded Transformer (github.com/facebookresearch)
DeepSeek V3 is now the highest scoring non-reasoning model (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
The Original 2012 AlexNet Is Open Source Now (github.com/computerhistory)
This package contains the original 2012 AlexNet code.
Attention is NOT all you need (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
MIT 6.S191: Deep Generative Modeling [video] (youtube.com)
Deepseek V3-0324 (huggingface.co)
This model is not currently available via any of the supported Inference Providers.
Mac Studio M3 Ultra can run Deepseek R1 671B in memory using <200W (techradar.com)
PyTorch Internals: Ezyang's Blog (ezyang.com)
This post is a long form essay version of a talk about PyTorch internals, that I gave at the PyTorch NYC meetup on May 14, 2019.