Hacker News with Generative AI: Deep Learning

Show HN: Free mammogram analysis tool combining deep learning and vision LLM (neuralrad.com:5300)
Scaling RNNs to Billions of Parameters with Zero Order (arxiv.org)
During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory.
You could have invented Transformers (gwern.net)
‘You Could Have Invented Transformers’ tutorial proposal
Ask HN: AI Reading List (ycombinator.com)
In the thread about John Carmack presentation, somebody mentioned the reading list he got from Ilya which were crucial to understand what matters and the current state of the knowledge (at the time).<p>After some googling, it seems like this list is plausible, although not confirmed: https://github.com/dzyim/ilya-sutskever-recommended-reading?tab=readme-ov-file<p>What would an actualized list look today ?
Attention Wasn't All We Needed (stephendiehl.com)
There's a lot of modern transformer techniques that have been developed since the original Attention Is All You Need paper. Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succintly as possible. We'll use the Pytorch framework for most of the examples.
Deep Learning is no Intelligence (cullmann.dev)
Here we are in the year 2025 and every company that wants to grab your money now peddles AI.
The Annotated Kolmogorov-Arnold Network (Kan) (alexzhang13.github.io)
Deep neural networks have been the driving force of developments in AI in the last decade. However, they currently suffer from several known issues such as a lack of interpretability, scaling issues, and data inefficiency – in other words, while they are powerful, they are not a perfect solution.
SUS backprop: linear backpropagation algorithm for long inputs in transformers (arxiv.org)
It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph.
µPC: Scaling Predictive Coding to 100 Layer Networks (arxiv.org)
The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024).
Deep Learning Is Applied Topology (theahura.substack.com)
When I think about AI, I think about topology.
Questioning Representational Optimism in Deep Learning (github.com/akarshkumar0101)
Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance.
You could have designed state of the art positional encoding (huggingface.co)
This post walks you through the step-by-step discovery of state-of-the-art positional encoding in transformer models.
Show HN: A highly extensible framework for building OCR systems (github.com/robbyzhaox)
MyOCR is a highly extensible and customizable framework for building OCR systems. Engineers can easily train, integrate deep learning models into custom OCR pipelines for real-world applications.
Wav2Lip: Accurately Lip-Syncing Videos and OpenVINO (github.com/openvinotoolkit)
Byte latent transformer: Patches scale better than tokens (2024) (arxiv.org)
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness.
The Speed of VITs and CNNs (eyer.be)
It is often stated that because of the quadratic self-attention, ViTs aren't practical at higher resolution.
Mixture of Tunable Experts-DeepSeek R1 Behavior Modification at Inference Time (huggingface.co)
CosAE: Learnable Fourier Series for Image Restoration (sifeiliu.net)
In this paper, we introduce CosAE (Cosine Autoencoder), a novel, generic Autoencoder that seamlessly leverages the classic Fourier series with a feed-forward neural network.
PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch (arxiv.org)
CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases.
Three things everyone should know about Vision Transformers (arxiv.org)
After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis.
Double Descent Demystified: size of smallest non-zero singular value of X (arxiv.org)
Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime.
Improving Deep Learning with a Little Help from Physics (quantamagazine.org)
Rose Yu has a plan for how to make AI better, faster and smarter — and it’s already yielding results.
Show HN: Keep your PyTorch model in VRAM by hot swapping code (github.com/valine)
This is an example of how to hotswap PyTorch training code without unloading your model weights from VRAM.
Sparsely-Gated Mixture of Experts (MoE) (thegreenplace.net)
In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity.
A curated blog for learning LLM internals: tokenize, attention, PE, and more (ycombinator.com)
I've been diving deep into the internals of Large Language Models (LLMs) and started documenting my findings.
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation (lllyasviel.github.io)
Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory. Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments. Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). No timestep distillation. Video diffusion, but feels like image diffusion.
Show HN: I built a deep learning engine from scratch in Python (github.com/whitegra)
It implements deep learning architecture and training logic without relying on NumPy, PyTorch, or any external libraries. Every operation—tensor arithmetic, backpropagation, attention, and optimization—is executed through hand-written, minimal Python logic.
Understanding Some Limits of DeepSeek (ycombinator.com)
Recently I asked deepseek about how to use javascript to extract and makes computations with moodle in education. I note that the program did not consider two crucial points: 1) It is of utmost importance that the answer and the grade of the answer should be in the same row. 2) Don't modify the answer of the student. In this case the answer are ten letters in response to a test with ten questions.
The path to open-sourcing the DeepSeek inference engine (github.com/deepseek-ai)
A few weeks ago, during Open Source Week, we open-sourced several libraries. The response from the community has been incredibly positive - sparking inspiring collaborations, productive discussions, and valuable bug fixes. Encouraged by this, we’ve decided to take another step forward: contributing our internal inference engine back to the open-source community.
NoProp: Training neural networks without back-propagation or forward-propagation (arxiv.org)
The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter.