Hacker News with Generative AI: Computer Vision

Three things everyone should know about Vision Transformers (arxiv.org)
After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis.
π0.5: A VLA with open-world generalization (pi.website)
Robots have come a long way over the past few years—they can perform impressive acrobatic feats, dance on stage, follow language commands and, in some of our own results, perform complex tasks like folding laundry or cleaning off a table. But the biggest challenge in robotics is not in performing feats of agility or dexterity, but generalization: the ability to figure out how to correctly perform even a simple task in a new setting or with new objects.
Gemma 3 QAT Models: Bringing AI to Consumer GPUs (googleblog.com)
Last month, we launched Gemma 3, our latest generation of open models. Delivering state-of-the-art performance, Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation (lllyasviel.github.io)
Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory. Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments. Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). No timestep distillation. Video diffusion, but feels like image diffusion.
SDFs from Unoriented Point Clouds Using Neural Variational Heat Distances (arxiv.org)
We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds.
UniK3D: Universal Camera Monocular 3D Estimation (lpiccinelli-eth.github.io)
To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera.
Liquid: Language models are scalable and unified multi-modal generators (foundationvision.github.io)
We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language.
LightlyTrain: Better Vision Models, Faster – No Labels Needed (github.com/lightly-ai)
LightlyTrain brings self-supervised pretraining to real-world computer vision pipelines, using your unlabeled data to reduce labeling costs and speed up model deployment.
Watermark segmentation (github.com/Diffusion-Dynamics)
This repository by Diffusion Dynamics, showcases the core technology behind the watermark segmentation capabilities of our first product, clear.photo. This work leverages insights from research on diffusion models for image restoration tasks.
Tom and Jerry One-Minute Video Generation with Test-Time Training (test-time-training.github.io)
Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos with strong temporal consistency and motion smoothness.
TripoSG – Text to 3D Model (github.com/VAST-AI-Research)
TripoSG is an advanced high-fidelity, high-quality and high-generalizability image-to-3D generation foundation model.
No elephants: Breakthroughs in image generation (oneusefulthing.org)
Over the past two weeks, first Google and then OpenAI rolled out their multimodal image generation abilities. This is a big deal.
Tenstorrent Launches Blackhole Developer Products at Tenstorrent Dev Day (tenstorrent.com)
Tenstorrent launched the next generation Blackhole™ chip family today at their DevDay event in San Francisco.
QVQ-Max: Think with Evidence (qwenlm.github.io)
Last December, we launched QVQ-72B-Preview as an exploratory model, but it had many issues. Today, we are officially releasing the first version of QVQ-Max, our visual reasoning model.
We Built an AI Tool to Create High-Quality 3D Models from Regular Videos (ycombinator.com)
We're a team of AI researchers passionate about simplifying 3D modeling. We've built an easy-to-use tool that generates detailed, high-quality 3D models directly from regular videos.
Waymo's Foundation Model for Autonomous Driving with Drago Anguelov [video] (youtube.com)
Mirrors: The Blind Spot of Image and Video Generation Models (medium.com)
Recent advances in image generation models have demonstrated remarkable capabilities in creating photorealistic and imaginative visuals. However, a persistent challenge remains: accurately rendering reflections in mirrors.
Apple's Cubify Anything: Scaling Indoor 3D Object Detection (github.com/apple)
This repository includes the public implementation of Cubify Transformer and the associated CA-1M dataset.
Self-Supervised Learning from Images with JEPA (2023) (arxiv.org)
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
How DeepSeek Rewrote the Transformer [video] (youtube.com)
Estimating Camera Motion from a Single Motion-Blurred Image (jerredchen.github.io)
Given a single motion-blurred image, we exploit the motion blur cues to predict the camera velocity at that instant without performing any deblurring.
VGGT: Visual Geometry Grounded Transformer (github.com/facebookresearch)
Open source AI agent helper to let it SEE what its doing (github.com/monteslu)
An MCP server that enables LLMs to "see" what's happening in browser-based games and applications through vectorized canvas visualization and debug information.
The Original 2012 AlexNet Is Open Source Now (github.com/computerhistory)
This package contains the original 2012 AlexNet code.
Can Large Vision Language Models Read Maps Like a Human? (arxiv.org)
Can Large Vision Language Models Read Maps Like a Human? (arxiv.org)
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios.
Intel RealSense Stereo Depth Cameras (intelrealsense.com)
Stereo Depth cameras, Lidar cameras, Coded light and Tracking cameras from Intel RealSense
Map Features in OpenStreetMap with Computer Vision (mozilla.ai)
Mozilla.ai developed and released the OpenStreetMap AI Helper Blueprint. If you love maps and are interested in training your own computer vision model, you’ll enjoy diving into this Blueprint.
Show HN: Torch Lens Maker – Differentiable Geometric Optics in PyTorch (victorpoughon.github.io)
Welcome to Torch Lens Maker, an open-source Python library for differentiable geometric optics based on PyTorch. Currently a very experimental project, the goal is to be able to design complex real-world optical systems (lenses, mirrors, etc.) using modern computer code and state-of-the art numerical optimization.
Nvidia GTC 2025 – Built for Reasoning, Vera Rubin, Kyber, Jensen Math, Feynman (semianalysis.com)
AI model progress has accelerated tremendously, and in the last six months, models have improved more than in the previous six months. This trend will continue because three scaling laws are stacked together and working in tandem: pre-training scaling, post-training scaling, and inference time scaling.