Hacker News with Generative AI: Computer Vision

Depth Anything V2 (depth-anything-v2.github.io)
Depth Anything V2 is trained from 595K synthetic labeled images and 62M+ real unlabeled images, providing the most capable monocular depth estimation (MDE) model with the following features: more fine-grained details than Depth Anything V1, more robust than Depth Anything V1 and SD-based models (e.g., Marigold, Geowizard), more efficient (10x faster) and more lightweight than SD-based models, impressive fine-tuned performance with our pre-trained models. We also release six metric depth models of three scales for indoor and outdoor scenes, respectively.
PlainsightAI Releases OpenFilter: Framework For Universal Vision Workloads (github.com/PlainsightAI)
OpenFilter is an universal abstraction for building and running vision workloads in modular image/video processing pipelines.
Satellites Spotting Depth (marksblogg.com)
Depth Anything V2 is a depth estimation model that was released last year. It was developed by a team from TikTok and the University of Hong Kong (HKU). Almost ~600K synthetic, labelled images and over 62M real, unlabelled images were used in its training.
ByteDance/Dolphin on HuggingFace (huggingface.co)
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.
Single RGB camera turns your palm into a keyboard for mixed reality interaction (arduino.cc)
Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University developed interaction technology called Palmpad that enables typing with just a single RGB camera and an Arduino.
Show HN: A highly extensible framework for building OCR systems (github.com/robbyzhaox)
MyOCR is a highly extensible and customizable framework for building OCR systems. Engineers can easily train, integrate deep learning models into custom OCR pipelines for real-world applications.
Steepest Descent Density Control for Compact 3D Gaussian Splatting (arxiv.org)
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis.
Ollama's new engine for multimodal models (ollama.com)
Ollama now supports multimodal models via Ollama’s new engine, starting with new vision multimodal models:
Show HN: Real-Time Gaussian Splatting (github.com/axbycc)
LiveSplat is an algorithm for realtime Gaussian splatting using RGBD camera streams.
Wav2Lip: Accurately Lip-Syncing Videos and OpenVINO (github.com/openvinotoolkit)
Bringing 3D shoppable products online with generative AI (research.google)
Discover how our latest AI models transform 2D product images into immersive 3D experiences for online shoppers.
FastVLM: Efficient vision encoding for vision language models (github.com/apple)
This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)
Vision Now Available in Llama.cpp (github.com/ggml-org)
System lets robots identify an object's properties through handling (news.mit.edu)
With a novel simulation method, robots can guess the weight, softness, and other physical properties of an object just by picking it up.
AI focused on brain regions recreates what you're looking at (2024) (newscientist.com)
Artificial intelligence systems can now create remarkably accurate reconstructions of what someone is looking at based on recordings of their brain activity. These reconstructed images are greatly improved when the AI learns which parts of the brain to pay attention to.
Your ViT Is Secretly an Image Segmentation Model (arxiv.org)
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.
The Speed of VITs and CNNs (eyer.be)
It is often stated that because of the quadratic self-attention, ViTs aren't practical at higher resolution.
O3 beats a master-level GeoGuessr player, even with fake EXIF data (sampatt.com)
Gaussian Splatting Meets ROS2 (github.com/shadygm)
ROSplat is the first online ROS2-based visualizer that leverages Gaussian splatting to render complex 3D scenes.
Vision Transformers Need Registers (arxiv.org)
Transformers have recently emerged as a powerful tool for learning visual representations.
CosAE: Learnable Fourier Series for Image Restoration (sifeiliu.net)
In this paper, we introduce CosAE (Cosine Autoencoder), a novel, generic Autoencoder that seamlessly leverages the classic Fourier series with a feed-forward neural network.
LLMs can see and hear without any training (github.com/facebookresearch)
LLMs can see and hear without any training
Watching o3 guess a photo's location is surreal, dystopian and entertaining (simonwillison.net)
Watching OpenAI’s new o3 model guess where a photo was taken is one of those moments where decades of science fiction suddenly come to life. It’s a cross between the Enhance Button and Omniscient Database TV Tropes.
Three things everyone should know about Vision Transformers (arxiv.org)
After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis.
π0.5: A VLA with open-world generalization (pi.website)
Robots have come a long way over the past few years—they can perform impressive acrobatic feats, dance on stage, follow language commands and, in some of our own results, perform complex tasks like folding laundry or cleaning off a table. But the biggest challenge in robotics is not in performing feats of agility or dexterity, but generalization: the ability to figure out how to correctly perform even a simple task in a new setting or with new objects.
Gemma 3 QAT Models: Bringing AI to Consumer GPUs (googleblog.com)
Last month, we launched Gemma 3, our latest generation of open models. Delivering state-of-the-art performance, Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation (lllyasviel.github.io)
Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory. Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments. Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). No timestep distillation. Video diffusion, but feels like image diffusion.
SDFs from Unoriented Point Clouds Using Neural Variational Heat Distances (arxiv.org)
We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds.
UniK3D: Universal Camera Monocular 3D Estimation (lpiccinelli-eth.github.io)
To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera.
Liquid: Language models are scalable and unified multi-modal generators (foundationvision.github.io)
We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language.