Hacker News with Generative AI: Computer Vision

Self-Supervised Learning from Images with JEPA (2023) (arxiv.org)
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
How DeepSeek Rewrote the Transformer [video] (youtube.com)
Estimating Camera Motion from a Single Motion-Blurred Image (jerredchen.github.io)
Given a single motion-blurred image, we exploit the motion blur cues to predict the camera velocity at that instant without performing any deblurring.
VGGT: Visual Geometry Grounded Transformer (github.com/facebookresearch)
Open source AI agent helper to let it SEE what its doing (github.com/monteslu)
An MCP server that enables LLMs to "see" what's happening in browser-based games and applications through vectorized canvas visualization and debug information.
The Original 2012 AlexNet Is Open Source Now (github.com/computerhistory)
This package contains the original 2012 AlexNet code.
Can Large Vision Language Models Read Maps Like a Human? (arxiv.org)
Can Large Vision Language Models Read Maps Like a Human? (arxiv.org)
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios.
Intel RealSense Stereo Depth Cameras (intelrealsense.com)
Stereo Depth cameras, Lidar cameras, Coded light and Tracking cameras from Intel RealSense
Map Features in OpenStreetMap with Computer Vision (mozilla.ai)
Mozilla.ai developed and released the OpenStreetMap AI Helper Blueprint. If you love maps and are interested in training your own computer vision model, you’ll enjoy diving into this Blueprint.
Show HN: Torch Lens Maker – Differentiable Geometric Optics in PyTorch (victorpoughon.github.io)
Welcome to Torch Lens Maker, an open-source Python library for differentiable geometric optics based on PyTorch. Currently a very experimental project, the goal is to be able to design complex real-world optical systems (lenses, mirrors, etc.) using modern computer code and state-of-the art numerical optimization.
Nvidia GTC 2025 – Built for Reasoning, Vera Rubin, Kyber, Jensen Math, Feynman (semianalysis.com)
AI model progress has accelerated tremendously, and in the last six months, models have improved more than in the previous six months. This trend will continue because three scaling laws are stacked together and working in tandem: pre-training scaling, post-training scaling, and inference time scaling.
SmolDocling: An ultra-compact VLM for end-to-end multi-modal document conversion (arxiv.org)
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion.
Feature maps from CNNs have a weird similarity to ancient Egyptian hieroglyphs (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
Tesla Autopilot Car Drove Through a Giant Photo of a Road (petapixel.com)
Famous YouTube creator and former NASA engineer Mark Rober, who provides his 65 million subscribers with scientific entertainment videos, put Tesla’s vision-based safety features up against a LiDAR system, and the results, although arguably flawed, highlight the limitations of vision-based autonomous vehicle systems.
Arbitrary-Scale Super-Resolution with Neural Heat Fields (therasr.github.io)
Thera is the first arbitrary-scale super-resolution method with a built-in physical observation model.
Gemma3 – The current strongest model that fits on a single GPU (ollama.com)
Gemma is a lightweight, family of models from Google built on Gemini technology. The Gemma 3 models are multimodal—processing text and images—and feature a 128K context window with support for over 140 languages. Available in 1B, 4B, 12B, and 27B parameter sizes, they excel in tasks like question answering, summarization, and reasoning, while their compact design allows deployment on resource-limited devices.
Show HN: 6DoF Object detection and tracking in web browser – WebAR.rocks.train (github.com/WebAR-rocks)
Gaussian Splats with OpenScene (ashwanirathee.com)
Exploring 3D Gaussian Splatting for Novel View Synthesis
Smaller but Better: Unifying Layout Generation with Smaller LLMs (arxiv.org)
We propose LGGPT, an LLM-based model tailored for unified layout generation.
Nexar Dashcam Crash Prediction Challenge (kaggle.com)
AMD YOLO (geohot.github.io)
AMD is sending us the two MI300X boxes we asked for. They are in the mail.
Nvidia OCR (nvidia.com)
Cutting-edge vision-language model exceling in retrieving text and metadata from images.
InstantStyle: Free Lunch Towards Style-Preserving in Text-to-Image Generation (github.com/instantX-research)
InstantStyle is a general framework that employs two straightforward yet potent techniques for achieving an effective disentanglement of style and content from reference images.
GEN3C: 3D-Informed World-Consistent Video (huggingface.co)
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency.
16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs (arxiv.org)
Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge.
Show HN: Vidformer – Drop-In Acceleration for Cv2 Video Annotation Scripts (github.com/ixlab)
A research project providing infrastructure for video-native interfaces and accelerating computer vision visualization. Developed by the OSU Interactive Data Systems Lab.
SpeciesNet: AI models to classify species from motion-triggered widlife cameras (github.com/google)
Effective wildlife monitoring relies heavily on motion-triggered wildlife cameras, or “camera traps”, which generate vast quantities of image data. Manual processing of these images is a significant bottleneck. AI can accelerate that processing, helping conservation practitioners spend more time on conservation, and less time reviewing images.
Putting Andrew Ng's OCR models to the test (runpulse.com)
Today, Andrew Ng, one of the legends of the AI world, released a new document extraction service that went viral on X.
Replace OCR with Vision Language Models (github.com/vlm-run)