Hacker News with Generative AI: Computer Vision

Show HN: Free mammogram analysis tool combining deep learning and vision LLM (neuralrad.com:5300)

Machine Learning, Healthcare, Deep Learning, Computer Vision

17 points by coolwulf 60 days ago | 15 comments

Understanding Generative AI Capabilities in Everyday Image Editing Tasks (arxiv.org)
Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025.

Generative AI, Image Editing, Artificial Intelligence, Computer Vision

5 points by taesiri 64 days ago | 1 comments

Depth Anything V2 (depth-anything-v2.github.io)
Depth Anything V2 is trained from 595K synthetic labeled images and 62M+ real unlabeled images, providing the most capable monocular depth estimation (MDE) model with the following features: more fine-grained details than Depth Anything V1, more robust than Depth Anything V1 and SD-based models (e.g., Marigold, Geowizard), more efficient (10x faster) and more lightweight than SD-based models, impressive fine-tuned performance with our pre-trained models. We also release six metric depth models of three scales for indoor and outdoor scenes, respectively.

Computer Vision, Machine Learning, Open Source, Image Processing, Artificial Intelligence

5 points by Brajeshwar 66 days ago | 0 comments

PlainsightAI Releases OpenFilter: Framework For Universal Vision Workloads (github.com/PlainsightAI)
OpenFilter is an universal abstraction for building and running vision workloads in modular image/video processing pipelines.

Computer Vision, Open Source, Machine Learning, Software, AI

14 points by nickstinemates 66 days ago | 4 comments

Satellites Spotting Depth (marksblogg.com)
Depth Anything V2 is a depth estimation model that was released last year. It was developed by a team from TikTok and the University of Hong Kong (HKU). Almost ~600K synthetic, labelled images and over 62M real, unlabelled images were used in its training.

Computer Vision, Artificial Intelligence, Machine Learning

100 points by marklit 66 days ago | 29 comments

ByteDance/Dolphin on HuggingFace (huggingface.co)
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.

Computer Vision, Document Understanding, Machine Learning

7 points by sroussey 68 days ago | 1 comments

Single RGB camera turns your palm into a keyboard for mixed reality interaction (arduino.cc)
Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University developed interaction technology called Palmpad that enables typing with just a single RGB camera and an Arduino.

Mixed Reality, User Interface, Computer Vision, Arduino

10 points by PaulHoule 68 days ago | 9 comments

Show HN: A highly extensible framework for building OCR systems (github.com/robbyzhaox)
MyOCR is a highly extensible and customizable framework for building OCR systems. Engineers can easily train, integrate deep learning models into custom OCR pipelines for real-world applications.

Open Source, Computer Vision, OCR, Deep Learning, Software

16 points by robbyzhao 68 days ago | 0 comments

Steepest Descent Density Control for Compact 3D Gaussian Splatting (arxiv.org)
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis.

3D Graphics, Computer Vision, Artificial Intelligence, Machine Learning

30 points by PaulHoule 70 days ago | 0 comments

Ollama's new engine for multimodal models (ollama.com)
Ollama now supports multimodal models via Ollama’s new engine, starting with new vision multimodal models:

Generative AI, Multimodal Models, Computer Vision

353 points by LorenDB 71 days ago | 84 comments

Show HN: Real-Time Gaussian Splatting (github.com/axbycc)
LiveSplat is an algorithm for realtime Gaussian splatting using RGBD camera streams.

Real-Time Rendering, Computer Vision, 3D Graphics, Algorithm, GitHub

144 points by markisus 72 days ago | 53 comments

Wav2Lip: Accurately Lip-Syncing Videos and OpenVINO (github.com/openvinotoolkit)

Deep Learning, Computer Vision, Video Editing, Open Source

3 points by handfuloflight 72 days ago | 0 comments

Bringing 3D shoppable products online with generative AI (research.google)
Discover how our latest AI models transform 2D product images into immersive 3D experiences for online shoppers.

Generative AI, E-commerce, 3D Modeling, Computer Vision

45 points by bookofjoe 74 days ago | 22 comments

FastVLM: Efficient vision encoding for vision language models (github.com/apple)
This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)

Computer Vision, Efficiency, New Releases, Research

367 points by nhod 74 days ago | 74 comments

Vision Now Available in Llama.cpp (github.com/ggml-org)

Open Source, AI, Computer Vision

550 points by redman25 77 days ago | 104 comments

System lets robots identify an object's properties through handling (news.mit.edu)
With a novel simulation method, robots can guess the weight, softness, and other physical properties of an object just by picking it up.

Robotics, Machine Learning, Artificial Intelligence, Computer Vision

3 points by gnabgib 79 days ago | 0 comments

AI focused on brain regions recreates what you're looking at (2024) (newscientist.com)
Artificial intelligence systems can now create remarkably accurate reconstructions of what someone is looking at based on recordings of their brain activity. These reconstructed images are greatly improved when the AI learns which parts of the brain to pay attention to.

Artificial Intelligence, Neuroscience, Brain Imaging, Computer Vision

79 points by openquery 81 days ago | 37 comments

Your ViT Is Secretly an Image Segmentation Model (arxiv.org)
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.

Computer Vision, Vision Transformers, Image Segmentation, Artificial Intelligence

10 points by lamename 82 days ago | 0 comments

The Speed of VITs and CNNs (eyer.be)
It is often stated that because of the quadratic self-attention, ViTs aren't practical at higher resolution.

Computer Vision, Artificial Intelligence, Deep Learning, CNNs, Vision Transformers

74 points by jxmorris12 85 days ago | 23 comments

O3 beats a master-level GeoGuessr player, even with fake EXIF data (sampatt.com)

Artificial Intelligence, Games, Geolocation, Computer Vision

451 points by bko 88 days ago | 310 comments

Gaussian Splatting Meets ROS2 (github.com/shadygm)
ROSplat is the first online ROS2-based visualizer that leverages Gaussian splatting to render complex 3D scenes.

Robotics, Computer Vision, Software, 3D Graphics

61 points by shadygm 88 days ago | 15 comments

Vision Transformers Need Registers (arxiv.org)
Transformers have recently emerged as a powerful tool for learning visual representations.

Computer Vision, Machine Learning, Artificial Intelligence, Transformers

94 points by felineflock 89 days ago | 9 comments

CosAE: Learnable Fourier Series for Image Restoration (sifeiliu.net)
In this paper, we introduce CosAE (Cosine Autoencoder), a novel, generic Autoencoder that seamlessly leverages the classic Fourier series with a feed-forward neural network.

Image Restoration, Machine Learning, Computer Vision, Artificial Intelligence, Deep Learning

69 points by E-Reverance 91 days ago | 17 comments

LLMs can see and hear without any training (github.com/facebookresearch)
LLMs can see and hear without any training

Computer Vision, Audio Processing, Generative AI

210 points by T-A 91 days ago | 66 comments

Watching o3 guess a photo's location is surreal, dystopian and entertaining (simonwillison.net)
Watching OpenAI’s new o3 model guess where a photo was taken is one of those moments where decades of science fiction suddenly come to life. It’s a cross between the Enhance Button and Omniscient Database TV Tropes.

Artificial Intelligence, Image Recognition, Technology, Computer Vision, Science Fiction

987 points by simonw 91 days ago | 433 comments

Three things everyone should know about Vision Transformers (arxiv.org)
After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis.

Computer Vision, Transformers, Artificial Intelligence, Deep Learning

71 points by reqo 93 days ago | 17 comments

π0.5: A VLA with open-world generalization (pi.website)
Robots have come a long way over the past few years—they can perform impressive acrobatic feats, dance on stage, follow language commands and, in some of our own results, perform complex tasks like folding laundry or cleaning off a table. But the biggest challenge in robotics is not in performing feats of agility or dexterity, but generalization: the ability to figure out how to correctly perform even a simple task in a new setting or with new objects.

Robotics, Artificial Intelligence, Machine Learning, Computer Vision

177 points by lachyg 95 days ago | 44 comments

Gemma 3 QAT Models: Bringing AI to Consumer GPUs (googleblog.com)
Last month, we launched Gemma 3, our latest generation of open models. Delivering state-of-the-art performance, Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.

Artificial Intelligence, Machine Learning, Hardware, Computer Vision, Open Source

602 points by emrah 97 days ago | 276 comments

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation (lllyasviel.github.io)
Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory. Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments. Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). No timestep distillation. Video diffusion, but feels like image diffusion.

Video Generation, Deep Learning, Computer Vision, Artificial Intelligence

270 points by GaggiX 98 days ago | 27 comments

SDFs from Unoriented Point Clouds Using Neural Variational Heat Distances (arxiv.org)
We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds.

Machine Learning, Computer Vision, Neural Networks, Distance Fields

38 points by haxiomic 99 days ago | 5 comments