Hacker News with Generative AI: Computer Vision

Hand Tracking for Mouse Input (2023) (chernando.com)
The other day I saw the launch of Apple Vision Pro, the whole thing was very interesting, but one thing that caught my attention was the finger input. It seems very intuitive, by using the finger pinching as sort of like a cursor or mouse input. I figured I want to try it out, so I took it upon myself to create it.
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation (nirvanalan.github.io)
GaussianAnything generates high-quality and editable surfel Gaussians through a cascaded 3D diffusion pipeline, given single-view images or texts as the conditions.
LLaVA-O1: Let Vision Language Models Reason Step-by-Step (arxiv.org)
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks.
All-in-one embedding model for interleaved text, images, and screenshots (voyageai.com)
TL;DR — We are excited to announce voyage-multimodal-3, a new state-of-the-art for multimodal embeddings and a big step forward towards seamless RAG and semantic search for documents rich with both visuals and text. Unlike existing multimodal embedding models, voyage-multimodal-3 is capable of vectorizing interleaved texts + images and capturing key visual features from screenshots of PDFs, slides, tables, figures, and more, thereby eliminating the need for complex document parsing.
Show HN: ColiVara – State of the Art RAG API with Vision Models (github.com/tjmlabs)
ColiVara = COntextualized Late Interaction Vision Augmented Retrieval API
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization (rccchoudhury.github.io)
We present Run-Length Tokenization (RLT), a simple and efficient approach to speed up video transformers by removing redundant tokens from the input.
Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices (nexa.ai)
Image-Text Curation for 1B+ Data: Faster, Better, Smaller Clip Models (datologyai.com)
Benchmarking Vision, Language, and Action Models on Robotic Learning Tasks (multinet.ai)
Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation.
Watermark Anything (github.com/facebookresearch)
Implementation and pretrained models for the paper Watermark Anything. Our approach allows for embedding (possibly multiple) localized watermarks into images.
FLUX1.1 a Prompt Like "IMG_1018.CR2" (twitter.com)
Drone Relative Positioning (matthew-bird.com)
My plan for this project was to determine the relative positioning and orientation of two or more objects with cameras using minimal usage of external libraries. I was inspired to do this by drone shows in HK.
SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup (hanlab.mit.edu)
A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU.
Neural Optical Flow for PIV in Fluids (synthical.com)
Five Learnings from 15 Years in Perception (tangramvision.com)
In the fall of 2008, I was working on my third startup, ReTel Technologies. Our goal was to analyze shopper behavior in grocery stores, and use that data to help stores and brands improve the customer experience and store profitability. But we had a challenge: how do you anonymously track hundreds of shoppers per day in a store? We thought we had the answer: active RFID tags on every shopping cart.
Peng quadrotor autonomy framework visualized in the browser (rerun.io)
Rerun is not yet supported on mobile browsers.
Harnessing Vision for Computation (2008) [pdf] (changizi.com)
Ollama 0.4 is released with support for Meta's Llama 3.2 Vision models locally (ollama.com)
Llama 3.2 Vision is now available to run in Ollama, in both 11B and 90B sizes.
Iterative α-(de)blending and Stochastic Interpolants (nicktasios.nl)
In this post I'm looking into a paper the authors of which promise to make diffusion models simple to understand and implement, called Iterative α-(de)blending1, and find out that this promise is only partially fulfilled, at least personally. I reproduce the algorithm from the paper and apply it to the generation of MNIST digits, like I did in the previous series of posts, and find out that something is missing.
GenXD: Generating Any 3D and 4D Scenes (arxiv.org)
Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design.
TextLap: Customizing Language Models for Text-to-Layout Planning (arxiv.org)
Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces.
Self-Occluded Avatar Recovery from a Single Video in the Wild (soar-avatar.github.io)
A Unified Framework Self-occlusion is common when capturing people in the wild, where the performers do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem.
ThunderKittens: Simple, fast, and adorable AI kernels (hazyresearch.stanford.edu)
Fiveish months ago, we put out our posts on ThunderKittens and GPUs, and were pleasantly surprised by their warm reception on The Platform Formerly Known as Twitter.
A return to hand-written notes by learning to read and write (research.google)
We present a model to convert photos of handwriting into a digital format that reproduces component pen strokes, without the need for specialized equipment.
RebrickNet – Lego Part Detector (rebrickable.com)
RebrickNet learns by looking at images of LEGO parts and discovering the features that make each part unique.
Leopard: A Vision Language Model for Text-Rich Multi-Image Tasks (arxiv.org)
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots.
OmniParser for Pure Vision Based GUI Agent (microsoft.github.io)
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces.
Claude Computer Use – Is Vision the Ultimate API? (thariq.io)
I’ve spent the last 2 days basically non-stop hacking with Anthropic’s Computer Use API.
Seeing faces in things: A model and dataset for pareidolia (mhamilton.net)
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. "Face pareidolia" describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective.
Transformers Utilization in Chart Understanding: A Review of Advances and Future (arxiv.org)
In recent years, interest in vision-language tasks has grown, especially those involving chart interactions.