Hacker News with Generative AI: Computer Vision

Magma: A foundation model for multimodal AI agents (microsoft.github.io)
Magma is the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment. Given a described goal, Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.
Run structured extraction on documents/images locally with Ollama and Pydantic (github.com/vlm-run)
Welcome to VLM Run Hub, a comprehensive repository of pre-defined Pydantic schemas for extracting structured data from unstructured visual domains such as images, videos, and documents.
WonderHuman: 3D avatars from single-view video (arxiv.org)
In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis.
Experiment: Can 3D improve AI video consistency? (backdroptech.github.io)
If you are not redirected, click here.
Fast Video Generation with Sliding Tile Attention (hao-ai-lab.github.io)
TL;DR: Video generation with DiTs is painfully slow – HunyuanVideo takes 16 minutes to generate just a 5-second video on an H100 with FlashAttention3. Our sliding tile attention (STA) slashes this to 5 minutes with zero quality loss, no extra training required. Specifically, STA accelerates attention alone by 2.8–17x over FlashAttention-2 and 1.6–10x over FlashAttention-3.
Biases in Apple's Image Playground (giete.ma)
Although Image Playground is heavily restricted, and we do not have direct access to the underlying model, can we still use the prompting interface with the above image input to influence the skin tone of the resulting image? Turns out we can, and in precisely the biased way most image models behave 🤦‍♂️.
Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model (arxiv.org)
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length.
ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs (arxiv.org)
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals.
OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent (github.com/microsoft)
OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
Diffusion Without Tears (notion.site)
Show HN: Live webcam metal pin toy simulation powered by WebGPU depth estimation (vncntt.github.io)
Using Depth Anything V2 + metal pin simulation
Segment Anything for Microscopy (nature.com)
Accurate segmentation of objects in microscopy images remains a bottleneck for many researchers despite the number of tools developed for this purpose.
Benchmarking vision-language models on OCR in dynamic video environments (arxiv.org)
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments.
Why LLMs still have problems with OCR (runpulse.com)
LLM’s suck at complex OCR, and probably will for a while. LLMs are excellent for many text-generation or summarization tasks, but they falter at the precise, detail-oriented job of OCR—especially when dealing with complicated layouts, unusual fonts, or tables. These models get lazy, often not following prompt instructions across hundreds of pages, failing to parse information, and “thinking” too much.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation (hila-chefer.github.io)
Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics.
Show HN: Automated Sorting of group photos by user defined N people in each pic (github.com/Karvy-Singh)
Sort photos based on the criteria of "Me with my favorite people (x, y, z...)" out of a bunch of group photos/random photos.
OmniHuman-1: Human Animation Models (omnihuman-lab.github.io)
We propose an end-to-end multimodality-conditioned human video generation framework named OmniHuman, which can generate human videos based on a single human image and motion signals (e.g., audio only, video only, or a combination of audio and video).
S1: Simple Test-Time Scaling (github.com/simplescaling)
This repository provides an overview of all resources for the paper "s1: Simple test-time scaling".
First place in Tetris 99 using computer vision, classical AI, a lot of free time (bpinzone.github.io)
We created a program to play Tetris 99, an online multiplayer game for the Nintendo Switch. The algorithm used computer vision to determine the state of the board, a depth-first search algorithm with a hand-crafted utility function to find a good next block placement, and sent the series of button presses required to perform that placement via a microcontroller that communicated with the Switch via USB.
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels (github.com/mpc001)
This repository is an open-sourced framework for speech recognition, with a primary focus on visual speech (lip-reading). It is designed for end-to-end training, aiming to deliver state-of-the-art models and enable reproducibility on audio-visual speech benchmarks.
Ten Takes on DeepSeek (peterwildeford.substack.com)
Yes, DeepSeek is impressive.
How to run 1.58bit DeepSeek R1 with Open WebUI (openwebui.com)
A huge shoutout to UnslothAI for their incredible efforts! Thanks to their hard work, we can now run the full DeepSeek-R1 671B parameter model in its dynamic 1.58-bit quantized form (compressed to just 131GB) on Llama.cpp! And the best part? You no longer have to despair about needing massive enterprise-class GPUs or servers — it’s possible to run this model on your personal machine (albeit slowly for most consumer hardware).
High-Speed Face-Tracking for Dynamic Facial Projection Mapping (titech.ac.jp)
Dynamic Facial Projection Mapping (DFPM) overlays computer-generated images onto human faces to create immersive experiences that have been used in the makeup and entertainment industries. In this study, we propose two concepts to reduce the misalignment artifacts between projected images and target faces, which is a persistent challenge for DFPM.
TopoNets: High performing vision and language models with brain-like topography (arxiv.org)
Neurons in the brain are organized such that nearby cells tend to share similar functions.
TopoNets: High-Performing Vision and Language Models with Brain-Like Topography (toponets.github.io)
The organization of neurons in the brain is highly structured: neurons performing similar functions are located near one another. This "topographic organization" is a fundamental principle of primate brains and plays an important role in shaping the brain's representations.
3D scene reconstruction in adverse weather conditions via Gaussian splatting (arxiv.org)
3D Gaussian Splatting (3DGS) has gained significant attention for 3D scene reconstruction, but still suffers from complex outdoor environments, especially under adverse weather.
Stable Flow: Vital Layers for Training-Free Image Editing (omriavrahami.com)
Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features.
Hugging Face claims its new AI models are the smallest of their kind (techcrunch.com)
A team at AI dev platform Hugging Face has released what they’re claiming are the smallest AI models that can analyze images, short videos, and text.
A QR code that sends you to a different destination – lenticular and adversarial (mstdn.social)
Surface-Stable Fractal Dithering (github.com/runevision)
Surface-Stable Fractal Dithering is a novel form of dithering invented by Rune Skovbo Johansen for use on surfaces in 3D scenes.