Hacker News with Generative AI: Computer Vision

Dr. TVAM – Inverse Rendering for Tomographic Volumetric Additive Manufacturing (github.com/rgl-epfl)
Dr.TVAM is an inverse rendering framework for tomographic volumetric additive manufacturing.
Issues with color spaces and perceptual brightness (johnaustin.io)
Unlike RGB, the CIELab color spaceAnd the more modern variants like CIECAM02 and Oklab. is designed to be perceptually uniform.
Show HN: I created a PoC for live descriptions of the surroundings for the blind (github.com/o40)
I wanted to see if I could create a low-cost tool for the blind to get live description of the scene in front of a camera.
Auto Smiley – Computer vision smile generator (2010) (fffff.at)
Auto Smiley is a computer vision application that runs in the background while you work. The software analyzes your face while you are working and if it detects a smile it sends the the ascii smiley face letters “: )” as keyboard presses to the front most application. Auto Smiley has many uses from just straight up convenience to enforcing honesty in your online communication :)
Show HN: DeepFace – A lightweight deep face recognition library for Python (github.com/serengil)
DeepFace is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python.
DeepSeek-VL2: MoE Vision-Language Models for Advanced Multimodal Understanding (github.com/deepseek-ai)
Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL.
Tenstorrent Wormhole Series (corsix.org)
A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards.
Reflecting on o3 "beating ARC": are we reliving the ImageNet 2012 moment again? (ycombinator.com)
AlexNet came and blown everything out of the water. Then you can reflect how much [a lot] progress there has been since 2012 till now just on this little dataset.<p>o3 beating ARC is such a harder dataset, I don't even want to compare them. So how much progress there will be from just this?<p>Next 10 years gonna be bonkers.
All You Need Is 4x 4090 GPUs to Train Your Own Model (sabareesh.com)
My journey into Large Language Models (LLMs) began with the excitement of seeing ChatGPT in action. I started by exploring diffusion models, drawn to their ability to create beautiful visuals. However, working on an M1 chip had its limitations, which motivated me to build a custom rig with an NVIDIA 4090 GPU. As I continued to explore LLMs and experimented with multi-agent systems, I came to realize the importance of mastering the fundamentals.
Armada: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition (arxiv.org)
Teleoperation for robot imitation learning is bottlenecked by hardware availability. Can high-quality robot data be collected without a physical robot?
Reverse Video Search (mixpeek.com)
Reverse video search allows us to use a video clip as an input for a query against videos that have been indexed in a vector store.
Language Model as Visual Explainer (arxiv.org)
In this paper, we present Language Model as Visual Explainer LVX, a systematic approach for interpreting the internal workings of vision models using a tree-structured linguistic explanation, without the need for model training.
DeepSeek-v3 Technical Report [pdf] (github.com/deepseek-ai)
Trying out QvQ – Qwen's new visual reasoning model (simonwillison.net)
I thought we were done for major model releases in 2024, but apparently not: Alibaba’s Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview, “an experimental research model focusing on enhancing visual reasoning capabilities”.
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (grisoon.github.io)
We present INFP, an audio-driven interactive head generation framework for dyadic conversations. Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements. Additionally, our framework is lightweight yet powerful, making it practical in instant communication scenarios such as the video conferencing. INFP denotes our method is Interactive, Natural, Flash and Person-generic.
FastVideo: a lightweight framework for accelerating large video diffusion models (github.com/hao-ai-lab)
FastVideo is a lightweight framework for accelerating large video diffusion models.
Fine-tuning a vision model to recognize break dance power moves (bawolf.com)
I’ve been looking for a way to combine software engineering with my break practice and stumbled upon the idea of fine-tuning a vision model on power moves.
Nvidia Jetson Orin Nano Super [video] (youtube.com)
Meta's new Video Understanding Multimodal Model used Qwen model for training (arxiv.org)
Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood.
Veo 2: Our video generation model (deepmind.google)
Veo creates videos with realistic motion and high quality output, up to 4K. Explore different styles and find your own with extensive camera controls.
Representing Long Volumetric Video with Temporal Gaussian Hierarchy (zju3dv.github.io)
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.
Llama.cpp Now Supports Qwen2-VL (Vision Language Model) (github.com/ggerganov)
This PR implements the Qwen2VL model as requested at #9246 . The main changes include:
OpenAI announces Advanced Voice with Vision [video] (youtube.com)
AI pioneer Fei-Fei Li has a vision for computer vision (ieee.org)
AI pioneer Fei-Fei Li says to unlock visual intelligence, we need to respect the fact that "the world is 3D."
Attribute Extraction from Images Using DSPy (langtrace.ai)
DSPy recently added support for VLMs in beta. A quick thread on attributes extraction from images using DSPy. For this example, we will see how to extract useful attributes from screenshots of websites
Long Convolutions via Polynomial Multiplication (hazyresearch.stanford.edu)
We’ve been writing a series of papers (1, 2, 3) that have at their core so-called long convolutions, with an aim towards enabling longer-context models. These are different from the 3x3 convolutions people grew up with in vision because well, they are longer–in some cases, with filters as long as the whole sequence. A frequent line of questions we get is about what these long convolutions are and how we compute them efficiently, so we put together a short tutorial.
Convolutional Neural Network Visualization [video] (youtube.com)
Show HN: Real-Time YOLO Object Detection in Elixir: Fast, Simple, Extensible (github.com/poeticoding)
YOLO is an Elixir library designed to simplify object detection by providing seamless integration of YOLO models. With this library, you can efficiently utilize the power of YOLO for real-time object detection.
Moondream 0.5B: The Smallest Vision-Language Model (moondream.ai)
Today, we’re thrilled to announce the launch of Moondream 0.5B, the latest addition to our family of open-source AI models. With only 0.5 billion parameters, Moondream 0.5B is the world's smallest Vision-Language Model (VLM), designed to unlock AI's potential on edge devices and mobile platforms. It builds on the success of its predecessor, Moondream 2B, and is also released under the flexible Apache License, ensuring accessibility for everyone.
PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning (googleblog.com)
Building custom, advanced AI that can "see" used to be a complex and resource-intensive endeavor. Not anymore. This past May, we launched PaliGemma, the first vision-language model in the Gemma family, taking a significant step toward making class-leading visual AI more accessible. Now, we're thrilled to introduce PaliGemma 2, the next evolution in tunable vision-language models.