Hacker News with Generative AI: Vision-Language Models

OmniSVG (github.com/OmniSVG)
OmniSVG is the first family of end-to-end multimodal SVG generators that leverage pre-trained Vision-Language Models (VLMs), capable of generating complex and detailed SVGs, from simple icons to intricate anime characters.

Generative AI, SVG, Vision-Language Models, Computer Graphics, Open Source

44 points by handfuloflight 97 days ago | 2 comments

SmolDocling: An ultra-compact VLM for end-to-end multi-modal document conversion (arxiv.org)
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion.

Vision-Language Models, Document Conversion, Machine Learning, Computer Vision

66 points by prats226 120 days ago | 12 comments

RT-2: Vision-Language-Action Models (2023) (robotics-transformer2.github.io)
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning.

Robotics, Artificial Intelligence, Vision-Language Models, Machine Learning

76 points by elsewhen 199 days ago | 13 comments

Multimodal Interpretability in 2024 (soniajoseph.ai)
I'm writing this post to clarify my thoughts and update my collaborators on multimodal interpretability in 2024. Having spent part of the summer in the AI safety sphere in Berkeley, and then joining the video understanding team at FAIR as a visiting researcher, I'm bridging two communities: the language mechanistic interpretability efforts in AI safety, and the efficiency-focused Vision-Language Model (VLM) community in industry. Some content may be more familiar to one community than the other.

Multimodal Interpretability, AI Safety, Vision-Language Models, Artificial Intelligence

30 points by apsec112 239 days ago | 0 comments