Hacker News with Generative AI: Multimodal AI

MMaDA – Open-Sourced Multimodal Large Diffusion Language Models (github.com/Gen-Verse)
MMaDA is a new family of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.

Open Source, Generative AI, Multimodal AI

8 points by lnyan 65 days ago | 0 comments

Gemma3 – The current strongest model that fits on a single GPU (ollama.com)
Gemma is a lightweight, family of models from Google built on Gemini technology. The Gemma 3 models are multimodal—processing text and images—and feature a 128K context window with support for over 140 languages. Available in 1B, 4B, 12B, and 27B parameter sizes, they excel in tasks like question answering, summarization, and reasoning, while their compact design allows deployment on resource-limited devices.

Generative AI, AI Models, Multimodal AI, Computer Vision

252 points by brylie 136 days ago | 138 comments

Magma: A foundation model for multimodal AI agents (microsoft.github.io)
Magma is the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment. Given a described goal, Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.

Artificial Intelligence, Foundation Models, Multimodal AI, Computer Vision, Robotics

305 points by SerCe 156 days ago | 68 comments

Meta Spirit LM: Open multimodal language model that freely mixes text and speech (twitter.com)

Meta, Language Models, Open Source, Multimodal AI, Speech Recognition

13 points by anjneymidha 281 days ago | 4 comments

MM1.5: Methods, Analysis and Insights from Multimodal LLM Fine-Tuning (arxiv.org)
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.

Generative AI, Multimodal AI, Computer Vision

5 points by dailcooper 297 days ago | 0 comments