Hacker News with Generative AI: Multimodal AI

Magma: A foundation model for multimodal AI agents (microsoft.github.io)
Magma is the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment. Given a described goal, Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.
Meta Spirit LM: Open multimodal language model that freely mixes text and speech (twitter.com)
MM1.5: Methods, Analysis and Insights from Multimodal LLM Fine-Tuning (arxiv.org)
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.