Hacker News with Generative AI: Multimodal AI

Meta Spirit LM: Open multimodal language model that freely mixes text and speech (twitter.com)
MM1.5: Methods, Analysis and Insights from Multimodal LLM Fine-Tuning (arxiv.org)
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.