Hacker News with Generative AI: Multimodal

Qwen2.5-Omni Technical Report (huggingface.co)
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Alibaba Qwen2.5-Omni-7B: Open Source End-to-End Multimodal AI Model (alizila.com)
Alibaba Cloud has launched Qwen2.5-Omni-7B, a unified end-to-end multimodal model in the Qwen series.
Meta Llama 3 vision multimodal models – how to use them and what they can do (theregister.com)
Meta has been influential in driving the development of open language models with its Llama family, but up until now, the only way to interact with them has been through text.
Molmo: a family of open multimodal AI models (allenai.org)
A Specialized UI Multimodal Model (motiff.com)
How it's Made: Interacting with Gemini through multimodal prompting (googleblog.com)