Hacker News with Generative AI: Speech Recognition

Play Dialog: A contextual turn-taking TTS model like NotebookLM Playground (play.ai)
PlayNoteAgentsPlaygroundPricingAPICommunityConversation (2 Speakers)Narration (1 Speaker)LanguageSpeaker 1 VoiceSpeaker 2 VoiceConnecting...Random PromptCreate Voice Clone
Evaluating OpenAI Whisper's Hallucinations on Different Silences (sabrina.dev)
AI hallucinations in healthcare have made recent headlines, as OpenAI’s speech-to-text model (Whisper) has been shown to hallucinate during silences.
Moonshine, the new state of the art for speech to text (petewarden.com)
Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy.
Ask HN: Real-time speech-to-speech translation (ycombinator.com)
Has anyone had any luck with a free, offline, open-source, real-time speech-to-speech translation app on under-powered devices (i.e., older smart phones)?
Moonshine, the new state of the art for speech to text (petewarden.com)
Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy.
Meta Spirit LM: Open multimodal language model that freely mixes text and speech (twitter.com)
Show HN: SpeakMyVoice – App for people with vocal or speech difficulties (speakmyvoice.com)
With SpeakMyVoice, you're always part of the conversation.
Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps (github.com/lifeiteng)
Built on SenseVoice, Omni SenseVoice is optimized for lightning-fast inference and precise timestamps—giving you a smarter, faster way to handle audio transcription!
Improving Whisper Transcriptions with GPT-4o (github.com/orcaman)
I was watching the latest news episode from Whisky.com (where fine spirits meet ™) the other day on YouTube, and noticed that the transcription was really off.
Whisper-Large-v3-Turbo (huggingface.co)
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.
Show HN: Reverb ASR+Diarization, the Best Open Source ASR for Long-Form Audio (ycombinator.com)
Today, we are launching and open sourcing our current generation ASR models named "Reverb."
VoiceRAG: A pattern for RAG and voice with the GPT-4o Realtime API for audio (microsoft.com)
The new Azure OpenAI gpt-4o-realtime-preview model opens the door for even more natural application user interfaces with its speech-to-speech capability.
Show HN: Speech-to-speech playground for OpenAI's new Realtime API (livekit.io)
Try OpenAI's new Realtime API right from your browser.
Llama 3.1 Omni Model (github.com/ictnlp)
LLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct, It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.
Moshi: A speech-text foundation model for real time dialogue (github.com/kyutai-labs)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework.
Speech Dictation Mode for Emacs (lepisma.xyz)
There is a wide range of input mechanisms for computers, starting with keyboards (which are relatively mature) and extending to various types of neural interfaces (currently under research). Speech lies somewhere on this spectrum with a lot of promises but still not much to show for. Keeping accessibility aspects aside, I think speech is mature enough to be used for drafting ideas and taking notes. Maybe not so much for structured writing like programming or final versions of most prose.
Hugging Face tackles speech-to-speech (github.com/huggingface)
Recording and Processing Spoken Word (tratt.net)
A powerful tool for converting speech into text (trintai.com)
Language model can listen while speaking (huggingface.co)
AiOla open-sources ultra-fast ‘multi-head’ speech recognition model (aiola.com)
Transcribro: On-device Accurate Speech-to-text (github.com/soupslurpr)
AI speech generator 'reaches human parity' – but it's too dangerous to release (livescience.com)
Show HN: SpeakStruct – Turn voice into consistent structured data (speakstruct.co)
“You Are My Friend”: Early Androids and Artificial Speech (publicdomainreview.org)
Sonic: A Low-Latency Voice Model for Lifelike Speech (cartesia.ai)
Simple Speech-to-Text on the '10 Cents' CH32V003 Microcontroller (github.com/brian-smith-github)
AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI (github.com/developersdigest)
Self-hosted offline transcription and diarization service with LLM summary (github.com/transcriptionstream)
Ask HN: Fast/cheap epaper badge for real time speech to text with deaf friends? (ycombinator.com)