Hacker News with Generative AI: Speech Recognition

FlowTSE: Target Speaker Extraction with Flow Matching (arxiv.org)
Target speaker extraction (TSE) aims to isolate a specific speaker's speech from a mixture using speaker enrollment as a reference.

Speech Recognition, Audio Processing, Machine Learning, Artificial Intelligence

25 points by agold97 159 days ago | 2 comments

"Not a Representation of Me": Accent Bias and Digital Exclusion in AI Voices (arxiv.org)
Recent advances in artificial intelligence (AI) speech generation and voice cloning technologies have produced naturalistic speech and accurate voice replication, yet their influence on sociotechnical systems across diverse accents and linguistic traits is not fully understood.

Artificial Intelligence, Speech Recognition, Social Impact, Bias, Digital Inclusion

3 points by gnabgib 180 days ago | 0 comments

Accents in latent spaces: How AI hears accent strength in English (boldvoice.com)
We work with accents a lot at BoldVoice, the AI-powered accent coaching app for non-native English speakers. Accents are subtle patterns in speech—vowel shape, timing, pitch, and more. Usually, you need a linguist to make sense of these qualities. However, our goal at BoldVoice is to get machines to understand accents, and machines don’t think like linguists. So, we ask: how does a machine learning model understand an accent, and specifically, how strong it is?

Artificial Intelligence, Speech Recognition, Linguistics, Language Learning

245 points by ilyausorov 181 days ago | 128 comments

I turned a 40 year old Apple Mouse into a speech to text button (cjpais.com)
I turned a 40 year old Apple Mouse into a speech to text button for my computer.

Retro Technology, DIY Projects, Accessibility, Speech Recognition

120 points by audionerd 184 days ago | 53 comments

Neuroscientists are racing to turn brain waves into speech (arstechnica.com)
Neuroscientists are striving to give a voice to people unable to speak in a fast-advancing quest to harness brainwaves to restore or enhance physical abilities.

Neuroscience, Technology, Accessibility, Speech Recognition

20 points by 01-_- 195 days ago | 1 comments

100% client-side speech-to-LaTeX web app (github.com/Thomas-McKanna)
Speech to LaTeX is a powerful web application that converts spoken mathematics into LaTeX expressions, running entirely in your browser. No server required! Simply speak your mathematical expressions, and watch as they're transformed into beautifully formatted LaTeX.

Web Development, LaTeX, Speech Recognition, Accessibility

5 points by tmckanna 200 days ago | 0 comments

Show HN: Aqua Voice 2 – Fast Voice Input for Mac and Windows (withaqua.com)
Aqua Voice uses a fusion transcription architecture + a client context engine to be the most accurate speech-to-text system available. Text is automatically formatted to fit the specific application and document. This enables using voice for entirely new applications like technical prompting. Aqua produces the highest quality output of any voice to text system.

Software, Speech Recognition, Productivity, Artificial Intelligence, Accessibility

140 points by the_king 208 days ago | 83 comments

Sesame CSM: A Conversational Speech Generation Model (github.com/SesameAILabs)
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.

Speech Recognition, Generative AI, Artificial Intelligence, Software

113 points by tosh 234 days ago | 13 comments

A unified acoustic-to-speech-to-language embedding space (nature.com)
This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain.

Language Models, Neuroscience, Speech Recognition, Artificial Intelligence

29 points by giuliomagnifico 239 days ago | 0 comments

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels (github.com/mpc001)
This repository is an open-sourced framework for speech recognition, with a primary focus on visual speech (lip-reading). It is designed for end-to-end training, aiming to deliver state-of-the-art models and enable reproducibility on audio-visual speech benchmarks.

Speech Recognition, Open Source, Machine Learning, Computer Vision

41 points by yagizdegirmenci 273 days ago | 1 comments

Chaplin: Local visual speech recognition (VSR) in real-time (github.com/amanvirparhar)
A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.

Speech Recognition, Real-Time, Machine Learning, Open Source, Accessibility

75 points by venusgirdle 273 days ago | 9 comments

AI uses throat vibrations to work out what someone is trying to say (newscientist.com)
People who find it difficult to speak due to a stroke or Parkinson’s disease could communicate more easily with the help of artificial intelligence.

Artificial Intelligence, Health, Communication, Speech Recognition

13 points by deadgopher 300 days ago | 1 comments

Moonshine – open-source, real-time speech-to-text in the browser (huggingface.co)

Speech Recognition, Open Source, Web Development, Artificial Intelligence

5 points by stared 319 days ago | 1 comments

WhisperNER: Unified Open Named Entity and Speech Recognition (arxiv.org)
Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness.

Speech Recognition, Open Source, Artificial Intelligence

133 points by timbilt 346 days ago | 17 comments

Play Dialog: A contextual turn-taking TTS model like NotebookLM Playground (play.ai)
PlayNoteAgentsPlaygroundPricingAPICommunityConversation (2 Speakers)Narration (1 Speaker)LanguageSpeaker 1 VoiceSpeaker 2 VoiceConnecting...Random PromptCreate Voice Clone

Speech Recognition, Artificial Intelligence, Text-to-Speech, Machine Learning

49 points by dulldata 355 days ago | 13 comments

Evaluating OpenAI Whisper's Hallucinations on Different Silences (sabrina.dev)
AI hallucinations in healthcare have made recent headlines, as OpenAI’s speech-to-text model (Whisper) has been shown to hallucinate during silences.

Generative AI, AI Hallucinations, Speech Recognition

8 points by sabrina_ramonov 369 days ago | 1 comments

Moonshine, the new state of the art for speech to text (petewarden.com)
Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy.

Speech Recognition, Artificial Intelligence, Open Source, Technology, Software

172 points by freediver 372 days ago | 36 comments

Ask HN: Real-time speech-to-speech translation (ycombinator.com)
Has anyone had any luck with a free, offline, open-source, real-time speech-to-speech translation app on under-powered devices (i.e., older smart phones)?

Speech Recognition, Translation, Open Source, Mobile Apps

158 points by thangalin 374 days ago | 70 comments

Moonshine, the new state of the art for speech to text (petewarden.com)
Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy.

Speech Recognition, Artificial Intelligence, Open Source, Technology

10 points by jamescham 377 days ago | 1 comments

Meta Spirit LM: Open multimodal language model that freely mixes text and speech (twitter.com)

Meta, Language Models, Open Source, Multimodal AI, Speech Recognition

13 points by anjneymidha 380 days ago | 4 comments

Show HN: SpeakMyVoice – App for people with vocal or speech difficulties (speakmyvoice.com)
With SpeakMyVoice, you're always part of the conversation.

Accessibility, Speech Recognition, Communication

44 points by 1mbsite 384 days ago | 7 comments

Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps (github.com/lifeiteng)
Built on SenseVoice, Omni SenseVoice is optimized for lightning-fast inference and precise timestamps—giving you a smarter, faster way to handle audio transcription!

Speech Recognition, Audio Transcription, Software, Open Source

169 points by ringer007 386 days ago | 27 comments

Improving Whisper Transcriptions with GPT-4o (github.com/orcaman)
I was watching the latest news episode from Whisky.com (where fine spirits meet ™) the other day on YouTube, and noticed that the transcription was really off.

AI, OpenAI, Speech Recognition, Transcription

6 points by _orcaman_ 394 days ago | 1 comments

Whisper-Large-v3-Turbo (huggingface.co)
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

Generative AI, Machine Learning, Speech Recognition

5 points by mfiguiere 395 days ago | 0 comments

Show HN: Reverb ASR+Diarization, the Best Open Source ASR for Long-Form Audio (ycombinator.com)
Today, we are launching and open sourcing our current generation ASR models named "Reverb."

Open Source, Speech Recognition, Audio Processing

12 points by leetharris 396 days ago | 1 comments

VoiceRAG: A pattern for RAG and voice with the GPT-4o Realtime API for audio (microsoft.com)
The new Azure OpenAI gpt-4o-realtime-preview model opens the door for even more natural application user interfaces with its speech-to-speech capability.

Generative AI, Artificial Intelligence, Microsoft, Speech Recognition, API

11 points by pmc00 397 days ago | 0 comments

Show HN: Speech-to-speech playground for OpenAI's new Realtime API (livekit.io)
Try OpenAI's new Realtime API right from your browser.

OpenAI, Speech Recognition, Real-time APIs, Web Development

10 points by bcherry 397 days ago | 11 comments

Llama 3.1 Omni Model (github.com/ictnlp)
LLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct, It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.

Generative AI, Speech Recognition

304 points by taikon 411 days ago | 41 comments

Moshi: A speech-text foundation model for real time dialogue (github.com/kyutai-labs)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework.

Speech Recognition, Machine Learning, Dialogue Systems

365 points by gkucsko 411 days ago | 64 comments

Speech Dictation Mode for Emacs (lepisma.xyz)
There is a wide range of input mechanisms for computers, starting with keyboards (which are relatively mature) and extending to various types of neural interfaces (currently under research). Speech lies somewhere on this spectrum with a lot of promises but still not much to show for. Keeping accessibility aspects aside, I think speech is mature enough to be used for drafting ideas and taking notes. Maybe not so much for structured writing like programming or final versions of most prose.

Emacs, Speech Recognition, Text Editing, Accessibility

127 points by adityaathalye 416 days ago | 32 comments