Hacker News with Generative AI: Audio Processing

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (grisoon.github.io)
We present INFP, an audio-driven interactive head generation framework for dyadic conversations. Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements. Additionally, our framework is lightweight yet powerful, making it practical in instant communication scenarios such as the video conferencing. INFP denotes our method is Interactive, Natural, Flash and Person-generic.
Fish Speech 1.5 (github.com/fishaudio)
This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.
Nvidia Fugatto: "World's Most Flexible Sound Machine" (nvidia.com)
A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text.
Show HN: Open-Source Tool to Remove Background Music from Videos (github.com/omeryusufyagci)
Fast Music Remover is a lightweight tool designed to remove music, sound effects and noise from internet media. Processing takes about 8% of the original source length -that's under 5 seconds for a minute-long video!
Audio Decomposition – open-source seperation of music to constituent instruments (matthew-bird.com)
My plan for this project was to create a program to turn music to sheet music. It was mainly incentivised by my own desire to turn music to sheet music and the lack (from what I could tell) of open source, simple algorithms to perform audio source separation.
Hertz-dev, the first open-source base model for conversational audio (si.inc)
For the last few months, we at Standard Intelligence have focused on fundamental research on the frontier of audio-only speech generation. We're excited to announce that we're open-sourcing current checkpoints of our full-duplex, audio-only transformer base model, hertz-dev, with a total of 8.5 billion parameters.
A Golang pipeline abomination (poxate.com)
In this project, we need to overlay a looping short music track over a long voice soundtrack.
NotebookLlama: An open source version of NotebookLM (github.com/meta-llama)
This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow.
Debugging audio artifacts caused by... a serial port? (recall.ai)
At Recall.ai we run enormous infrastructure to process millions of meetings per month, in real-time.
Omnio: First AI model that can natively reason over audio (soniox.com)
Omnio is the first multimodal AI model to comprehensively understand both conversations and human behavior through audio.
Show HN: Detect if an audio file was generated by NotebookLM (github.com/ListenNotes)
A simple tool to detect whether an audio file was generated by NotebookLM.
Show HN: Reverb ASR+Diarization, the Best Open Source ASR for Long-Form Audio (ycombinator.com)
Today, we are launching and open sourcing our current generation ASR models named "Reverb."
Lessons learnt building a real-time audio application in Python (vangemert.dev)
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency (loopyavatar.github.io)
Recording and Processing Spoken Word (tratt.net)
AudioFlux: A C/C++ library for audio and music analysis (github.com/libAudioFlux)
Real-time ML audio noise suppression on Raspberry Pi Pico 2 (raspberrypi.com)
StreamVC: Real-Time Low-Latency Voice Conversion (research.google)
Ask HN: Where are the good resources for learning audio processing? (ycombinator.com)
Generating audio for video (deepmind.google)
Groqnotes: Generate structured notes from audio using Groq, Whisper, and Llama3 (github.com/Bklieger)