Hacker News with Generative AI: Training Data

Ask HN: Is politeness towards LLMs good training data, or just expensive noise? (ycombinator.com)
Sam Altman recently said user politeness towards ChatGPT costs OpenAI "tens of millions" but is "money well spent."
The Unbelievable Scale of AI's Pirated-Books Problem (theatlantic.com)
Meta pirated millions of books to train its AI. Search through them here.
The Scale of AI's Pirated-Books Problem (theatlantic.com)
Meta pirated millions of books to train its AI. Search through them here.
There's No Longer Any Doubt That Hollywood Writing Is Powering AI (theatlantic.com)
Dialogue from these movies and TV shows has been used by companies such as Apple and Anthropic to train AI systems.
SwiGLU activation function causes instability in FP8 LLM training (arxiv.org)
We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits.
Leaked Docs Show Nvidia Scraping a Human Lifetime of Videos per Day to Train AI (404media.co)
Apple, Nvidia, Anthropic Used Swiped YouTube Videos to Train AI (proofnews.org)
YouTube creators surprised to find Apple and others trained AI on their videos (arstechnica.com)
Figma will use your content to train its AI (stackdiary.com)
OpenAI destroyed a trove of books used to train AI models (businessinsider.com)