Hacker News with Generative AI: Data Scraping

Retrieve the comments history of any YouTube user across 1.4B users (lolarchiver.com)
Retrieve the comments history of any Youtube user across 1.4 billion users & 20 billion comments recorded

YouTube, Data Scraping, Social Media, Web Archives

20 points by gonzalezj 429 days ago | 7 comments

Job Hunting Scripts (github.com/CajuM)
These scripts are used to scrap GitHub for info on organizations. In the end you will be left with a TSV file that contains the name of the organization, it's URL, it's declared location and the number of stars of its select repositories.

Software, Job Hunting, Data Scraping, GitHub

15 points by CajuM 430 days ago | 15 comments

Anubis Works (xeiaso.net)
You are seeing this because the administrator of this website has set up Anubis to protect the server against the scourge of AI companies aggressively scraping websites.

Web Security, Artificial Intelligence, Data Scraping

319 points by evacchi 472 days ago | 208 comments

Wikipedia is struggling with voracious AI bot crawlers (engadget.com)
Wikimedia has seen a 50 percent increase in bandwidth used for downloading multimedia content since January 2024, the foundation said in an update. But it's not because human readers have suddenly developed a voracious appetite for consuming Wikipedia articles and for watching videos or downloading files from Wikimedia Commons. No, the spike in usage came from AI crawlers, or automated programs scraping Wikimedia's openly licensed images, videos, articles and other files to train generative artificial intelligence models.

Wikipedia, AI, Generative AI, Data Scraping, Open Source

91 points by bretpiatt 482 days ago | 100 comments

The Great Scrape (bearblog.dev)
LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author's permission and all content being opt-in by default.

Data Scraping, Web Content, Ethics, AI

13 points by Tomte 489 days ago | 1 comments

Court Overturns a Bad Jury Verdict Against Scraping–Ryanair vs. Booking (ericgoldman.org)
This summer, I wrote that the jury trial between Ryanair and Booking Holdings ended in the strangest way possible. The jury returned a verdict that Booking Holdings had caused exactly $5,000 in legally cognizable “loss” to Ryanair under the CFAA—the statutory minimum to establish a CFAA claim.

Law, Software, Travel, Data Scraping

4 points by hn_acker 512 days ago | 2 comments

Show HN: I Built a FAANG Job Board – Only Jobs Scraped in the Last 24h (topjobstoday.com)
🌟 The #1 platform for tech job seekers - join our growing community today

Jobs, Tech, Software, Web Development, Data Scraping

11 points by stasman 525 days ago | 3 comments

League of Legends data scraping the hard and tedious way for fun (maknee.github.io)
League of Legends is one of the world’s most popular competitive games, with millions of players generating vast amounts of gameplay data daily. Basic match statistics are available, but accessing moment-by-moment gameplay data is near impossible. This article demonstrates how to create a high-fidelity dataset by reverse engineering the game engine, capturing information such as precise player positions to ability usage timings and damage calculations.

Game Development, Data Scraping

158 points by maknee 531 days ago | 38 comments

Microsoft Word and Excel AI data scraping switched to opt-in by default (tomshardware.com)

Microsoft Word, Excel, AI, Data Scraping, Privacy

103 points by oldnetguy 609 days ago | 50 comments

ByteDance is abusing the free video downloading service Cobalt for mass scraping (twitter.com)

ByteDance, Social Media, Data Scraping, Privacy, Ethics

138 points by jsheard 660 days ago | 56 comments

Cloudflare's new marketplace lets websites charge AI bots for scraping (techcrunch.com)
Cloudflare announced plans on Monday to launch a marketplace in the next year where website owners can sell AI model providers access to scrape their site’s content. The marketplace is the final step of Cloudflare CEO Matthew Prince’s larger plan to give publishers greater control over how and when AI bots scrape their websites.

Web Development, AI, Business, Data Scraping

412 points by boristsr 673 days ago | 270 comments

Some Suggestions to Improve Robots.txt (ietf.org)
The BBC does not believe the current scraping of its content and data without permission in order to train generative AI models is in the public interest, and wants to agree a more structured and sustainable approach with technology companies.

Web Development, Artificial Intelligence, Data Scraping

14 points by 0xFF0123 677 days ago | 2 comments

LinkedIn silently opts users into generative AI data scraping by default (bsky.app)

Privacy, Generative AI, LinkedIn, Data Scraping, Social Media

6 points by diggan 678 days ago | 2 comments

LinkedIn scraped user data for training before updating its terms of service (techcrunch.com)
LinkedIn may have trained AI models on user data without updating its terms.

Privacy, AI, LinkedIn, Data Scraping, Legal Issues

36 points by mfiguiere 678 days ago | 14 comments

Game UI Database slowdown caused by relentless OpenAI scraping (gamedeveloper.com)

Game Development, Artificial Intelligence, Data Scraping

13 points by raytopia 687 days ago | 3 comments

Is it legal and possible to scrape the social media platforms? (ycombinator.com)
Given links to posts, is it legal & possible to scrape from social media such as YT, FB, Insta, TikTok & Snap?

Legal Issues, Social Media, Data Scraping

13 points by iamnnk 690 days ago | 10 comments

Leaked Docs Show Nvidia Scraping a Human Lifetime of Videos per Day to Train AI (404media.co)

Artificial Intelligence, Data Scraping, Privacy, Nvidia, Training Data

50 points by depingus 722 days ago | 11 comments

AI startup Anthropic accused of 'egregious' data scraping (ft.com)

AI, Data Scraping, Ethics, Startups

7 points by marban 730 days ago | 0 comments

Show HN: I scraped 3.2B TikTok profiles and 9B posts to build this search engine (seeksocial.io)

Data Scraping, Social Media, Search Engines, TikTok

22 points by IWantAllTheData 743 days ago | 20 comments

Storing Scraped Data in an SQLite Database on GitHub (jerrynsh.com)