Hacker News with Generative AI: Web Crawling

Crawlers impact the operations of the Wikimedia projects (wikimedia.org)
Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.

Wikimedia, Web Crawling, Content Consumption, Open Source

115 points by edward 441 days ago | 66 comments

Crawl Order and Disorder (marginalia.nu)
A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains.

Search Engines, Web Crawling, Technical

65 points by ingve 477 days ago | 9 comments

AI crawlers haven't learned to play nice with websites (theregister.com)
SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.

AI, Web Crawling, Open Source, Technology, Data

77 points by belter 486 days ago | 43 comments

Common Crawl maintains a free, open repository of web crawl data (commoncrawl.org)
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Web Crawling, Open Data, Big Data

27 points by doener 499 days ago | 1 comments

Show HN: Crawlspace – A centralized web crawling platform built on Cloudflare (crawlspace.dev)
Crawlspace is a centralized platform for developers to build and deploy web crawlers. Gather fresh data for your apps and agents while contributing to a platform-wide cache for crawler traffic.

Web Crawling, Cloud Computing, Developer Tools, Software

9 points by andrethegiant 541 days ago | 1 comments

Amazon's AI crawler is making my Git server unstable (xeiaso.net)
Please, just stop.

Amazon, Git, AI, Web Crawling, Server Stability

607 points by OptionOfT 544 days ago | 246 comments

25% of the top websites are blocking OpenAI from crawling (originality.ai)
Bots such as OpenAI’s GPTBot, the Applebot, CCBot, Google-Extended, and Bytespider analyze, store, or scrape your website’s data in order to provide data to train more advanced LLMs.

Web Crawling, Privacy, Artificial Intelligence, Data

4 points by behnamoh 557 days ago | 0 comments

Crawling More Politely Than Big Tech (cameronboehmer.com)
Dennis Schubert, engineer at Mozilla and noteworthy contributor to diapsora, a distributed, open-source social network, recently observed that 70% of the load on diaspora's servers was coming from poorly-behaved bots that feed the LLMs of a few big outfits.

Open Source, Artificial Intelligence, Web Crawling, Social Networks

43 points by pkghost 561 days ago | 17 comments

Web Crawler and Scraper for AI (spider.cloud)
Spider offers the finest data collecting solution. Engineered for speed and scalability, it allows you to elevate your AI projects.

Web Crawling, AI, Data Collection, Software, Tools

8 points by catskindleyou 574 days ago | 4 comments

Crawl4AI: Open-Source Web Crawler for Seamless AI Data Scraping (github.com/unclecode)
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

Web Crawling, AI, Open Source

6 points by ProbeCraft 655 days ago | 0 comments

AI Has Created a Battle over Web Crawling (ieee.org)

AI, Web Crawling, Legal Issues

60 points by pseudolus 684 days ago | 52 comments

AI crawlers need to be more respectful (readthedocs.com)

Artificial Intelligence, Ethics, Web Crawling

226 points by pneff 721 days ago | 118 comments

Reddit has updated its robots.txt to block all web crawlers (stackdiary.com)

Web Crawling, Robots.txt, Reddit, Social Media

19 points by skilled 743 days ago | 15 comments

OpenAI and Anthropic are ignoring robots.txt (businessinsider.com)

Robotics, OpenAI, Anthropic, Web Crawling

18 points by Handy-Man 755 days ago | 6 comments

Show HN: Yomuco – A simple web crawling library for Node.js (github.com/andraindrops)

Web Crawling, Node.js, JavaScript, Software, Libraries

23 points by jtakahashi64 796 days ago | 3 comments