The Web Is Broken – Botnet Part 2(wildeboer.net) I guess you have all heard about the growing problem of AI companies trying to aggressively collect whatever data they can get their hands on to train their models. This has caused an explosive surge in web crawlers relentlessly hitting servers big and small. But who runs these crawlers? Turns out — it could be you!
Improved ways to operate a rude crawler(marginalia.nu) Tech news is abuzz with rude AI crawlers that forge their user-agent and ignore robots.txt. In my opinion, if this is all the AI startups can muster, they’re losing their touch. wget can do this. You need to up your game, get that crawler really rolling coal. Flagrant disregard for externalities is an important signal to the investors that your AI startup is the one.
326 points by alexdanilowicz 173 days ago | 62 comments
Nearly 90% of our AI crawler traffic is from ByteDance(haproxy.com) This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models. We noticed the same thing when looking at bot management analytics produced by HAProxy Edge — our global network that we ourselves use to serve traffic for haproxy.com. Some of the numbers we are seeing are fairly shocking, so let’s review the traffic sources and where they originate.
Web scraping with your web browser: Why not?(8chananon.github.io) You can find plenty of tutorials on the Internet about the art of web scraping (for example, here and here) and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser though you will find browser extensions that claim to do it all without any need for coding (this only works for simplistic and unprotected websites).
The most accurate and cheapest AI for scraping(ortutay.substack.com) AI models have the potential to make web scraping 10x easier. Instead of writing complicated code like XPath and CSS selectors, you can scrape a website with plain English. That’s the idea behind FetchFox, a Chrome extension that scrapes any website using AI.
Show HN: Finic – Open source platform for building browser automations(github.com/finic-ai) Finic is a cloud platform designed to simplify the deployment and management of browser-based automation agents, with a focus on fault-tolerant execution. It’s designed for quickly launching bots, scrapers, RPA integrationsm and other jobs that depend on multiple authenticated web services.