Hacker News with Generative AI: Web Scraping

Maxun: Open-Source No-Code Web Data Extraction Platform (github.com/getmaxun)
Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!
Show HN: Convert any website into a React component (chromewebstore.google.com)
Nearly 90% of our AI crawler traffic is from ByteDance (haproxy.com)
This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models. We noticed the same thing when looking at bot management analytics produced by HAProxy Edge — our global network that we ourselves use to serve traffic for haproxy.com. Some of the numbers we are seeing are fairly shocking, so let’s review the traffic sources and where they originate.
Annoyed Redditors tanking Google Search results illustrates peril of AI scrapers (arstechnica.com)
A trend on Reddit that sees Londoners giving false restaurant recommendations in order to keep their favorites clear of tourists and social media influencers highlights the inherent flaws of Google Search’s reliance on Reddit and Google's AI Overview.
You-get: Dumb downloader that scrapes the web (github.com/soimort)
You-Get is a tiny command-line utility to download media contents (videos, audios, images) from the Web, in case there is no other handy way to do it.
Show HN: Epublifier – scrape pages (books, manuals) for offline reading (github.com/maoserr)
Converts some webnovels to epub format
Video scraping: extracting JSON from a 35s screen capture for 1/10th of a cent (simonwillison.net)
The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.
Show HN: Web Scraping with AI (hystruct.com)
Hystruct uses AI to help you scrape the web with ease. Get started with 4000 free credits today.
ByteDance’s Bytespider is scraping at much higher rates than other platforms (fortune.com)
ByteDance looks like it’s eager to make up for lost time when it comes to scraping the web for data needed to train its generative AI models.
Web scraping with your web browser: Why not? (8chananon.github.io)
You can find plenty of tutorials on the Internet about the art of web scraping (for example, here and here) and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser though you will find browser extensions that claim to do it all without any need for coding (this only works for simplistic and unprotected websites).
Show HN: A tool to import and manage "Who Is Hiring" posts (github.com/gabfl)
This application allows you to scrape, store, and interactively explore job postings from Hacker News’s “Who is Hiring?” threads.
Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes (github.com/bjesus)
Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.
The most accurate and cheapest AI for scraping (ortutay.substack.com)
AI models have the potential to make web scraping 10x easier. Instead of writing complicated code like XPath and CSS selectors, you can scrape a website with plain English. That’s the idea behind FetchFox, a Chrome extension that scrapes any website using AI.
Show HN: Finic – Open source platform for building browser automations (github.com/finic-ai)
Finic is a cloud platform designed to simplify the deployment and management of browser-based automation agents, with a focus on fault-tolerant execution. It’s designed for quickly launching bots, scrapers, RPA integrationsm and other jobs that depend on multiple authenticated web services.
Show HN: I'm making an AI scraper called FetchFox (fetchfoxai.com)
Unlock Articles with Paywallskip (paywallskip.com)
Major Sites Are Saying No to Apple's AI Scraping (wired.com)
A web scraping CLI made for AI that is idempotent (github.com/clemlesne)
Learn Python 3 Spiders: Comprehensive Guide from Basics to Advanced Techniques (github.com/wistbean)
Tracking supermarket prices with Playwright (sakisv.net)
Websites Are Blocking the Wrong AI Scrapers (404media.co)
Anthropic is scraping websites so fast it's causing problems (pivot-to-ai.com)
Show HN: G-Scraper, a GUI Web Scraper, Written in Python (github.com/thegigacoder123)
Show HN: Crawlee for Python – a web scraping and browser automation library (crawlee.dev)
How to Detect Puppeteer Extra Stealth (datadome.co)
Cloudflare rolls out feature for blocking AI companies' web scrapers (siliconangle.com)
Cloudflare debuts one-click nuke of web-scraping AI (theregister.com)
How to crawl big websites with no sitemap? (ycombinator.com)
Block AI bots, scrapers and crawlers with a single click (cloudflare.com)
Show HN: Linkgrabs.com the Simple and Fast API to Fetch JavaScript Web Pages (linkgrabs.com)