Hacker News with Generative AI: Web Scraping

A thought on JavaScript "proof of work" anti-scraper systems (utoronto.ca)
One of the things that people are increasingly using these days to deal with the issue of aggressive LLM and other web scrapers is JavaScript based "proof of work" systems, where your web server requires visiting clients to run some JavaScript to solve a challenge; one such system (increasingly widely used) is Xe Iaso's Anubis.

Web Scraping, JavaScript, Security, Artificial Intelligence, LLMs

194 points by zdw 54 days ago | 207 comments

Show HN: Defuddle, an HTML-to-Markdown alternative to Readability (github.com/kepano)
Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Web Development, HTML, Markdown, Open Source, Web Scraping

418 points by kepano 57 days ago | 68 comments

Scraperr – A Self Hosted Webscraper (github.com/jaypyles)
Scraperr enables you to extract data from websites with precision using XPath selectors.

Web Scraping, Software, Open Source, Tools

263 points by jpyles 68 days ago | 94 comments

Bot countermeasures impact on the quality of life on the web (volution.ro)
I think enough has already been written on the subject of fighting against rogue bots (today mostly for LLM scraping) that are ruining the web, not only by strip-mining human creativity and turning it into average slop, but especially by taking down hosting infrastructure through uncoordinated crawling that turns into DDoS.

Web Scraping, Bots, Artificial Intelligence, Internet

20 points by ciprian_craciun 70 days ago | 9 comments

Show HN: POC to scrape and structure HTML into JSON for RAG (pages.dev)
Enter a URL to extract structured JSON content using Gemini AI. This is ideal for RAG workflows, AI assistants, or any app needing clean, machine-readable data from web pages.

Web Scraping, JSON, RAG, AI, Gemini AI

9 points by nirvanist 81 days ago | 6 comments

Scrapy needs to have sane defaults that do no harm (github.com/scrapy)
Scrapy needs to have sane defaults that do no harm #6597

Software, Programming, Web Scraping, Open Source

12 points by sagacity 87 days ago | 3 comments

The Web Is Broken – Botnet Part 2 (wildeboer.net)
I guess you have all heard about the growing problem of AI companies trying to aggressively collect whatever data they can get their hands on to train their models. This has caused an explosive surge in web crawlers relentlessly hitting servers big and small. But who runs these crawlers? Turns out — it could be you!

AI, Web Scraping, Data Collection, Privacy

411 points by todsacerdoti 90 days ago | 274 comments

Crawler operators, please stop destroying the commons (lunnova.dev)
Goodwill for crawling/scraping has rapidly been depleted

Web Scraping, Ethics, Internet

7 points by xena 91 days ago | 0 comments

Abusive AI Web Crawlers: Get Off My Lawn (mythic-beasts.com)
As many other folks have reported in the last few weeks, we have also been seeing a huge increase in the amount of traffic from abusive web crawlers.

Web Scraping, Artificial Intelligence, Abuse, Security

23 points by bluehatbrit 108 days ago | 22 comments

Cloudflare is luring web-scraping bots into an 'AI Labyrinth' (theverge.com)
Cloudflare, one of the biggest network internet infrastructure companies in the world, has announced AI Labyrinth, a new tool to fight web-crawling bots that scrape sites for AI training data without permission.

Artificial Intelligence, Security, Web Scraping, Cloud Computing

6 points by adam__smith 118 days ago | 0 comments

Improved ways to operate a rude crawler (marginalia.nu)
Tech news is abuzz with rude AI crawlers that forge their user-agent and ignore robots.txt. In my opinion, if this is all the AI startups can muster, they’re losing their touch. wget can do this. You need to up your game, get that crawler really rolling coal. Flagrant disregard for externalities is an important signal to the investors that your AI startup is the one.

Artificial Intelligence, Web Scraping, Ethics, Startups

75 points by doruk101 119 days ago | 12 comments

Show HN: Hyperbrowser MCP Server – Connect AI agents to the web through browsers (github.com/hyperbrowserai)
This is Hyperbrowser's Model Context Protocol (MCP) Server. It provides various tools to scrape, extract structured data, and crawl webpages. It also provides easy access to general purpose browser agents like OpenAI's CUA, Anthropic's Claude Computer Use, and Browser Use.

Web Scraping, AI Agents, Browsers, Open Source

63 points by shrisukhani 120 days ago | 26 comments

Fetch-MCP: Playwright-Based MCP Server with Batch URL Fetching Support (github.com/jae-jae)
MCP server for fetch web page content using Playwright headless browser.

Web Scraping, Web Development, Playwright, Headless Browsers, Automation

64 points by Sulfide6416 121 days ago | 14 comments

Show HN: I scrape Steam data every month and it's yours to download for free (gginsights.io)
Leverage the power of AI to help answer your questions about the Steam market and become a data expert, transforming data into actionable insights.

Data Analysis, Gaming, Steam, AI, Web Scraping

161 points by csmets 145 days ago | 60 comments

Show HN: I Scraped 2,200 Software Engineering Jobs from Career Pages Using LLMs (grepjob.com)

Software Engineering, Job Search, Web Scraping

8 points by kylem866 150 days ago | 12 comments

Show HN: Automatic Python scraper for xenforo, phpbb, invision, smf, vbulletin (github.com/TUVIMEN)
forumscraper aims to be an universal, automatic and extensive scraper for forums.

Web Scraping, Python, Forums, Software, Open Source

3 points by TUVIMEN 193 days ago | 0 comments

Show HN: API Parrot – Automatically Reverse Engineer HTTP APIs (apiparrot.com)
API Parrot is the tool specifically designed to reverese engineer the HTTP APIs of any website. Making life easier for developers looking to automate, integrate or scrape websites without public APIs.

API Development, Web Development, Reverse Engineering, Automation, Web Scraping

456 points by pvarghav 199 days ago | 117 comments

Show HN: DataFuel.dev – Turn websites into LLM-ready data (datafuel.dev)
DataFuel API scrapes entire websites and knowledge bases in a single query. Get clean, markdown-structured web data instantly for your RAG systems and AI models. No complex scraping code needed.

Web Scraping, AI, Data

43 points by sachou 218 days ago | 34 comments

Show HN: App to discover job listings directly from company websites (unlistedjobs.com)

Job Hunting, Web Scraping, Software, Startups

67 points by Jabbs 224 days ago | 57 comments

What Are the Latest Scraping APIs or Services/Websites? (ycombinator.com)
I'm working on a large web scraping project and looking to avoid building custom scrapers for every site using something like BeautifulSoup.

Web Scraping, APIs, Services, Websites

6 points by jdcampolargo 226 days ago | 7 comments

Maxun: Open-Source No-Code Web Data Extraction Platform (github.com/getmaxun)
Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!

Open Source, Web Scraping, Data Extraction, Automation

58 points by thunderbong 252 days ago | 8 comments

Show HN: Convert any website into a React component (chromewebstore.google.com)

Web Development, React, Chrome Extensions, Web Scraping

326 points by alexdanilowicz 256 days ago | 62 comments

Nearly 90% of our AI crawler traffic is from ByteDance (haproxy.com)
This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models. We noticed the same thing when looking at bot management analytics produced by HAProxy Edge — our global network that we ourselves use to serve traffic for haproxy.com. Some of the numbers we are seeing are fairly shocking, so let’s review the traffic sources and where they originate.

AI, Web Scraping, Traffic Analysis, Generative AI, TikTok

95 points by jcat123 260 days ago | 43 comments

Annoyed Redditors tanking Google Search results illustrates peril of AI scrapers (arstechnica.com)
A trend on Reddit that sees Londoners giving false restaurant recommendations in order to keep their favorites clear of tourists and social media influencers highlights the inherent flaws of Google Search’s reliance on Reddit and Google's AI Overview.

AI, Search Engines, Social Media, Web Scraping, Consumer Behavior

5 points by isaacfrond 263 days ago | 0 comments

You-get: Dumb downloader that scrapes the web (github.com/soimort)
You-Get is a tiny command-line utility to download media contents (videos, audios, images) from the Web, in case there is no other handy way to do it.

Command Line, Web Scraping, Downloading, Media

397 points by Anon84 265 days ago | 146 comments

Show HN: Epublifier – scrape pages (books, manuals) for offline reading (github.com/maoserr)
Converts some webnovels to epub format

Web Scraping, Ebooks, Software, Offline Reading, Webnovels

290 points by maoserr 271 days ago | 43 comments

Video scraping: extracting JSON from a 35s screen capture for 1/10th of a cent (simonwillison.net)
The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

Web Scraping, Data Extraction, Automation, Cost Optimization, Efficiency

309 points by simonw 275 days ago | 46 comments

Show HN: Web Scraping with AI (hystruct.com)
Hystruct uses AI to help you scrape the web with ease. Get started with 4000 free credits today.

Web Scraping, AI, Tools, Software

4 points by alexpate 284 days ago | 1 comments

ByteDance’s Bytespider is scraping at much higher rates than other platforms (fortune.com)
ByteDance looks like it’s eager to make up for lost time when it comes to scraping the web for data needed to train its generative AI models.

AI, Web Scraping, Data, ByteDance, Generative AI

141 points by wmstack 285 days ago | 93 comments

Web scraping with your web browser: Why not? (8chananon.github.io)
You can find plenty of tutorials on the Internet about the art of web scraping (for example, here and here) and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser though you will find browser extensions that claim to do it all without any need for coding (this only works for simplistic and unprotected websites).

Web Scraping, JavaScript, Browser Extensions

150 points by 8chanAnon 290 days ago | 73 comments