Weekly Log — Week of May 19, 2026
I Tested ScrapeGraphAI and Built From It
AI-assisted scraping that replaced selector rules with a prompt. Here is what I installed, what I built on top of it, what broke first, and what actually produced usable output.
The weekly log is where I document tools I test before writing about them. Not polished reviews. Just what happened when I actually tried something.
This week: ScrapeGraphAI.
What ScrapeGraphAI actually is
Most scraping tools require you to write selectors, map extraction rules, and fix the code every time a page changes layout.
ScrapeGraphAI works differently. You describe what you want. It builds the extraction pipeline using an LLM and a graph-based scraping system.
The core class is SmartScraperGraph. You give it a prompt and a URL. It handles the rest.
from scrapegraphai.graphs import SmartScraperGraph graph_config = { "llm": { "model": "openai/gpt-4o-mini", "api_key": "YOUR_OPENAI_API_KEY", }, "verbose": True, "headless": False, } graph = SmartScraperGraph( prompt="Extract useful information from the webpage", source="https://scrapegraphai.com/", config=graph_config, ) result = graph.run()
No selector mapping. No parsing rules. No brittle XPath chains.
The full source is here: github.com/ScrapeGraphAI/Scrapegraph-ai
How to install it
Two commands:
pip install scrapegraphai playwright install
The playwright step matters. ScrapeGraphAI uses browser automation to fetch and render pages before extraction. Without it, JavaScript-heavy pages come back incomplete or empty.
For a local dev setup against the source repo:
cd Scrapegraph-ai python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -e . playwright install
There is also a Docker path in the repo if you prefer containers. Both work.
The first test worked
SmartScraperGraph pulled structured data from a test page without me writing a single selector.
That proved the concept. The next question was whether it could do something useful, not just something impressive.
I decided to push it toward a real task: find public contact signals for AI creators, newsletters, directories, and sponsorship pages. No LinkedIn. No private data. Public web pages only.
What I built on top of it
The custom script is called prospect_ai_creators.py. Its job is to build a CSV of business-facing contact signals for AI creators and advertising operators.
The pipeline runs in five steps:
- Start with targeted search queries — “AI creator” “sponsor” “contact”, “AI newsletter” “advertise”
- Search with DuckDuckGo, fall back to Bing if needed
- Skip unwanted domains — LinkedIn, Facebook, Wikipedia, large tech companies
- Fetch each page, parse with BeautifulSoup, extract public emails, contact pages, sponsor links, and social handles
- Score each prospect 0–100 based on AI and advertising keyword fit, then write to CSV
Running it:
.\.venv\Scripts\Activate.ps1 python tools\prospect_ai_creators.py ` --output ai_creator_prospects.csv ` --limit 25
For targeted searches:
python tools\prospect_ai_creators.py ` --query '"AI ads" "sponsor" "contact"' ` --query '"AI marketing newsletter" "advertise"' ` --output ai_ad_prospects.csv ` --limit 50 ` --insecure
The --insecure flag handles local HTTPS interception issues. Needed on some networks.
What went wrong first
The first output was bad.
The initial query set was too broad. The results included ISO standards pages, Google Cloud documentation, and Stanford HAI explainers. Technically scraped correctly. Completely useless as leads.
This is not a ScrapeGraphAI problem. It is a query problem. The tool extracts what the search finds. If the search is vague, the extraction is vague.
The fix: tighter query terms and stricter domain filters. Once the queries focused specifically on advertise pages, sponsor pages, and AI newsletter operators, the results changed.
What the refined output produced
The second and third CSV files were useful. Prospects included:
- AI Indigo
- AiTuts
- AI-Ready CMO
- PoweredbyAI
- WhatTheAI
- AIViewer
- Aitoonic
These had sponsor pages, advertise links, public emails, and newsletter audiences. Actual prospects, not documentation pages.
The larger run produced 130 rows focused on individual AI creators and studios. Each row includes names, role signals, creative areas, source links, outreach paths, and a confidence score.
Is it worth trying?
Yes, if you need to extract structured information from public web pages without writing and maintaining custom scrapers.
The useful part is not the basic demo. The useful part is that once setup is working, you can push it toward a real task quickly. The library handles the extraction logic. You define the problem.
The constraint that made this build work: keep the target narrow. Public contacts only. Specific query terms. Filtered domains. Without that constraint, the output volume goes up but the usefulness goes down.
Start with one clear question. Build the smallest query set that could answer it. Then refine from there.