Skip to content

Pattern: Web Scraper

Quick facts

  • Category: Scripts & Automation
  • Maturity: Adopt
  • Typical team size: 1-2 engineers
  • Typical timeline to MVP: 1-2 weeks
  • Last reviewed: 2026-05-02 by Architecture Team

1. Context

Use this pattern when:

  • You need structured data from a public website and no official API or data export exists
  • The data changes on a schedule (prices, listings, public records) and you need to track it over time
  • Volume is thousands to millions of pages per day — not a one-off manual copy-paste

Do NOT use this pattern when:

  • An official API exists — use it; scraping is always more brittle than an API
  • The site's Terms of Service explicitly prohibit automated access to the data you need — review ToS and robots.txt first
  • The data is behind authentication and belongs to a third party — this pattern does not cover authenticated scraping of private user data
  • You need sub-second freshness — scraping is not a streaming data source

2. Problem it solves

Valuable information is published on the web in HTML meant for human reading, not machine consumption. Converting that into queryable, structured data requires reliably fetching pages, extracting the right fields, handling rendering inconsistencies and anti-bot measures, storing raw copies for reprocessing, and scheduling repeated runs as the source changes. Doing this correctly is more work than it looks — this pattern captures the recurring decisions so you do not start from scratch each time.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    Sched[Scheduler\nGitHub Actions / Prefect] -->|triggers| Scraper[Web Scraper]
    Scraper -->|HTTP requests| Web[Target Websites]
    Web -->|HTML / JSON responses| Scraper
    Scraper --> Raw[(Raw Store\nS3 HTML archive)]
    Scraper --> Structured[(Structured Store\nPostgreSQL)]
    Scraper --> Alerts[Alerting\nSlack / PagerDuty]

Container view (C4 Level 2)

flowchart TB
    subgraph Scheduling
        Cron[GitHub Actions Cron\nor Prefect Flow]
    end
    subgraph Scraper Process
        URLQueue[URL Queue / Seed List]
        Fetcher[HTTP Fetcher\nhttpx async or Playwright]
        RateLimiter[Rate Limiter\npoliteness delay + robots.txt]
        Parser[HTML Parser\nBeautifulSoup4 + lxml]
        Dedup[Deduplication\nhash of raw content]
    end
    subgraph Storage
        RawS3[(S3 raw zone\nHTML per URL + timestamp)]
        StructuredDB[(PostgreSQL\nstructured records)]
    end
    subgraph Ops
        DLQ[Failed URLs log\nS3 or DB table]
        Alert[Slack alert\non error rate threshold]
    end

    Cron --> URLQueue
    URLQueue --> RateLimiter
    RateLimiter --> Fetcher
    Fetcher -->|raw HTML| RawS3
    Fetcher -->|raw HTML| Parser
    Fetcher -->|4xx / 5xx| DLQ
    Parser --> Dedup
    Dedup --> StructuredDB
    DLQ --> Alert

4. Technology stack

Layer Primary choice Alternatives Notes
Language Python 3.12+ with uv Node.js (Puppeteer/Crawlee), Go (Colly) Python has the richest scraping ecosystem; see ADR-0002
HTTP client httpx (async) requests (sync), aiohttp httpx supports async, HTTP/2, and connection pooling natively; use for static HTML sites
JavaScript rendering Playwright Selenium, Puppeteer, Splash Playwright is faster and more reliable than Selenium; required when the target page uses client-side rendering
HTML parsing BeautifulSoup4 + lxml backend lxml direct, cssselect, parsel (Scrapy's parser) lxml backend makes BS4 ~10× faster than the html.parser default; parsel if you prefer XPath
Crawl framework Scrapy crawlee (Node.js) Scrapy for large-scale multi-page crawls with link following; skip for simple single-page jobs
Scheduling GitHub Actions cron Prefect, APScheduler GitHub Actions is simplest for infrequent scrapes (hourly or slower); Prefect for complex dependency chains
Proxy rotation Bright Data Oxylabs, Smartproxy Required for large-scale scraping where IP bans are likely; overkill for occasional small jobs
Raw storage AWS S3 Cloudflare R2, local filesystem Always store the raw HTML before parsing — enables schema evolution and re-parsing without re-fetching
Structured storage PostgreSQL DuckDB, SQLite, MongoDB Postgres for shared queryable output; DuckDB for local analytical workloads; SQLite for single-machine jobs

5. Non-functional characteristics

Concern Profile
Scalability Single-process async (httpx) handles ~100 concurrent requests comfortably. Scrapy scales to millions of pages/day with its built-in downloader middleware. Beyond that, parallelise across multiple worker containers partitioned by domain or URL range.
Availability target Not a long-running service. Availability = "scheduled run completes and new records appear in the structured store within the SLA window". A 10% transient HTTP error rate from the target is normal and must not fail the whole run.
Latency target Not latency-sensitive. Optimise for throughput (pages/hour) and politeness (requests/second per domain), not p95 response time.
Security posture Outbound HTTP only — the scraper has no inbound surface. Risks: credentials for storage in environment (use Secrets Manager), accidental exfiltration of authenticated sessions, and legal risk from scraping prohibited data. Rotate proxy credentials on a schedule.
Data residency Raw HTML is stored in S3; choose a region that matches your data classification requirements. Do not store PII from scraped pages without a legal basis.
Compliance fit Legal basis for scraping varies by jurisdiction and site ToS. GDPR: do not store scraped personal data without a lawful basis. CFAA (US): do not scrape after receiving a cease-and-desist. Always check robots.txt and respect Crawl-delay directives.

6. Cost ballpark

Indicative monthly USD cost. Proxy costs dominate at scale; compute is cheap.

Scale Pages / day Monthly cost Cost drivers
Small < 10,000 $5 - $50 GitHub Actions free tier, S3 storage, no proxies needed
Medium 10k - 1M $100 - $600 Proxy bandwidth, ECS compute, S3 + Postgres storage and transfer
Large 1M+ $800 - $5,000 Dedicated proxy plan (Bright Data ~$500/TB), large ECS fleet, significant S3 storage

7. LLM-assisted development fit

Aspect Rating Notes
Initial scraper scaffolding ★★★★★ Excellent — CSS/XPath selectors, pagination loops, and retry logic are generated well.
CSS and XPath selector writing ★★★★ Good; verify against the actual rendered DOM with DevTools. Selectors break on site redesigns regardless of how they were written.
Playwright async scripts ★★★★ Solid for standard navigation and form interaction; struggles with non-standard UI patterns (canvas, shadow DOM).
Anti-bot evasion logic ★★ Suggests plausible but brittle tricks (user-agent rotation, fake mouse moves). These require constant maintenance and are an arms race. Invest in legitimate proxy infrastructure instead.
Legal / ethical review Never outsource ToS interpretation or data privacy decisions to an LLM.

Recommended workflow: Generate the fetcher and parser skeleton, then write the deduplication and re-parse-from-raw logic by hand. Store raw HTML from day one — you will need to re-parse.

8. Reference implementations

  • Public reference: scrapy/quotesbot — canonical Scrapy tutorial spider, good starting structure
  • Public reference: scrapy/scrapy — the Scrapy framework itself; examples/ and tests/ show real spider architectures including pagination, item pipelines, and middleware (200 OK ✓)
  • Public reference: microsoft/playwright-python — examples — official Playwright Python examples including async patterns
  • Public reference: lorien/awesome-web-scraping — curated list of Python scraping libraries with descriptions
  • Internal case study: Add your anonymised internal example here

10. Known risks & gotchas

  • Selectors break silently on site redesigns — The target website changes its HTML structure; your parser returns empty results or wrong data with no error. Mitigation: assert that key fields are non-null after each parse; alert on a sudden drop in extracted-record count; store raw HTML so you can re-parse when you fix the selector.
  • Politeness violations trigger IP bans — Sending requests too fast gets your IP blocked and may constitute a DoS. Mitigation: respect robots.txt crawl-delay, add a random jitter between requests (0.5–3 s per domain), never paralllelise more than 2–3 concurrent requests to the same host.
  • JavaScript-rendered content silently missing — You scrape the HTML but the data you need is loaded by a subsequent XHR. Mitigation: run the first request without Playwright; if the target field is missing, switch to Playwright and add a wait_for_selector on the element before extracting.
  • Memory leak in long-running async loops — An async httpx/Playwright process that never exits accumulates sessions and DOM objects. Mitigation: process URLs in bounded batches (e.g., 1,000 URLs per process invocation), then exit cleanly; do not run as a single infinite loop.
  • Re-scraping rate after schema change is expensive — If you did not store raw HTML, a parser bug or schema change means re-fetching everything. Mitigation: always write raw HTML to S3 before parsing. Storage is cheap; bandwidth and proxy costs are not.