# Intelligent Fetch Strategy Router ## Problem Statement When extracting content from arbitrary web URLs, you face a dilemma: - **Simple HTTP GET** (wget, requests): Fast (500ms), cheap, but fails on JavaScript-heavy sites - **Browser automation** (Playwright, Selenium): Reliable, handles JS, but slow (5-10s) and expensive **Challenge**: Determine which approach to use WITHOUT trying both. **Requirements**: - ✅ Minimize code complexity (no trying 10 different strategies) - ✅ Minimize time (make one smart decision) - ✅ Minimize cost (avoid unnecessary browser launches) - ✅ Be reliable across diverse websites (news, SPAs, videos, paywalls) - ✅ Handle edge cases intelligently (distinguish "article about CAPTCHAs" from "actual CAPTCHA page") --- ## The Solution: Single-Pass, Multi-Signal Reconnaissance ### Core Concept Perform **one cheap HTTP GET request**, extract multiple signals from it, and route through a **tiered decision engine**: 1. **Fast path**: Simple heuristics handle obvious cases (70-80% of URLs) 2. **Smart path**: LLM handles ambiguous cases (10-20% of URLs) 3. **Lookup path**: Known domains bypass both (10% of URLs) ``` URL Input ↓ [Known Domain Lookup] ───→ (cache hit) ───→ Use cached strategy ↓ (cache miss) [Single Cheap HTTP GET] ↓ [Extract Multi-Signal Payload] ↓ [Heuristic Gatekeeper] ──→ (clear signal) ──→ wget or browser ↓ (ambiguous) [LLM Arbiter] ──→ (judgment) ──→ wget or browser ↓ Execute chosen strategy ↓ [Self-Correction Layer] ``` --- ## Step 1: The Universal Cheap Fetch For any new URL, always start here: ### Action Execute a simple HTTP GET request (like `wget` or Python `requests`) - **Timeout**: 3 seconds max - **Follow redirects**: Yes (up to 3) - **User-Agent**: Generic browser UA (avoid immediate blocks) ### Collect Two Things 1. **Full response headers** 2. **Raw HTML content** (entire response body) **Cost**: ~500ms, negligible compute, no browser overhead --- ## Step 2: Extract Multi-Signal Payload From the cheap fetch response, extract these signals: ### Signal Set A: HTTP Metadata - **Status code**: 200, 403, 401, 503, etc. - **Content-Type header**: text/html, application/json, etc. - **Content-Length header**: Expected size - **Server header**: nginx, cloudflare, etc. - **Actual body size**: Compare to Content-Length ### Signal Set B: HTML Structure Parse the raw HTML (simple regex/string matching, no full DOM parsing): - **Text content volume**: Strip `