# Intelligent Fetch Strategy Router

## Problem Statement

When extracting content from arbitrary web URLs, you face a dilemma:

- **Simple HTTP GET** (wget, requests): Fast (500ms), cheap, but fails on JavaScript-heavy sites
- **Browser automation** (Playwright, Selenium): Reliable, handles JS, but slow (5-10s) and expensive

**Challenge**: Determine which approach to use WITHOUT trying both.

**Requirements**:
- ✅ Minimize code complexity (no trying 10 different strategies)
- ✅ Minimize time (make one smart decision)
- ✅ Minimize cost (avoid unnecessary browser launches)
- ✅ Be reliable across diverse websites (news, SPAs, videos, paywalls)
- ✅ Handle edge cases intelligently (distinguish "article about CAPTCHAs" from "actual CAPTCHA page")

---

## The Solution: Single-Pass, Multi-Signal Reconnaissance

### Core Concept

Perform **one cheap HTTP GET request**, extract multiple signals from it, and route through a **tiered decision engine**:

1. **Fast path**: Simple heuristics handle obvious cases (70-80% of URLs)
2. **Smart path**: LLM handles ambiguous cases (10-20% of URLs)
3. **Lookup path**: Known domains bypass both (10% of URLs)

```
URL Input
    ↓
[Known Domain Lookup] ───→ (cache hit) ───→ Use cached strategy
    ↓ (cache miss)
[Single Cheap HTTP GET]
    ↓
[Extract Multi-Signal Payload]
    ↓
[Heuristic Gatekeeper] ──→ (clear signal) ──→ wget or browser
    ↓ (ambiguous)
[LLM Arbiter] ──→ (judgment) ──→ wget or browser
    ↓
Execute chosen strategy
    ↓
[Self-Correction Layer]
```

---

## Step 1: The Universal Cheap Fetch

For any new URL, always start here:

### Action
Execute a simple HTTP GET request (like `wget` or Python `requests`)
- **Timeout**: 3 seconds max
- **Follow redirects**: Yes (up to 3)
- **User-Agent**: Generic browser UA (avoid immediate blocks)

### Collect Two Things
1. **Full response headers**
2. **Raw HTML content** (entire response body)

**Cost**: ~500ms, negligible compute, no browser overhead

---

## Step 2: Extract Multi-Signal Payload

From the cheap fetch response, extract these signals:

### Signal Set A: HTTP Metadata
- **Status code**: 200, 403, 401, 503, etc.
- **Content-Type header**: text/html, application/json, etc.
- **Content-Length header**: Expected size
- **Server header**: nginx, cloudflare, etc.
- **Actual body size**: Compare to Content-Length

### Signal Set B: HTML Structure
Parse the raw HTML (simple regex/string matching, no full DOM parsing):

- **Text content volume**: Strip `<script>`, `<style>`, measure remaining text
- **SPA framework signatures**: 
  - `<div id="root">`, `<div id="app">`, `<div id="__next">`
  - `__NEXT_DATA__`, `__NUXT__`, `window.__INITIAL_STATE__`
- **Script-to-content ratio**: Count `<script>` tags vs text paragraphs
- **Structural elements**: Count of `<p>`, `<h1>-<h6>`, `<article>` tags
- **Empty body**: `<body>` with <200 chars after stripping scripts

### Signal Set C: Anti-Bot Indicators
String matching in HTML + headers:

- **Blocking keywords in title/headers**:
  - "Cloudflare", "Checking your browser", "Just a moment"
  - "hCaptcha", "reCAPTCHA", "Are you a robot"
  - "Access Denied", "Enable JavaScript"
- **Challenge page patterns**:
  - `<noscript>` warnings
  - `cf-browser-verification` class names
  - `grecaptcha` or `hcaptcha` script sources

---

## Step 3: Heuristic Gatekeeper (Fast Path)

Apply simple rules to the signal payload. These handle **obvious cases instantly**:

### Rule 1: Empty Body Detection (SPA Signature)
```
IF:
  - Text content after stripping <script>/<style> < 200 chars
  AND
  - Contains SPA framework signature (id="root", __NEXT_DATA__, etc.)
THEN:
  → Decision: REQUIRES BROWSER
  → Reason: "Client-side rendered application"
```

### Rule 2: Rich Static Content
```
IF:
  - Text content after stripping scripts > 2KB
  AND
  - Contains 5+ paragraph tags with content
  AND
  - NO anti-bot indicators
THEN:
  → Decision: WGET SUFFICIENT
  → Reason: "Server-rendered content present"
```

### Rule 3: Obvious Block Page
```
IF:
  - Title or Server header contains blocking keywords
  OR
  - Body < 1KB AND contains "captcha" in class/id attributes
THEN:
  → Decision: REQUIRES BROWSER
  → Reason: "Detected challenge page"
```

### Rule 4: HTTP Status Blocks
```
IF:
  - Status code in [403, 401, 429]
  AND
  - Body size < 5KB
THEN:
  → Decision: REQUIRES BROWSER (with anti-detection)
  → Reason: "Anti-bot HTTP response"
```

**Exit Condition**: If ANY rule triggers with high confidence → execute decision, skip LLM

**Coverage**: ~70-80% of URLs will match these rules clearly

---

## Step 4: LLM Arbiter (Smart Path)

When heuristics are **ambiguous** (e.g., medium text content + some scripts, or "captcha" word appears but unclear context), escalate to LLM.

### The LLM Dossier

Don't send the entire HTML. Curate a compact, information-dense payload:

```json
{
  "url": "https://example.com/article",
  "http_status": 200,
  "headers": {
    "server": "nginx",
    "content-type": "text/html; charset=utf-8"
  },
  "html_head": "<title>Understanding CAPTCHAs</title><meta name='description'...>",
  "html_body_start": "<header><nav>...</nav></header><article><h1>...",
  "heuristic_signals": {
    "text_content_length": 1200,
    "script_count": 8,
    "has_spa_signature": false,
    "found_blocker_keywords": true,
    "blocker_context": "word 'captcha' found in <h1> tag"
  }
}
```

**Size**: Typically 2-4KB (HTML head + first 2KB of body + signals)

### The Prompt

```
You are an expert web extraction engine. Analyze this HTTP response and classify the webpage.

URL: {url}
Status: {status}
Headers: {key headers}

HTML Head: {full <head> section}
HTML Body (first 2KB): {truncated body}

Heuristic Analysis: {signal summary}

Classify into ONE of these categories:

1. STATIC: Content is present in raw HTML. Simple HTTP fetch is sufficient.
2. DYNAMIC: HTML is a shell requiring JavaScript execution to render content.
3. BLOCKER: Interstitial page (CAPTCHA, paywall, consent wall) blocking content access.

Respond ONLY in JSON format:
{
  "classification": "STATIC" | "DYNAMIC" | "BLOCKER",
  "confidence": 0.0-1.0,
  "reasoning": "Brief explanation (1-2 sentences)"
}

Examples:
- Article with word "CAPTCHA" in headline → STATIC (content is there)
- Cloudflare challenge page mentioning captcha → BLOCKER (must bypass)
- React app with empty body → DYNAMIC (needs JS execution)
```

### LLM Decision Mapping

```
LLM Response: STATIC → Use wget result (it's already fetched)
LLM Response: DYNAMIC → Retry with browser (Playwright)
LLM Response: BLOCKER → Retry with browser + anti-detection
```

**Why This Works**:
- **Contextual understanding**: LLM sees `<h1>Understanding CAPTCHAs</h1>` inside `<article>` tags → recognizes it's content, not a block
- **Structural awareness**: LLM recognizes SPA patterns even from unfamiliar frameworks
- **Flexible**: Adapts to new blocking techniques without code changes

---

## Step 5: Domain Lookup Table (Optional Fast Path)

Maintain a small cache of **known domain patterns**:

```json
{
  "youtube.com": {
    "strategy": "browser",
    "reason": "Always requires JS player",
    "confidence": 1.0,
    "last_verified": "2025-11-15"
  },
  "wikipedia.org": {
    "strategy": "wget",
    "reason": "Server-side rendered",
    "confidence": 1.0
  }
}
```

**Usage**:
- Check domain before cheap fetch (saves 500ms)
- Treat as **hint, not law**: If strategy fails, fall through to next step
- Populate from successful extractions: after 10+ successful wget fetches from `example.com`, cache it
- Auto-expire: Remove entries older than 30 days (sites change)

**Coverage**: ~10% of URLs (common domains visited repeatedly)

---

## Step 6: Self-Correction Layer

The system must **learn from failures**:

### Failure Detection

After executing the chosen strategy (wget or browser), check if extraction succeeds:

```
IF:
  - Chosen strategy was "wget"
  AND
  - Content extraction fails (empty text, no entities, no useful data)
THEN:
  → Automatic escalation: Retry with browser
  → Log mismatch: "Heuristics predicted wget OK, but failed"
```

### Learning Loop

```
Store tuple: (signal_payload, predicted_strategy, actual_strategy_that_worked)

After N mismatches for similar signal patterns:
  → Update heuristic thresholds
  → Add to domain lookup table
  → Flag for manual review (potential new pattern)
```

### Circuit Breaker

If a domain consistently fails both strategies:
```
After 3 consecutive failures on example.com:
  → Mark domain as "problematic"
  → Return error immediately on future requests
  → Suggest manual investigation
```

---

## Complete Decision Flow

```
┌─────────────────────────────────────────┐
│         URL Input                       │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│  Check Domain Lookup Table              │
│  - Cached strategy for known domains    │
└────┬───────────────────────────┬────────┘
     │ (miss)                    │ (hit)
     ▼                           ▼
┌─────────────────────┐    ┌──────────────┐
│  Single HTTP GET    │    │ Use Cached   │
│  - 3s timeout       │    │ Strategy     │
│  - Collect headers  │    └──────┬───────┘
│  - Collect HTML     │           │
└────────┬────────────┘           │
         │                        │
         ▼                        │
┌─────────────────────────────────┐      │
│  Extract Multi-Signal Payload   │      │
│  - HTTP metadata                │      │
│  - HTML structure               │      │
│  - Anti-bot indicators          │      │
└────────┬────────────────────────┘      │
         │                               │
         ▼                               │
┌─────────────────────────────────────┐  │
│  Heuristic Gatekeeper               │  │
│  - Empty body check                 │  │
│  - Rich content check               │  │
│  - Obvious blocker check            │  │
│  - HTTP status check                │  │
└────┬──────────────────────┬─────────┘  │
     │ (clear)              │ (ambiguous)│
     ▼                      ▼             │
┌─────────┐         ┌──────────────────┐ │
│ Execute │         │  LLM Arbiter     │ │
│ Strategy│◄────────│  - Contextual    │ │
└────┬────┘         │    analysis      │ │
     │              │  - Classification│ │
     │              └──────────────────┘ │
     │                                   │
     ▼                                   │
┌─────────────────────────────────────┐  │
│  Extraction Layer                   │◄─┘
│  - Parse with chosen strategy       │
└────────┬────────────────────────────┘
         │ (success)    │ (failure)
         ▼              ▼
   [Complete]   ┌──────────────────┐
                │ Self-Correction  │
                │ - Retry browser  │
                │ - Log mismatch   │
                │ - Update cache   │
                └──────────────────┘
```

---

## Performance Characteristics

### Timing Breakdown

**Fast path** (70-80% of URLs):
- Domain lookup hit: **~1ms**
- Cheap fetch + heuristics: **~500ms**

**Smart path** (10-20% of URLs):
- Cheap fetch: **500ms**
- LLM call: **1-2 seconds**
- Total: **1.5-2.5 seconds** (still no browser launched)

**Browser escalation** (20-30% of URLs):
- Cheap fetch: **500ms**
- Heuristics/LLM: **0-2 seconds**
- Browser launch + page load: **5-10 seconds**
- Total: **5.5-12.5 seconds**

**Worst case** (ambiguous + failed prediction):
- Initial attempt: **2.5 seconds**
- Browser retry: **10 seconds**
- Total: **12.5 seconds** (but learns and caches for next time)

### Cost Breakdown

**Per URL (cold cache)**:
- HTTP request: $0.000001 (negligible)
- LLM call (20% of URLs): $0.001-0.002
- Browser (30% of URLs): CPU time only
- **Average**: ~$0.0004/URL

**Per URL (warm cache)**:
- Domain lookup: $0 (DB query)
- HTTP request (if needed): $0.000001
- **Average**: ~$0.000001/URL

---

## Code Minimalism Principles

This entire system requires **minimal code**:

### Core Components (4 files)

1. **`cheap_fetcher.py`** (~50 lines)
   - Single function: `fetch_with_timeout(url) -> (headers, html)`

2. **`signal_extractor.py`** (~100 lines)
   - Simple regex/string matching on HTML
   - Returns structured signal dict

3. **`heuristic_engine.py`** (~80 lines)
   - 4 if/else rules
   - Returns: `("wget"|"browser"|"ambiguous", confidence)`

4. **`llm_arbiter.py`** (~60 lines)
   - Format prompt
   - Call LLM API
   - Parse JSON response

**Total**: ~300 lines of decision logic

The rest is standard infrastructure (caching, logging, retry handling) that you'd need anyway.

---

## Key Properties

✅ **One cheap probe**: Single HTTP GET gathers all signals  
✅ **Fast common path**: 70-80% skip LLM entirely  
✅ **Smart edge cases**: LLM handles nuance automatically  
✅ **Self-improving**: Learns from failures, builds cache  
✅ **Minimal code**: ~300 lines of decision logic  
✅ **Flexible**: No hard rules, everything can escalate  
✅ **Reliable**: Two chances (cheap + browser), validation layer

---

## Extension Points

### Advanced Heuristics (Optional)

Add more sophisticated checks:
- **Paywall detection**: Look for subscription CTAs, metered content warnings
- **Video platform detection**: Check for `<video>` tags, player divs, streaming URLs
- **PDF detection**: `Content-Type: application/pdf` → use PDF fetcher instead

### Multi-Modal LLM (Future)

Instead of text-only analysis, include:
- Screenshot from cheap fetch (if available)
- Visual analysis: "Page is mostly white space with spinner" → DYNAMIC

### Confidence Thresholds

Tune heuristic aggressiveness:
- **Conservative** (low false positives): Only trigger on very clear signals
- **Aggressive** (fast): Trigger on weaker signals, accept some retries

### A/B Testing

Run both strategies in parallel occasionally:
- Measure: Did LLM prediction match actual need?
- Improve: Retrain heuristic thresholds based on data

---

## Anti-Patterns to Avoid

❌ **Don't try multiple strategies speculatively**: Pick one, validate, retry if wrong  
❌ **Don't send full HTML to LLM**: Curate small, relevant payload  
❌ **Don't trust domain cache absolutely**: Always have fallback path  
❌ **Don't skip validation layer**: Always check if extraction actually succeeded  
❌ **Don't ignore mismatch logs**: They're signals for improving heuristics

---

## Summary

The intelligent fetch router solves the core dilemma: **"Should I use wget or browser?"** by:

1. **Probing once cheaply** (one HTTP GET)
2. **Applying fast heuristics** for obvious cases
3. **Escalating to LLM** only when ambiguous
4. **Caching successful patterns** for future speed
5. **Self-correcting** when predictions fail

**Result**: A system that is fast, cheap, reliable, and requires minimal code—adapting to any website without manual rule maintenance.