# LLM-Driven Web Extraction Architecture

## Problem Statement

Extract structured information from arbitrary web URLs where:
- Different sites require different approaches (simple HTTP vs browser automation)
- Content types vary (articles, videos, audio, images, PDFs)
- Sites change frequently (patterns break, paywalls appear)
- We want to minimize expensive operations (browser automation, LLM calls)

**Solution**: Three-layer extraction pipeline with intelligent LLM-based routing that learns and caches decisions.

---

## Architecture Overview

```
URL Input
    ↓
[Layer 0: Intelligent Routing]
    LLM analyzes page → determines fetch strategy
    ↓
[Layer 1: Content Fetching]
    Generic fetchers retrieve raw artifacts (HTML, video, audio, images)
    ↓
[Layer 2: Entity Extraction]
    Type-specific extractors parse structured data from artifacts
    ↓
[Layer 3: Meta Analysis]
    Inferential extractors derive implicit knowledge
    ↓
Structured Output
```

### Key Principles

1. **Intelligence at the edge**: LLM decides what to fetch before expensive operations
2. **Progressive extraction**: Raw → Entities → Inferences (each layer builds on previous)
3. **Learning system**: Cache successful strategies, promote patterns over time
4. **Always retry**: Failures advance to next strategy, never immediate fail
5. **Full observability**: Every decision and attempt logged for debugging

---

## Layer 0: Intelligent Routing

### Purpose
Determine the optimal fetch strategy for a given URL without manual rule maintenance.

### Flow

**Step 1: Quick Reconnaissance**

Lightweight fetch to gather intelligence:
- Load page with minimal wait (2-3 seconds)
- Capture viewport screenshot
- Extract HTML structure
- Read meta tags and JSON-LD
- Count media elements (videos, audio, iframes, images)

**Step 2: LLM Analysis**

Send reconnaissance data to LLM with prompt:
```
"Based on this screenshot, HTML, and metadata, what should I fetch?
Available fetchers: html, video, audio, images, pdf
Respond with structured fetch plan including content type and methods."
```

LLM responds with fetch plan:
```json
{
  "content_type": "video",
  "fetchers": [
    {"type": "html", "method": "playwright", "config": {"wait_for": "networkidle"}},
    {"type": "video", "method": "yt-dlp"}
  ],
  "expected_entity_types": ["text", "video"]
}
```

**Step 3: Caching Strategy**

- First visit to `example.com` → LLM call (slow, ~2-5 seconds)
- Cache decision by domain
- Future visits to `example.com` → use cached plan (instant)
- After 10+ successful uses → promote to fast pattern match (no cache lookup)

### Hybrid Optimization

Common domains (YouTube, SoundCloud, NY Times) use pre-defined patterns:
```python
if domain == 'youtube.com':
    return YOUTUBE_PATTERN  # instant
elif domain in cache:
    return cache[domain]     # fast
else:
    return llm_route(url)    # smart but slow
```

Results:
- 80% of URLs use patterns (< 1ms)
- 15% use cache (< 10ms)
- 5% require LLM (2-5 seconds, then cached)

---

## Layer 1: Content Fetching

### Purpose
Retrieve raw artifacts based on fetch plan. Fetchers are generic, site-agnostic tools.

### Generic Fetchers

**HTMLFetcher**
- Methods: `wget` (fast, cheap) or `playwright` (slow, handles JS)
- Config: wait conditions, scroll behavior, timeouts
- Output: `page.html`

**VideoFetcher**
- Methods: `yt-dlp` (works for YouTube, Vimeo, Twitter, etc.) or `direct` (MP4 URLs)
- Output: `video.mp4` + metadata

**AudioFetcher**
- Methods: `yt-dlp` (SoundCloud, Spotify) or `direct` (MP3 URLs)
- Output: `audio.mp3` + metadata

**ImageFetcher**
- Methods: CSS selector extraction or direct download
- Output: Multiple image files

### Execution Model

Fetch plan specifies multiple fetchers to run in parallel/sequence:
```
YouTube video page:
  1. HTMLFetcher (playwright) → page.html
  2. VideoFetcher (yt-dlp) → video.mp4
  3. ImageFetcher (selectors) → thumbnail.jpg
```

All artifacts stored in job directory with manifest:
```
artifacts/job_123/
  ├── page.html
  ├── video.mp4
  ├── thumbnail.jpg
  └── manifest.json
```

Manifest lists what was successfully fetched:
```json
{
  "artifacts": [
    {"type": "html", "path": "page.html", "mime": "text/html", "size": 50000},
    {"type": "video", "path": "video.mp4", "mime": "video/mp4", "size": 5000000}
  ]
}
```

### Retry Strategy

If fetcher fails, advance to next in plan:
```
Plan: [wget/html, playwright/html, playwright/generic]
  ↓
wget fails (403) → try playwright
  ↓
playwright succeeds → continue to Layer 2
```

---

## Layer 2: Entity Extraction

### Purpose
Parse structured entities from raw artifacts. Extractors are content-type specific, not site-specific.

### Generic Extractors

**TextEntityExtractor** (processes HTML, plain text, PDFs)
- Input: HTML artifact
- Extracts: People, organizations, locations (via NER)
- Extracts: Topics, keywords, main content
- Output: Structured entities

**VideoEntityExtractor** (processes video files)
- Input: Video artifact
- Extracts: Visual objects from keyframes (via vision model)
- Extracts: Transcript from audio (via speech-to-text)
- Extracts: Spoken entities from transcript (via NER)
- Output: Visual + auditory entities

**AudioEntityExtractor** (processes audio files)
- Input: Audio artifact
- Extracts: Transcript (via speech-to-text)
- Extracts: Spoken entities (via NER)
- Extracts: Audio features (language, speakers, duration)
- Output: Auditory entities

**ImageEntityExtractor** (processes images)
- Input: Image artifacts
- Extracts: Visual objects (via vision model)
- Extracts: OCR text if present
- Extracts: Entities from OCR text (via NER)
- Output: Visual entities

### Execution Model

For each artifact in manifest, run appropriate extractor:
```python
for artifact in manifest['artifacts']:
    if artifact['type'] == 'html':
        entities = TextEntityExtractor().extract(artifact)
    elif artifact['type'] == 'video':
        entities = VideoEntityExtractor().extract(artifact)
    # etc.
```

Result:
```json
{
  "html": {
    "people": ["Dr. Jane Smith", "Prof. John Doe"],
    "organizations": ["MIT", "Stanford"],
    "topics": ["quantum computing", "cryptography"],
    "main_content": "..."
  },
  "video": {
    "visual_objects": ["person", "whiteboard", "equations"],
    "transcript": "Today we'll discuss quantum entanglement...",
    "spoken_entities": ["Einstein", "Schrödinger"]
  }
}
```

---

## Layer 3: Meta Analysis

### Purpose
Infer implicit knowledge from explicit entities. Meta extractors are hypothesis-driven, not deterministic.

### Meta Extractors

**AuthorExpertiseMetaExtractor**
- Input: Entities from all artifacts
- Hypothesis: "Author who writes about topic X likely knows topic X"
- Process: Aggregate topics across all content, frequency analysis
- Output: Inferred expertise areas with confidence scores

**CredibilityMetaExtractor**
- Input: Entities + HTML metadata
- Hypothesis: "Credible sources have authors, dates, citations, reputable domains"
- Process: Signal scoring (author presence, citation count, domain reputation)
- Output: Credibility score with evidence

**ContentStyleMetaExtractor**
- Input: Entities from text + video
- Hypothesis: "Formal language + citations = educational; casual + no citations = entertainment"
- Process: Linguistic analysis, visual pacing
- Output: Style classification (educational, promotional, entertainment)

### Execution Model

Meta extractors run independently after Layer 2 completes:
```python
for meta_extractor in [AuthorExpertise, Credibility, ContentStyle]:
    analysis = meta_extractor.analyze(entities, manifest)
    store_with_confidence(analysis)
```

### Confidence Levels

All meta analysis includes confidence because it's inferential:
```json
{
  "analysis_type": "author_expertise",
  "inferred_expertise": ["quantum computing", "cryptography"],
  "confidence": 0.85,
  "reasoning": "Author published 3 peer-reviewed papers on quantum cryptography",
  "evidence": ["paper1.pdf", "paper2.pdf"]
}
```

---

## State Machine

Jobs progress through states with automatic retry on failure.

### States
- `pending`: Job created, waiting for worker
- `processing`: Currently executing a strategy
- `retrying`: Strategy failed, moving to next in plan
- `completed`: Extraction successful
- `failed`: All strategies exhausted

### Transitions

```
pending
  ↓ [worker picks up]
processing (strategy 0)
  ↓ [execute fetcher]
  ├─ success → completed
  └─ failure → retrying (move to strategy 1)
            ↓ [worker picks up]
          processing (strategy 1)
            ↓ [execute fetcher]
            ├─ success → completed
            └─ failure → retrying (move to strategy 2)
                      ↓
                    ... (continue until strategies exhausted)
                      ↓
                    failed
```

### Worker Model

Stateless workers poll database for pending/retrying jobs:
```sql
SELECT id FROM extraction_jobs
WHERE status IN ('pending', 'retrying')
ORDER BY queue_priority DESC, created_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED
```

Workers:
- Can crash and restart without losing state
- Scale horizontally (add more workers)
- Process jobs in batches (cron, queue consumer, etc.)
- Don't hold any state in memory

---

## Database Schema

### Layer 0: Routing

```sql
-- Cache LLM routing decisions
CREATE TABLE fetch_plan_cache (
  domain VARCHAR(255) PRIMARY KEY,
  plan JSONB NOT NULL,
  times_used INT DEFAULT 0,
  success_rate FLOAT DEFAULT 1.0,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Store recon snapshots for debugging
CREATE TABLE recon_snapshots (
  id SERIAL PRIMARY KEY,
  job_id INT REFERENCES extraction_jobs(id),
  recon_data JSONB,
  llm_prompt TEXT,
  llm_response TEXT,
  created_at TIMESTAMP DEFAULT NOW()
);
```

### Layer 1: Fetching

```sql
-- Job state
CREATE TABLE extraction_jobs (
  id SERIAL PRIMARY KEY,
  url TEXT NOT NULL,
  status VARCHAR(20) DEFAULT 'pending',
  domain VARCHAR(255),
  current_strategy_position INT DEFAULT 0,
  artifacts_dir TEXT,
  manifest JSONB,
  queue_priority INT DEFAULT 0,
  created_at TIMESTAMP DEFAULT NOW(),
  completed_at TIMESTAMP
);

-- Audit log
CREATE TABLE strategy_attempts (
  id SERIAL PRIMARY KEY,
  job_id INT REFERENCES extraction_jobs(id),
  strategy_position INT NOT NULL,
  fetcher VARCHAR(50) NOT NULL,
  method VARCHAR(50) NOT NULL,
  success BOOLEAN NOT NULL,
  error_type VARCHAR(100),
  execution_time_ms INT,
  attempted_at TIMESTAMP DEFAULT NOW()
);
```

### Layer 2: Entity Extraction

```sql
CREATE TABLE entity_extractions (
  id SERIAL PRIMARY KEY,
  job_id INT REFERENCES extraction_jobs(id),
  artifact_type VARCHAR(50),
  entities JSONB,
  extracted_at TIMESTAMP DEFAULT NOW()
);
```

### Layer 3: Meta Analysis

```sql
CREATE TABLE meta_analysis (
  id SERIAL PRIMARY KEY,
  job_id INT REFERENCES extraction_jobs(id),
  analysis_type VARCHAR(50),
  result JSONB,
  confidence FLOAT,
  analyzed_at TIMESTAMP DEFAULT NOW()
);
```

### System Configuration

```sql
CREATE TABLE system_config (
  key VARCHAR(100) PRIMARY KEY,
  value JSONB
);

-- Concurrency limits
INSERT INTO system_config (key, value) VALUES
  ('fetcher_concurrency', '{
    "html": {"max_parallel": 5, "current": 0},
    "video": {"max_parallel": 2, "current": 0}
  }');
```

---

## Cost & Performance Optimization

### LLM Call Minimization

**Problem**: LLM calls are expensive (~$0.01 per URL) and slow (~2-5 seconds).

**Solutions**:
1. **Domain-level caching**: First URL from domain uses LLM, rest use cache
2. **Pattern promotion**: After 10 successful uses, promote to fast pattern
3. **Batch reconnaissance**: Process 100 URLs, group by domain, run LLM once per unique domain

**Result**: 95% of URLs avoid LLM calls after system runs for a week.

### Progressive Loading

Don't block the entire pipeline:
- Layer 1 completes → show raw artifacts to user immediately
- Layer 2 runs async → update with entities when ready
- Layer 3 runs async → update with meta analysis when ready

User sees:
```
t=0s:   "Fetching..."
t=5s:   "✓ Fetched HTML + video" [show download links]
t=10s:  "✓ Extracted entities" [show structured data]
t=15s:  "✓ Inferred expertise" [show meta analysis]
```

### Concurrency Management

Limit expensive operations:
```json
{
  "html": {"max_parallel": 10},    // cheap
  "video": {"max_parallel": 2},     // expensive
  "llm_routing": {"max_parallel": 5} // expensive + rate-limited
}
```

Workers check limits before starting:
```sql
UPDATE system_config 
SET value = jsonb_set(value, '{video,current}', ...) 
WHERE value->'video'->>'current' < value->'video'->>'max_parallel'
```

---

## Observability

### What to Track

**Routing decisions**: Which domains use patterns vs cache vs LLM
**Fetch success rates**: Which fetchers fail most often, which methods work
**Extraction quality**: Which extractors produce best results per content type
**Meta confidence**: How often high-confidence predictions are verified
**Performance**: Time spent in each layer, bottlenecks

### Key Queries

Find domains that should be promoted to patterns:
```sql
SELECT domain, times_used, success_rate
FROM fetch_plan_cache
WHERE times_used > 10 AND success_rate > 0.9
ORDER BY times_used DESC;
```

See what happened to a job:
```sql
SELECT 
  sa.strategy_position,
  sa.fetcher,
  sa.method,
  sa.success,
  sa.error_type,
  sa.execution_time_ms
FROM strategy_attempts sa
WHERE sa.job_id = 123
ORDER BY sa.attempted_at;
```

Find unreliable fetchers:
```sql
SELECT 
  fetcher,
  method,
  COUNT(*) as attempts,
  AVG(CASE WHEN success THEN 1.0 ELSE 0.0 END) as success_rate,
  AVG(execution_time_ms) as avg_time_ms
FROM strategy_attempts
GROUP BY fetcher, method
ORDER BY success_rate ASC;
```

---

## Example: Complete Flow

**URL**: `https://unknown-blog.com/post/quantum-computing`

### Layer 0: Routing (8 seconds)

1. Check pattern match → none (unknown domain)
2. Check cache → none (first visit)
3. Run reconnaissance:
   - Screenshot shows article with code snippets
   - HTML has `<article>` tag, no `<video>` tags
   - Meta tags: `og:type=article`, author present
4. LLM analyzes → responds:
   ```json
   {
     "content_type": "article",
     "fetchers": [
       {"type": "html", "method": "playwright", "config": {"wait_for": "networkidle"}}
     ]
   }
   ```
5. Cache decision for `unknown-blog.com`

### Layer 1: Fetching (5 seconds)

1. Execute fetch plan
2. HTMLFetcher (playwright) retrieves page
3. Store artifacts:
   ```
   artifacts/job_456/
     ├── page.html (150KB)
     └── manifest.json
   ```

### Layer 2: Entity Extraction (3 seconds)

1. TextEntityExtractor processes `page.html`
2. Output:
   ```json
   {
     "people": ["Dr. Alice Chen"],
     "organizations": ["MIT", "IBM"],
     "topics": ["quantum computing", "quantum entanglement", "qubits"],
     "main_content": "Quantum computing represents a paradigm shift..."
   }
   ```

### Layer 3: Meta Analysis (2 seconds)

1. AuthorExpertiseMetaExtractor:
   ```json
   {
     "inferred_expertise": ["quantum computing", "quantum physics"],
     "confidence": 0.75,
     "reasoning": "Author wrote detailed technical article with citations"
   }
   ```

2. CredibilityMetaExtractor:
   ```json
   {
     "credibility_score": 0.82,
     "signals": {
       "has_author": true,
       "has_date": true,
       "citation_count": 8,
       "domain_age": "5 years"
     }
   }
   ```

**Total time**: 18 seconds
**Next URL from `unknown-blog.com`**: 10 seconds (skips recon + LLM)

---

## Design Benefits

✅ **Zero configuration**: No manual rule writing, system learns automatically  
✅ **Adapts to change**: Sites evolve, LLM sees new page structure  
✅ **Cost efficient**: LLM used sparingly, cached aggressively  
✅ **Broad coverage**: Works for any content type (video, audio, text, images)  
✅ **Full observability**: Every decision logged and queryable  
✅ **Scales horizontally**: Stateless workers, add more to go faster  
✅ **Progressive enhancement**: Basic extraction fast, advanced analysis async  
✅ **Self-improving**: Successful patterns promoted, system gets faster over time

---

## Future Extensions

### Feedback Loop
Track user corrections to meta analysis → retrain models → improve confidence

### Multi-Modal LLM
Current: Vision model for images + text LLM for HTML  
Future: Single multi-modal LLM processes screenshots + HTML + video frames together

### Distributed Caching
Current: Database cache per instance  
Future: Shared Redis cache across all workers

### Adaptive Retry
Current: Always retry with next strategy  
Future: Skip strategies based on error type (404 = don't retry, 403 = try next browser)

### Cost Prediction
Before running job, estimate cost based on fetch plan and historical data  
Let user approve expensive operations (video transcription, long LLM context)