# LLM-Driven Web Extraction Architecture ## Problem Statement Extract structured information from arbitrary web URLs where: - Different sites require different approaches (simple HTTP vs browser automation) - Content types vary (articles, videos, audio, images, PDFs) - Sites change frequently (patterns break, paywalls appear) - We want to minimize expensive operations (browser automation, LLM calls) **Solution**: Three-layer extraction pipeline with intelligent LLM-based routing that learns and caches decisions. --- ## Architecture Overview ``` URL Input ↓ [Layer 0: Intelligent Routing] LLM analyzes page → determines fetch strategy ↓ [Layer 1: Content Fetching] Generic fetchers retrieve raw artifacts (HTML, video, audio, images) ↓ [Layer 2: Entity Extraction] Type-specific extractors parse structured data from artifacts ↓ [Layer 3: Meta Analysis] Inferential extractors derive implicit knowledge ↓ Structured Output ``` ### Key Principles 1. **Intelligence at the edge**: LLM decides what to fetch before expensive operations 2. **Progressive extraction**: Raw → Entities → Inferences (each layer builds on previous) 3. **Learning system**: Cache successful strategies, promote patterns over time 4. **Always retry**: Failures advance to next strategy, never immediate fail 5. **Full observability**: Every decision and attempt logged for debugging --- ## Layer 0: Intelligent Routing ### Purpose Determine the optimal fetch strategy for a given URL without manual rule maintenance. ### Flow **Step 1: Quick Reconnaissance** Lightweight fetch to gather intelligence: - Load page with minimal wait (2-3 seconds) - Capture viewport screenshot - Extract HTML structure - Read meta tags and JSON-LD - Count media elements (videos, audio, iframes, images) **Step 2: LLM Analysis** Send reconnaissance data to LLM with prompt: ``` "Based on this screenshot, HTML, and metadata, what should I fetch? Available fetchers: html, video, audio, images, pdf Respond with structured fetch plan including content type and methods." ``` LLM responds with fetch plan: ```json { "content_type": "video", "fetchers": [ {"type": "html", "method": "playwright", "config": {"wait_for": "networkidle"}}, {"type": "video", "method": "yt-dlp"} ], "expected_entity_types": ["text", "video"] } ``` **Step 3: Caching Strategy** - First visit to `example.com` → LLM call (slow, ~2-5 seconds) - Cache decision by domain - Future visits to `example.com` → use cached plan (instant) - After 10+ successful uses → promote to fast pattern match (no cache lookup) ### Hybrid Optimization Common domains (YouTube, SoundCloud, NY Times) use pre-defined patterns: ```python if domain == 'youtube.com': return YOUTUBE_PATTERN # instant elif domain in cache: return cache[domain] # fast else: return llm_route(url) # smart but slow ``` Results: - 80% of URLs use patterns (< 1ms) - 15% use cache (< 10ms) - 5% require LLM (2-5 seconds, then cached) --- ## Layer 1: Content Fetching ### Purpose Retrieve raw artifacts based on fetch plan. Fetchers are generic, site-agnostic tools. ### Generic Fetchers **HTMLFetcher** - Methods: `wget` (fast, cheap) or `playwright` (slow, handles JS) - Config: wait conditions, scroll behavior, timeouts - Output: `page.html` **VideoFetcher** - Methods: `yt-dlp` (works for YouTube, Vimeo, Twitter, etc.) or `direct` (MP4 URLs) - Output: `video.mp4` + metadata **AudioFetcher** - Methods: `yt-dlp` (SoundCloud, Spotify) or `direct` (MP3 URLs) - Output: `audio.mp3` + metadata **ImageFetcher** - Methods: CSS selector extraction or direct download - Output: Multiple image files ### Execution Model Fetch plan specifies multiple fetchers to run in parallel/sequence: ``` YouTube video page: 1. HTMLFetcher (playwright) → page.html 2. VideoFetcher (yt-dlp) → video.mp4 3. ImageFetcher (selectors) → thumbnail.jpg ``` All artifacts stored in job directory with manifest: ``` artifacts/job_123/ ├── page.html ├── video.mp4 ├── thumbnail.jpg └── manifest.json ``` Manifest lists what was successfully fetched: ```json { "artifacts": [ {"type": "html", "path": "page.html", "mime": "text/html", "size": 50000}, {"type": "video", "path": "video.mp4", "mime": "video/mp4", "size": 5000000} ] } ``` ### Retry Strategy If fetcher fails, advance to next in plan: ``` Plan: [wget/html, playwright/html, playwright/generic] ↓ wget fails (403) → try playwright ↓ playwright succeeds → continue to Layer 2 ``` --- ## Layer 2: Entity Extraction ### Purpose Parse structured entities from raw artifacts. Extractors are content-type specific, not site-specific. ### Generic Extractors **TextEntityExtractor** (processes HTML, plain text, PDFs) - Input: HTML artifact - Extracts: People, organizations, locations (via NER) - Extracts: Topics, keywords, main content - Output: Structured entities **VideoEntityExtractor** (processes video files) - Input: Video artifact - Extracts: Visual objects from keyframes (via vision model) - Extracts: Transcript from audio (via speech-to-text) - Extracts: Spoken entities from transcript (via NER) - Output: Visual + auditory entities **AudioEntityExtractor** (processes audio files) - Input: Audio artifact - Extracts: Transcript (via speech-to-text) - Extracts: Spoken entities (via NER) - Extracts: Audio features (language, speakers, duration) - Output: Auditory entities **ImageEntityExtractor** (processes images) - Input: Image artifacts - Extracts: Visual objects (via vision model) - Extracts: OCR text if present - Extracts: Entities from OCR text (via NER) - Output: Visual entities ### Execution Model For each artifact in manifest, run appropriate extractor: ```python for artifact in manifest['artifacts']: if artifact['type'] == 'html': entities = TextEntityExtractor().extract(artifact) elif artifact['type'] == 'video': entities = VideoEntityExtractor().extract(artifact) # etc. ``` Result: ```json { "html": { "people": ["Dr. Jane Smith", "Prof. John Doe"], "organizations": ["MIT", "Stanford"], "topics": ["quantum computing", "cryptography"], "main_content": "..." }, "video": { "visual_objects": ["person", "whiteboard", "equations"], "transcript": "Today we'll discuss quantum entanglement...", "spoken_entities": ["Einstein", "Schrödinger"] } } ``` --- ## Layer 3: Meta Analysis ### Purpose Infer implicit knowledge from explicit entities. Meta extractors are hypothesis-driven, not deterministic. ### Meta Extractors **AuthorExpertiseMetaExtractor** - Input: Entities from all artifacts - Hypothesis: "Author who writes about topic X likely knows topic X" - Process: Aggregate topics across all content, frequency analysis - Output: Inferred expertise areas with confidence scores **CredibilityMetaExtractor** - Input: Entities + HTML metadata - Hypothesis: "Credible sources have authors, dates, citations, reputable domains" - Process: Signal scoring (author presence, citation count, domain reputation) - Output: Credibility score with evidence **ContentStyleMetaExtractor** - Input: Entities from text + video - Hypothesis: "Formal language + citations = educational; casual + no citations = entertainment" - Process: Linguistic analysis, visual pacing - Output: Style classification (educational, promotional, entertainment) ### Execution Model Meta extractors run independently after Layer 2 completes: ```python for meta_extractor in [AuthorExpertise, Credibility, ContentStyle]: analysis = meta_extractor.analyze(entities, manifest) store_with_confidence(analysis) ``` ### Confidence Levels All meta analysis includes confidence because it's inferential: ```json { "analysis_type": "author_expertise", "inferred_expertise": ["quantum computing", "cryptography"], "confidence": 0.85, "reasoning": "Author published 3 peer-reviewed papers on quantum cryptography", "evidence": ["paper1.pdf", "paper2.pdf"] } ``` --- ## State Machine Jobs progress through states with automatic retry on failure. ### States - `pending`: Job created, waiting for worker - `processing`: Currently executing a strategy - `retrying`: Strategy failed, moving to next in plan - `completed`: Extraction successful - `failed`: All strategies exhausted ### Transitions ``` pending ↓ [worker picks up] processing (strategy 0) ↓ [execute fetcher] ├─ success → completed └─ failure → retrying (move to strategy 1) ↓ [worker picks up] processing (strategy 1) ↓ [execute fetcher] ├─ success → completed └─ failure → retrying (move to strategy 2) ↓ ... (continue until strategies exhausted) ↓ failed ``` ### Worker Model Stateless workers poll database for pending/retrying jobs: ```sql SELECT id FROM extraction_jobs WHERE status IN ('pending', 'retrying') ORDER BY queue_priority DESC, created_at ASC LIMIT 10 FOR UPDATE SKIP LOCKED ``` Workers: - Can crash and restart without losing state - Scale horizontally (add more workers) - Process jobs in batches (cron, queue consumer, etc.) - Don't hold any state in memory --- ## Database Schema ### Layer 0: Routing ```sql -- Cache LLM routing decisions CREATE TABLE fetch_plan_cache ( domain VARCHAR(255) PRIMARY KEY, plan JSONB NOT NULL, times_used INT DEFAULT 0, success_rate FLOAT DEFAULT 1.0, created_at TIMESTAMP DEFAULT NOW() ); -- Store recon snapshots for debugging CREATE TABLE recon_snapshots ( id SERIAL PRIMARY KEY, job_id INT REFERENCES extraction_jobs(id), recon_data JSONB, llm_prompt TEXT, llm_response TEXT, created_at TIMESTAMP DEFAULT NOW() ); ``` ### Layer 1: Fetching ```sql -- Job state CREATE TABLE extraction_jobs ( id SERIAL PRIMARY KEY, url TEXT NOT NULL, status VARCHAR(20) DEFAULT 'pending', domain VARCHAR(255), current_strategy_position INT DEFAULT 0, artifacts_dir TEXT, manifest JSONB, queue_priority INT DEFAULT 0, created_at TIMESTAMP DEFAULT NOW(), completed_at TIMESTAMP ); -- Audit log CREATE TABLE strategy_attempts ( id SERIAL PRIMARY KEY, job_id INT REFERENCES extraction_jobs(id), strategy_position INT NOT NULL, fetcher VARCHAR(50) NOT NULL, method VARCHAR(50) NOT NULL, success BOOLEAN NOT NULL, error_type VARCHAR(100), execution_time_ms INT, attempted_at TIMESTAMP DEFAULT NOW() ); ``` ### Layer 2: Entity Extraction ```sql CREATE TABLE entity_extractions ( id SERIAL PRIMARY KEY, job_id INT REFERENCES extraction_jobs(id), artifact_type VARCHAR(50), entities JSONB, extracted_at TIMESTAMP DEFAULT NOW() ); ``` ### Layer 3: Meta Analysis ```sql CREATE TABLE meta_analysis ( id SERIAL PRIMARY KEY, job_id INT REFERENCES extraction_jobs(id), analysis_type VARCHAR(50), result JSONB, confidence FLOAT, analyzed_at TIMESTAMP DEFAULT NOW() ); ``` ### System Configuration ```sql CREATE TABLE system_config ( key VARCHAR(100) PRIMARY KEY, value JSONB ); -- Concurrency limits INSERT INTO system_config (key, value) VALUES ('fetcher_concurrency', '{ "html": {"max_parallel": 5, "current": 0}, "video": {"max_parallel": 2, "current": 0} }'); ``` --- ## Cost & Performance Optimization ### LLM Call Minimization **Problem**: LLM calls are expensive (~$0.01 per URL) and slow (~2-5 seconds). **Solutions**: 1. **Domain-level caching**: First URL from domain uses LLM, rest use cache 2. **Pattern promotion**: After 10 successful uses, promote to fast pattern 3. **Batch reconnaissance**: Process 100 URLs, group by domain, run LLM once per unique domain **Result**: 95% of URLs avoid LLM calls after system runs for a week. ### Progressive Loading Don't block the entire pipeline: - Layer 1 completes → show raw artifacts to user immediately - Layer 2 runs async → update with entities when ready - Layer 3 runs async → update with meta analysis when ready User sees: ``` t=0s: "Fetching..." t=5s: "✓ Fetched HTML + video" [show download links] t=10s: "✓ Extracted entities" [show structured data] t=15s: "✓ Inferred expertise" [show meta analysis] ``` ### Concurrency Management Limit expensive operations: ```json { "html": {"max_parallel": 10}, // cheap "video": {"max_parallel": 2}, // expensive "llm_routing": {"max_parallel": 5} // expensive + rate-limited } ``` Workers check limits before starting: ```sql UPDATE system_config SET value = jsonb_set(value, '{video,current}', ...) WHERE value->'video'->>'current' < value->'video'->>'max_parallel' ``` --- ## Observability ### What to Track **Routing decisions**: Which domains use patterns vs cache vs LLM **Fetch success rates**: Which fetchers fail most often, which methods work **Extraction quality**: Which extractors produce best results per content type **Meta confidence**: How often high-confidence predictions are verified **Performance**: Time spent in each layer, bottlenecks ### Key Queries Find domains that should be promoted to patterns: ```sql SELECT domain, times_used, success_rate FROM fetch_plan_cache WHERE times_used > 10 AND success_rate > 0.9 ORDER BY times_used DESC; ``` See what happened to a job: ```sql SELECT sa.strategy_position, sa.fetcher, sa.method, sa.success, sa.error_type, sa.execution_time_ms FROM strategy_attempts sa WHERE sa.job_id = 123 ORDER BY sa.attempted_at; ``` Find unreliable fetchers: ```sql SELECT fetcher, method, COUNT(*) as attempts, AVG(CASE WHEN success THEN 1.0 ELSE 0.0 END) as success_rate, AVG(execution_time_ms) as avg_time_ms FROM strategy_attempts GROUP BY fetcher, method ORDER BY success_rate ASC; ``` --- ## Example: Complete Flow **URL**: `https://unknown-blog.com/post/quantum-computing` ### Layer 0: Routing (8 seconds) 1. Check pattern match → none (unknown domain) 2. Check cache → none (first visit) 3. Run reconnaissance: - Screenshot shows article with code snippets - HTML has `
` tag, no `