# Unified Fetcher Learning System ## Why This System? The existing `fetcher_router` table requires manual configuration for each domain. This doesn't scale: - New domains require human intervention - Site behavior changes over time (captcha appears, SPA migration) - No feedback loop to improve routing decisions **Solution**: Learn from experience. Every fetch attempt teaches the system what works. ## Core Innovation Replace pre-computed patterns with **conditional probability queries** over heuristic-tagged attempts. Instead of: ``` patterns[domain=wikipedia.org] → playwright patterns[suffix=.pdf] → simple_http patterns[domain=wikipedia.org AND suffix=.pdf] → ??? # combinatorial explosion ``` We query: ``` P(success | heuristics that match this URL) → best fetcher ``` This enables flexible queries like "P(success | domain=X AND suffix=Y)" without pattern explosion. ## Documentation Index | Doc | Description | |-----|-------------| | [learning_migrations.md](learning_migrations.md) | Database schema & migrations (HUMAN) | | [learning_query.md](learning_query.md) | Query algorithms, time decay, confidence | | [learning_validation.md](learning_validation.md) | Block/SPA detection, heuristic extraction | | [learning_integration.md](learning_integration.md) | Phase code modifications | ## Task Split ### HUMAN TASKS All PocketBase migrations must be done manually: 1. Create `fetcher_attempts` collection 2. Create `attempt_heuristics` collection 3. Update `links` collection (+4 fields) 4. Seed data (optional) See [learning_migrations.md](learning_migrations.md) ### CODE TASKS (After migrations) 1. Update Go models (`internal/link/model.go`) 2. Create packages: `attempt/`, `heuristic/`, `validator/`, `selection/` 3. Modify integration points (`phase_route_fetcher.go`, `phase_fetch.go`) ## Architecture ### System Flow ``` ┌─────────────────────────────────────────────────┐ │ route_fetcher phase │ └─────────────────────────────────────────────────┘ │ ┌─────────────────────┴─────────────────────┐ │ │ LEGACY MODE LEARNING MODE (LEARNING_MODE=false) (LEARNING_MODE=true) │ │ ▼ ▼ ┌───────────────────┐ ┌───────────────────────────┐ │ fetcher_router │ │ 1. Query attempt_heuristics│ │ table lookup │ │ with time decay │ └───────────────────┘ └───────────────────────────┘ │ │ │ confidence > 0.6? │ │ │ yes──────┴──────no │ │ │ │ ▼ ▼ │ use learned ┌─────────────────┐ │ fetcher │ 2. Cheap HTTP │ │ │ │ probe (3s) │ │ │ └─────────────────┘ │ │ │ │ │ classify: │ │ captcha → stealth │ │ SPA → playwright │ │ static → reuse body │ │ │ └────────────┬───────────────┴──────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ fetch phase (execute) │ │ - Extract heuristics (pre + post) │ │ - Validate response (block/SPA/empty) │ │ - Log attempt + heuristics │ │ - Handle retry/pause on failure │ └─────────────────────────────────────────────────┘ ``` ### Parallel Implementation Strategy The learning system runs **in parallel** with existing hard-coded router: | Mode | Behavior | Use Case | |------|----------|----------| | Legacy (`false`) | Uses `fetcher_router` table only | Current production | | Learning (`true`) | Ignores `fetcher_router`, uses heuristics | New system | **Why parallel?** - Zero risk deployment (toggle off if issues) - A/B testing between approaches - Gradual migration - Rollback capability ### Learning Mode Lookup Priority 1. **Heuristic-based learning** - query `attempt_heuristics` with time decay 2. **Cheap HTTP classification** - 3s GET + content analysis (Probe & Reuse) 3. **Execute & Learn** - validate, log attempt with heuristics, handle failures **Note**: Hard-coded router is NOT part of learning flow (legacy mode only) ## Code Structure ### New Packages ``` internal/fetcher/ validator/ validator.go # ValidationResult struct, Validate() block_detector.go # DetectCaptcha(), Detect403() spa_detector.go # DetectSPA(), DetectReact() heuristic_extractor.go # ExtractPostFetchHeuristics() heuristic/ extractor.go # ExtractFromURL() - domain, suffix, path types.go # Heuristic struct attempt/ store.go # AttemptStore interface logger.go # LogAttempt() with heuristic linking pocketbase/ store.go # PocketBase CRUD query.go # FindBestFetcher() - conditional probability selection/ selector.go # FindBestFetcher() - waterfall logic confidence.go # CalculateConfidence() with time decay fallback.go # CheapHTTPClassification() + Probe & Reuse ``` ### Modified Files - `internal/link/model.go` - Add PausedUntil, RiskLevel, AttemptCount, LastAttemptAt - `internal/link/pocketbase/store.go` - Handle new fields in Get/Save - `internal/link/pocketbase/phase_route_fetcher.go` - Learning mode branch - `internal/fetcher/direct_access/phase_fetch.go` - Validation, retry, logging - `internal/fetcher/direct_access/fetcher.go` - Return headers + status code ## Key Design Decisions ### 1. Heuristic-Based vs Pattern-Based **Choice**: Heuristic-based (no pre-computed `learned_patterns` table) **Why?** - No combinatorial explosion (domain × suffix × path × ...) - Natural frequency weighting via query - Flexible conditional queries - Full transparency (audit any decision) - Add new heuristics without schema changes **Trade-off**: Slower queries (acceptable) vs simpler schema + zero maintenance ### 2. Time Decay (30-Day Half-Life) **Choice**: `0.5^(days/30)` exponential decay **Why?** - Web changes fast (sites add captcha, migrate to SPA) - Recent data more relevant than old data - Balances recency vs sample size - Smooth influence reduction (not cliff) ### 3. Explicit `is_banned` Field **Choice**: Boolean flag instead of inferring from `error_type` **Why?** - Clear intent: ban = IP risk requiring backoff - Fast ban rate queries per domain - Decouples detection logic from categorization ### 4. TEXT Fields (not SELECT) **Choice**: Use `text` for `fetcher`, `error_type`, `heuristic_type` **Why?** - Add new fetchers without migration - Add new error types without migration - Add new heuristic types without migration - Current fetchers: `direct_access`, `youtube_downloader`, `nytime_searcher`, `archive_proxy` ### 5. Error-Specific Retry/Backoff **Choice**: Different handling per error type | Error | Retry? | Backoff Base | |-------|--------|--------------| | timeout | Yes (once) | 5 min | | network_error | Yes (once) | 5 min | | blocked_captcha | No | 10 min | | blocked_403 | No | 10 min | | wrong_tool | Yes (playwright) | - | **Why?** - Transient errors (timeout) = safe to retry immediately - IP ban risk (captcha) = longer pause to protect - Wrong tool = try different approach, not retry same ### 6. Probe & Reuse Pattern **Problem**: Cheap classification + full fetch = 2 HTTP requests **Solution**: If probe succeeds and fetcher is `direct_access`, reuse probe body directly. **Why?** - Zero extra requests for static pages - Still correctly classifies SPA/captcha (probe body discarded) - Reduces load on target servers