# Unified Fetcher Learning System

## Why This System?

The existing `fetcher_router` table requires manual configuration for each domain. This doesn't scale:
- New domains require human intervention
- Site behavior changes over time (captcha appears, SPA migration)
- No feedback loop to improve routing decisions

**Solution**: Learn from experience. Every fetch attempt teaches the system what works.

## Core Innovation

Replace pre-computed patterns with **conditional probability queries** over heuristic-tagged attempts.

Instead of:
```
patterns[domain=wikipedia.org] → playwright
patterns[suffix=.pdf] → simple_http
patterns[domain=wikipedia.org AND suffix=.pdf] → ???  # combinatorial explosion
```

We query:
```
P(success | heuristics that match this URL) → best fetcher
```

This enables flexible queries like "P(success | domain=X AND suffix=Y)" without pattern explosion.

## Documentation Index

| Doc | Description |
|-----|-------------|
| [learning_migrations.md](learning_migrations.md) | Database schema & migrations (HUMAN) |
| [learning_query.md](learning_query.md) | Query algorithms, time decay, confidence |
| [learning_validation.md](learning_validation.md) | Block/SPA detection, heuristic extraction |
| [learning_integration.md](learning_integration.md) | Phase code modifications |

## Task Split

### HUMAN TASKS
All PocketBase migrations must be done manually:
1. Create `fetcher_attempts` collection
2. Create `attempt_heuristics` collection
3. Update `links` collection (+4 fields)
4. Seed data (optional)

See [learning_migrations.md](learning_migrations.md)

### CODE TASKS (After migrations)
1. Update Go models (`internal/link/model.go`)
2. Create packages: `attempt/`, `heuristic/`, `validator/`, `selection/`
3. Modify integration points (`phase_route_fetcher.go`, `phase_fetch.go`)

## Architecture

### System Flow

```
                    ┌─────────────────────────────────────────────────┐
                    │              route_fetcher phase                │
                    └─────────────────────────────────────────────────┘
                                          │
                    ┌─────────────────────┴─────────────────────┐
                    │                                           │
              LEGACY MODE                               LEARNING MODE
         (LEARNING_MODE=false)                      (LEARNING_MODE=true)
                    │                                           │
                    ▼                                           ▼
        ┌───────────────────┐                   ┌───────────────────────────┐
        │  fetcher_router   │                   │  1. Query attempt_heuristics│
        │  table lookup     │                   │     with time decay         │
        └───────────────────┘                   └───────────────────────────┘
                    │                                           │
                    │                              confidence > 0.6?
                    │                                    │
                    │                           yes──────┴──────no
                    │                            │              │
                    │                            ▼              ▼
                    │                      use learned    ┌─────────────────┐
                    │                      fetcher        │ 2. Cheap HTTP   │
                    │                            │        │    probe (3s)   │
                    │                            │        └─────────────────┘
                    │                            │              │
                    │                            │         classify:
                    │                            │         captcha → stealth
                    │                            │         SPA → playwright
                    │                            │         static → reuse body
                    │                            │              │
                    └────────────┬───────────────┴──────────────┘
                                 │
                                 ▼
                    ┌─────────────────────────────────────────────────┐
                    │            fetch phase (execute)                │
                    │  - Extract heuristics (pre + post)              │
                    │  - Validate response (block/SPA/empty)          │
                    │  - Log attempt + heuristics                     │
                    │  - Handle retry/pause on failure                │
                    └─────────────────────────────────────────────────┘
```

### Parallel Implementation Strategy

The learning system runs **in parallel** with existing hard-coded router:

| Mode | Behavior | Use Case |
|------|----------|----------|
| Legacy (`false`) | Uses `fetcher_router` table only | Current production |
| Learning (`true`) | Ignores `fetcher_router`, uses heuristics | New system |

**Why parallel?**
- Zero risk deployment (toggle off if issues)
- A/B testing between approaches
- Gradual migration
- Rollback capability

### Learning Mode Lookup Priority

1. **Heuristic-based learning** - query `attempt_heuristics` with time decay
2. **Cheap HTTP classification** - 3s GET + content analysis (Probe & Reuse)
3. **Execute & Learn** - validate, log attempt with heuristics, handle failures

**Note**: Hard-coded router is NOT part of learning flow (legacy mode only)

## Code Structure

### New Packages

```
internal/fetcher/
  validator/
    validator.go              # ValidationResult struct, Validate()
    block_detector.go         # DetectCaptcha(), Detect403()
    spa_detector.go           # DetectSPA(), DetectReact()
    heuristic_extractor.go    # ExtractPostFetchHeuristics()

  heuristic/
    extractor.go              # ExtractFromURL() - domain, suffix, path
    types.go                  # Heuristic struct

  attempt/
    store.go                  # AttemptStore interface
    logger.go                 # LogAttempt() with heuristic linking
    pocketbase/
      store.go                # PocketBase CRUD
      query.go                # FindBestFetcher() - conditional probability

  selection/
    selector.go               # FindBestFetcher() - waterfall logic
    confidence.go             # CalculateConfidence() with time decay
    fallback.go               # CheapHTTPClassification() + Probe & Reuse
```

### Modified Files

- `internal/link/model.go` - Add PausedUntil, RiskLevel, AttemptCount, LastAttemptAt
- `internal/link/pocketbase/store.go` - Handle new fields in Get/Save
- `internal/link/pocketbase/phase_route_fetcher.go` - Learning mode branch
- `internal/fetcher/direct_access/phase_fetch.go` - Validation, retry, logging
- `internal/fetcher/direct_access/fetcher.go` - Return headers + status code

## Key Design Decisions

### 1. Heuristic-Based vs Pattern-Based

**Choice**: Heuristic-based (no pre-computed `learned_patterns` table)

**Why?**
- No combinatorial explosion (domain × suffix × path × ...)
- Natural frequency weighting via query
- Flexible conditional queries
- Full transparency (audit any decision)
- Add new heuristics without schema changes

**Trade-off**: Slower queries (acceptable) vs simpler schema + zero maintenance

### 2. Time Decay (30-Day Half-Life)

**Choice**: `0.5^(days/30)` exponential decay

**Why?**
- Web changes fast (sites add captcha, migrate to SPA)
- Recent data more relevant than old data
- Balances recency vs sample size
- Smooth influence reduction (not cliff)

### 3. Explicit `is_banned` Field

**Choice**: Boolean flag instead of inferring from `error_type`

**Why?**
- Clear intent: ban = IP risk requiring backoff
- Fast ban rate queries per domain
- Decouples detection logic from categorization

### 4. TEXT Fields (not SELECT)

**Choice**: Use `text` for `fetcher`, `error_type`, `heuristic_type`

**Why?**
- Add new fetchers without migration
- Add new error types without migration
- Add new heuristic types without migration
- Current fetchers: `direct_access`, `youtube_downloader`, `nytime_searcher`, `archive_proxy`

### 5. Error-Specific Retry/Backoff

**Choice**: Different handling per error type

| Error | Retry? | Backoff Base |
|-------|--------|--------------|
| timeout | Yes (once) | 5 min |
| network_error | Yes (once) | 5 min |
| blocked_captcha | No | 10 min |
| blocked_403 | No | 10 min |
| wrong_tool | Yes (playwright) | - |

**Why?**
- Transient errors (timeout) = safe to retry immediately
- IP ban risk (captcha) = longer pause to protect
- Wrong tool = try different approach, not retry same

### 6. Probe & Reuse Pattern

**Problem**: Cheap classification + full fetch = 2 HTTP requests

**Solution**: If probe succeeds and fetcher is `direct_access`, reuse probe body directly.

**Why?**
- Zero extra requests for static pages
- Still correctly classifies SPA/captcha (probe body discarded)
- Reduces load on target servers
