# Learning System - Database Migrations **Owner**: Human (manual PocketBase migrations) ## Schema Overview ``` ┌─────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │ links │ │ fetcher_attempts │ │ attempt_heuristics │ ├─────────────────┤ ├──────────────────────┤ ├────────────────────┤ │ id │◄──────│ link (nullable) │ │ id │ │ initial_url │ │ url │ │ attempt │──────►│ │ ...existing... │ │ fetcher │◄──────│ heuristic_type │ │ + paused_until │ │ success │ │ heuristic_value │ │ + risk_level │ │ is_banned │ │ importance_score │ │ + attempt_count │ │ error_type │ └────────────────────┘ │ + last_attempt │ │ http_status │ └─────────────────┘ │ response_headers │ │ duration_ms │ │ attempted_at │ └──────────────────────┘ ``` **Why this structure?** - `fetcher_attempts`: Raw audit log (what happened, when, outcome) - `attempt_heuristics`: Tags for conditional probability queries (many per attempt) - Junction table enables: "find all attempts where domain=X OR suffix=Y" --- ## Migration 1: `fetcher_attempts` Raw audit log of fetch attempts. Recommend 90-day retention policy. **Fields**: | Field | Type | Notes | |-------|------|-------| | `id` | text | PK, auto 15-char | | `link` | relation → links | **See cascade decision below** | | `url` | text | required (redundant but useful for orphaned attempts) | | `fetcher` | text | **USE TEXT** (see why below) | | `success` | bool | Did fetch return usable content? | | `is_banned` | bool | Explicit IP ban detection | | `error_type` | text | **USE TEXT** (see why below) | | `http_status` | number | 200, 403, 429, etc. | | `response_headers` | json | Server, Content-Type, CF-Ray for debugging | | `duration_ms` | number | For performance tracking | | `attempted_at` | autodate | For time decay calculation | ### Why TEXT not SELECT? PocketBase `select` fields require predefined options. Adding a new fetcher or error type would require a migration. **TEXT allows**: - Add `playwright` fetcher without migration - Add `blocked_waf` error type without migration - Future-proof against unforeseen categories **Current fetcher values**: `direct_access`, `youtube_downloader`, `nytime_searcher`, `archive_proxy` **Error type values**: `blocked_captcha`, `blocked_403`, `empty_content`, `timeout`, `network_error`, `wrong_tool` ### Indexes - `idx_attempts_link` on `link` - fast lookup by link - `idx_attempts_url` on `url` - fast lookup by URL pattern - `idx_attempts_time` on `attempted_at DESC` - time decay queries --- ## Migration 2: `attempt_heuristics` Many-to-many junction table for conditional probability queries. **Design Philosophy**: This is the core innovation. Each attempt gets tagged with multiple heuristics (domain, suffix, has_captcha, etc.). Queries find attempts matching ANY of the input heuristics, then calculate success rates. **Fields**: | Field | Type | Notes | |-------|------|-------| | `id` | text | PK, auto 15-char | | `attempt` | relation → fetcher_attempts | cascade delete | | `heuristic_type` | text | **MUST be TEXT** for dynamic types | | `heuristic_value` | text | e.g. "wikipedia.org", ".pdf", "true" | | `importance_score` | number | 0-1, default 0.5 (for future analytics) | | `created_at` | autodate | | ### Why heuristic_type is TEXT This is intentional and critical: - Add `video_platform` heuristic type without migration - Add `deep_path` heuristic type without migration - Any code can invent new heuristics, DB accepts them - Robust via attempt relation (query any type, even just invented) **Initial heuristic types**: - Pre-fetch (from URL): `domain`, `suffix`, `contains_cdn`, `contains_static` - Post-fetch (from response): `status_200`, `has_captcha`, `has_spa`, `empty_body` ### Indexes - `idx_heur_type_value` on `(heuristic_type, heuristic_value)` - the main query path - `idx_heur_attempt` on `attempt` - cleanup, analytics --- ## Migration 3: Update `links` Add pause/retry fields to existing collection. **New fields**: | Field | Type | Notes | |-------|------|-------| | `paused_until` | date | nullable, exponential backoff timestamp | | `risk_level` | number | min 0, default 0, ban risk counter | | `attempt_count` | number | min 0, default 0 | | `last_attempt_at` | date | nullable | ### Why these fields? - **paused_until**: Exponential backoff after failures. Link won't be processed until this timestamp passes. - **risk_level**: Tracks consecutive failures. Higher = longer pause. Reset on success. - **attempt_count**: Total attempts ever. For monitoring, not logic. - **last_attempt_at**: When last attempted. For debugging, rate limiting. --- ## Migration 4: Seed Data (Optional) Bootstrap learning system with synthetic priors. Without this, system starts cold and must probe every URL. ### Why Seed Data? Common patterns are well-known: - `.pdf`, `.mp4`, `.jpg` → direct HTTP works - `/cdn/`, `/static/` paths → direct HTTP works Seeding gives confidence > 0.6 immediately for these patterns. ### Insert into `fetcher_attempts`: ``` id: seed_pdf url: "seed://suffix.pdf" fetcher: "direct_access" success: true link: null (no real link) id: seed_mp4 url: "seed://suffix.mp4" fetcher: "direct_access" success: true id: seed_cdn url: "seed://cdn-path" fetcher: "direct_access" success: true ``` ### Insert into `attempt_heuristics`: ``` attempt: seed_pdf, heuristic_type: "suffix", heuristic_value: ".pdf", importance_score: 0.95 attempt: seed_mp4, heuristic_type: "suffix", heuristic_value: ".mp4", importance_score: 0.95 attempt: seed_cdn, heuristic_type: "contains_cdn", heuristic_value: "true", importance_score: 0.85 ``` --- ## Critical Decision: Cascade Delete The `link` relation in `fetcher_attempts` has a cascade delete decision: | Option | Behavior | Pros | Cons | |--------|----------|------|------| | Cascade delete | Attempts deleted when link deleted | Clean, no orphans | Lose learning data | | Nullable | Attempts preserved when link deleted | Preserve domain-level learning | Orphaned records | **Recommendation**: Make `link` **nullable** to preserve historical learning data. **Why?** The learning system benefits from ALL historical attempts for a domain, even if the specific link record is gone. Domain-level patterns persist. --- ## Migration Order Must run in order due to foreign key dependencies: ``` 1. fetcher_attempts (standalone) 2. attempt_heuristics (depends on fetcher_attempts) 3. links update (standalone, existing collection) 4. seed data (depends on 1 + 2) ``` --- ## Migration Files (when created) ``` migrations/{timestamp}_created_fetcher_attempts.go migrations/{timestamp}_created_attempt_heuristics.go migrations/{timestamp}_updated_links.go migrations/{timestamp}_seed_heuristic_priors.go ```