# Unified Fetcher Learning System - Implementation Plan ## Overview Write a unified system that learns optimal fetcher strategies from experience while protecting against IP bans. **Core Innovation**: Replace pre-computed patterns with conditional probability queries over heuristic-tagged attempts. This enables flexible "P(success | domain=X AND suffix=Y)" queries without pattern explosion. ## Architecture ### Parallel Implementation Strategy The new learning system runs **in parallel** with the existing hard-coded router, with a **feature flag** to switch between: - **Legacy mode**: Uses existing `fetcher_router` table (no learning) - **Learning mode**: Uses new heuristic-based system (ignores `fetcher_router`) This allows: - Zero risk deployment (toggle off if issues) - A/B testing between approaches - Gradual migration - Rollback capability ### Learning Mode Lookup Priority 1. **Heuristic-based learning** - query `attempt_heuristics` with time decay 2. **Cheap HTTP classification** - 3s GET + content analysis 3. **Execute & Learn** - validate, log attempt with heuristics, handle failures **Note**: Hard-coded router is NOT part of the new flow (legacy mode only) ### Key Enhancements - **Parallel deployment**: Feature flag to switch between legacy and learning modes - **Explicit ban tracking**: `is_banned` boolean for IP risk management - **Time decay**: Weight recent attempts with `0.5^(days/30)` exponential decay - **Heuristic importance**: Natural frequency weighting + explicit scoring for dynamic heuristics - **Smart retry**: Retry on timeout/network errors, LONGER pause after captcha (10min base) - **Exponential backoff**: Different schedules for captcha vs other errors - **Dynamic heuristics**: Add new heuristic types without code changes ## Database Schema ### New Collection: `fetcher_attempts` Raw audit log (90-day retention) **Fields**: - `id` (text, PK) - `link` (relation → links, cascade delete) - `url` (text, required) - `fetcher` (select: simple_http, playwright, playwright_stealth) - `success` (bool) - `is_banned` (bool) - explicit IP ban detection - `error_type` (select: blocked_captcha, blocked_403, empty_content, timeout, wrong_tool) - `http_status` (number) - `response_headers` (json) - Server, Content-Type, CF-Ray - `duration_ms` (number) - `attempted_at` (autodate) **Indexes**: - `idx_attempts_link` on `link` - `idx_attempts_url` on `url` - `idx_attempts_time` on `attempted_at DESC` ### New Collection: `attempt_heuristics` Many-to-many junction table for conditional probability **Fields**: - `id` (text, PK) - `attempt` (relation → fetcher_attempts, cascade delete) - `heuristic_type` (text) - OPEN-ENDED: domain, suffix, contains_cdn, status_200, has_captcha, custom_xyz, etc. - `heuristic_value` (text) - "wikipedia.org", ".jpg", "true" - `importance_score` (number, 0-1) - calculated via predictive power analytics - `created_at` (autodate) **Design**: `heuristic_type` is TEXT (not select) to allow adding new heuristic types dynamically without schema migration. The attempt relation guarantees robustness - you can query any heuristic type even if it was just invented. **Indexes**: - `idx_heur_type_value` on `(heuristic_type, heuristic_value)` - `idx_heur_attempt` on `attempt` **Design**: No pre-computed patterns. Dynamic GROUP BY queries for conditional probability. ### Extended Collection: `links` Add pause/retry fields **New fields**: - `paused_until` (date, optional) - exponential backoff timestamp - `risk_level` (number, 0-5) - ban risk counter - `attempt_count` (number, min 0) - total fetch attempts - `last_attempt_at` (date) - most recent fetch - `learning_mode_enabled` (bool, default: false) - feature flag per-link (optional: can be global config instead) ### Seed Data Bootstrap with synthetic priors: ``` Attempt 1: seed:.pdf → simple_http (success) Heuristics: (suffix, .pdf, importance: 0.95) Attempt 2: seed:.mp4 → simple_http (success) Heuristics: (suffix, .mp4, importance: 0.95) Attempt 3: seed:/cdn/ → simple_http (success) Heuristics: (contains_cdn, true, importance: 0.85) ``` ## Code Structure ### New Packages ``` internal/fetcher/ validator/ validator.go # ValidationResult struct, Validate() interface block_detector.go # DetectCaptcha(), Detect403() spa_detector.go # DetectSPA(), DetectReact() heuristic_extractor.go # ExtractPostFetchHeuristics() heuristic/ extractor.go # ExtractFromURL() - domain, suffix, path types.go # Heuristic struct, HeuristicType enum attempt/ store.go # AttemptStore interface logger.go # LogAttempt() with heuristic linking pocketbase/ store.go # PocketBase CRUD query.go # FindBestFetcher() - conditional probability selection/ selector.go # FindBestFetcher() - waterfall logic confidence.go # CalculateConfidence() with time decay fallback.go # CheapHTTPClassification() ``` ### Modified Files - `internal/link/model.go` - Add PausedUntil, RiskLevel, AttemptCount, LastAttemptAt - `internal/link/pocketbase/store.go` - Handle new fields in Get/Save - `internal/link/pocketbase/phase_route_fetcher.go` - Integrate waterfall (hard-coded → learned → heuristic) - `internal/fetcher/direct_access/phase_fetch.go` - Add validation, retry logic, attempt logging - `internal/fetcher/direct_access/fetcher.go` - Return headers + status code ### Migration Files - `migrations/{timestamp}_created_fetcher_attempts.go` - `migrations/{timestamp}_created_attempt_heuristics.go` - `migrations/{timestamp}_updated_links.go` - `migrations/{timestamp}_seed_heuristic_priors.go` ## Core Algorithms ### Conditional Probability Query (Fully Dynamic) Find best fetcher given **any combination** of heuristics with time decay: ```go func (s *Store) FindBestFetcher(heuristics []Heuristic, minSampleSize int) (*FetcherScore, error) { // 1. Build OR filter matching ANY heuristic (fully dynamic - no hardcoded types) filter := "" params := make(map[string]any) for i, h := range heuristics { if i > 0 { filter += " || " } filter += fmt.Sprintf("(heuristic_type = {:type%d} && heuristic_value = {:value%d})", i, i) params[fmt.Sprintf("type%d", i)] = h.Type // Could be "video_platform", "deep_path", anything params[fmt.Sprintf("value%d", i)] = h.Value } // 2. Find matching heuristic records (discovers ANY heuristic type in DB) heuristicRecords, err := s.app.FindRecordsByFilter("attempt_heuristics", filter, "", 0, 0, params) if err != nil || len(heuristicRecords) == 0 { return nil, nil // No learned data } // 3. Extract attempt IDs attemptIDs := make(map[string]bool) for _, hr := range heuristicRecords { attemptIDs[hr.GetString("attempt")] = true } // 4. Fetch all attempts, apply time decay scores := make(map[string]*FetcherScore) now := time.Now() for attemptID := range attemptIDs { attempt, _ := s.app.FindRecordById("fetcher_attempts", attemptID) fetcher := attempt.GetString("fetcher") success := attempt.GetBool("success") attemptedAt := attempt.GetDateTime("attempted_at").Time() // Time decay: 0.5 decay per 30 days daysSince := now.Sub(attemptedAt).Hours() / 24 weight := math.Pow(0.5, daysSince/30) if scores[fetcher] == nil { scores[fetcher] = &FetcherScore{Fetcher: fetcher} } scores[fetcher].SampleSize++ if success { scores[fetcher].WeightedSuccesses += weight } } // 5. Calculate confidence (weighted_success_rate × min(1, sample_size/10)) var best *FetcherScore for _, score := range scores { if score.SampleSize < minSampleSize { continue // Skip if not enough data } score.SuccessRate = score.WeightedSuccesses / float64(score.SampleSize) // Confidence penalty for low sample size sampleFactor := math.Min(1.0, float64(score.SampleSize)/10.0) score.Confidence = score.SuccessRate * sampleFactor if best == nil || score.Confidence > best.Confidence { best = score } } return best, nil } ``` **Key insight**: This query works with ANY heuristic type that exists in the database. No enumeration needed. ### Fetcher Selection (Single Dynamic Query) **No hardcoded fallback cascade** - natural specificity emerges from data: ```go func FindBestFetcher(url string) (fetcher string, confidence float64, error) { // Extract ALL heuristics dynamically (no pre-declaration needed) heuristics := ExtractFromURL(url) // Single query with ALL heuristics the URL has result := queryWithHeuristics(heuristics, minSample=5) if result != nil && result.Confidence > 0.6 { return result.Fetcher, result.Confidence, nil } // No learned data → return nil, caller will use cheap HTTP return "", 0.0, nil } ``` **How natural specificity works:** - URL matches 5 heuristics + 20 past attempts = high sample size, high confidence (very specific match) - URL matches 2 heuristics + 3 past attempts = low sample size, low confidence (not enough data) - Confidence formula already penalizes low samples: `confidence = success_rate × min(1, sample_size/10)` - Single threshold (0.6) is the ONLY gate - no need to enumerate heuristic types or fallback levels ### Ban Risk Calculation ```go func GetDomainBanRate(domain string, days int) float64 { // Find all attempts for domain in last N days heuristics := FindByFilter("heuristic_type = 'domain' && heuristic_value = :domain") bannedCount := 0 totalCount := 0 for h in heuristics: attempt := FindRecordById("fetcher_attempts", h.Attempt) if attempt.AttemptedAt > now.AddDate(0, 0, -days): totalCount++ if attempt.IsBanned: bannedCount++ return float64(bannedCount) / float64(totalCount) } ``` ### Pause Duration (Error-Specific) ```go func CalculatePauseDuration(riskLevel int, errorType string) time.Duration { var baseDelay time.Duration // Longer pause for captcha (higher IP ban risk) if errorType == "blocked_captcha" || errorType == "blocked_403" { baseDelay = 10 * time.Minute } else { baseDelay = 5 * time.Minute } return time.Duration(float64(baseDelay) * math.Pow(2, float64(riskLevel))) } ``` **Sequences**: - Captcha/403: 10min → 20min → 40min → 80min → 160min → 320min - Other errors: 5min → 10min → 20min → 40min → 80min → 160min ## Integration Points ### Modified `phase_route_fetcher.go` (Parallel Implementation) ```go func RegisterRouteFetcherPhase(orch *Orchestrator, execs *ExecutorRegistry, app core.App) { fetcherStore := fetcherpb.New(app) attemptStore := attemptpb.New(app) selector := selection.New(attemptStore) // Global config (or read from env/settings) learningModeEnabled := os.Getenv("LEARNING_MODE_ENABLED") == "true" execs.Register("route_fetcher", func(record Record) *ExecutorResult { l := record.(*link.Link) // Check if paused if l.PausedUntil != nil && time.Now().Before(*l.PausedUntil) { return &ExecutorResult{Error: fmt.Errorf("paused until %v", l.PausedUntil)} } domain := extractDomain(l.InitialURL) // FEATURE FLAG: Choose between legacy and learning mode if !learningModeEnabled { // LEGACY MODE: Use hard-coded router only if route := fetcherStore.GetByDomain(domain); route != nil { return &ExecutorResult{Data: map[string]interface{}{ "fetcher": route.Fetcher, "source": "hard_coded", }} } return &ExecutorResult{Error: fmt.Errorf("no route for domain: %s", domain)} } // LEARNING MODE: Use heuristic-based learning // PRIORITY 1: Heuristic-based learning fetcher, confidence, err := selector.FindBestFetcher(l.InitialURL) if err == nil && confidence > 0.6 { return &ExecutorResult{Data: map[string]interface{}{ "fetcher": fetcher, "source": "learned_heuristic", "confidence": confidence, }} } // PRIORITY 2: Cheap HTTP classification fetcher, classification := selector.CheapHTTPClassification(l.InitialURL) return &ExecutorResult{Data: map[string]interface{}{ "fetcher": fetcher, "source": "heuristic_classification", }} }) } ``` ### Modified `phase_fetch.go` ```go func makeFetchExecutor(store *Store, attemptStore *AttemptStore, validator *Validator) ExecutorFunc { return func(record Record) *ExecutorResult { l := record.(*link.Link) fetcher := l.PhaseResults["fetcher"].(string) // Extract pre-fetch heuristics preHeuristics := heuristic.ExtractFromURL(l.InitialURL) startTime := time.Now() // Execute fetch html, headers, statusCode, err := doFetch(l.InitialURL, fetcher) duration := time.Since(startTime) // Validate response validationResult := validator.Validate(html, headers, statusCode) // Extract post-fetch heuristics postHeuristics := validator.ExtractPostFetchHeuristics(html, headers, statusCode) allHeuristics := append(preHeuristics, postHeuristics...) var success bool var errorType string var isBanned bool // CASE 1: Network/timeout error (RETRY) if err != nil { success = false errorType = classifyError(err) // "timeout", "network_error" // RETRY ONCE on transient errors if (errorType == "timeout" || errorType == "network_error") && l.AttemptCount == 0 { html2, headers2, statusCode2, err2 := doFetch(l.InitialURL, fetcher) attemptStore.Log(l.ID, l.InitialURL, fetcher, err2 == nil, errorType, statusCode2, headers2, time.Since(startTime), false, allHeuristics) if err2 == nil { success = true html = html2 headers = headers2 statusCode = statusCode2 err = nil } } // CASE 2: Bot block detected (PAUSE with LONGER backoff) } else if validationResult.IsBlocked { success = false errorType = validationResult.BlockType isBanned = true // PAUSE LINK (longer backoff for captcha) l.RiskLevel++ pauseDuration := CalculatePauseDuration(l.RiskLevel, errorType) pauseUntil := time.Now().Add(pauseDuration) l.PausedUntil = &pauseUntil // CASE 3: Empty content (wrong tool) } else if validationResult.IsEmpty && !validationResult.HasSPA { success = false errorType = "wrong_tool" // RETRY ONCE with playwright if l.AttemptCount == 0 { html2, headers2, statusCode2, err2 := doFetch(l.InitialURL, "playwright") attemptStore.Log(l.ID, l.InitialURL, "playwright", err2 == nil, "", statusCode2, headers2, time.Since(startTime), false, allHeuristics) if err2 == nil { success = true html = html2 } } // CASE 4: Success } else { success = true } // Log attempt l.AttemptCount++ l.LastAttemptAt = &startTime attemptStore.Log(l.ID, l.InitialURL, fetcher, success, errorType, statusCode, headers, duration, isBanned, allHeuristics) if !success { return &ExecutorResult{Error: fmt.Errorf("fetch failed: %s", errorType)} } // Save file fileID := store.Create(l.ID, []byte(html), "page.html") return &ExecutorResult{Data: map[string]interface{}{"file_ids": []string{fileID}}} } } ``` ## Validation Layer ### Block Detection ```go func DetectCaptcha(html string) bool { keywords := []string{ "Cloudflare", "hCaptcha", "reCAPTCHA", "g-recaptcha", "Just a moment", "cf-browser-verification", "grecaptcha", } for _, kw := range keywords { if strings.Contains(html, kw) { return true } } return false } func Detect403(statusCode int) bool { return statusCode == 403 || statusCode == 429 } ``` ### SPA Detection ```go func DetectSPA(html string) (bool, string) { frameworks := map[string][]string{ "react": {`
`, `__REACT_DEVTOOLS_`}, "vue": {`
`, `__VUE__`}, "next": {`
`, `__NEXT_DATA__`}, } for framework, patterns := range frameworks { for _, pattern := range patterns { if strings.Contains(html, pattern) { stripped := stripScriptsAndStyles(html) if len(stripped) < 200 { return true, framework } } } } return false, "" } ``` ### Cheap HTTP Classification ```go func CheapHTTPClassification(url string) (string, *ClassificationResult) { html, headers, statusCode, err := httpClient.Get(url, 3*time.Second) if err != nil { return "simple_http", &ClassificationResult{Reason: "default"} } validationResult := validator.Validate(html, headers, statusCode) if validationResult.IsBlocked { return "playwright_stealth", &ClassificationResult{Reason: "block_detected"} } if validationResult.HasSPA || validationResult.IsEmpty { return "playwright", &ClassificationResult{Reason: "spa_or_empty"} } return "simple_http", &ClassificationResult{Reason: "static_content"} } ``` ## Heuristic Types (Extensible) **Design Philosophy**: Heuristic types are stored as TEXT, not enum, allowing dynamic addition without schema changes. ### Initial Pre-Fetch Heuristics (Extracted from URL) - `domain`: wikipedia.org, nytimes.com - `suffix`: .pdf, .jpg, .mp4 - `contains_cdn`: true (if URL contains `/cdn/`) - `contains_static`: true (if URL contains `/static/`) - `contains_assets`: true (if URL contains `/assets/`) ### Initial Post-Fetch Heuristics (Extracted from Response) - `status_200`, `status_403`, `status_429`: true - `has_captcha`: true/false (Cloudflare, hCaptcha detection) - `has_spa`: true/false (React/Vue/Next signatures) - `has_paywall`: true/false (subscription keywords) - `server_nginx`, `server_cloudflare`: true (from Server header) - `empty_body`: true/false (< 200 chars after stripping scripts) - `high_script_ratio`: true/false (scripts > 50% of HTML) ### Adding New Heuristics (Examples) You can add new heuristic extractors without touching the database: ```go // Example: Add video platform detection if strings.Contains(url, "youtube.com") || strings.Contains(url, "vimeo.com") { heuristics = append(heuristics, Heuristic{ Type: "video_platform", Value: "true", }) } // Example: Add path depth analysis pathDepth := strings.Count(parsedURL.Path, "/") if pathDepth > 5 { heuristics = append(heuristics, Heuristic{ Type: "deep_path", Value: "true", }) } ``` The attempt relation ensures these new heuristics are immediately queryable for conditional probability. ## Importance Score Calculation (Optional Analytics) ### Natural Frequency Weighting (Default) By default, heuristics get importance through query frequency: - Domain appears in EVERY attempt for that domain → naturally high weight - Suffix appears in fewer attempts → naturally lower weight - This happens automatically via conditional probability queries (no explicit scoring) ### Explicit Importance Scoring (Advanced) For deeper insights, run periodic analytics to calculate predictive power: **Metric: Information Gain** ```sql -- How much does knowing "has_captcha=true" improve our prediction? -- Compare P(success) vs P(success | has_captcha=true) -- Overall success rate (baseline) SELECT AVG(CASE WHEN success THEN 1.0 ELSE 0.0 END) as baseline_success_rate FROM fetcher_attempts; -- Success rate given heuristic SELECT ah.heuristic_type, ah.heuristic_value, AVG(CASE WHEN fa.success THEN 1.0 ELSE 0.0 END) as conditional_success_rate, COUNT(*) as sample_size, ABS(AVG(CASE WHEN fa.success THEN 1.0 ELSE 0.0 END) - :baseline) as information_gain FROM fetcher_attempts fa JOIN attempt_heuristics ah ON fa.id = ah.attempt GROUP BY ah.heuristic_type, ah.heuristic_value HAVING COUNT(*) > 10 ORDER BY information_gain DESC; ``` **Update importance_score:** ```sql -- Store information gain as importance_score UPDATE attempt_heuristics SET importance_score = ( SELECT ABS(AVG(CASE WHEN fa.success THEN 1.0 ELSE 0.0 END) - :baseline) FROM fetcher_attempts fa WHERE fa.id = attempt_heuristics.attempt ) WHERE heuristic_type = :type AND heuristic_value = :value; ``` **Use cases:** - Identify which heuristics are most predictive (e.g., "has_captcha" likely has high importance) - Debug why certain URLs are misclassified (check which heuristics dominated) - Prune low-importance heuristics from queries to speed them up **When to run**: Weekly analytics job (not on hot path) ## Implementation Steps ### Phase 1: Foundation 1. Create migration for `fetcher_attempts` collection 2. Create migration for `attempt_heuristics` collection 3. Create migration for `links` schema updates 4. Update `internal/link/model.go` with new fields 5. Update `internal/link/pocketbase/store.go` to handle new fields 6. Run migrations, verify schema ### Phase 2: Heuristic Extraction 1. Create `internal/fetcher/heuristic/types.go` (Heuristic struct) 2. Implement `internal/fetcher/heuristic/extractor.go` (ExtractFromURL) 3. Create `internal/fetcher/validator/` package (block/SPA detection) 4. Manual test: Extract heuristics from sample URLs ### Phase 3: Attempt Logging 1. Create `internal/fetcher/attempt/pocketbase/store.go` (CRUD) 2. Implement `internal/fetcher/attempt/logger.go` (LogAttempt with heuristics) 3. Modify `phase_fetch.go` to log attempts (read-only) 4. Manual test: Verify attempts logged with heuristics ### Phase 4: Query Layer 1. Implement `internal/fetcher/attempt/pocketbase/query.go` (FindBestFetcher) 2. Create seed data migration 3. Manual test: Query with known heuristics ### Phase 5: Selection Layer 1. Create `internal/fetcher/selection/selector.go` (waterfall logic) 2. Implement `internal/fetcher/selection/fallback.go` (cheap HTTP) 3. Manual test: Selection for various URLs ### Phase 6: Integration 1. Modify `phase_route_fetcher.go` (integrate waterfall) 2. Manual test: Hard-coded routes work, learned routes kick in ### Phase 7: Retry and Pause 1. Add retry logic to `phase_fetch.go` (wrong tool detection) 2. Add pause logic to `phase_fetch.go` (block detection) 3. Add pause check to `phase_route_fetcher.go` 4. Manual test: Pause on block, retry on empty ### Phase 8: Monitoring 1. Create analytics queries (success rates, ban rates) 2. Monitor learning velocity 3. Adjust confidence thresholds ## Manual Testing Strategy ### Test 1: Hard-Coded Priority 1. Add `example.com` → `simple_http` to `fetcher_router` 2. Submit link, verify `source: "hard_coded"` ### Test 2: Learned Pattern 1. Insert 5 successful `wikipedia.org` + `playwright` attempts 2. Submit new wikipedia.org link 3. Verify `source: "learned_heuristic"`, `fetcher: "playwright"` ### Test 3: Suffix Pattern 1. Use seed data for `.pdf` 2. Submit `unknown-domain.com/document.pdf` 3. Verify `simple_http` selected ### Test 4: Block Detection 1. Mock captcha response 2. Submit link, verify `is_banned: true`, `paused_until` set 3. Resubmit immediately, verify pause error ### Test 5: Wrong Tool Retry 1. Mock empty body (no block) 2. Submit link, verify `simple_http` tried first 3. Verify automatic retry with `playwright` 4. Verify 2 attempts logged ### Test 6: Time Decay 1. Insert 5 old successful attempts (60 days ago) 2. Insert 2 recent failed attempts (today) 3. Verify recent failures outweigh old successes ## Key Design Decisions ### 1. Heuristic-Based vs Pattern-Based **Choice**: Heuristic-based (no pre-computed `learned_patterns` table) **Rationale**: - No combinatorial explosion - Natural frequency weighting (via query, not fallback cascade) - Flexible conditional queries - Full transparency - **User confirmed**: OK with query overhead - **User confirmed**: No hardcoded heuristic enumeration - fully dynamic **Trade-off**: Slower queries (acceptable) vs simpler schema + zero maintenance ### 2. Explicit `is_banned` Field **Choice**: Boolean flag instead of inferring from `error_type` **Rationale**: - Clear intent (ban = IP risk) - Fast ban rate queries - Decouples detection from categorization ### 3. Time Decay (30-Day Half-Life) **Choice**: `0.5^(days/30)` exponential decay **Rationale**: - Web changes fast - Balances recency vs sample size - Smooth influence reduction ### 4. Parallel Implementation (Feature Flag) **Choice**: Run learning system in parallel with legacy, switchable via env var **Rationale**: - Zero-risk deployment (toggle off if broken) - A/B testing capability - Clean separation of concerns - Hard-coded router stays in legacy mode only ### 5. Retry on Multiple Error Types **Choice**: Retry on timeout, network errors, and wrong tool **Rationale**: - **User confirmed**: Can retry in more cases - Timeout/network = transient (safe to retry) - Wrong tool = empty content (safe to retry) - Bounded cost (max 2 attempts per error type) - Different pause durations per error type ### 6. Error-Specific Exponential Backoff **Choice**: Different base delays for captcha (10min) vs other errors (5min) **Rationale**: - Captcha = high IP ban risk → longer pause - Timeout/network = transient → shorter pause - Protects against permanent bans - Retry logic for transient errors ## Critical Files ### Migrations - `migrations/{timestamp}_created_fetcher_attempts.go` - `migrations/{timestamp}_created_attempt_heuristics.go` - `migrations/{timestamp}_updated_links.go` - `migrations/{timestamp}_seed_heuristic_priors.go` ### Integration Points - `internal/link/pocketbase/phase_route_fetcher.go` - Waterfall selection - `internal/fetcher/direct_access/phase_fetch.go` - Validation, retry, logging - `internal/link/model.go` - Add pause/risk fields - `internal/link/pocketbase/store.go` - Handle new fields ### New Packages - `internal/fetcher/validator/` - Block/SPA detection - `internal/fetcher/heuristic/` - URL heuristic extraction - `internal/fetcher/attempt/` - Logging with heuristics - `internal/fetcher/selection/` - Waterfall logic