# Learning System - Query & Selection ## Core Concept: Natural Specificity Traditional pattern matching requires explicit priority: ``` if domain matches → use domain rule else if suffix matches → use suffix rule else if path matches → use path rule else → default ``` **Problem**: What if both domain AND suffix match? Which wins? Combinatorial explosion. **Our approach**: Query ALL matching heuristics, let sample size determine confidence. ``` URL: wikipedia.org/document.pdf Heuristics extracted: - domain = wikipedia.org - suffix = .pdf Query finds attempts matching EITHER: - 50 attempts for wikipedia.org (40 success with playwright) - 200 attempts for .pdf (180 success with direct_access) Result: .pdf has higher sample size AND confidence → use direct_access ``` **Why this works**: - More data = higher confidence (sample factor penalty) - No explicit priority needed - Specific patterns naturally emerge from frequency - Single threshold (0.6) is the only gate --- ## FindBestFetcher Algorithm Find best fetcher given any combination of heuristics with time decay: ```go func (s *Store) FindBestFetcher(heuristics []Heuristic, minSampleSize int) (*FetcherScore, error) { // 1. Build OR filter matching ANY heuristic (fully dynamic) // No hardcoded heuristic types - works with any type in DB filter := "" params := make(map[string]any) for i, h := range heuristics { if i > 0 { filter += " || " } filter += fmt.Sprintf("(heuristic_type = {:type%d} && heuristic_value = {:value%d})", i, i) params[fmt.Sprintf("type%d", i)] = h.Type params[fmt.Sprintf("value%d", i)] = h.Value } // 2. Find matching heuristic records heuristicRecords, err := s.app.FindRecordsByFilter("attempt_heuristics", filter, "", 0, 0, params) if err != nil || len(heuristicRecords) == 0 { return nil, nil // No learned data } // 3. Extract unique attempt IDs attemptIDs := make(map[string]bool) for _, hr := range heuristicRecords { attemptIDs[hr.GetString("attempt")] = true } // 4. Fetch all attempts, apply time decay scores := make(map[string]*FetcherScore) now := time.Now() for attemptID := range attemptIDs { attempt, _ := s.app.FindRecordById("fetcher_attempts", attemptID) fetcher := attempt.GetString("fetcher") success := attempt.GetBool("success") attemptedAt := attempt.GetDateTime("attempted_at").Time() // Time decay: 0.5^(days/30) - recent data matters more daysSince := now.Sub(attemptedAt).Hours() / 24 weight := math.Pow(0.5, daysSince/30) if scores[fetcher] == nil { scores[fetcher] = &FetcherScore{Fetcher: fetcher} } scores[fetcher].SampleSize++ if success { scores[fetcher].WeightedSuccesses += weight } } // 5. Calculate confidence with sample size penalty var best *FetcherScore for _, score := range scores { if score.SampleSize < minSampleSize { continue // Skip if not enough data } score.SuccessRate = score.WeightedSuccesses / float64(score.SampleSize) // Confidence penalty for low sample size // 5 samples → 0.5 factor, 10+ samples → 1.0 factor sampleFactor := math.Min(1.0, float64(score.SampleSize)/10.0) score.Confidence = score.SuccessRate * sampleFactor if best == nil || score.Confidence > best.Confidence { best = score } } return best, nil } ``` **Key insight**: This query works with ANY heuristic type that exists in the database. No enumeration needed. Add `video_platform` heuristic tomorrow, query finds it automatically. --- ## Time Decay **Why decay?** Web changes fast: - Sites add Cloudflare protection - SPAs migrate to SSR - Rate limits change - Yesterday's success may be today's failure **Formula**: `weight = 0.5^(days/30)` | Age | Weight | Interpretation | |-----|--------|----------------| | Today | 1.0 | Full influence | | 15 days | 0.7 | Still very relevant | | 30 days | 0.5 | Half influence | | 60 days | 0.25 | Minor influence | | 90 days | 0.125 | Mostly ignored | **Why 30-day half-life?** - Fast enough to adapt to changes - Slow enough to accumulate useful data - Smooth decay (no sudden cliff) --- ## Confidence Calculation ``` confidence = success_rate × min(1, sample_size/10) ``` **Two factors**: 1. **Success rate**: How often does this fetcher work? 2. **Sample factor**: Do we have enough data to trust this? **Examples**: | Samples | Successes | Success Rate | Sample Factor | Confidence | |---------|-----------|--------------|---------------|------------| | 5 | 5 | 100% | 0.5 | **0.50** | | 10 | 8 | 80% | 1.0 | **0.80** | | 20 | 14 | 70% | 1.0 | **0.70** | | 3 | 3 | 100% | 0.3 | **0.30** (rejected) | **Threshold**: Only use learned data if confidence > 0.6 **Why this formula?** - Prevents overconfidence from small samples - 100% success with 3 samples ≠ reliable - 70% success with 50 samples = reliable - Rewards both accuracy AND evidence --- ## Fetcher Selection Flow ```go func FindBestFetcher(url string) (fetcher string, confidence float64, error) { // Extract ALL heuristics dynamically heuristics := ExtractFromURL(url) // domain, suffix, path patterns // Single query with ALL heuristics result := queryWithHeuristics(heuristics, minSample=5) if result != nil && result.Confidence > 0.6 { return result.Fetcher, result.Confidence, nil } // No learned data → caller uses cheap HTTP classification return "", 0.0, nil } ``` **No hardcoded fallback cascade** - natural specificity emerges from data. --- ## Probe & Reuse Pattern ### The Problem Without optimization, learning mode makes 2 HTTP requests: 1. **Cheap probe** (3s GET) → analyze HTML → decide fetcher 2. **Full fetch** → execute with selected fetcher → save If fetcher = `direct_access`, we made the **same request twice**. ### The Solution Capture probe response body and reuse when applicable: ```go type ClassificationResult struct { Fetcher string Reason string ProbeBody []byte // Captured for reuse ProbeStatus int ProbeHeaders http.Header } func CheapHTTPClassification(url string) (*ClassificationResult, error) { resp, err := httpClient.Get(url, 3*time.Second) if err != nil { return &ClassificationResult{Fetcher: "direct_access", Reason: "probe_failed"}, nil } defer resp.Body.Close() body, _ := io.ReadAll(resp.Body) result := &ClassificationResult{ ProbeBody: body, ProbeStatus: resp.StatusCode, ProbeHeaders: resp.Header, } validationResult := validator.Validate(string(body), resp.Header, resp.StatusCode) if validationResult.IsBlocked { result.Fetcher = "playwright_stealth" result.ProbeBody = nil // Don't reuse - blocked response useless } else if validationResult.HasSPA || validationResult.IsEmpty { result.Fetcher = "playwright" result.ProbeBody = nil // Don't reuse - needs JS execution } else if resp.StatusCode == 200 && len(body) > 500 { result.Fetcher = "direct_access" // ProbeBody STAYS POPULATED - reuse in fetch phase } return result, nil } ``` ### Reuse in phase_fetch.go ```go func fetch(...) { // Check if we already have valid content from probe if probeBody, ok := l.GetPhaseResults()["probe_body"].([]byte); ok && len(probeBody) > 0 { // Skip fetch - use probe result directly! return saveToStore(probeBody, l.GetPhaseResults()["probe_status"].(int)) } // Otherwise do full fetch resp, err := http.Get(l.InitialURL) // ... } ``` ### When Reuse Happens | Classification | Fetcher | Reuse Probe? | Why | |---------------|---------|--------------|-----| | Static HTML | direct_access | **Yes** | Same result, skip request | | Captcha detected | playwright_stealth | No | Blocked response useless | | SPA detected | playwright | No | Needs JS execution | | Probe failed | direct_access | No | No body captured | --- ## Ban Risk Calculation Track IP ban risk per domain: ```go func GetDomainBanRate(domain string, days int) float64 { heuristics := FindByFilter("heuristic_type = 'domain' && heuristic_value = :domain") bannedCount := 0 totalCount := 0 for h in heuristics { attempt := FindRecordById("fetcher_attempts", h.Attempt) if attempt.AttemptedAt > now.AddDate(0, 0, -days) { totalCount++ if attempt.IsBanned { bannedCount++ } } } return float64(bannedCount) / float64(totalCount) } ``` **Why explicit `is_banned` field?** - Clear intent: ban = IP risk requiring backoff - Fast queries (boolean filter vs error_type parsing) - Decouples detection from categorization --- ## Pause Duration (Error-Specific) ```go func CalculatePauseDuration(riskLevel int, errorType string) time.Duration { var baseDelay time.Duration // Longer pause for captcha (higher IP ban risk) if errorType == "blocked_captcha" || errorType == "blocked_403" { baseDelay = 10 * time.Minute // IP ban risk = cautious } else { baseDelay = 5 * time.Minute // Transient errors = shorter } return time.Duration(float64(baseDelay) * math.Pow(2, float64(riskLevel))) } ``` **Backoff sequences**: | Error Type | Base | Level 0 | Level 1 | Level 2 | Level 3 | Level 4 | |------------|------|---------|---------|---------|---------|---------| | Captcha/403 | 10min | 10min | 20min | 40min | 80min | 160min | | Other | 5min | 5min | 10min | 20min | 40min | 80min | **Why different bases?** - Captcha = site actively blocking us → high IP ban risk → longer pause - Timeout = transient network issue → low risk → shorter pause - Protects against permanent IP bans