# Learning System - Validation & Heuristics ## Philosophy: Dynamic Heuristics Traditional fetcher routing uses static rules: ``` youtube.com → youtube_downloader nytimes.com → nytime_searcher *.pdf → direct_access ``` **Problems**: - Must anticipate every case - Can't adapt to changes - No learning from failures **Our approach**: Extract observable features (heuristics) from URLs and responses, let the system learn which matter. **Key principle**: `heuristic_type` is TEXT, not enum. This means: - Add new heuristics in code without touching the database - Query finds ANY heuristic type that exists - System evolves without schema changes --- ## Heuristic Types ### Pre-Fetch (Extracted from URL) These are known BEFORE making any HTTP request: | Type | Example Values | What it detects | |------|---------------|-----------------| | `domain` | wikipedia.org, nytimes.com | Site-specific behavior | | `suffix` | .pdf, .jpg, .mp4 | File type (often static) | | `contains_cdn` | true | CDN paths (usually static) | | `contains_static` | true | Static asset paths | | `contains_assets` | true | Asset directories | | `contains_api` | true | API endpoints | | `path_depth` | 1, 5, 10 | Deep paths (may indicate dynamic content) | ### Post-Fetch (Extracted from Response) These are discovered AFTER making an HTTP request: | Type | Example Values | What it detects | |------|---------------|-----------------| | `status_200` | true | Successful response | | `status_403` | true | Forbidden (possible block) | | `status_429` | true | Rate limited | | `has_captcha` | true | Cloudflare/hCaptcha challenge | | `has_spa` | true | React/Vue/Next SPA | | `empty_body` | true | <200 chars after stripping | | `server_cloudflare` | true | Cloudflare-protected | | `server_nginx` | true | Nginx server | | `high_script_ratio` | true | Scripts >50% of HTML | | `has_paywall` | true | Subscription keywords | --- ## Block Detection ### Captcha Detection ```go func DetectCaptcha(html string) bool { keywords := []string{ "Cloudflare", "hCaptcha", "reCAPTCHA", "g-recaptcha", "Just a moment", "cf-browser-verification", "grecaptcha", } for _, kw := range keywords { if strings.Contains(html, kw) { return true } } return false } ``` **Why these keywords?** - Cloudflare's "Just a moment" challenge page - hCaptcha and reCAPTCHA integration markers - Browser verification divs **When captcha detected**: - Mark `is_banned = true` - Set `error_type = "blocked_captcha"` - Trigger exponential backoff (10min base) ### HTTP Status Detection ```go func Detect403(statusCode int) bool { return statusCode == 403 || statusCode == 429 } ``` **403 Forbidden**: Site explicitly blocking request **429 Too Many Requests**: Rate limited Both indicate we should back off. --- ## SPA Detection Single Page Applications render content via JavaScript. A simple HTTP GET returns an empty shell. ```go func DetectSPA(html string) (bool, string) { frameworks := map[string][]string{ "react": {`