# Architecture Layered architecture for intelligent web content extraction with PocketBase. ## Layer Overview ``` API → Routing → Fetching → Extraction → Storage ``` --- ## Layer 0: Routing & Decision Engine **Purpose**: Decide whether to use wget or browser for a given URL **Components**: - **Domain Cache** → lookup table for known URL patterns - **Cheap Fetcher** → single HTTP GET probe (3s timeout) - **Signal Extractor** → parse headers + HTML for indicators - **Heuristic Gatekeeper** → rule-based fast path (70-80% coverage) - **LLM Arbiter** → contextual classification for ambiguous cases - **Self-Correction** → validation + retry logic **Location**: `pkg/routing/` **Details**: See [layer0.md](layer0.md) for complete decision flow --- ## Layer 1: Fetching Strategies **Purpose**: Execute chosen fetch strategy **Components**: - **Wget Fetcher** → simple HTTP client (`dbf_wget.go`) - **Browser Fetcher** → Playwright/headless (`dbn_playwright.go`) - **Archive Fetcher** → local cache (`dbf_localarchive.go`) - **Provider Abstraction** → browserless, browserbase (`provider/`) **Location**: `pkg/drivers/browsers/fetcher/` ✅ --- ## Layer 2: Content Processing **Purpose**: Extract and clean content from raw HTML **Components**: - **HTML Parser** → extract text, metadata, entities - **Structure Analyzer** → detect article vs SPA vs paywall - **Content Cleaner** → remove scripts, ads, boilerplate - **Entity Extractor** → URLs, dates, authors **Location**: `pkg/extraction/` --- ## Layer 3: API & Orchestration **Purpose**: HTTP interface and job management **Components**: - **HTTP Handlers** → REST endpoints (`navigator_handlers.go`) - **Session Management** → track extraction jobs (`navigator_sessions.go`) - **PocketBase Integration** → hooks, middlewares, realtime - **Job Queue** → async processing for bulk URLs **Location**: `pkg/api/` ✅ --- ## Layer 4: Storage & Persistence **Purpose**: Data persistence via PocketBase **Collections**: - `sources` → URL metadata + fetch strategy - `extractions` → parsed content results - `domain_cache` → strategy lookup table - `fetch_logs` → self-correction data **Location**: PocketBase DB + `migrations/` --- ## Directory Structure ``` clio/ ├── main.go # PocketBase bootstrap ├── migrations/ # DB schema ✅ ├── pkg/ │ ├── routing/ # Layer 0: Decision engine │ ├── drivers/browsers/ # Layer 1: Fetching ✅ │ ├── extraction/ # Layer 2: Content processing │ ├── api/ # Layer 3: HTTP interface ✅ │ └── models/ # Shared types/interfaces └── doc/ # Documentation ✅ ``` --- ## Design Principles 1. **Separation of Concerns** → each layer has ONE job 2. **Dependency Flow** → unidirectional top-to-bottom 3. **Interface-Driven** → layers communicate via Go interfaces 4. **Self-Contained** → each package independently testable 5. **PocketBase Native** → leverage hooks, middlewares, subscriptions