# Architecture

Layered architecture for intelligent web content extraction with PocketBase.

## Layer Overview

```
API → Routing → Fetching → Extraction → Storage
```

---

## Layer 0: Routing & Decision Engine

**Purpose**: Decide whether to use wget or browser for a given URL

**Components**:
- **Domain Cache** → lookup table for known URL patterns
- **Cheap Fetcher** → single HTTP GET probe (3s timeout)
- **Signal Extractor** → parse headers + HTML for indicators
- **Heuristic Gatekeeper** → rule-based fast path (70-80% coverage)
- **LLM Arbiter** → contextual classification for ambiguous cases
- **Self-Correction** → validation + retry logic

**Location**: `pkg/routing/`

**Details**: See [layer0.md](layer0.md) for complete decision flow

---

## Layer 1: Fetching Strategies

**Purpose**: Execute chosen fetch strategy

**Components**:
- **Wget Fetcher** → simple HTTP client (`dbf_wget.go`)
- **Browser Fetcher** → Playwright/headless (`dbn_playwright.go`)
- **Archive Fetcher** → local cache (`dbf_localarchive.go`)
- **Provider Abstraction** → browserless, browserbase (`provider/`)

**Location**: `pkg/drivers/browsers/fetcher/` ✅

---

## Layer 2: Content Processing

**Purpose**: Extract and clean content from raw HTML

**Components**:
- **HTML Parser** → extract text, metadata, entities
- **Structure Analyzer** → detect article vs SPA vs paywall
- **Content Cleaner** → remove scripts, ads, boilerplate
- **Entity Extractor** → URLs, dates, authors

**Location**: `pkg/extraction/`

---

## Layer 3: API & Orchestration

**Purpose**: HTTP interface and job management

**Components**:
- **HTTP Handlers** → REST endpoints (`navigator_handlers.go`)
- **Session Management** → track extraction jobs (`navigator_sessions.go`)
- **PocketBase Integration** → hooks, middlewares, realtime
- **Job Queue** → async processing for bulk URLs

**Location**: `pkg/api/` ✅

---

## Layer 4: Storage & Persistence

**Purpose**: Data persistence via PocketBase

**Collections**:
- `sources` → URL metadata + fetch strategy
- `extractions` → parsed content results
- `domain_cache` → strategy lookup table
- `fetch_logs` → self-correction data

**Location**: PocketBase DB + `migrations/`

---

## Directory Structure

```
clio/
├── main.go                          # PocketBase bootstrap
├── migrations/                      # DB schema ✅
├── pkg/
│   ├── routing/                     # Layer 0: Decision engine
│   ├── drivers/browsers/            # Layer 1: Fetching ✅
│   ├── extraction/                  # Layer 2: Content processing
│   ├── api/                         # Layer 3: HTTP interface ✅
│   └── models/                      # Shared types/interfaces
└── doc/                             # Documentation ✅
```

---

## Design Principles

1. **Separation of Concerns** → each layer has ONE job
2. **Dependency Flow** → unidirectional top-to-bottom
3. **Interface-Driven** → layers communicate via Go interfaces
4. **Self-Contained** → each package independently testable
5. **PocketBase Native** → leverage hooks, middlewares, subscriptions
