- Wikipedia / MediaWiki onlyMainContent fix —
onlyMainContent: truenow correctly extracts article text from Wikipedia pages (~49% size reduction) - 3-tier noise pattern matching — substring, exact-token, and prefix matching to avoid false positives
- Structural element guard — noise handler never removes
<html>,<head>,<body>, or<main>elements - Re-clean after readability — readability output is re-cleaned to strip residual noise
- Wikipedia-aware readability — added
.mw-parser-output,#mw-content-text,#bodyContentto scored selectors - BYOK LLM extraction — per-request
llmApiKey,llmProvider,llmModelfields - JSON format validation —
formats: ["json"]withoutjsonSchemanow returns a 400 error - Block detection skip — pages >50 KB skip interstitial/block detection
- Null byte URL rejection — URLs with
%00or null bytes rejected at validation - Request timeout — default timeout bumped from 60s to 120s
- Dockerfile fix — corrected
cargo buildflags, addedconfig.docker.toml