github us/crw v0.0.8

latest releases: v0.16.0, v0.15.2, v0.15.1...
3 months ago
  • Wikipedia / MediaWiki onlyMainContent fixonlyMainContent: true now correctly extracts article text from Wikipedia pages (~49% size reduction)
  • 3-tier noise pattern matching — substring, exact-token, and prefix matching to avoid false positives
  • Structural element guard — noise handler never removes <html>, <head>, <body>, or <main> elements
  • Re-clean after readability — readability output is re-cleaned to strip residual noise
  • Wikipedia-aware readability — added .mw-parser-output, #mw-content-text, #bodyContent to scored selectors
  • BYOK LLM extraction — per-request llmApiKey, llmProvider, llmModel fields
  • JSON format validationformats: ["json"] without jsonSchema now returns a 400 error
  • Block detection skip — pages >50 KB skip interstitial/block detection
  • Null byte URL rejection — URLs with %00 or null bytes rejected at validation
  • Request timeout — default timeout bumped from 60s to 120s
  • Dockerfile fix — corrected cargo build flags, added config.docker.toml

Don't miss a new crw release

NewReleases is sending notifications on new releases.