[3.3.0] - 2026-03-16
Theme: 10 new source types (17 total), EPUB unified integration, sync-config command, performance optimizations, 12 README translations, and 19 bug fixes. 117 files changed, +41,588 lines since v3.2.0.
Supported Source Types (17)
| # | Type | CLI Command | Config Type | Auto-Detection |
|---|---|---|---|---|
| 1 | Documentation (web) | scrape / create <url>
| documentation
| HTTP/HTTPS URLs |
| 2 | GitHub repository | github / create owner/repo
| github
| owner/repo or github.com URLs
|
| 3 | PDF document | pdf / create file.pdf
| pdf
| .pdf extension
|
| 4 | Word document | word / create file.docx
| word
| .docx extension
|
| 5 | EPUB e-book | epub / create file.epub
| epub
| .epub extension
|
| 6 | Video | video / create <url/file>
| video
| YouTube/Vimeo URLs, video extensions |
| 7 | Local codebase | analyze / create ./path
| local
| Directory paths |
| 8 | Jupyter Notebook | jupyter / create file.ipynb
| jupyter
| .ipynb extension
|
| 9 | Local HTML | html / create file.html
| html
| .html/.htm extensions
|
| 10 | OpenAPI/Swagger | openapi / create spec.yaml
| openapi
| .yaml/.yml with OpenAPI content
|
| 11 | AsciiDoc | asciidoc / create file.adoc
| asciidoc
| .adoc/.asciidoc extensions
|
| 12 | PowerPoint | pptx / create file.pptx
| pptx
| .pptx extension
|
| 13 | RSS/Atom feed | rss / create feed.rss
| rss
| .rss/.atom extensions
|
| 14 | Man pages | manpage / create cmd.1
| manpage
| .1–.8/.man extensions
|
| 15 | Confluence wiki | confluence
| confluence
| API or export directory |
| 16 | Notion pages | notion
| notion
| API or export directory |
| 17 | Slack/Discord chat | chat
| chat
| Export directory or API |
Added
10 New Skill Source Types (17 total)
Skill Seekers now supports 17 source types — up from 7. Every new type is fully integrated into the CLI (skill-seekers <type>), create command auto-detection, unified multi-source configs, config validation, the MCP server, and the skill builder.
-
Jupyter Notebook —
skill-seekers jupyter --notebook file.ipynborskill-seekers create file.ipynb- Extracts markdown cells, code cells with outputs, kernel metadata, imports, and language detection
- Handles single files and directories of notebooks; filters
.ipynb_checkpoints - Optional dependency:
pip install "skill-seekers[jupyter]"(nbformat) - Entry point:
skill-seekers-jupyter
-
Local HTML —
skill-seekers html --html-path file.htmlorskill-seekers create file.html- Parses HTML using BeautifulSoup with smart main content detection (
<article>,<main>,.content, largest div) - Extracts headings, code blocks, tables (to markdown), images, links; converts inline HTML to markdown
- Handles single files and directories; supports
.html,.htm,.xhtmlextensions - No extra dependencies (BeautifulSoup is a core dep)
- Parses HTML using BeautifulSoup with smart main content detection (
-
OpenAPI/Swagger —
skill-seekers openapi --spec spec.yamlorskill-seekers create spec.yaml- Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via
--spec-url) - Extracts endpoints, parameters, request/response schemas, security schemes, tags
- Resolves
$refreferences with circular reference protection; handlesallOf/oneOf/anyOf - Groups endpoints by tags; generates comprehensive API reference markdown
- Source detection sniffs YAML file content for
openapi:orswagger:keys (avoids false positives on non-API YAML files) - Optional dependency:
pip install "skill-seekers[openapi]"(pyyaml — already a core dep, guard added for safety)
- Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via
-
AsciiDoc —
skill-seekers asciidoc --asciidoc-path file.adocorskill-seekers create file.adoc- Regex-based parser (no external library required) with optional
asciidoclibrary support - Extracts headings (= through =====),
[source,lang]code blocks,|===tables, admonitions (NOTE/TIP/WARNING/IMPORTANT/CAUTION), andinclude::directives - Converts AsciiDoc formatting to markdown; handles single files and directories
- Optional dependency:
pip install "skill-seekers[asciidoc]"(asciidoc library for advanced rendering)
- Regex-based parser (no external library required) with optional
-
PowerPoint (.pptx) —
skill-seekers pptx --pptx file.pptxorskill-seekers create file.pptx- Extracts slide text, speaker notes, tables, images (with alt text), and grouped shapes
- Detects code blocks by monospace font analysis (30+ font families)
- Groups slides into sections by layout type; handles single files and directories
- Optional dependency:
pip install "skill-seekers[pptx]"(python-pptx)
-
RSS/Atom Feeds —
skill-seekers rss --feed-url <url>/--feed-path file.rssorskill-seekers create feed.rss- Parses RSS 2.0, RSS 1.0, and Atom feeds via feedparser
- Optionally follows article links (
--follow-links, default on) to scrape full page content using BeautifulSoup - Extracts article titles, summaries, authors, dates, categories; configurable
--max-articles(default 50) - Source detection matches
.rssand.atomextensions (.xmlexcluded to avoid false positives) - Optional dependency:
pip install "skill-seekers[rss]"(feedparser)
-
Man Pages —
skill-seekers manpage --man-names git,curl/--man-path dir/orskill-seekers create git.1- Extracts man pages by running
mancommand via subprocess or reading.1–.8/.manfiles directly - Handles gzip/bzip2/xz compressed man files; strips troff/groff formatting (backspace overstriking, macros, font escapes)
- Parses structured sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXAMPLES, SEE ALSO)
- Source detection uses basename heuristic to avoid false positives on log rotation files (e.g.,
access.log.1) - No external dependencies (stdlib only)
- Extracts man pages by running
-
Confluence —
skill-seekers confluence --base-url <url> --space-key <key>or--export-path dir/- API mode: fetches pages from Confluence REST API with pagination (
atlassian-python-api) - Export mode: parses Confluence HTML/XML export directories
- Extracts page content, code/panel/info/warning macros, page hierarchy, tables
- Optional dependency:
pip install "skill-seekers[confluence]"(atlassian-python-api)
- API mode: fetches pages from Confluence REST API with pagination (
-
Notion —
skill-seekers notion --database-id <id>/--page-id <id>or--export-path dir/- API mode: fetches pages via Notion API with support for 20+ block types (paragraph, heading, code, callout, toggle, table, etc.)
- Export mode: parses Notion Markdown/CSV export directories
- Extracts rich text with annotations (bold, italic, code, links), 16+ property types for database entries
- Optional dependency:
pip install "skill-seekers[notion]"(notion-client)
-
Slack/Discord Chat —
skill-seekers chat --export-path dir/or--token <token> --channel <channel>- Slack: parses workspace JSON exports or fetches via Slack Web API (
slack_sdk) - Discord: parses DiscordChatExporter JSON or fetches via Discord HTTP API
- Extracts messages, code snippets (fenced blocks), shared URLs, threads, reactions, attachments
- Generates per-channel summaries and topic categorization
- Optional dependency:
pip install "skill-seekers[chat]"(slack-sdk)
- Slack: parses workspace JSON exports or fetches via Slack Web API (
EPUB Unified Pipeline Integration
- EPUB (.epub) input support via
skill-seekers create book.epuborskill-seekers epub --epub book.epub- Extracts chapters, metadata (Dublin Core), code blocks, images, and tables from EPUB 2 and EPUB 3 files
- DRM detection with clear error messages (Adobe ADEPT, Apple FairPlay, Readium LCP)
- Font obfuscation correctly identified as non-DRM
- EPUB 3 TOC bug workaround (
ignore_ncxoption) --help-epubflag for EPUB-specific help- Optional dependency:
pip install "skill-seekers[epub]"(ebooklib) - 107 tests across 14 test classes
- EPUB added to unified scraper —
_scrape_epub()method,scraped_data["epub"], config validation (_validate_epub_source), and dry-run display. Previously EPUB worked standalone but was missing from multi-source configs.
Unified Skill Builder — Generic Merge System
_generic_merge()— Priority-based section merge for any combination of source types not covered by existing pairwise synthesis (docs+github, docs+pdf, etc.). Produces YAML frontmatter + source-attributed sections._append_extra_sources()— Appends additional source type content (e.g., Jupyter + PPTX) to pairwise-synthesized SKILL.md._generate_generic_references()— Generatesreferences/<type>/index.mdfor any source type, with ID resolution fallback chain._SOURCE_LABELSdict — Human-readable labels for all 17 source types used in merge attribution.
Config Validator Expansion
- 17 source types in
VALID_SOURCE_TYPES— All new types pluswordandvideonow have per-type validation methods. _validate_word_source()— Validatespathfield for Word documents (was previously missing)._validate_video_source()— Validatesurl,path, orplaylistfield for video sources (was previously missing).- 11 new
_validate_*_source()methods — One for each new type with appropriate required-field checks.
Source Detection Improvements
- 7 new file extension detections in
SourceDetector.detect()—.ipynb,.html/.htm,.pptx,.adoc/.asciidoc,.rss/.atom,.1–.8/.man,.yaml/.yml(with content sniffing) _looks_like_openapi()— Content sniffing for YAML files: only classifies as OpenAPI if the file containsopenapi:orswagger:key in first 20 lines (prevents false positives on docker-compose, Ansible, Kubernetes manifests, etc.)- Man page basename heuristic —
.1–.8extensions only detected as man pages if the basename has no dots (e.g.,git.1matches butaccess.log.1does not) .xmlexcluded from RSS detection — Too generic; only.rssand.atomtrigger RSS detection
MCP Server Integration
scrape_generictool — New MCP tool handles all 10 new source types via subprocess with per-type flag mapping_PATH_FLAGS/_URL_FLAGSdicts — Correct flag routing for each source type (e.g., jupyter→--notebook, html→--html-path, rss→--feed-url)GENERIC_SOURCE_TYPEStuple — Lists all 10 new types for validation- Config validation display —
validate_configtool now shows source details for all new types - Tool count updated — 33 → 34 tools (scraping tools 10 → 11)
CLI Wiring
- 10 new CLI subcommands —
jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chatinCOMMAND_MODULES - 10 new argument modules —
arguments/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}.pywith per-type*_ARGUMENTSdicts - 10 new parser modules —
parsers/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}_parser.pywithSubcommandParserimplementations createcommand routing —_route_generic()method for all new types with correct module names and CLI flags- 10 new entry points in pyproject.toml —
skill-seekers-{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat} - 7 new optional dependency groups in pyproject.toml —
[jupyter],[asciidoc],[pptx],[confluence],[notion],[rss],[chat] [all]group updated — Includes all 7 new optional dependencies
Sync Config Command
skill-seekers sync-config— New subcommand that crawls a docs site's navigation, diffs discovered URLs against a config'sstart_urls, and optionally writes the updated list back with--apply(#306)- BFS link discovery with configurable depth (default 2), max-pages, rate-limit
- Respects
url_patterns.include/excludefrom config - Supports optional
nav_seed_urlsconfig field - Handles both unified (sources array) and legacy flat config formats
- MCP
sync_configtool included - 57 tests (39 unit + 18 E2E with local HTTP server)
Workflow & Documentation
complex-merge.yaml— New 7-stage AI-powered workflow for complex multi-source merging (source inventory → cross-reference → conflict detection → priority merge → gap analysis → synthesis → quality check)- AGENTS.md rewritten — Updated with all 17 source types, scraper pattern docs, project layout, and key pattern documentation
- 77 new integration tests in
test_new_source_types.py— Source detection, config validation, generic merge, CLI wiring, validation, and create command routing docs/BEST_PRACTICES.md— Comprehensive guide for creating high-quality skills: SKILL.md structure, code examples, prerequisites, troubleshooting, quality targets, and real-world Grade F to Grade A example (#206)- Documentation updated for 17 source types — 32 files updated across README, CLI reference, feature matrix, MCP reference, config format, API reference, unified scraping, multi-source guide, installation, quick-start, core concepts, user guide, FAQ, troubleshooting, architecture, and all Chinese (zh-CN) translations
- README translations for 10 languages (12 total) — Added Japanese (日本語), Korean (한국어), Spanish (Español), French (Français), German (Deutsch), Portuguese (Português), Turkish (Türkçe), Arabic (العربية), Hindi (हिन्दी), and Russian (Русский) README translations with language selector bar across all versions
Performance
- Pre-compiled regex and O(1) URL dedup in doc_scraper — Module-level compiled patterns,
_enqueued_urlsset for O(1) dedup, cached URL patterns, async error logging fix (#309) - Bisect-based line indexing in code_analyzer and dependency_analyzer — O(log n)
offset_to_line()via bisect replaces O(n)count("\n")across all 10 language analyzers and all import extractors - O(n) parent class map for Python method detection — Replaces O(n²) repeated AST walks in code_analyzer
- O(1) tree traversal in github_scraper —
deque.popleft()replaces listpop(0) - Shared
build_line_index()/offset_to_line()utilities incli/utils.py— DRY extraction from code_analyzer and dependency_analyzer
Fixed
- Config validator missing
wordandvideodispatch —_validate_source()had noelifbranches forwordorvideotypes, silently skipping validation. Added dispatch entries and_validate_word_source()/_validate_video_source()methods. openapi_scraper.pyunconditionalimport yaml— Would crash at import time if pyyaml not installed. Addedtry/except ImportErrorguard withYAML_AVAILABLEflag and_check_yaml_deps()helper.asciidoc_scraper.pymissing standard arguments —main()manually defined args instead of usingadd_asciidoc_arguments(). Refactored to use shared argument definitions + added enhancement workflow integration.pptx_scraper.pymissing standard arguments — Same issue. Refactored to useadd_pptx_arguments().chat_scraper.pymissing standard arguments — Same issue. Refactored to useadd_chat_arguments().notion_scraper.pymissingrun_workflowscall —--enhance-workflowflags were silently ignored. Added workflow runner integration.openapi_scraper.pyreturn typeNone—main()returnedNoneinstead ofint. Fixed toreturn 0on success, matching all other scrapers.- MCP
scrape_generic_toolflag mismatch — Was passing--path/--urlas generic flags, but every scraper expects its own flag name (e.g.,--notebook,--html-path,--spec). All 10 source types would have failed at runtime. Fixed with per-type_PATH_FLAGSand_URL_FLAGSmappings. - Word scraper
docx_idkey mismatch — Unified scraper data dict useddocx_idbut generic reference generation looked forword_id. Addedword_idalias. main.pydocstring stale — Missing all 10 new commands. Updated to list all 27 commands.source_detector.pymodule docstring stale — Described only 5 source types. Updated to describe 14+ detected types.manpage_parser.pydocstring referenced wrong file — Saidmanpage_scraper.pybut actual file isman_scraper.py. Fixed.- Parser registry test count — Updated expected count from 25 to 35 for 10 new parsers.
- 'Invalid IPv6 URL' error on bracket-containing URLs (#284) — URLs with square brackets (e.g.,
/api/[v1]/users) discovered via BFS crawl or HTML extraction bypassed the original fix in_clean_url(). Added sharedsanitize_url()utility applied at every URL ingestion point. 16 new tests. - GitHub scraper 'list index out of range' on issue extraction (#269) — PyGithub's
PaginatedListslicing could fail on some versions or empty repos. Replaced withitertools.islice(). - Release workflow version mismatch — GitHub release showed wrong version (v3.1.3 instead of v3.2.0) because no explicit release name was set and sed regex had unescaped dots. Added explicit
name/tag_name, version consistency check (tag vs pyproject.toml vs package), and empty release notes fallback. - Release workflow Python 3.10 compatibility — Version consistency check used
tomllib(Python 3.11+). Replaced with grep/sed for 3.10 compatibility. infer_categories()"tutorial" vs "tutorials" key mismatch — Guard checked'tutorial'but wrote to'tutorials'key, risking silent overwrites in category inference.- Flaky
test_benchmark_metadata_overhead— Stabilized with 20 iterations, warm-up run, median averaging, and 200% threshold (was failing on CI with 5 iterations and mean). - CI branch protection check permanently pending — Summary job was named 'All Checks Complete' but branch protection required 'Tests'. PRs were stuck as 'Expected — Waiting for status to be reported'. Renamed job to match.