github yusufkaraaslan/Skill_Seekers v3.3.0

7 hours ago

[3.3.0] - 2026-03-16

Theme: 10 new source types (17 total), EPUB unified integration, sync-config command, performance optimizations, 12 README translations, and 19 bug fixes. 117 files changed, +41,588 lines since v3.2.0.

Supported Source Types (17)

# Type CLI Command Config Type Auto-Detection
1 Documentation (web) scrape / create <url> documentation HTTP/HTTPS URLs
2 GitHub repository github / create owner/repo github owner/repo or github.com URLs
3 PDF document pdf / create file.pdf pdf .pdf extension
4 Word document word / create file.docx word .docx extension
5 EPUB e-book epub / create file.epub epub .epub extension
6 Video video / create <url/file> video YouTube/Vimeo URLs, video extensions
7 Local codebase analyze / create ./path local Directory paths
8 Jupyter Notebook jupyter / create file.ipynb jupyter .ipynb extension
9 Local HTML html / create file.html html .html/.htm extensions
10 OpenAPI/Swagger openapi / create spec.yaml openapi .yaml/.yml with OpenAPI content
11 AsciiDoc asciidoc / create file.adoc asciidoc .adoc/.asciidoc extensions
12 PowerPoint pptx / create file.pptx pptx .pptx extension
13 RSS/Atom feed rss / create feed.rss rss .rss/.atom extensions
14 Man pages manpage / create cmd.1 manpage .1.8/.man extensions
15 Confluence wiki confluence confluence API or export directory
16 Notion pages notion notion API or export directory
17 Slack/Discord chat chat chat Export directory or API

Added

10 New Skill Source Types (17 total)

Skill Seekers now supports 17 source types — up from 7. Every new type is fully integrated into the CLI (skill-seekers <type>), create command auto-detection, unified multi-source configs, config validation, the MCP server, and the skill builder.

  • Jupyter Notebookskill-seekers jupyter --notebook file.ipynb or skill-seekers create file.ipynb

    • Extracts markdown cells, code cells with outputs, kernel metadata, imports, and language detection
    • Handles single files and directories of notebooks; filters .ipynb_checkpoints
    • Optional dependency: pip install "skill-seekers[jupyter]" (nbformat)
    • Entry point: skill-seekers-jupyter
  • Local HTMLskill-seekers html --html-path file.html or skill-seekers create file.html

    • Parses HTML using BeautifulSoup with smart main content detection (<article>, <main>, .content, largest div)
    • Extracts headings, code blocks, tables (to markdown), images, links; converts inline HTML to markdown
    • Handles single files and directories; supports .html, .htm, .xhtml extensions
    • No extra dependencies (BeautifulSoup is a core dep)
  • OpenAPI/Swaggerskill-seekers openapi --spec spec.yaml or skill-seekers create spec.yaml

    • Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via --spec-url)
    • Extracts endpoints, parameters, request/response schemas, security schemes, tags
    • Resolves $ref references with circular reference protection; handles allOf/oneOf/anyOf
    • Groups endpoints by tags; generates comprehensive API reference markdown
    • Source detection sniffs YAML file content for openapi: or swagger: keys (avoids false positives on non-API YAML files)
    • Optional dependency: pip install "skill-seekers[openapi]" (pyyaml — already a core dep, guard added for safety)
  • AsciiDocskill-seekers asciidoc --asciidoc-path file.adoc or skill-seekers create file.adoc

    • Regex-based parser (no external library required) with optional asciidoc library support
    • Extracts headings (= through =====), [source,lang] code blocks, |=== tables, admonitions (NOTE/TIP/WARNING/IMPORTANT/CAUTION), and include:: directives
    • Converts AsciiDoc formatting to markdown; handles single files and directories
    • Optional dependency: pip install "skill-seekers[asciidoc]" (asciidoc library for advanced rendering)
  • PowerPoint (.pptx)skill-seekers pptx --pptx file.pptx or skill-seekers create file.pptx

    • Extracts slide text, speaker notes, tables, images (with alt text), and grouped shapes
    • Detects code blocks by monospace font analysis (30+ font families)
    • Groups slides into sections by layout type; handles single files and directories
    • Optional dependency: pip install "skill-seekers[pptx]" (python-pptx)
  • RSS/Atom Feedsskill-seekers rss --feed-url <url> / --feed-path file.rss or skill-seekers create feed.rss

    • Parses RSS 2.0, RSS 1.0, and Atom feeds via feedparser
    • Optionally follows article links (--follow-links, default on) to scrape full page content using BeautifulSoup
    • Extracts article titles, summaries, authors, dates, categories; configurable --max-articles (default 50)
    • Source detection matches .rss and .atom extensions (.xml excluded to avoid false positives)
    • Optional dependency: pip install "skill-seekers[rss]" (feedparser)
  • Man Pagesskill-seekers manpage --man-names git,curl / --man-path dir/ or skill-seekers create git.1

    • Extracts man pages by running man command via subprocess or reading .1.8/.man files directly
    • Handles gzip/bzip2/xz compressed man files; strips troff/groff formatting (backspace overstriking, macros, font escapes)
    • Parses structured sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXAMPLES, SEE ALSO)
    • Source detection uses basename heuristic to avoid false positives on log rotation files (e.g., access.log.1)
    • No external dependencies (stdlib only)
  • Confluenceskill-seekers confluence --base-url <url> --space-key <key> or --export-path dir/

    • API mode: fetches pages from Confluence REST API with pagination (atlassian-python-api)
    • Export mode: parses Confluence HTML/XML export directories
    • Extracts page content, code/panel/info/warning macros, page hierarchy, tables
    • Optional dependency: pip install "skill-seekers[confluence]" (atlassian-python-api)
  • Notionskill-seekers notion --database-id <id> / --page-id <id> or --export-path dir/

    • API mode: fetches pages via Notion API with support for 20+ block types (paragraph, heading, code, callout, toggle, table, etc.)
    • Export mode: parses Notion Markdown/CSV export directories
    • Extracts rich text with annotations (bold, italic, code, links), 16+ property types for database entries
    • Optional dependency: pip install "skill-seekers[notion]" (notion-client)
  • Slack/Discord Chatskill-seekers chat --export-path dir/ or --token <token> --channel <channel>

    • Slack: parses workspace JSON exports or fetches via Slack Web API (slack_sdk)
    • Discord: parses DiscordChatExporter JSON or fetches via Discord HTTP API
    • Extracts messages, code snippets (fenced blocks), shared URLs, threads, reactions, attachments
    • Generates per-channel summaries and topic categorization
    • Optional dependency: pip install "skill-seekers[chat]" (slack-sdk)

EPUB Unified Pipeline Integration

  • EPUB (.epub) input support via skill-seekers create book.epub or skill-seekers epub --epub book.epub
    • Extracts chapters, metadata (Dublin Core), code blocks, images, and tables from EPUB 2 and EPUB 3 files
    • DRM detection with clear error messages (Adobe ADEPT, Apple FairPlay, Readium LCP)
    • Font obfuscation correctly identified as non-DRM
    • EPUB 3 TOC bug workaround (ignore_ncx option)
    • --help-epub flag for EPUB-specific help
    • Optional dependency: pip install "skill-seekers[epub]" (ebooklib)
    • 107 tests across 14 test classes
  • EPUB added to unified scraper_scrape_epub() method, scraped_data["epub"], config validation (_validate_epub_source), and dry-run display. Previously EPUB worked standalone but was missing from multi-source configs.

Unified Skill Builder — Generic Merge System

  • _generic_merge() — Priority-based section merge for any combination of source types not covered by existing pairwise synthesis (docs+github, docs+pdf, etc.). Produces YAML frontmatter + source-attributed sections.
  • _append_extra_sources() — Appends additional source type content (e.g., Jupyter + PPTX) to pairwise-synthesized SKILL.md.
  • _generate_generic_references() — Generates references/<type>/index.md for any source type, with ID resolution fallback chain.
  • _SOURCE_LABELS dict — Human-readable labels for all 17 source types used in merge attribution.

Config Validator Expansion

  • 17 source types in VALID_SOURCE_TYPES — All new types plus word and video now have per-type validation methods.
  • _validate_word_source() — Validates path field for Word documents (was previously missing).
  • _validate_video_source() — Validates url, path, or playlist field for video sources (was previously missing).
  • 11 new _validate_*_source() methods — One for each new type with appropriate required-field checks.

Source Detection Improvements

  • 7 new file extension detections in SourceDetector.detect().ipynb, .html/.htm, .pptx, .adoc/.asciidoc, .rss/.atom, .1.8/.man, .yaml/.yml (with content sniffing)
  • _looks_like_openapi() — Content sniffing for YAML files: only classifies as OpenAPI if the file contains openapi: or swagger: key in first 20 lines (prevents false positives on docker-compose, Ansible, Kubernetes manifests, etc.)
  • Man page basename heuristic.1.8 extensions only detected as man pages if the basename has no dots (e.g., git.1 matches but access.log.1 does not)
  • .xml excluded from RSS detection — Too generic; only .rss and .atom trigger RSS detection

MCP Server Integration

  • scrape_generic tool — New MCP tool handles all 10 new source types via subprocess with per-type flag mapping
  • _PATH_FLAGS / _URL_FLAGS dicts — Correct flag routing for each source type (e.g., jupyter→--notebook, html→--html-path, rss→--feed-url)
  • GENERIC_SOURCE_TYPES tuple — Lists all 10 new types for validation
  • Config validation displayvalidate_config tool now shows source details for all new types
  • Tool count updated — 33 → 34 tools (scraping tools 10 → 11)

CLI Wiring

  • 10 new CLI subcommandsjupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat in COMMAND_MODULES
  • 10 new argument modulesarguments/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}.py with per-type *_ARGUMENTS dicts
  • 10 new parser modulesparsers/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}_parser.py with SubcommandParser implementations
  • create command routing_route_generic() method for all new types with correct module names and CLI flags
  • 10 new entry points in pyproject.toml — skill-seekers-{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}
  • 7 new optional dependency groups in pyproject.toml — [jupyter], [asciidoc], [pptx], [confluence], [notion], [rss], [chat]
  • [all] group updated — Includes all 7 new optional dependencies

Sync Config Command

  • skill-seekers sync-config — New subcommand that crawls a docs site's navigation, diffs discovered URLs against a config's start_urls, and optionally writes the updated list back with --apply (#306)
    • BFS link discovery with configurable depth (default 2), max-pages, rate-limit
    • Respects url_patterns.include/exclude from config
    • Supports optional nav_seed_urls config field
    • Handles both unified (sources array) and legacy flat config formats
    • MCP sync_config tool included
    • 57 tests (39 unit + 18 E2E with local HTTP server)

Workflow & Documentation

  • complex-merge.yaml — New 7-stage AI-powered workflow for complex multi-source merging (source inventory → cross-reference → conflict detection → priority merge → gap analysis → synthesis → quality check)
  • AGENTS.md rewritten — Updated with all 17 source types, scraper pattern docs, project layout, and key pattern documentation
  • 77 new integration tests in test_new_source_types.py — Source detection, config validation, generic merge, CLI wiring, validation, and create command routing
  • docs/BEST_PRACTICES.md — Comprehensive guide for creating high-quality skills: SKILL.md structure, code examples, prerequisites, troubleshooting, quality targets, and real-world Grade F to Grade A example (#206)
  • Documentation updated for 17 source types — 32 files updated across README, CLI reference, feature matrix, MCP reference, config format, API reference, unified scraping, multi-source guide, installation, quick-start, core concepts, user guide, FAQ, troubleshooting, architecture, and all Chinese (zh-CN) translations
  • README translations for 10 languages (12 total) — Added Japanese (日本語), Korean (한국어), Spanish (Español), French (Français), German (Deutsch), Portuguese (Português), Turkish (Türkçe), Arabic (العربية), Hindi (हिन्दी), and Russian (Русский) README translations with language selector bar across all versions

Performance

  • Pre-compiled regex and O(1) URL dedup in doc_scraper — Module-level compiled patterns, _enqueued_urls set for O(1) dedup, cached URL patterns, async error logging fix (#309)
  • Bisect-based line indexing in code_analyzer and dependency_analyzer — O(log n) offset_to_line() via bisect replaces O(n) count("\n") across all 10 language analyzers and all import extractors
  • O(n) parent class map for Python method detection — Replaces O(n²) repeated AST walks in code_analyzer
  • O(1) tree traversal in github_scraperdeque.popleft() replaces list pop(0)
  • Shared build_line_index() / offset_to_line() utilities in cli/utils.py — DRY extraction from code_analyzer and dependency_analyzer

Fixed

  • Config validator missing word and video dispatch_validate_source() had no elif branches for word or video types, silently skipping validation. Added dispatch entries and _validate_word_source() / _validate_video_source() methods.
  • openapi_scraper.py unconditional import yaml — Would crash at import time if pyyaml not installed. Added try/except ImportError guard with YAML_AVAILABLE flag and _check_yaml_deps() helper.
  • asciidoc_scraper.py missing standard argumentsmain() manually defined args instead of using add_asciidoc_arguments(). Refactored to use shared argument definitions + added enhancement workflow integration.
  • pptx_scraper.py missing standard arguments — Same issue. Refactored to use add_pptx_arguments().
  • chat_scraper.py missing standard arguments — Same issue. Refactored to use add_chat_arguments().
  • notion_scraper.py missing run_workflows call--enhance-workflow flags were silently ignored. Added workflow runner integration.
  • openapi_scraper.py return type Nonemain() returned None instead of int. Fixed to return 0 on success, matching all other scrapers.
  • MCP scrape_generic_tool flag mismatch — Was passing --path/--url as generic flags, but every scraper expects its own flag name (e.g., --notebook, --html-path, --spec). All 10 source types would have failed at runtime. Fixed with per-type _PATH_FLAGS and _URL_FLAGS mappings.
  • Word scraper docx_id key mismatch — Unified scraper data dict used docx_id but generic reference generation looked for word_id. Added word_id alias.
  • main.py docstring stale — Missing all 10 new commands. Updated to list all 27 commands.
  • source_detector.py module docstring stale — Described only 5 source types. Updated to describe 14+ detected types.
  • manpage_parser.py docstring referenced wrong file — Said manpage_scraper.py but actual file is man_scraper.py. Fixed.
  • Parser registry test count — Updated expected count from 25 to 35 for 10 new parsers.
  • 'Invalid IPv6 URL' error on bracket-containing URLs (#284) — URLs with square brackets (e.g., /api/[v1]/users) discovered via BFS crawl or HTML extraction bypassed the original fix in _clean_url(). Added shared sanitize_url() utility applied at every URL ingestion point. 16 new tests.
  • GitHub scraper 'list index out of range' on issue extraction (#269) — PyGithub's PaginatedList slicing could fail on some versions or empty repos. Replaced with itertools.islice().
  • Release workflow version mismatch — GitHub release showed wrong version (v3.1.3 instead of v3.2.0) because no explicit release name was set and sed regex had unescaped dots. Added explicit name/tag_name, version consistency check (tag vs pyproject.toml vs package), and empty release notes fallback.
  • Release workflow Python 3.10 compatibility — Version consistency check used tomllib (Python 3.11+). Replaced with grep/sed for 3.10 compatibility.
  • infer_categories() "tutorial" vs "tutorials" key mismatch — Guard checked 'tutorial' but wrote to 'tutorials' key, risking silent overwrites in category inference.
  • Flaky test_benchmark_metadata_overhead — Stabilized with 20 iterations, warm-up run, median averaging, and 200% threshold (was failing on CI with 5 iterations and mean).
  • CI branch protection check permanently pending — Summary job was named 'All Checks Complete' but branch protection required 'Tests'. PRs were stuck as 'Expected — Waiting for status to be reported'. Renamed job to match.

Don't miss a new Skill_Seekers release

NewReleases is sending notifications on new releases.