github rhel-lightspeed/docs2db v0.3.0
v0.3.0 - Subdirectory-based document storage

latest releases: v0.4.4, v0.4.3, v0.4.2...
5 months ago

Added

  • document_needs_update() function for external API callers to check if documents need updating without knowing internal storage details
  • Enhanced audit reporting with separate tracking for stale chunks and stale embeddings
  • Zero-chunk document detection in audit (documents that legitimately have no chunks)
  • Orphan directory detection in audit (directories without source.json)
  • Audit command now accepts directory patterns for targeted auditing (e.g., --pattern "external/**" or --pattern "allowed_kcs/*")

Changed

  • BREAKING: File storage now uses subdirectory-based structure where each document gets its own directory:
    • Old: doc_name.json, doc_name.chunks.json, doc_name.gran.json
    • New: doc_name/source.json, doc_name/chunks.json, doc_name/gran.json
  • BREAKING: ingest_from_content() now requires stream_name parameter (e.g., "document.html") for proper format detection
  • BREAKING: ingest_file() and ingest_from_content() now accept directory paths instead of file paths (e.g., content_dir/doc_name not content_dir/doc_name/source.json)
  • BREAKING: Default glob patterns changed to **/source.json (for chunking) and **/chunks.json (for embedding)
  • Audit command default pattern changed from **/*.json to ** (directory-based pattern)
  • Audit now validates that patterns don't include file extensions (must match directories only)
  • Audit results now show separate counts for chunks, embeddings, and zero-chunk documents
  • Moved routine document analysis logging to DEBUG level; summarization events still logged at INFO level
  • performance improvement: LLM API clients (WatsonX, OpenAI) are now reused across documents in the same batch instead of creating new clients for each document
  • LLMSession.__init__() no longer takes doc_text parameter; call set_document(doc_text) after initialization
  • Added LLM_PROVIDER setting (defaults to "openai") to explicitly choose between OpenAI-compatible and WatsonX providers; provider selection now respects Pydantic Settings precedence (CLI/env > .env file > defaults)
  • Added --llm-provider, --openai-url, --watsonx-url, --context-model, and --context-limit CLI flags to chunk and pipeline commands to explicitly control LLM provider settings
  • Provider is inferred from URL flags if not explicitly specified (e.g., --watsonx-urlwatsonx, --openai-urlopenai)
  • Removed validation that prevented both --openai-url and --watsonx-url from being set; provider selection now explicit via --llm-provider or inferred from flags
  • Settings access is now centralized in public API functions only (generate_chunks(), generate_embeddings(), load_documents(), perform_audit()); all internal functions require explicit parameters and validate inputs strictly (no global settings access)

Fixed

  • Addressed WatsonX rate limiting issues by reusing API clients across documents in worker batches
  • Fixed race condition causing duplicate key errors when multiple workers insert the same embedding model concurrently (now uses INSERT ... ON CONFLICT DO NOTHING)
  • Fixed db-destroy command to correctly parse project name from postgres-compose.yml instead of hardcoding "docs2db" (was failing to remove volumes for projects with different names)
  • Fixed ingest_file() to enforce .json extension regardless of caller-provided path, ensuring downstream chunking/embedding tools always find correct files

Don't miss a new docs2db release

NewReleases is sending notifications on new releases.