rhel-lightspeed/docs2db v0.3.0 on GitHub

Added

document_needs_update() function for external API callers to check if documents need updating without knowing internal storage details
Enhanced audit reporting with separate tracking for stale chunks and stale embeddings
Zero-chunk document detection in audit (documents that legitimately have no chunks)
Orphan directory detection in audit (directories without source.json)
Audit command now accepts directory patterns for targeted auditing (e.g., --pattern "external/**" or --pattern "allowed_kcs/*")

Changed

BREAKING: File storage now uses subdirectory-based structure where each document gets its own directory:
- Old: doc_name.json, doc_name.chunks.json, doc_name.gran.json
- New: doc_name/source.json, doc_name/chunks.json, doc_name/gran.json
BREAKING: ingest_from_content() now requires stream_name parameter (e.g., "document.html") for proper format detection
BREAKING: ingest_file() and ingest_from_content() now accept directory paths instead of file paths (e.g., content_dir/doc_name not content_dir/doc_name/source.json)
BREAKING: Default glob patterns changed to **/source.json (for chunking) and **/chunks.json (for embedding)
Audit command default pattern changed from **/*.json to ** (directory-based pattern)
Audit now validates that patterns don't include file extensions (must match directories only)
Audit results now show separate counts for chunks, embeddings, and zero-chunk documents
Moved routine document analysis logging to DEBUG level; summarization events still logged at INFO level
performance improvement: LLM API clients (WatsonX, OpenAI) are now reused across documents in the same batch instead of creating new clients for each document
LLMSession.__init__() no longer takes doc_text parameter; call set_document(doc_text) after initialization
Added LLM_PROVIDER setting (defaults to "openai") to explicitly choose between OpenAI-compatible and WatsonX providers; provider selection now respects Pydantic Settings precedence (CLI/env > .env file > defaults)
Added --llm-provider, --openai-url, --watsonx-url, --context-model, and --context-limit CLI flags to chunk and pipeline commands to explicitly control LLM provider settings
Provider is inferred from URL flags if not explicitly specified (e.g., --watsonx-url → watsonx, --openai-url → openai)
Removed validation that prevented both --openai-url and --watsonx-url from being set; provider selection now explicit via --llm-provider or inferred from flags
Settings access is now centralized in public API functions only (generate_chunks(), generate_embeddings(), load_documents(), perform_audit()); all internal functions require explicit parameters and validate inputs strictly (no global settings access)

Fixed

Addressed WatsonX rate limiting issues by reusing API clients across documents in worker batches
Fixed race condition causing duplicate key errors when multiple workers insert the same embedding model concurrently (now uses INSERT ... ON CONFLICT DO NOTHING)
Fixed db-destroy command to correctly parse project name from postgres-compose.yml instead of hardcoding "docs2db" (was failing to remove volumes for projects with different names)
Fixed ingest_file() to enforce .json extension regardless of caller-provided path, ensuring downstream chunking/embedding tools always find correct files

rhel-lightspeed/docs2db v0.3.0 v0.3.0 - Subdirectory-based document storage on GitHub

Added

Changed

Fixed

rhel-lightspeed/docs2db v0.3.0
v0.3.0 - Subdirectory-based document storage

on GitHub