Added
document_needs_update()function for external API callers to check if documents need updating without knowing internal storage details- Enhanced audit reporting with separate tracking for stale chunks and stale embeddings
- Zero-chunk document detection in audit (documents that legitimately have no chunks)
- Orphan directory detection in audit (directories without
source.json) - Audit command now accepts directory patterns for targeted auditing (e.g.,
--pattern "external/**"or--pattern "allowed_kcs/*")
Changed
- BREAKING: File storage now uses subdirectory-based structure where each document gets its own directory:
- Old:
doc_name.json,doc_name.chunks.json,doc_name.gran.json - New:
doc_name/source.json,doc_name/chunks.json,doc_name/gran.json
- Old:
- BREAKING:
ingest_from_content()now requiresstream_nameparameter (e.g.,"document.html") for proper format detection - BREAKING:
ingest_file()andingest_from_content()now accept directory paths instead of file paths (e.g.,content_dir/doc_namenotcontent_dir/doc_name/source.json) - BREAKING: Default glob patterns changed to
**/source.json(for chunking) and**/chunks.json(for embedding) - Audit command default pattern changed from
**/*.jsonto**(directory-based pattern) - Audit now validates that patterns don't include file extensions (must match directories only)
- Audit results now show separate counts for chunks, embeddings, and zero-chunk documents
- Moved routine document analysis logging to DEBUG level; summarization events still logged at INFO level
- performance improvement: LLM API clients (WatsonX, OpenAI) are now reused across documents in the same batch instead of creating new clients for each document
LLMSession.__init__()no longer takesdoc_textparameter; callset_document(doc_text)after initialization- Added
LLM_PROVIDERsetting (defaults to"openai") to explicitly choose between OpenAI-compatible and WatsonX providers; provider selection now respects Pydantic Settings precedence (CLI/env > .env file > defaults) - Added
--llm-provider,--openai-url,--watsonx-url,--context-model, and--context-limitCLI flags tochunkandpipelinecommands to explicitly control LLM provider settings - Provider is inferred from URL flags if not explicitly specified (e.g.,
--watsonx-url→watsonx,--openai-url→openai) - Removed validation that prevented both
--openai-urland--watsonx-urlfrom being set; provider selection now explicit via--llm-provideror inferred from flags - Settings access is now centralized in public API functions only (
generate_chunks(),generate_embeddings(),load_documents(),perform_audit()); all internal functions require explicit parameters and validate inputs strictly (no global settings access)
Fixed
- Addressed WatsonX rate limiting issues by reusing API clients across documents in worker batches
- Fixed race condition causing duplicate key errors when multiple workers insert the same embedding model concurrently (now uses
INSERT ... ON CONFLICT DO NOTHING) - Fixed
db-destroycommand to correctly parse project name frompostgres-compose.ymlinstead of hardcoding "docs2db" (was failing to remove volumes for projects with different names) - Fixed
ingest_file()to enforce.jsonextension regardless of caller-provided path, ensuring downstream chunking/embedding tools always find correct files