Major Improvements
- LightRAG now supports multimodal document processing and can fully leverage images, tables, and formulas within documents to answer queries. All RagAnything’s multimodal processing capabilities are merged into LightRAG; ( RagAnything will no longer receive core feature updates or maintenance going forward)
- Fully upgraded the document processing pipeline, with support for using MinerU and Docling to parse and extract file content, seamlessly integrating with multimodal content analysis and entity-relationship extraction.
- LightRAG now introduces Native Parsing, enabling high-quality content extraction from DOCX documents. It supports accurate reconstruction of Word auto-numbering, as well as extraction of images, tables, and formulas, providing seamless integration with multimodal content analysis and entity-relationship extraction. Expanded format support for the Native Parser is coming soon.
- Introduced four selectable text chunking strategies:
Fix,Recursive,Vector, andParagraph. The parameters for each chunking strategy can be configured through environment variables. - The file processing pipeline supports selecting the content parsing engine and text chunking strategy either based on file extensions or on a per-file basis. For detailed usage instructions, refer to FileProcessingPipeline.md.
- Enable task-aware embedding support for asymmetric models, including
voyage-3,text-embedding-004,embed-multilingual-v3.0, andjina-embeddings-v3. - Add optional JSON-formatted LLM output to enhance stability in the entity and relation extraction pipeline.
- Improved entity/relation extraction reliability by introducing LLM JSON-structured output; set ENTITY_EXTRACTION_USE_JSON=true to enable it.
- Introduce
ENTITY_TYPE_PROMPT_FILEto empower users with enhanced guidance for LLM-driven entity type recognition and extraction. - Fully support Amazon and Anthropic models on AWS Bedrock API.
- Implemented role-specific LLM configuration support, introducing four distinct roles: EXTRACT, QUERY, KEYWORDS, and VLM, each with independent LLM settings. It is recommended to configure the KEYWORDS role with a small-parameter, non-reasoning, high-speed model to optimize query latency; the EXTRACT role with a medium-parameter, non-reasoning model to balance accuracy and throughput; and the QUERY role with a large-parameter reasoning model to enhance query quality. For detailed usage instructions, refer to RoleSpecificLLMConfiguration.md.
What's Broken
- The
ENTITY_TYPESenvironment variable has been deprecated; Please replace it with ENTITY_TYPE_PROMPT_FILE before launching this new version. - For OpenSearch versions prior to 3.3.0, upgrade OpenSearch before upgrading LightRAG (see #2991).
What's Changed
- feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline by @danielaskdd in #2830
- ♻️ refactor(documentManager): reorganize document status filtering by @danielaskdd in #2851
- Remove config.ini from compose samples by @danielaskdd in #2906
- fix: handle OpenAI length finish reason fallback by @danielaskdd in #2913
- feat: apply entity extraction best practice and add full-service comparisons by @MrGidea in #2914
- Feature/enhance entity extraction stability dev by @yunzhongxiaxi in #2864
- feat(extraction): configurable per-response entity/relation limits by @danielaskdd in #2950
- ♻️ refactor(llm): unify keyword extraction across providers by @danielaskdd in #2953
- ♻️ refactor(llm): unify structured output control via response_format by @danielaskdd in #2956
- refactor(gemini): improve default endpoint handling and sdk integration by @danielaskdd in #2957
- refactor(bedrock): support default and custom endpoints by @danielaskdd in #2958
- refactor(setup): use sentinel endpoints for Gemini and Bedrock defaults by @danielaskdd in #2959
- perf(postgres): use binary parameter for vector similarity queries by @wkpark in #2949
- feat(prompt): externalize entity type extraction profiles by @danielaskdd in #2964
- feat(bedrock): rename aws_bedrock to bedrock and add BindingOptions support by @danielaskdd in #2966
- chore(deps): bump react-router-dom from 7.14.0 to 7.14.1 in /lightrag_webui in the react group by @dependabot[bot] in #2967
- chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #2968
- chore(deps): bump the frontend-minor-patch group across 1 directory with 3 updates by @dependabot[bot] in #2969
- Fix role LLM max async fallback by @danielaskdd in #2973
- fix(llm): tighten client and stream cleanup across LLM bindings by @danielaskdd in #2974
- chore(deps): bump lucide-react from 0.577.0 to 1.6.0 in /lightrag_webui by @dependabot[bot] in #2970
- docs: add role-specific LLM configuration guide by @danielaskdd in #2976
- refactor: unify role LLM config via ROLES registry + queue observability by @danielaskdd in #2978
- feat(status): role-based LLM observability and storage workspace info by @danielaskdd in #2980
- feat(rerank): add independent concurrency and timeout configuration by @danielaskdd in #2981
- Fix LLM cache role identity isolation by @danielaskdd in #2982
- Add role LLM provider options logging and change role provider options to start from empty not default by @danielaskdd in #2984
- Add Podman-compatible compose file by @tears710 in #2983
- Fix bedrock/gemini host leak from env.example on make server/storage by @danielaskdd in #2985
- fix: remove
streamparameter from.parse()call whenresponse_formatis present by @PaulTitto in #2965 - Explicit voyageai embed support by @laszukdawid in #2484
- feat: Add task-aware embedding support by @StoreksFeed in #2560
- Improve DOCX parsing idempotency by @danielaskdd in #2987
- OpenSearch: Use version-aware sort tiebreaker for PIT search by @LantaoJin in #2991
- Refactor DOCX archive handling by @danielaskdd in #2994
- refactor: remove embedded docling and raganything fallbacks by @danielaskdd in #2997
- Refactor file parser routing by @danielaskdd in #2998
- Deduplicate documents by filename by @danielaskdd in #3000
- Fix parsed document artifact isolation by @danielaskdd in #3005
- Fix parser-hinted filename dedup by @danielaskdd in #3006
- feat(native): align docx parser with mineru LIGHTRAG output (.blocks.jsonl) by @danielaskdd in #3008
- ♻️ refactor(lightrag): split monolithic lightrag.py into focused modules by @danielaskdd in #3012
- feat(parser_routing): per-file process options and canonical basename by @danielaskdd in #3013
- feat(api): pipeline reentrancy guards and idempotent multimodal analyze by @danielaskdd in #3014
- fix(pipeline): preserve process_options in doc_status metadata across transitions by @danielaskdd in #3017
- feat(pipeline): resume already-extracted documents under current process_options by @danielaskdd in #3015
- refactor(pipeline): normalize parser / extraction metadata field names by @danielaskdd in #3021
- ♻️ refactor(pipeline): unify F chunking for raw and lightrag formats by @danielaskdd in #3023
- ♻️ refactor(docx): move native parser to lightrag/native_parser/docx by @danielaskdd in #3030
- chore(deps): bump react-router-dom from 7.14.1 to 7.14.2 in /lightrag_webui in the react group by @dependabot[bot] in #3001
- chore(deps): bump the ui-components group in /lightrag_webui with 3 updates by @dependabot[bot] in #3002
- chore(deps): bump axios from 1.15.1 to 1.15.2 in /lightrag_webui in the frontend-minor-patch group by @dependabot[bot] in #3004
- chore(deps-dev): bump the build-tools group across 1 directory with 5 updates by @dependabot[bot] in #3003
- Replace status tooltip with details modal by @g2424303264-code in #3025
- fix(webui): resolve react-hooks lint errors after eslint-plugin upgrade by @danielaskdd in #3036
- chore(deps): bump lucide-react from 1.9.0 to 1.14.0 in /lightrag_webui in the ui-components group by @dependabot[bot] in #3032
- chore(deps): bump sigma from 3.0.2 to 3.0.3 in /lightrag_webui in the graph-viz group by @dependabot[bot] in #3033
- chore(deps-dev): bump typescript-eslint from 8.59.1 to 8.59.2 in /lightrag_webui in the build-tools group by @dependabot[bot] in #3034
- chore(deps): bump i18next from 25.10.10 to 26.0.3 in /lightrag_webui by @dependabot[bot] in #3035
- Add prefix path to API and WebUI by @shahrin014 in #3007
- feat(webui): inject path prefix at runtime — one build serves all sites by @danielaskdd in #3039
- refactor(pipeline): split apipeline_process_enqueue_documents into helpers by @danielaskdd in #3041
- feat(openai): inject X-DashScope-Workspace header from DASHSCOPE_WORKSPACE_ID env by @getsov75-maker in #3042
- Update README by @danielaskdd in #3047
- Add paragraph semantic chunking strategy by @danielaskdd in #3044
- feat(chunker): add R/V chunkers and chunk_options snapshot mechanism by @danielaskdd in #3046
- Add offline sample retrieval check for RAGAS evaluation by @FU-max-boop in #3038
- fix(chunker): CJK punctuation support and chunk_token_size enforcement by @danielaskdd in #3050
- feat(chunker): split oversized tables on row boundaries before char fallback by @danielaskdd in #3051
- chore(deps): bump sigstore/cosign-installer from 4.1.1 to 4.1.2 in the github-actions group by @dependabot[bot] in #3053
- fix(chunker): preserve HTML table row group wrappers by @danielaskdd in #3055
- fix(PGGraphStorage): edge properties lost on upsert with Apache AGE by @oleksandr-kushnir in #3052
- feat(multimodal): backfill surrounding context on native sidecars by @danielaskdd in #3057
- docs: fix typo 'Papper' to 'Paper' in README by @viraj1995 in #3058
- feat(llm/vlm): unify image_inputs across bindings + VLM cache + master switch by @danielaskdd in #3063
- chore(deps): update google-genai requirement from <2.0.0,>=1.0.0 to >=1.0.0,<3.0.0 by @dependabot[bot] in #3062
- chore(deps): update pymilvus requirement from <3.0.0,>=2.6.2 to >=2.6.2,<4.0.0 by @dependabot[bot] in #3061
- Refactor multimodal: status semantics, nested chunk schema, per-chunk entity injection by @danielaskdd in #3064
- style: update EmptyCard and DocumentManager for improved layout by @JackLuguibin in #3060
- fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89) by @sebastiondev in #3026
- feat(docx): doc_title from first heading, table_header in sidecar by @danielaskdd in #3065
- feat(multimodal): defer sidecar surrounding to analyze entry; env-tunable budgets by @danielaskdd in #3066
- ✨ feat(sidecar): shorten item IDs by stripping doc- prefix by @danielaskdd in #3067
- feat(multimodal): strip parser-internal markup from sidecar surrounding by @danielaskdd in #3068
- ✨ feat(extract): enforce MAX_EXTRACT_INPUT_TOKENS for analyze & gleaning by @danielaskdd in #3073
- chore(deps): bump the react group in /lightrag_webui with 3 updates by @dependabot[bot] in #3069
- chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #3070
- chore(deps): bump the frontend-minor-patch group across 1 directory with 5 updates by @dependabot[bot] in #3071
- chore(deps): bump react-i18next from 16.6.6 to 17.0.3 in /lightrag_webui by @dependabot[bot] in #3072
- refactor(multimodal): inline bracket-label format for mm chunks by @danielaskdd in #3074
- fix: extract Docling async markdown result by @he-yufeng in #3031
- feat(parse_mineru): unified sidecar writer + MinerU raw bundle cache by @danielaskdd in #3075
- feat(native_parser/docx): route through unified SidecarWriter by @danielaskdd in #3077
- feat: dedupe cross-filename uploads via merged_text normalization by @danielaskdd in #3078
- feat(mineru): split-by-heading block merging with markdown titles by @danielaskdd in #3079
- test: harden hermetic env fixture and fix MinerU put() fake by @danielaskdd in #3080
- feat(mineru): emit page-level positions from page_idx by @danielaskdd in #3081
- feat(docling): route parse_docling through sidecar bundle pipeline by @danielaskdd in #3085
- fix: guard against IndexError on empty LLM choices list by @qizwiz in #3086
- refactor(parser): rename adapters to ir_builder and consolidate parser packages by @danielaskdd in #3087
- feat(parser_cli): unified debug CLI for native / mineru / docling by @danielaskdd in #3088
- Tokenizer.encode: gracefully handle disallowed special tokens in content by @RooseveltAdvisors in #3082
- Improve MinerU parsing provenance and upload handling by @danielaskdd in #3089
- fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure by @danielaskdd in #3090
- feat(doc-status): add per-stage start time and parse-skipped flag by @danielaskdd in #3091
- feat(mineru): invalidate raw bundle cache on parser option changes by @danielaskdd in #3092
- refactor(full-docs): unify path handling and slim chunk_options snapshot by @danielaskdd in #3093
- refactor(pg): align PG storage fields with JSON storage for parity by @danielaskdd in #3094
- feat(redis): add basename and content_hash lookups for doc status by @danielaskdd in #3098
- refactor(mongo): align doc-status storage with JSON storage for parity by @danielaskdd in #3099
- feat(opensearch): add basename and content_hash lookups for doc status by @danielaskdd in #3100
- feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback by @danielaskdd #3101
- feat(chunker): give P strategy a dedicated default chunk_token_size by @danielaskdd in #3102
New Contributors
- @yunzhongxiaxi made their first contribution in #2864
- @tears710 made their first contribution in #2983
- @PaulTitto made their first contribution in #2965
- @laszukdawid made their first contribution in #2484
- @g2424303264-code made their first contribution in #3025
- @shahrin014 made their first contribution in #3007
- @getsov75-maker made their first contribution in #3042
- @FU-max-boop made their first contribution in #3038
- @oleksandr-kushnir made their first contribution in #3052
- @viraj1995 made their first contribution in #3058
- @JackLuguibin made their first contribution in #3060
- @qizwiz made their first contribution in #3086
Full Changelog: v1.4.15...v1.5.0rc2