HKUDS/LightRAG v1.5.0 on GitHub

Major Improvements

🎉✨Feacture: LightRAG now supports multimodal document processing and can fully leverage images, tables, and formulas within documents to answer queries. All RagAnything’s multimodal processing capabilities are merged into LightRAG; ( RagAnything will no longer receive core feature updates or maintenance going forward)
✨🎉Feacture: Fully upgraded the document processing pipeline, with support for using MinerU and Docling to parse and extract file content, seamlessly integrating with multimodal content analysis and entity-relationship extraction.
💡✨Feacture: LightRAG now introduces Native Parsing, enabling high-quality content extraction from DOCX documents. It supports accurate reconstruction of Word auto-numbering, as well as extraction of images, tables, and formulas, providing seamless integration with multimodal content analysis and entity-relationship extraction. Expanded format support for the Native Parser is coming soon.
💡🧠Feacture: Introduced four selectable text chunking strategies: Fix, Recursive, Vector, and Paragraph. The parameters for each chunking strategy can be configured through environment variables.
💡🎯Feacture: The file processing pipeline supports selecting the file parsing engine and text chunking strategy either based on file extensions or on a per-file basis. For detailed usage instructions, refer to FileProcessingPipeline.md.
🚀⚡Performance: Optimized the vector storage persistence logic by deferring vector computations until the end of each file processing batch, enabling centralized bulk computation. This significantly reduced the number of vector model invocations and significantly enhance the upsert speed of all vector DB LightRAG supported.
Enable task-aware embedding support for asymmetric models, including voyage-3, text-embedding-004, embed-multilingual-v3.0, and jina-embeddings-v3.
Improved entity/relation extraction reliability by introducing LLM JSON-structured output; set ENTITY_EXTRACTION_USE_JSON=true to enable it.
Introduce ENTITY_TYPE_PROMPT_FILE to empower users with enhanced guidance for LLM-driven entity type recognition and extraction.
Fully support Amazon and Anthropic models on AWS Bedrock API.
🎯✨🎉Feacture: Implemented role-specific LLM configuration support, introducing four distinct roles: EXTRACT, QUERY, KEYWORDS, and VLM, each with independent LLM settings. It is recommended to configure the KEYWORDS role with a small-parameter, non-reasoning, high-speed model to optimize query latency; the EXTRACT role with a medium-parameter, non-reasoning model to balance accuracy and throughput; and the QUERY role with a large-parameter reasoning model to enhance query quality. For detailed usage instructions, refer to RoleSpecificLLMConfiguration.md.

What's Broken

💥Warning: Upgrades require careful planning for users currently utilizing OpenSearch and MongoDB graph storage, as data migration and system restarts may cause extended service outages.
⚠️The ENTITY_TYPES environment variable has been deprecated; Please replace it with ENTITY_TYPE_PROMPT_FILE before launching this new version.
💔Removed deprecated field: Deleted QueryParam.model_func from lightrag/base.py
For OpenSearch versions prior to 3.3.0, upgrade OpenSearch before upgrading LightRAG (see #2991).
⚠️Move delete_entity/delete_relation from document API to graph API

Before	After
`DELETE /documents/delete_entity`	`DELETE /graph/entity/delete`
`DELETE /documents/delete_relation`	`DELETE /graph/relation/delete`

What's Changed

feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline by @danielaskdd in #2830
♻️ refactor(documentManager): reorganize document status filtering by @danielaskdd in #2851
Remove config.ini from compose samples by @danielaskdd in #2906
fix: handle OpenAI length finish reason fallback by @danielaskdd in #2913
feat: apply entity extraction best practice and add full-service comparisons by @MrGidea in #2914
Feature/enhance entity extraction stability dev by @yunzhongxiaxi in #2864
feat(extraction): configurable per-response entity/relation limits by @danielaskdd in #2950
♻️ refactor(llm): unify keyword extraction across providers by @danielaskdd in #2953
♻️ refactor(llm): unify structured output control via response_format by @danielaskdd in #2956
refactor(gemini): improve default endpoint handling and sdk integration by @danielaskdd in #2957
refactor(bedrock): support default and custom endpoints by @danielaskdd in #2958
refactor(setup): use sentinel endpoints for Gemini and Bedrock defaults by @danielaskdd in #2959
perf(postgres): use binary parameter for vector similarity queries by @wkpark in #2949
feat(prompt): externalize entity type extraction profiles by @danielaskdd in #2964
feat(bedrock): rename aws_bedrock to bedrock and add BindingOptions support by @danielaskdd in #2966
chore(deps): bump react-router-dom from 7.14.0 to 7.14.1 in /lightrag_webui in the react group by @dependabot[bot] in #2967
chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #2968
chore(deps): bump the frontend-minor-patch group across 1 directory with 3 updates by @dependabot[bot] in #2969
Fix role LLM max async fallback by @danielaskdd in #2973
fix(llm): tighten client and stream cleanup across LLM bindings by @danielaskdd in #2974
chore(deps): bump lucide-react from 0.577.0 to 1.6.0 in /lightrag_webui by @dependabot[bot] in #2970
docs: add role-specific LLM configuration guide by @danielaskdd in #2976
refactor: unify role LLM config via ROLES registry + queue observability by @danielaskdd in #2978
feat(status): role-based LLM observability and storage workspace info by @danielaskdd in #2980
feat(rerank): add independent concurrency and timeout configuration by @danielaskdd in #2981
Fix LLM cache role identity isolation by @danielaskdd in #2982
Add role LLM provider options logging and change role provider options to start from empty not default by @danielaskdd in #2984
Add Podman-compatible compose file by @tears710 in #2983
Fix bedrock/gemini host leak from env.example on make server/storage by @danielaskdd in #2985
fix: remove stream parameter from .parse() call when response_format is present by @PaulTitto in #2965
Explicit voyageai embed support by @laszukdawid in #2484
feat: Add task-aware embedding support by @StoreksFeed in #2560
Improve DOCX parsing idempotency by @danielaskdd in #2987
OpenSearch: Use version-aware sort tiebreaker for PIT search by @LantaoJin in #2991
Refactor DOCX archive handling by @danielaskdd in #2994
refactor: remove embedded docling and raganything fallbacks by @danielaskdd in #2997
Refactor file parser routing by @danielaskdd in #2998
Deduplicate documents by filename by @danielaskdd in #3000
Fix parsed document artifact isolation by @danielaskdd in #3005
Fix parser-hinted filename dedup by @danielaskdd in #3006
feat(native): align docx parser with mineru LIGHTRAG output (.blocks.jsonl) by @danielaskdd in #3008
♻️ refactor(lightrag): split monolithic lightrag.py into focused modules by @danielaskdd in #3012
feat(parser_routing): per-file process options and canonical basename by @danielaskdd in #3013
feat(api): pipeline reentrancy guards and idempotent multimodal analyze by @danielaskdd in #3014
fix(pipeline): preserve process_options in doc_status metadata across transitions by @danielaskdd in #3017
feat(pipeline): resume already-extracted documents under current process_options by @danielaskdd in #3015
refactor(pipeline): normalize parser / extraction metadata field names by @danielaskdd in #3021
♻️ refactor(pipeline): unify F chunking for raw and lightrag formats by @danielaskdd in #3023
♻️ refactor(docx): move native parser to lightrag/native_parser/docx by @danielaskdd in #3030
chore(deps): bump react-router-dom from 7.14.1 to 7.14.2 in /lightrag_webui in the react group by @dependabot[bot] in #3001
chore(deps): bump the ui-components group in /lightrag_webui with 3 updates by @dependabot[bot] in #3002
chore(deps): bump axios from 1.15.1 to 1.15.2 in /lightrag_webui in the frontend-minor-patch group by @dependabot[bot] in #3004
chore(deps-dev): bump the build-tools group across 1 directory with 5 updates by @dependabot[bot] in #3003
Replace status tooltip with details modal by @g2424303264-code in #3025
fix(webui): resolve react-hooks lint errors after eslint-plugin upgrade by @danielaskdd in #3036
chore(deps): bump lucide-react from 1.9.0 to 1.14.0 in /lightrag_webui in the ui-components group by @dependabot[bot] in #3032
chore(deps): bump sigma from 3.0.2 to 3.0.3 in /lightrag_webui in the graph-viz group by @dependabot[bot] in #3033
chore(deps-dev): bump typescript-eslint from 8.59.1 to 8.59.2 in /lightrag_webui in the build-tools group by @dependabot[bot] in #3034
chore(deps): bump i18next from 25.10.10 to 26.0.3 in /lightrag_webui by @dependabot[bot] in #3035
Add prefix path to API and WebUI by @shahrin014 in #3007
feat(webui): inject path prefix at runtime — one build serves all sites by @danielaskdd in #3039
refactor(pipeline): split apipeline_process_enqueue_documents into helpers by @danielaskdd in #3041
feat(openai): inject X-DashScope-Workspace header from DASHSCOPE_WORKSPACE_ID env by @getsov75-maker in #3042
Update README by @danielaskdd in #3047
Add paragraph semantic chunking strategy by @danielaskdd in #3044
feat(chunker): add R/V chunkers and chunk_options snapshot mechanism by @danielaskdd in #3046
Add offline sample retrieval check for RAGAS evaluation by @FU-max-boop in #3038
fix(chunker): CJK punctuation support and chunk_token_size enforcement by @danielaskdd in #3050
feat(chunker): split oversized tables on row boundaries before char fallback by @danielaskdd in #3051
chore(deps): bump sigstore/cosign-installer from 4.1.1 to 4.1.2 in the github-actions group by @dependabot[bot] in #3053
fix(chunker): preserve HTML table row group wrappers by @danielaskdd in #3055
fix(PGGraphStorage): edge properties lost on upsert with Apache AGE by @oleksandr-kushnir in #3052
feat(multimodal): backfill surrounding context on native sidecars by @danielaskdd in #3057
docs: fix typo 'Papper' to 'Paper' in README by @viraj1995 in #3058
feat(llm/vlm): unify image_inputs across bindings + VLM cache + master switch by @danielaskdd in #3063
chore(deps): update google-genai requirement from <2.0.0,>=1.0.0 to >=1.0.0,<3.0.0 by @dependabot[bot] in #3062
chore(deps): update pymilvus requirement from <3.0.0,>=2.6.2 to >=2.6.2,<4.0.0 by @dependabot[bot] in #3061
Refactor multimodal: status semantics, nested chunk schema, per-chunk entity injection by @danielaskdd in #3064
style: update EmptyCard and DocumentManager for improved layout by @JackLuguibin in #3060
fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89) by @sebastiondev in #3026
feat(docx): doc_title from first heading, table_header in sidecar by @danielaskdd in #3065
feat(multimodal): defer sidecar surrounding to analyze entry; env-tunable budgets by @danielaskdd in #3066
✨ feat(sidecar): shorten item IDs by stripping doc- prefix by @danielaskdd in #3067
feat(multimodal): strip parser-internal markup from sidecar surrounding by @danielaskdd in #3068
✨ feat(extract): enforce MAX_EXTRACT_INPUT_TOKENS for analyze & gleaning by @danielaskdd in #3073
chore(deps): bump the react group in /lightrag_webui with 3 updates by @dependabot[bot] in #3069
chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #3070
chore(deps): bump the frontend-minor-patch group across 1 directory with 5 updates by @dependabot[bot] in #3071
chore(deps): bump react-i18next from 16.6.6 to 17.0.3 in /lightrag_webui by @dependabot[bot] in #3072
refactor(multimodal): inline bracket-label format for mm chunks by @danielaskdd in #3074
fix: extract Docling async markdown result by @he-yufeng in #3031
feat(parse_mineru): unified sidecar writer + MinerU raw bundle cache by @danielaskdd in #3075
feat(native_parser/docx): route through unified SidecarWriter by @danielaskdd in #3077
feat: dedupe cross-filename uploads via merged_text normalization by @danielaskdd in #3078
feat(mineru): split-by-heading block merging with markdown titles by @danielaskdd in #3079
test: harden hermetic env fixture and fix MinerU put() fake by @danielaskdd in #3080
feat(mineru): emit page-level positions from page_idx by @danielaskdd in #3081
feat(docling): route parse_docling through sidecar bundle pipeline by @danielaskdd in #3085
fix: guard against IndexError on empty LLM choices list by @qizwiz in #3086
refactor(parser): rename adapters to ir_builder and consolidate parser packages by @danielaskdd in #3087
feat(parser_cli): unified debug CLI for native / mineru / docling by @danielaskdd in #3088
Tokenizer.encode: gracefully handle disallowed special tokens in content by @RooseveltAdvisors in #3082
Improve MinerU parsing provenance and upload handling by @danielaskdd in #3089
fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure by @danielaskdd in #3090
feat(doc-status): add per-stage start time and parse-skipped flag by @danielaskdd in #3091
feat(mineru): invalidate raw bundle cache on parser option changes by @danielaskdd in #3092
refactor(full-docs): unify path handling and slim chunk_options snapshot by @danielaskdd in #3093
refactor(pg): align PG storage fields with JSON storage for parity by @danielaskdd in #3094
feat(redis): add basename and content_hash lookups for doc status by @danielaskdd in #3098
refactor(mongo): align doc-status storage with JSON storage for parity by @danielaskdd in #3099
feat(opensearch): add basename and content_hash lookups for doc status by @danielaskdd in #3100
feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback by @danielaskdd #3101
feat(chunker): give P strategy a dedicated default chunk_token_size by @danielaskdd in #3102
Setup wizard: role-specific LLM models, Bedrock auth docs, env tidy by @danielaskdd in #3116
♻️ refactor(parser): consolidate parser modules under lightrag/parser/ by @danielaskdd in #3117
build(deps): bump react-router-dom from 7.15.0 to 7.15.1 in /lightrag_webui in the react group by @dependabot[bot] in #3109
build(deps): bump the ui-components group in /lightrag_webui with 4 updates by @dependabot[bot] in #3110
feat(pipeline): surface parse/analyze progress to pipeline_status by @danielaskdd in #3118
feat(docx-parser): drop revision/comment markers and skip empty tables by @danielaskdd in #3119
fix: make EMBEDDING_TOKEN_LIMIT actually enforce token truncation by @kennedydqz-del in #3105
Revert "fix: make EMBEDDING_TOKEN_LIMIT actually enforce token truncation" by @danielaskdd in #3120
fix: resolve 3 bugs in Anthropic/Claude LLM provider by @kennedydqz-del in #3106
fix(pipeline): emit per-item analyze logs in real time by @danielaskdd in #3122
feat(pipeline): propagate /cancel_pipeline into PARSE and ANALYZE by @danielaskdd in #3124
feat(webui): surface parser routing + MinerU/Docling config in status card by @danielaskdd in #3125
fix: sync API docs dark theme by @he-yufeng in #3123
write_nx_graph: atomic write via per-writer .tmp + os.replace by @RooseveltAdvisors in #3083
fix(PGGraphStorage): serialise concurrent upsert_edge with advisory lock by @oleksandr-kushnir in #3056
fix(server): normalize scope.path so Mount works in proxy-strip mode by @Xaverrrrr in #3128
fix(concurrency): wrap ainsert_custom_kg writes with keyed lock (follow-up to #3056) by @danielaskdd in #3129
Remove quotes from status details by @g2424303264-code in #3127
perf(opensearch): batch vector upserts to reduce HTTP roundtrips by @Saswatsusmoy in #3043
perf(opensearch): batch KV operations under namespace lock by @danielaskdd in #3133
fix(storage): atomic index_done_callback for Faiss/Json/Nano by @danielaskdd in #3134
♻️ refactor(tests): mirror feature folders under tests/ by @danielaskdd in #3135
🐛 fix(setup): treat opensearch-dashboards as opensearch sub-service by @danielaskdd in #3136
Remove deprecated QueryParam.model_func and related code by @danielaskdd in #3137
✨ feat(pipeline): record parse/analyze stage end-time and skipped metadata by @danielaskdd in #3138
feat(graph): block graph mutations while pipeline is busy by @danielaskdd in #3139
Add PostgreSQL 18 AGE pgvector image by @danielaskdd in #3140
feat(setup): allow custom PostgreSQL credentials for Docker deployments by @danielaskdd in #3141
feat(api): add chunking strategy selection to /documents/text(s) by @danielaskdd in #3142
perf(opensearch): defer vector embedding until flush by @danielaskdd in #3143
perf(nano): defer vector embedding to flush time by @danielaskdd in #3145
perf(faiss): defer vector embedding to flush time by @danielaskdd in #3147
🐛 fix(operate): correct final_data key path in debug log by @danielaskdd in #3148
perf(milvus): defer vector embedding until flush by @danielaskdd in #3149
perf(mongo): defer vector embedding until flush by @danielaskdd in #3150
perf(qdrant): defer vector embedding until flush by @danielaskdd in #3151
refact(postgres): defer PGVector embedding to index_done_callback by @danielaskdd in #3152
🐛 fix(kg): post-prune pending buffer in vector delete_entity_relation by @danielaskdd in #3154
build(deps): bump the content-rendering group across 1 directory with 2 updates by @dependabot[bot] in #3112
build(deps-dev): bump @types/react from 19.2.14 to 19.2.15 in /lightrag_webui in the react group by @dependabot[bot] in #3155
fix(llm): retry transient OpenAI 5xx and 'could not parse JSON body' 400s by @cvidaillac in #3144
build(deps-dev): bump the build-tools group across 1 directory with 5 updates by @dependabot[bot] in #3156
build(deps): bump the frontend-minor-patch group across 1 directory with 5 updates by @dependabot[bot] in #3157
refactor(parser): make every recognized heading its own block by @danielaskdd in #3159
feat(parser): render heading lines with a markdown # prefix by @danielaskdd in #3160
🔧 chore(neo4j): drop @Final from Neo4JStorage to allow subclassing by @danielaskdd in #3163
fix(sidecar): use source spans for chunk backfill by @danielaskdd in #3162
🐛 fix(milvus): batch flush upserts/deletes under gRPC message limit by @danielaskdd in #3164
🐛 fix(qdrant): resilient flush — own-batch oversized points, batch deletes by @danielaskdd in #3165
🔧 fix(milvus): shrink upsert batch count to 128 (align with Qdrant) by @danielaskdd in #3166
feat(mongo): batch bulk_write upserts/deletes across all MongoDB storage paths by @danielaskdd in #3167
✨ feat(opensearch): bound bulk write payload/record size by @danielaskdd in #3168
✨ feat(postgres): bound non-graph bulk upsert/delete payload/record size by @danielaskdd in #3169
⚡ feat(postgres): chunk-level transaction fallback for PGGraph batch writes by @danielaskdd in #3170
🔒 fix(opensearch): canonical edge ids to close reciprocal upsert race by @danielaskdd in #3171
⚡️ perf(mongo): exact single-edge reads use canonical (edge_lo, edge_hi) index by @danielaskdd in #3174
🔒 fix(mongo): canonical edge_key + unique index to close upsert duplicate race by @danielaskdd in #3172
🔒 fix(opensearch): merge reverse payload into canonical on duplicate migration by @danielaskdd in #3173
⚡ perf(opensearch): collapse edge point-lookups onto canonical _id by @danielaskdd in #3175
🐛 fix(postgres): rank get_knowledge_graph("*") nodes by undirected degree by @danielaskdd in #3176
🔒 fix(mongo): rebuild FAILED Atlas vector index instead of wedging queries by @danielaskdd in #3177
♻️ refactor(api): move delete_entity/delete_relation to graph API (breaking) by @danielaskdd in #3178
♻️ refactor(pipeline): rename metadata fields & add source_file back-compat by @danielaskdd in #3179
🔧 refactor(constants): centralize LLM call priorities and align analysis stage with processing by @danielaskdd in #3180
Slim queue payloads by re-reading document bodies from storage by @danielaskdd in #3182
chore(config): bump default MAX_PARALLEL_INSERT to 3 by @danielaskdd in #3183
Replace sys.exit with exceptions in DOCX parser validation by @danielaskdd in #3184
♻️ refactor(webui): split document status filter tabs by @danielaskdd in #3185
Split queue size config into parse and analyze stages by @danielaskdd in #3186
🐛 fix(pipeline): byte-aware entity truncation & fail-fast on storage flush errors by @danielaskdd in #3187
🐛 fix(pipeline): clean stale per-attempt metadata earlier for PENDING docs by @danielaskdd in #3188
🐛 fix(parser): split over-long outline headings at soft line break by @danielaskdd in #3190
🐛 fix(pipeline): guarantee worker teardown + isolate enqueue get_by_id on backend outage by @danielaskdd in #3191

New Contributors

@yunzhongxiaxi made their first contribution in #2864
@tears710 made their first contribution in #2983
@PaulTitto made their first contribution in #2965
@laszukdawid made their first contribution in #2484
@g2424303264-code made their first contribution in #3025
@shahrin014 made their first contribution in #3007
@getsov75-maker made their first contribution in #3042
@FU-max-boop made their first contribution in #3038
@oleksandr-kushnir made their first contribution in #3052
@viraj1995 made their first contribution in #3058
@JackLuguibin made their first contribution in #3060
@qizwiz made their first contribution in #3086
@kennedydqz-del made their first contribution in #3105
@Xaverrrrr made their first contribution in #3128
@Saswatsusmoy made their first contribution in #3043
@cvidaillac made their first contribution in #3144

Full Changelog: v1.4.16...v1.5.0