github HKUDS/LightRAG v1.5.0rc2

4 hours ago

Major Improvements

  • LightRAG now supports multimodal document processing and can fully leverage images, tables, and formulas within documents to answer queries. All RagAnything’s multimodal processing capabilities are merged into LightRAG; ( RagAnything will no longer receive core feature updates or maintenance going forward)
  • Fully upgraded the document processing pipeline, with support for using MinerU and Docling to parse and extract file content, seamlessly integrating with multimodal content analysis and entity-relationship extraction.
  • LightRAG now introduces Native Parsing, enabling high-quality content extraction from DOCX documents. It supports accurate reconstruction of Word auto-numbering, as well as extraction of images, tables, and formulas, providing seamless integration with multimodal content analysis and entity-relationship extraction. Expanded format support for the Native Parser is coming soon.
  • Introduced four selectable text chunking strategies: Fix, Recursive, Vector, and Paragraph. The parameters for each chunking strategy can be configured through environment variables.
  • The file processing pipeline supports selecting the content parsing engine and text chunking strategy either based on file extensions or on a per-file basis. For detailed usage instructions, refer to FileProcessingPipeline.md.
  • Enable task-aware embedding support for asymmetric models, including voyage-3, text-embedding-004, embed-multilingual-v3.0, and jina-embeddings-v3.
  • Add optional JSON-formatted LLM output to enhance stability in the entity and relation extraction pipeline.
  • Improved entity/relation extraction reliability by introducing LLM JSON-structured output; set ENTITY_EXTRACTION_USE_JSON=true to enable it.
  • Introduce ENTITY_TYPE_PROMPT_FILE to empower users with enhanced guidance for LLM-driven entity type recognition and extraction.
  • Fully support Amazon and Anthropic models on AWS Bedrock API.
  • Implemented role-specific LLM configuration support, introducing four distinct roles: EXTRACT, QUERY, KEYWORDS, and VLM, each with independent LLM settings. It is recommended to configure the KEYWORDS role with a small-parameter, non-reasoning, high-speed model to optimize query latency; the EXTRACT role with a medium-parameter, non-reasoning model to balance accuracy and throughput; and the QUERY role with a large-parameter reasoning model to enhance query quality. For detailed usage instructions, refer to RoleSpecificLLMConfiguration.md.
iShot_2026-05-20_18 14 32

What's Broken

  • The ENTITY_TYPES environment variable has been deprecated; Please replace it with ENTITY_TYPE_PROMPT_FILE before launching this new version.
  • For OpenSearch versions prior to 3.3.0, upgrade OpenSearch before upgrading LightRAG (see #2991).

What's Changed

  • feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline by @danielaskdd in #2830
  • ♻️ refactor(documentManager): reorganize document status filtering by @danielaskdd in #2851
  • Remove config.ini from compose samples by @danielaskdd in #2906
  • fix: handle OpenAI length finish reason fallback by @danielaskdd in #2913
  • feat: apply entity extraction best practice and add full-service comparisons by @MrGidea in #2914
  • Feature/enhance entity extraction stability dev by @yunzhongxiaxi in #2864
  • feat(extraction): configurable per-response entity/relation limits by @danielaskdd in #2950
  • ♻️ refactor(llm): unify keyword extraction across providers by @danielaskdd in #2953
  • ♻️ refactor(llm): unify structured output control via response_format by @danielaskdd in #2956
  • refactor(gemini): improve default endpoint handling and sdk integration by @danielaskdd in #2957
  • refactor(bedrock): support default and custom endpoints by @danielaskdd in #2958
  • refactor(setup): use sentinel endpoints for Gemini and Bedrock defaults by @danielaskdd in #2959
  • perf(postgres): use binary parameter for vector similarity queries by @wkpark in #2949
  • feat(prompt): externalize entity type extraction profiles by @danielaskdd in #2964
  • feat(bedrock): rename aws_bedrock to bedrock and add BindingOptions support by @danielaskdd in #2966
  • chore(deps): bump react-router-dom from 7.14.0 to 7.14.1 in /lightrag_webui in the react group by @dependabot[bot] in #2967
  • chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #2968
  • chore(deps): bump the frontend-minor-patch group across 1 directory with 3 updates by @dependabot[bot] in #2969
  • Fix role LLM max async fallback by @danielaskdd in #2973
  • fix(llm): tighten client and stream cleanup across LLM bindings by @danielaskdd in #2974
  • chore(deps): bump lucide-react from 0.577.0 to 1.6.0 in /lightrag_webui by @dependabot[bot] in #2970
  • docs: add role-specific LLM configuration guide by @danielaskdd in #2976
  • refactor: unify role LLM config via ROLES registry + queue observability by @danielaskdd in #2978
  • feat(status): role-based LLM observability and storage workspace info by @danielaskdd in #2980
  • feat(rerank): add independent concurrency and timeout configuration by @danielaskdd in #2981
  • Fix LLM cache role identity isolation by @danielaskdd in #2982
  • Add role LLM provider options logging and change role provider options to start from empty not default by @danielaskdd in #2984
  • Add Podman-compatible compose file by @tears710 in #2983
  • Fix bedrock/gemini host leak from env.example on make server/storage by @danielaskdd in #2985
  • fix: remove stream parameter from .parse() call when response_format is present by @PaulTitto in #2965
  • Explicit voyageai embed support by @laszukdawid in #2484
  • feat: Add task-aware embedding support by @StoreksFeed in #2560
  • Improve DOCX parsing idempotency by @danielaskdd in #2987
  • OpenSearch: Use version-aware sort tiebreaker for PIT search by @LantaoJin in #2991
  • Refactor DOCX archive handling by @danielaskdd in #2994
  • refactor: remove embedded docling and raganything fallbacks by @danielaskdd in #2997
  • Refactor file parser routing by @danielaskdd in #2998
  • Deduplicate documents by filename by @danielaskdd in #3000
  • Fix parsed document artifact isolation by @danielaskdd in #3005
  • Fix parser-hinted filename dedup by @danielaskdd in #3006
  • feat(native): align docx parser with mineru LIGHTRAG output (.blocks.jsonl) by @danielaskdd in #3008
  • ♻️ refactor(lightrag): split monolithic lightrag.py into focused modules by @danielaskdd in #3012
  • feat(parser_routing): per-file process options and canonical basename by @danielaskdd in #3013
  • feat(api): pipeline reentrancy guards and idempotent multimodal analyze by @danielaskdd in #3014
  • fix(pipeline): preserve process_options in doc_status metadata across transitions by @danielaskdd in #3017
  • feat(pipeline): resume already-extracted documents under current process_options by @danielaskdd in #3015
  • refactor(pipeline): normalize parser / extraction metadata field names by @danielaskdd in #3021
  • ♻️ refactor(pipeline): unify F chunking for raw and lightrag formats by @danielaskdd in #3023
  • ♻️ refactor(docx): move native parser to lightrag/native_parser/docx by @danielaskdd in #3030
  • chore(deps): bump react-router-dom from 7.14.1 to 7.14.2 in /lightrag_webui in the react group by @dependabot[bot] in #3001
  • chore(deps): bump the ui-components group in /lightrag_webui with 3 updates by @dependabot[bot] in #3002
  • chore(deps): bump axios from 1.15.1 to 1.15.2 in /lightrag_webui in the frontend-minor-patch group by @dependabot[bot] in #3004
  • chore(deps-dev): bump the build-tools group across 1 directory with 5 updates by @dependabot[bot] in #3003
  • Replace status tooltip with details modal by @g2424303264-code in #3025
  • fix(webui): resolve react-hooks lint errors after eslint-plugin upgrade by @danielaskdd in #3036
  • chore(deps): bump lucide-react from 1.9.0 to 1.14.0 in /lightrag_webui in the ui-components group by @dependabot[bot] in #3032
  • chore(deps): bump sigma from 3.0.2 to 3.0.3 in /lightrag_webui in the graph-viz group by @dependabot[bot] in #3033
  • chore(deps-dev): bump typescript-eslint from 8.59.1 to 8.59.2 in /lightrag_webui in the build-tools group by @dependabot[bot] in #3034
  • chore(deps): bump i18next from 25.10.10 to 26.0.3 in /lightrag_webui by @dependabot[bot] in #3035
  • Add prefix path to API and WebUI by @shahrin014 in #3007
  • feat(webui): inject path prefix at runtime — one build serves all sites by @danielaskdd in #3039
  • refactor(pipeline): split apipeline_process_enqueue_documents into helpers by @danielaskdd in #3041
  • feat(openai): inject X-DashScope-Workspace header from DASHSCOPE_WORKSPACE_ID env by @getsov75-maker in #3042
  • Update README by @danielaskdd in #3047
  • Add paragraph semantic chunking strategy by @danielaskdd in #3044
  • feat(chunker): add R/V chunkers and chunk_options snapshot mechanism by @danielaskdd in #3046
  • Add offline sample retrieval check for RAGAS evaluation by @FU-max-boop in #3038
  • fix(chunker): CJK punctuation support and chunk_token_size enforcement by @danielaskdd in #3050
  • feat(chunker): split oversized tables on row boundaries before char fallback by @danielaskdd in #3051
  • chore(deps): bump sigstore/cosign-installer from 4.1.1 to 4.1.2 in the github-actions group by @dependabot[bot] in #3053
  • fix(chunker): preserve HTML table row group wrappers by @danielaskdd in #3055
  • fix(PGGraphStorage): edge properties lost on upsert with Apache AGE by @oleksandr-kushnir in #3052
  • feat(multimodal): backfill surrounding context on native sidecars by @danielaskdd in #3057
  • docs: fix typo 'Papper' to 'Paper' in README by @viraj1995 in #3058
  • feat(llm/vlm): unify image_inputs across bindings + VLM cache + master switch by @danielaskdd in #3063
  • chore(deps): update google-genai requirement from <2.0.0,>=1.0.0 to >=1.0.0,<3.0.0 by @dependabot[bot] in #3062
  • chore(deps): update pymilvus requirement from <3.0.0,>=2.6.2 to >=2.6.2,<4.0.0 by @dependabot[bot] in #3061
  • Refactor multimodal: status semantics, nested chunk schema, per-chunk entity injection by @danielaskdd in #3064
  • style: update EmptyCard and DocumentManager for improved layout by @JackLuguibin in #3060
  • fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89) by @sebastiondev in #3026
  • feat(docx): doc_title from first heading, table_header in sidecar by @danielaskdd in #3065
  • feat(multimodal): defer sidecar surrounding to analyze entry; env-tunable budgets by @danielaskdd in #3066
  • ✨ feat(sidecar): shorten item IDs by stripping doc- prefix by @danielaskdd in #3067
  • feat(multimodal): strip parser-internal markup from sidecar surrounding by @danielaskdd in #3068
  • ✨ feat(extract): enforce MAX_EXTRACT_INPUT_TOKENS for analyze & gleaning by @danielaskdd in #3073
  • chore(deps): bump the react group in /lightrag_webui with 3 updates by @dependabot[bot] in #3069
  • chore(deps-dev): bump the build-tools group in /lightrag_webui with 3 updates by @dependabot[bot] in #3070
  • chore(deps): bump the frontend-minor-patch group across 1 directory with 5 updates by @dependabot[bot] in #3071
  • chore(deps): bump react-i18next from 16.6.6 to 17.0.3 in /lightrag_webui by @dependabot[bot] in #3072
  • refactor(multimodal): inline bracket-label format for mm chunks by @danielaskdd in #3074
  • fix: extract Docling async markdown result by @he-yufeng in #3031
  • feat(parse_mineru): unified sidecar writer + MinerU raw bundle cache by @danielaskdd in #3075
  • feat(native_parser/docx): route through unified SidecarWriter by @danielaskdd in #3077
  • feat: dedupe cross-filename uploads via merged_text normalization by @danielaskdd in #3078
  • feat(mineru): split-by-heading block merging with markdown titles by @danielaskdd in #3079
  • test: harden hermetic env fixture and fix MinerU put() fake by @danielaskdd in #3080
  • feat(mineru): emit page-level positions from page_idx by @danielaskdd in #3081
  • feat(docling): route parse_docling through sidecar bundle pipeline by @danielaskdd in #3085
  • fix: guard against IndexError on empty LLM choices list by @qizwiz in #3086
  • refactor(parser): rename adapters to ir_builder and consolidate parser packages by @danielaskdd in #3087
  • feat(parser_cli): unified debug CLI for native / mineru / docling by @danielaskdd in #3088
  • Tokenizer.encode: gracefully handle disallowed special tokens in content by @RooseveltAdvisors in #3082
  • Improve MinerU parsing provenance and upload handling by @danielaskdd in #3089
  • fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure by @danielaskdd in #3090
  • feat(doc-status): add per-stage start time and parse-skipped flag by @danielaskdd in #3091
  • feat(mineru): invalidate raw bundle cache on parser option changes by @danielaskdd in #3092
  • refactor(full-docs): unify path handling and slim chunk_options snapshot by @danielaskdd in #3093
  • refactor(pg): align PG storage fields with JSON storage for parity by @danielaskdd in #3094
  • feat(redis): add basename and content_hash lookups for doc status by @danielaskdd in #3098
  • refactor(mongo): align doc-status storage with JSON storage for parity by @danielaskdd in #3099
  • feat(opensearch): add basename and content_hash lookups for doc status by @danielaskdd in #3100
  • feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback by @danielaskdd #3101
  • feat(chunker): give P strategy a dedicated default chunk_token_size by @danielaskdd in #3102

New Contributors

Full Changelog: v1.4.15...v1.5.0rc2

Don't miss a new LightRAG release

NewReleases is sending notifications on new releases.