This release brings a two major improvements to Invoke's memory management: partial model loading (aka Low-VRAM mode) and dynamic memory limits.
Memory Management Improvements
Thanks to @RyanJDick for designing and implementing these improved memory management features!
Partial Model Loading (Low-VRAM mode)
Invoke's previous "all or nothing" model loading strategy required your GPU to have enough VRAM to hold whole models during generation.
As a result, as image generation models increased in size and auxiliary models (e.g. ControlNet) became critical to workflows, Invoke's VRAM requirements have increased at the same rate. The increased VRAM requirements have prevent many of our users from running Invoke with the latest and greatest models.
Partial model loading allows Invoke to load only the parts of the model that are actively being used onto the GPU, substantially reducing Invoke's VRAM requirements.
- Applies to systems with a CUDA device.
- Enables large models to run with limited GPU VRAM (e.g. Full 24GB FLUX dev on an 8GB GPU)
- When models are too large to fit on the GPU, they will be partially offloaded to RAM. The model weights are still streamed to the GPU for fast inference. Inference speed won't be as fast as when a model is fully loaded, but will be much faster than running on the CPU.
- The recommended minimum CUDA GPU size is 8GB. An 8GB GPU should now be capable of running all models supported by Invoke (even the full 24GB FLUX models with ControlNet).
- If there is sufficient demand, we could probably support 4GB cards in the future by moving the VAE decoding operation fully to the CPU.
Dynamic Memory Limits
Previously, the amount of RAM and VRAM used for model caching were set to hard limits. Now, the amount of RAM and VRAM used is adjusted dynamically based on what's available.
For most users, this will result in more effective use of their RAM/VRAM without having to tune configuration values.
Users can expect:
- Faster average model load times on systems with extra memory
- Fewer out-of-memory errors when combined with Partial Model Loading
Enabling Partial Model Loading and Dynamic Memory Limits
Partial Model Loading is disabled by default. To enable it, set enable_partial_loading: true
in your invokeai.yaml
:
enable_partial_loading: true
This is highly recommended for users with limited VRAM. Users with 24GB+ of VRAM may prefer to leave this option disabled to guarantee that models get fully-loaded and run at full speed.
Dynamic memory limits are enabled by default, but can be overridden by setting ram
or vram
in your invokeai.yaml
.
# Override the dynamic cache limits to ram=6GB and vram=20GB.
ram: 6
vram: 20
🚨 Note: Users who previously set ram
or vram
in their invokeai.yaml
will need to delete these overrides in order to benefit from the new dynamic memory limits.
All Changes
- Added support for partial model loading.
- Added support for dynamic memory limits.
- Fixed issue where excessively long board names could cause performance issues.
- Reworked error handling when installing models from a URL.
- Fixed link to
Scale
setting's support docs. - Tidied some unused variables. Thanks @rikublock!
- Added typegen check to CI pipeline. Thanks @rikublock!
- Added stereogram nodes to Community Nodes docs. Thanks @simonfuhrmann!
- Updated installation-related docs (quick start, manual install, dev install).
Installing and Updating
The new Invoke Launcher is the recommended way to install, update and run Invoke. It takes care of a lot of details for you - like installing the right version of python - and runs Invoke as a desktop application.
Follow the Quick Start guide to get started with the launcher.
If you already have the launcher, you can use it to update your existing install.
We've just updated the launcher to v1.2.0 with a handful of fixes. To update the launcher itself, download the latest version from the quick start guide - the download links are kept up to date.
Legacy Scripts (not recommended!)
We recommend using the launcher, as described in the previous section!
To install or update with the outdated legacy scripts 😱, download the latest legacy scripts and follow the legacy scripts instructions.
What's Changed
- Update Readme with new Installer Instructions by @hipsterusername in #7455
- docs: fix installation docs home by @psychedelicious in #7470
- docs: fix installation docs home again by @psychedelicious in #7471
- feat(ci): add typegen check workflow by @rikublock in #7463
- docs: update download links for launcher by @psychedelicious in #7489
- Add Stereogram Nodes to communityNodes.md by @simonfuhrmann in #7493
- Partial Loading PR1: Tidy ModelCache by @RyanJDick in #7492
- Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU by @RyanJDick in #7494
- Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching by @RyanJDick in #7500
- Correct Scale Informational Popover by @hipsterusername in #7499
- docs: install guides by @psychedelicious in #7508
- docs: no need to specify version for dev env setup by @psychedelicious in #7510
- feat(ui): reset canvas layers only resets the layers by @psychedelicious in #7511
- refactor(ui): mm model install error handling by @psychedelicious in #7512
- fix(api): limit board_name length to 300 characters by @maryhipp in #7515
- fix(app): remove obsolete DEFAULT_PRECISION variable by @rikublock in #7473
- Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache by @RyanJDick in #7522
- Partial Loading PR4: Enable partial loading (behind config flag) by @RyanJDick in #7505
- Partial Loading PR5: Dynamic cache ram/vram limits by @RyanJDick in #7509
- ui: translations update from weblate by @weblate in #7480
- chore: bump version to v5.6.0rc1 by @psychedelicious in #7521
New Contributors
- @simonfuhrmann made their first contribution in #7493
Full Changelog: v5.5.0...v5.6.0rc1