Read the blog post at FEX-Emu's Site!
We were a little bit late this month for this release. Turns out getting distracted trying to hunt bugs for a week does that. Let's jump in to what
has changed!
More memory savings
This month we have had some memory saving changes land, which is vitally important for 8GB and 16GB systems. Primarily we have now enabled our Dynamic
L1 lookup cache and disabled our L2 lookup caches by default. We talked about this more in the FEX-2511 release post, but this can save
hundreds of megabytes by changing these default options.
Additionally we have fixed a pseudo-leak in one of our thread-pool allocators. It wasn't quite a real leak because each thread only ever held a single
allocation, but it is supposed to share allocations between threads which means this ballooned pretty heavily for games that create a lot of threads.
For our test game, ENDER LILIES: Quietus of the Knights, this meant
going from consuming 409MB of memory down to 6MB for this pool.
Another change that occured this month is being more aware of Transparent Huge Pages potentially causing us to consume more memory than expected. When
the operating mode is set to always instead of madvise then we were consuming significantly more RAM than expected. ArchLinux currently
defaults to always which caught us by surprise in our testing. FEX will now actively ask for THP or non-THP buffers depending on their use-case
which can dramatically reduce memory usage for our sparse buffers on these systems that default to always. As a side-effect, our JIT code buffer
now always asks for a THP buffer, which cuts iTLB misses in half in our testing which dramatically reduces pressure on CPU's L2 TLB lookups.
A smattering of bug fixes and performance improvements
As usual we have a large number of bug fixes and performance improvements. Each one being small enough that it would be hard to list them all, but we
do have some highlights.
Inline SIN/COS/TAN for x87 reduced precision
One of the most costly things that our JIT can do is x87 emulation and jumping out of the JIT for a helper. Unfortunately they tend to come
hand-in-hand. This month we have optimized these three transcendental operations to no longer jump out of the JIT which has sped up the operations by
an average of 3.7x! This makes games that hit these x87 transcendentals go quite a bit faster, like Bayonetta and Fallout: New Vegas. Improving their
playability on a larger set of systems.
Additional changes as follow:
Performance
- Replace a code invalidation mutex with our hand-rolled implementation that is dramatically faster
- Wire up FEAT_MOPS support. The Samsung Exynos 2600 is one of the first SoCs with support
- Rearrange some Arm64EC dispatcher code for performance
- Optimize a vector broadcast a game was hitting
- Skip ELF parsing when code caching is disabled
Bug fixes
- Fix prefetch encoded nop instructions
- Ensure MXCSR is saved and restored correctly on signal
- Reset relocation data on JIT restart
Workaround a Docker seccomp filter bug
A user has been tinkering with FEX inside of a Docker environment and they uncovered an issue where FEX was crashing for really bizarre issues. We
eventually tracked down some syscalls that were ending up returning broken results due to a bug in Docker's seccomp filter rules. It turns out that
their filter doesn't follow the AAPCS64 nor the SystemV x86-64 ABI rules
around zero extending arguments that are smaller than the register size. This causes problems because it is on the callee to do the zero extension
of the argument, and if you have garbage in the upper bits of the source register, it should get ignored.
Because Docker's seccomp filter only ever compares the values passed to system calls to the full size of the register, 8/16/32-bit arguments can have
garbage in the upper bits and incorrectly return -EPERM for perfectly valid data. We manually worked around the one instance we saw this causing
problems locally, but Docker needs to audit their seccomp filters and correctly handle this for a real fix!
Add option to FEXGetConfig to show fault granularity
One of the major struggles with emulating x86's TSO is that because our memory accesses fault when unaligned, they have dramatic overhead compared to
x86 basically never faulting due to alignment problems. This is slightly improved on newer ARM CPUs where the FEAT_LSE2 extension removes a
percentage of faults by allowing unaligned accesses inside of a 16-byte granule. With this new run-time test, we can visualize when instructions
are going to fault to showcase how bad it is.
First a system that doesn't support FEAT_LSE2
Then a system that supports FEAT_LSE2
The green pips show which byte-aligned memory accesses don't fault and cause problems, while the red pips show where a fault will occur and we end up needing to
backpatch the code with a memory fence, or simulate the operation in the signal handler. As seen, there is still quite a bit of red on the graphs even
with best hardware for this. Meanwhile if we had a similar test for x86, all the pips would be green except the 128-bit result, which matches
behaviour between the two architectures. (Except vector accesses which this doesn't test).
We added this test capability so that if any hardware in the future does decide to fix this performance and correctness problem, then we get a
very quick test we can run to detect it.
Raw Changes
FEX Release FEX-2604
-
Allocators
- Remove legacy NOREPLACE handling (b77ddcf)
-
Arm64EC
- Invert suspend doorbell and move out of hot path (0695249)
-
Arm64Emitter
- Disable 32-bit constant optimization when NOP-padding is requested (de11c05)
-
CPUBackend
- Enable Transparent Huge Pages on JIT buffers (afc7248)
-
CPUID
- Add basic stub handling for AVX10 info (6bd476f)
-
Cmake
- Default to release builds with a message (194eb69)
-
CodeCache
-
Config
-
FEX
-
FEXCore
-
FEXGetConfig
- Test for showing fault granularity (34b48c4)
-
FEXRootFSFetcher
-
FEXServer
- Try both fusermount and fusermount3 (56de0d1)
-
FEXpidof
- Fixes another missing std::filesystem throw (c18fb3c)
-
HostFeatures
-
IR
- Adds support for printing strings (ae3fa6a)
-
JIT
-
Linux
- GuestFrames
- Ensure MXCSR is saved and restored on signal (5b4a596)
- GuestFrames
-
LinuxSyscalls
-
MemoryOps
-
OpcodeDispatcher
-
SMCTracking
-
Syscalls
- Fixes crash in ELF parsing code (cc02edb)
-
Tests
- ASM
- Add FISTTP unit tests for 16, 32, 64-bit as well as negatives (494dd64)
- ASM
-
Win32
- Enable support for virtual naming and THP control (5c34c57)
-
Misc
- JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path (4229154)
- Fix most GCC build issues (73c1f4c)
- Add test for shifts preserving flags (f3e9042)
- code-format-helper: Another dependabot upgrade (e4ae6ce)
- External/code-format-helper: Update dependencies (6177ab9)
- Handle zero length in ChangeProtectionFlags (9d4a71b)
- gitlab-ci: Update requirements (58d9755)
-
github
- Stop running unittests always on build failure (5c4c468)