github ashvardanian/StringZilla v3.10.0
v3.10: Improved Memory Operations

latest releases: v4.6.0, v4.5.1, v4.5.0...
16 months ago

This update brings many performance optimizations before the next wave of breaking major releases with new functionality and wider range of CPUs supported. Time to get excited 🥳

Faster memcpy and memset

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          19.7128 GB/s       3404322.4 ns          0 errors in       7344 iterations                     
- sz_copy_serial<aligned>                  11.7727 GB/s       5700374.0 ns          0 errors in       4388 iterations                     
- sz_copy_avx512<aligned>                  20.0675 GB/s       3344156.1 ns          0 errors in       7476 iterations                     
- sz_copy_avx2<aligned>                    11.4429 GB/s       5864690.5 ns          0 errors in       4264 iterations                     
- memcpy<unaligned>                        19.4694 GB/s       3446883.2 ns          0 errors in       7256 iterations                     
- sz_copy_serial<unaligned>                11.6158 GB/s       5777373.4 ns          0 errors in       4328 iterations                     
- sz_copy_avx512<unaligned>                20.3848 GB/s       3292099.3 ns          0 errors in       7596 iterations                     
- sz_copy_avx2<unaligned>                  11.2894 GB/s       5944407.9 ns          0 errors in       4208 iterations                     
- memset                                   27.9879 GB/s       2397785.1 ns          0 errors in      10428 iterations                     
- sz_fill_serial                           28.0284 GB/s       2394315.1 ns          0 errors in      10444 iterations                     
- sz_fill_avx512                           28.9894 GB/s       2314942.1 ns          0 errors in      10800 iterations                     
- sz_fill_avx2                             27.7442 GB/s       2418845.8 ns          0 errors in      10336 iterations

On AWS Graviton 4 we still have room for improvement.
A potential improvement can come from non-temporal stores on large payloads.

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          28.4008 GB/s       2362924.1 ns          0 errors in      10584 iterations                     
- sz_copy_serial<aligned>                  23.0014 GB/s       2917600.0 ns          0 errors in       8572 iterations                     
- sz_copy_sve<aligned>                     27.5536 GB/s       2435573.1 ns          0 errors in      10268 iterations                     
- sz_copy_neon<aligned>                    21.1320 GB/s       3175702.1 ns          0 errors in       7876 iterations                     
- memcpy<unaligned>                        26.9551 GB/s       2489652.6 ns          0 errors in      10044 iterations                     
- sz_copy_serial<unaligned>                22.6073 GB/s       2968456.4 ns          0 errors in       8424 iterations                     
- sz_copy_sve<unaligned>                   25.6073 GB/s       2620692.7 ns          0 errors in       9540 iterations                     
- sz_copy_neon<unaligned>                  20.8439 GB/s       3219593.9 ns          0 errors in       7768 iterations                     
- memset                                   66.9055 GB/s       1003039.9 ns          0 errors in      24928 iterations                     
- sz_fill_serial                           44.1775 GB/s       1519072.9 ns          0 errors in      16460 iterations                     
- sz_fill_sve                              34.5010 GB/s       1945126.1 ns          0 errors in      12856 iterations                     
- sz_fill_neon                             44.5696 GB/s       1505708.6 ns          0 errors in      16604 iterations

256-byte Look-Up Table Transform

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    3.8070 GB/s      17627743.2 ns          0 errors in       1420 iterations                     
- str::transform<increment>                23.9881 GB/s       2797588.7 ns          0 errors in       8940 iterations                     
- sz_look_up_transform_serial               3.6020 GB/s      18630895.7 ns          0 errors in       1344 iterations                     
- sz_look_up_transform_avx512              21.1733 GB/s       3169507.5 ns          0 errors in       7888 iterations                     
- sz_look_up_transform_avx2                 8.3881 GB/s       8000528.7 ns          0 errors in       3128 iterations

On AWS Graviton 4:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    2.6494 GB/s      25329887.2 ns          0 errors in        988 iterations                     
- str::transform<increment>                23.7150 GB/s       2829809.9 ns          0 errors in       8836 iterations                     
- sz_look_up_transform_serial               2.6069 GB/s      25742844.6 ns          0 errors in        972 iterations                     
- sz_look_up_transform_neon                 8.4908 GB/s       7903721.1 ns          0 errors in       3164 iterations

Don't miss a new StringZilla release

NewReleases is sending notifications on new releases.