ashvardanian/StringZilla v3.10.0 on GitHub

This update brings many performance optimizations before the next wave of breaking major releases with new functionality and wider range of CPUs supported. Time to get excited 🥳

Faster `memcpy` and `memset`

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          19.7128 GB/s       3404322.4 ns          0 errors in       7344 iterations                     
- sz_copy_serial<aligned>                  11.7727 GB/s       5700374.0 ns          0 errors in       4388 iterations                     
- sz_copy_avx512<aligned>                  20.0675 GB/s       3344156.1 ns          0 errors in       7476 iterations                     
- sz_copy_avx2<aligned>                    11.4429 GB/s       5864690.5 ns          0 errors in       4264 iterations                     
- memcpy<unaligned>                        19.4694 GB/s       3446883.2 ns          0 errors in       7256 iterations                     
- sz_copy_serial<unaligned>                11.6158 GB/s       5777373.4 ns          0 errors in       4328 iterations                     
- sz_copy_avx512<unaligned>                20.3848 GB/s       3292099.3 ns          0 errors in       7596 iterations                     
- sz_copy_avx2<unaligned>                  11.2894 GB/s       5944407.9 ns          0 errors in       4208 iterations                     
- memset                                   27.9879 GB/s       2397785.1 ns          0 errors in      10428 iterations                     
- sz_fill_serial                           28.0284 GB/s       2394315.1 ns          0 errors in      10444 iterations                     
- sz_fill_avx512                           28.9894 GB/s       2314942.1 ns          0 errors in      10800 iterations                     
- sz_fill_avx2                             27.7442 GB/s       2418845.8 ns          0 errors in      10336 iterations

On AWS Graviton 4 we still have room for improvement.
A potential improvement can come from non-temporal stores on large payloads.

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          28.4008 GB/s       2362924.1 ns          0 errors in      10584 iterations                     
- sz_copy_serial<aligned>                  23.0014 GB/s       2917600.0 ns          0 errors in       8572 iterations                     
- sz_copy_sve<aligned>                     27.5536 GB/s       2435573.1 ns          0 errors in      10268 iterations                     
- sz_copy_neon<aligned>                    21.1320 GB/s       3175702.1 ns          0 errors in       7876 iterations                     
- memcpy<unaligned>                        26.9551 GB/s       2489652.6 ns          0 errors in      10044 iterations                     
- sz_copy_serial<unaligned>                22.6073 GB/s       2968456.4 ns          0 errors in       8424 iterations                     
- sz_copy_sve<unaligned>                   25.6073 GB/s       2620692.7 ns          0 errors in       9540 iterations                     
- sz_copy_neon<unaligned>                  20.8439 GB/s       3219593.9 ns          0 errors in       7768 iterations                     
- memset                                   66.9055 GB/s       1003039.9 ns          0 errors in      24928 iterations                     
- sz_fill_serial                           44.1775 GB/s       1519072.9 ns          0 errors in      16460 iterations                     
- sz_fill_sve                              34.5010 GB/s       1945126.1 ns          0 errors in      12856 iterations                     
- sz_fill_neon                             44.5696 GB/s       1505708.6 ns          0 errors in      16604 iterations

256-byte Look-Up Table Transform

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    3.8070 GB/s      17627743.2 ns          0 errors in       1420 iterations                     
- str::transform<increment>                23.9881 GB/s       2797588.7 ns          0 errors in       8940 iterations                     
- sz_look_up_transform_serial               3.6020 GB/s      18630895.7 ns          0 errors in       1344 iterations                     
- sz_look_up_transform_avx512              21.1733 GB/s       3169507.5 ns          0 errors in       7888 iterations                     
- sz_look_up_transform_avx2                 8.3881 GB/s       8000528.7 ns          0 errors in       3128 iterations

On AWS Graviton 4:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    2.6494 GB/s      25329887.2 ns          0 errors in        988 iterations                     
- str::transform<increment>                23.7150 GB/s       2829809.9 ns          0 errors in       8836 iterations                     
- sz_look_up_transform_serial               2.6069 GB/s      25742844.6 ns          0 errors in        972 iterations                     
- sz_look_up_transform_neon                 8.4908 GB/s       7903721.1 ns          0 errors in       3164 iterations

ashvardanian/StringZilla v3.10.0 v3.10: Improved Memory Operations on GitHub

Faster memcpy and memset

256-byte Look-Up Table Transform

ashvardanian/StringZilla v3.10.0
v3.10: Improved Memory Operations

on GitHub

Faster `memcpy` and `memset`