ashvardanian/StringZilla v4.0.0 on GitHub

This PR entirely refactors the codebase, separating the single-header implementation into separate headers. Moreover, it brings faster kernels for:

Sorting of string sequences and pointer-sized integers 🔀
Levenshtein edit distances for fuzzy matching of UTF-8 or DNA 🧬
Needleman-Wunsch and Smith-Waterman scoring, see affine-gaps ☣️
Multi-pattern sketching and fingerprinting on GPUs 🔍
AES-based portable general-purpose hashing functions #️⃣

And more community contributions:

Detecting CPU capabilities 👏 @GoWind - #196
Windows cross-compilation 👏 @ashbob999 - #169
CMake refactor 👏 @friendlyanon - #85
Charset initialization 👏 @alexbarev - #200
Benchmarking sorting algorithms 👏 @ashbob999 #209
Allocator initialization and unknown pragmas on MSVC 👏 @GerHobbelt #231
Big-endian SWAR substring-search backends 👏 @SammyVimes #75
NodeJS and GoLang bindings groundwork 👏 @MarkReedZ #151
New C++23 APIs 👏 @PleaseJustDont #225
Repeated help with Rust 👏 @mikayelgr and @grouville #215

Huge thanks to our partners at Nebius for their continued support and endless stream of GPU installations for the most demanding computational workloads in both AI and beyond!

Why Split the Files? Matching SimSIMD Design

Sadly, most modern software development tooling is subpar. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.

Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.

Faster Sequence Alignment & Scoring on GPUs

Biology is set to be one of the driving forces of the 21st century, and biological DNA/RNA/protein data is already one of the fastest-growing data modalities, outpacing Moore's Law. Still, most of the BioInformatics software today is flawed and pretty slow. Last year, I helped several BioTech and Pharma companies scale up their data-processing capacity with specialized optimizations for various niche use cases. Still, I also wanted to include baseline kernels for the most crucial algorithms to StringZilla, covering:

Levenshtein distance computation for binary & UTF-8 data for general-purpose fuzzy string matching on CPUs and GPUs,
Needleman-Wunsch and Smith-Waterman global and local sequence alignment with both linear and affine Gotoh gap extensions, so common in BioInformatics, on both CPUs and GPUs.

These kernels are hardly state-of-the-art at this point, but should provide a good baseline, ensuring correctness and equivalent outputs across different CPU & GPU brands.

Faster Sorting

Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on both 32-bit and 64-bit systems.

The new serial algorithm is often 5 times faster than the std::sort function in the C++ Standard Template Library for a vector of strings. It's also typically 10 times faster than the qsort_r function in the GNU C library. There are even quicker versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.

Faster Hashing Algorithms

Our old algorithm was a variation of the Karp-Rabin hash and was designed more for rolling hashing workloads. Sadly, such hashing schemes don't pass SMHasher and similar hash-testing suites, and a better solution was needed. For years, I have been contemplating designing a general-purpose hash function based on AES instructions, which have been implemented in hardware for several CPU generations now. As discussed with @jandrewrogers, and can be seen in his AquaHash project, those instructions provide an almost unique amount of mixing logic per CPU cycle of latency.

Many popular hash libraries, such as AHash in the Rust ecosystem, cleverly combine AES instructions with 8-bit shuffles and 64-bit additions. However, they rarely harness the full power of the CPU due to the constraints of Rust tooling and the complexity of using masked x86 AVX-512 and predicated Arm SVE2 instructions. StringZilla does that and ticks a few more boxes:

Outputs 64-bit hashes and passes the SMHasher --extra tests.
Is fast for both short (velocity) and long strings (throughput).
Supports incremental (streaming) hashing, when the data arrives in chunks.
Supports custom seeds for hashes and has it affecting every bit of the output.
Provides dynamic-dispatch for different architectures to simplify deployment.
Documents its logic and guarantees the same output across different platforms.

Implementing this logic, which provides both fast and high-quality hashes, often capable of computing four hashes simultaneously, made these kernels handy not only for hashing itself, but also for higher-level operations like database-style hash joins and set intersections, as well as advanced sequence alignment algorithms for bioinformatics.

Fingerprinting, Sketching, and Min-Hashing... using 52-bit Floats?!

Despite deprecating Rabin-Karp rolling hashes for general-purpose workloads, it's hard to argue with their usability in "fingerprinting" or "sketching" tasks, where a fixed-size feature vector is needed to compare contents of variable-length strings. The features should be as different as possible, covering various substring lengths, so polynomial rolling hashes fit nicely!

That said, implementing modulo-arithmetic over 64-bit integers is extremely expensive on Intel CPUs:

VPMULLQ (ZMM, ZMM, ZMM) for _mm512_mullo_epi64:
- on Intel Ice Lake: 15 cycles on port 0.
- on AMD Zen4: 3 cycles on ports 0 or 1.
VPMULLD (ZMM, ZMM, ZMM) for _mm512_mullo_epi32:
- on Intel Ice Lake: 10 cycles on port 0.
- on AMD Zen4: 3 cycles on ports 0 or 1.
VPMULLW (ZMM, ZMM, ZMM) for _mm512_mullo_epi16:
- on Intel Ice Lake: 5 cycles on port 0.
- on AMD Zen4: 3 cycles on ports 0 or 1.
VPMADD52LUQ (ZMM, ZMM, ZMM) for _mm512_madd52lo_epu64 for 52-bit multiplication:
- on Intel Ice Lake: 4 cycles on port 0.
- on AMD Zen4: 4 cycles on ports 0 or 1.

It's even pricier on Nvidia GPUs, but as you can see above, we may have a way out! x86 has a cheap instruction for 52-bit integer multiplication & addition. Moreover, 52 is precisely the number of bits we can safely use to exactly store an integer inside a double-precision floating-point number, opening doors to a really weird, but equally as implementation of rolling hash functions across CPUs and GPUs, implemented using 64-bit floats, and Barrett's reductions for modulo-arithmetics, to avoid division!

Nest Steps

Ditch using constant memory in CUDA. The original design was to load the substitution tables for NW and SW into shared memory, but a simpler design with constant memory was eventually accepted. That proved to be a mistake. The performance numbers are way lower than expected.
Use Distributed Shared Memory for 10x larger NW and SW inputs. That would require additional variants of linear_score_on_each_cuda_warp_ and affine_score_on_each_cuda_warp_, which scale to 16 blocks of (228 - 1) KB shared memory on each, totaling 3.5 MB of shared memory that can be used for alignments. With u32 scores and 3 DP diagonals (in case of non-affine gaps) that should significantly accelerate the alignment of ~300K-long strings. But keep in mind that cluster.sync() is still very expensive - 1300 cycles - only 40% less than 2200 cycles for grid.sync().
Add asynchronous host callbacks for updating the addressable memory region in NW and SW.
Potentially bring back stringzilla_bare builds on MSVC, if needed.

Major

Break: Output error messages via C API (fc94890)
Break: New wording for incremental hashers (66898ad)
Break: Rust namespaces layout (b3338bb)
Break: Refactor Strs ops (009b975)
Break: Rename again (fb56b60)
Break: Drop fingerprinting bench (3e04157)
Break: sz::edit_distance -> Levenshtein (d44beb4)
Break: C++ lookup and fill_random (1ce830b)
Break: charset/generate -> byteset/fill_random (2ce2b49)
Break: New calling convention in similarity.h (095bc2d)
Break: Return error-codes in sort functions (944804e)
Break: look_up_transform to lookup API (e0055d5)
Break: checkum to bytesum, new hash, and PRNG (71f1f4b)
Break: Pointer-sized N-gram Sorting (0c38bff)
Break: sz_sort now takes allocators (ec81663)
Break: Deprecate old fingerprinting (38014ee)
Break: Replace char_set constructor with literals (2c49eae)

Minor

Add: Str.count_byteset for Python (802d699)
Add: GoLang official support (234b758)
Add: HashMap traits for Rust (e555cc3)
Add: try_resize_and_overwrite (e00a98b)
Add: Hashers for Swift (be363c9)
Add: Big-endian SWAR backends (607dd14)
Add: Zero-copy JS wrapper for buffers (4bf2dd6)
Add: sz.fill_random for Python (3565f9b)
Add: Allow resetting the dispatch table (68172a0)
Add: Ephemeral GPU executors if no device is passed (6431900)
Add: Capabilities getters for SW (86c89b7)
Add: StringZillas for Rust draft (48a1120)
Add: Fingerprinting benchmarks (471649e)
Add: On GPU fingerprints in Python (6f26629)
Add: basic_rolling_hashers CUDA port (a25e3f2)
Add: Unrolled fingerprinting backends (9a04c95)
Add: NW and SW scoring classes (958c9e1)
Add: SW scoring (3e8a3a6)
Add: NW scoring (079a8c1)
Add: Capability-constrained Py constructors (d0dfa0e)
Add: LevenshteinDistancesUTF8 in Py (4cd6c7d)
Add: Wrap DeviceScope for Python (30b8fd3)
Add: Make Strs from lists, tuples, generators (05a5434)
Add: StringZillas Python tests (978d13d)
Add: Strs layout conversion tests (4e5df62)
Add: Exportable _sz_py_api capsules (9d5935b)
Add: Levenshtein kernels in shared lib (42a815f)
Add: Strs.from_arrow conversion (0441b32)
Add: Draft fingerprinting C binding (3f5e004)
Add: Fingerprinting baselines (f997dad)
Add: SZ_NOINLINE (4e6e1af)
Add: Draft fingerprinting on GPUs (11af644)
Add: Haswell rolling fingerprints (25d3ee6)
Add: Draft exploration of float fingerprints (3955eee)
Add: lock_guard to avoid STL (e6cdb93)
Add: Parallel fingerprinting (4ccac06)
Add: to_span, to_view helpers (4fb9283)
Add: Min-Hashing basic_rolling_hashers (63447d3)
Add: Fingerprinting benchmarks (00c68d5)
Add: 64-bit double fingerprinting (9ed6006)
Add: Test rolling hashes (a065aa0)
Add: Fingerprinting drafts are back :) (9ab4a2b)
Add: Draft parallel library backend (94147ed)
Add: Long haystack CUDA kernels for find-many (51c9171)
Add: find_many minimal counter & benchmarks (3c615d9)
Add: bench_find_many.cpp (2d594e4)
Add: CUDA Aho Corasick placeholder (d087afa)
Add: Parallel Ice Lake variants (6aaf16c)
Add: NW and SW for Hopper (e53c8d5)
Add: Hopper Levenshtein kernels (cb7c48d)
Add: Affine gaps Levenshtein on Kepler (f09bbf9)
Add: Ice Lake Affine Levenshtein kernels (51afad5)
Add: Affine Levenshtein variants on GPU (3e1df93)
Add: sz_similarity_gaps_t enums (5b410b8)
Add: Concepts & new multithreading scheme (6be0da5)
Add: Draft StringCuZilla C API (1975351)
Add: Draft executors for StringCuZilla (e1a37de)
Add: C++20 concepts (f7b091c)
Add: Unmasked Ice Lake Levenshtein (2b4fa09)
Add: Affine Levenshtein-Gotoh variants (25ab3b6)
Add: Draft global scoring in CUDA (07196d0)
Add: Affine gap extensions baseline (945613c)
Add: Draft device-wide similarity kernels (871d7bc)
Add: Levenshtein on Kepler (938bf6f)
Add: Parallel multi-needle search with OpenMP (c8dc314)
Add: Draft parallel substring search (cf53b9b)
Add: Overflow-risk error codes (783cfa8)
Add: Random access ranges (0a26059)
Add: safe_vector, safe_array (ac6218a)
Add: Immutable iterators for Arrow tapes (b70096b)
Add: indexed_container_iterator (a5ca2ac)
Add: Multi-pattern exact substring search (a83cdb5)
Add: NW benchmarks on GPU (1879aeb)
Add: Multi-flag sz_caps (f76fa40)
Add: Repeat similarity benchmarks on CPU (f64fc56)
Add: gpu_specs and cuda_status_t (c36d2b8)
Add: Ice Lake similarity kernels (3c8d181)
Add: Fetching Nvidia GPU specs (d0bdd17)
Add: New batch similarity benchmarks (aca3301)
Add: Warp-shuffle optimizations (4a62715)
Add: Mem consumption CUDA tests (53d3e0d)
Add: New NW and SW GPU kernels (ca0fece)
Add: Blosum62 and NUC.4.4 matrices (8a99373)
Add: Separate parallel & serial tests (3cb8dd0)
Add: Parallel SW & NW scoring in OpenMP (85b9675)
Add: Thrust-like constant_iterator (6751e79)
Add: arrow_strings_tape::try_append (6d7f221)
Add: CUDA scoring benchmarks (7ac0fdd)
Add: Baseline NW and SW alignment with ~O(NM) space (044c7cc)
Add: Horizontal scoring in OpenMP (02a2481)
Add: Local alignment adaptation (109d8de)
Add: Parallel GPU kernels (2861a85)
Add: OpenMP score_diagonally (427d5b5)
Add: lookup transform in Rust (3090472)
Add: OpenMP C++ draft (6f8cdb9)
Add: Stateful hashing in Rust (763538e)
Add: Expose find_byteset in Rust (667ea91)
Add: Set intersections in Rust (06784fc)
Add: Sorting in Rust (e467649)
Add: New memory benchmarks (dfd6ddf)
Add: Draft sz_find_sve (d7ede5d)
Add: Faster find_byte with SVE (56a49c8)
Add: New string similarity benchmarks (c12e6c4)
Add: Short string hashing in SVE2 (a007c7c)
Add: SVE & SVE2 bytesum (d6f87d7)
Add: SVE2 macros (9fbdc9a)
Add: Intersection benchmarks (e860af0)
Add: SVE backend for sorting (148b615)
Add: All new benchmarking suite (4744406)
Add: Comparisons in SVE (c31020d)
Add: Arm NEON hashing (4b3847d)
Add: Missing SVE placeholder definition (63f0368)
Add: C++ argsort, intersect (5ea0698)
Add: status_t for errors in C++ (63daa5f)
Add: Feature-extraction placeholder (de62723)
Add: Intersections on Ice Lake (ea5dc76)
Add: Serial JOINs (c7b841e)
Add: Dispatched version API (9a32744)
Add: Fetching dynamic library version in C (3538e97)
Add: PRNG for Haswell & serial backend (6659aa0)
Add: Streaming hashing on Ice Lake & Skylake X (2607d45)
Add: Streaming hash benchmarks (8ac3a23)
Add: Hashing on Haswell & Skylake-X (3c345bc)
Add: Missing sz_sequence_t helpers (dc7c109)
Add: Sorting placeholders & dispatch (cc98389)
Add: sz_sequence_argsort_ice (69d4ecb)
Add: AES-based hash placeholders (cb18c78)
Add: Smaller Sorting Networks (cd6859a)
Add: String sorting tests for different lengths (c670ccd)
Add: Separate Skylake-X & Ice Lake checksums (554f50d)
Add: New Levenshtein distance kernels (43471aa)
Add: Missing Rust interfaces (1765f33)

Patch

Docs: Pre-release stats update (c08c6c3)
Make: Upgrade setuptools in CI (492ecc0)
Improve: Allow RuntimeError for engine calls (f9cbb00)
Docs: Section titles (755f583)
Improve: PyTest invalid input arguments (f5e46d5)
Improve: All new similarity-scoring benchmarks (3713e72)
Make: Option to disable sanitizers for masked IO (f880621)
Make: Default to Python 3.12 for better itertools (f2704d7)
Fix: Naming scorers in Python like in Rust (d0bc604)
Fix: Avoid SWAR on big-endian (230e354)
Improve: More big-endian SWAR tests (7a4b78f)
Fix: Drop old similarity APIs in benchmarks (230fc13)
Make: Packaging for Py & Go (4c819d6)
Improve: Byte-set counting PyTests (bd692d2)
Docs: What to know about CUDA (15349c6)
Improve: Str_like_* naming convention (5fdd8ee)
Fix: Merge artifacts & lifetime annotations (c71e67e)
Docs: Wording & AI dashes (4a06f3f)
Make: Packaging for NPM (0d95f13)
Make: Drop long-deprecated .releaserc (8b5af6e)
Make: Enable SIMD in NodeJS builds (c8f6a49)
Improve: Polish parallel string test names (61cc860)
Make: Reuse SIMD compilation flags in build.rs (d129f95)
Make: Missing AES definitions for lib builds (e951c4d)
Docs: Hashing sections for each SDK (cb4fe1b)
Improve: Expose .capabilities to JS (c557a62)
Improve: Test incremental hashers (7c6d37d)
Fix: Self-move construction of basic_string (5fb0e74)
Fix: Strict aliasing violation (a5a2421)
Fix: __cpp_lib_string_resize_and_overwrite test guards (cb2d6a8)
Docs: How to use parallel algorithms (ecad156)
Fix: Stricter following of SZ_AVOID_STL (c3040b7)
Docs: JS Quick Start (c923ed2)
Make: Publish to NPM (b2cedc4)
Make: Drop CodeQL noise (7c9caac)
Docs: Describe dynamic dispatch & linking (e6f2c02)
Improve: NodeJS groundwork & corner-case tests (#151) (00d75f5)
Fix: _MSC_VER to __GNUC__ conditions (83492ac)
Fix: Unknown pragmas in MSVC (#231) (78bbc11)
Make: Parallel algorithms CI/CD for PyPI (9b8c466)
Make: Ignore formatting blames (b69206f)
Make: JSON & YAML uniform formatting (320bddd)
Make: Explicit CodeQL coverage in CI (e9fb38d)
Fix: Match new C-level DeviceScope behavior (f217b86)
Fix: Sorting on big-endian s390x (8d2d9c8)
Fix: Ruff-statically suggested issues (edaff0e)
Improve: Disable E722 import exception warning (3aedfb2)
Fix: Check for immutable Py buffers (e8c437e)
Improve: Test PRNG in Py and boundary Strs sizes (3d76e8f)
Fix: np.random v1 vs v2 compatibility also in szs. (5f8ac03)
Fix: Prevent PyTest from parsing invalid UTF-8 (30c8935)
Fix: Don't repeat seed-ed fuzzy tests (efceb0b)
Fix: np.random v1 vs v2 compatibility (aea096b)
Make: Forwarding SZ_IS_QEMU_ (d433db0)
Fix: Minor logical inconsistencies & unused vars (679f7b9)
Improve: Disable SVE in QEMU runs (072ee2e)
Fix: Avoid sys.getrefcount tests on PyPy (80fa90e)
Improve: Check rich comparisons before sorting (a21c44d)
Improve: Session-scope fixture for PyTest env logs (3d84812)
Improve: Fuzz PyTests and log environment (6545a04)
Make: Respect env-vars for -arch (6a450f6)
Make: Upgrade GitHub actions (dec93d0)
Make: Avoid universal builds defaults for pip install . (c462f42)
Improve: Guard compiler pragmas (42ad14d)
Make: Bump FU to avoid missing +wfxt target (337257b)
Make: Detect CPU AES support on Arm (ae74d44)
Make: Reinstall pre-packaged CMake on macOS-14 (24967c7)
Fix: Fall-back CPU alloc for fingerprints (b582d5d)
Make: Drop macOS Universal builds (e5cfb08)
Fix: Sorting difference on 32/64 bit machines (0e17a3d)
Make: Respect MACOSX_DEPLOYMENT_TARGET (95a95fe)
Make: Avoid fail-fast for Python pre-release wheels (a7c3f04)
Fix: Win32 compilation issues (f792366)
Improve: Test against Affine Gaps (65d323e)
Improve: Avoid many unified memory re-allocs (d0db2d4)
Fix: Intersect scopes HW capabilities (e4245b7)
Make: Drop Python 3.7, require 3.8+ (c65bf5e)
Make: Verbose PyTest logging in CI (c273d52)
Fix: OS-feature-gate AVX checks (cd81b4a)
Fix: Linking to 64-bit symbols (7da4dac)
Make: Log HW caps in CI before tests (2c02d5b)
Fix: Type-casting on MSVC (a347dcd)
Make: Irrelevant links & comments (d6221e6)
Make: Target Hopper 90a in Py & C (c317280)
Make: Outdated 64-bit detection envs (72d0f4b)
Docs: Wording typos (cf1623c)
Make: Missing affine-gaps dep (fa52363)
Docs: No Alpine flow on release CI (f214169)
Fix: Argument order (10dff6c)
Make: Can't read SWIFT_VERSION for container (a4015f1)
Improve: Differentiate capabilities_mode in PyTest (2bee1e4)
Fix: Dispatch serial code for bytes_per_cell <= 2 (48b4406)
Improve: New error codes for CPU/GPU interop (6e597ca)
Fix: Avoid UB assigning i8x256x256 matrix (7ac3b83)
Make: Bump FU due to sign conversions warnings (cbd160a)
Fix: Detect missing GPUs at runtime (0bc0f8a)
Improve: Formatting Swift (3d6669c)
Improve: Test against affine-gaps (48a538c)
Improve: Reduce sign-casting issues (a340131)
Fix: Check CUDA in szs_capabilities (42f043c)
Improve: Build & test reproducibility (446e14e)
Make: Override /std:c++ for MSVC (930b1f0)
Make: Add CUDA to GitHub CI (2a4552c)
Make: Workaround CI issues (46410c6)
Make: No bare builds on Windows & macOS (b42a340)
Fix: MSVC compilation issues (5f60fe4)
Improve: Simplify setting thread-counts (20bbe22)
Make: Install sz before szs in CI (9033649)
Make: Skip x86 intrinsics in universal builds (ce6cab2)
Fix: Inferring Ice Lake similarity kernels (a65dc99)
Fix: Disambiguate szs_ symbols (1fd5db8)
Make: Override VS Code compiler choice on osx (d06d4e4)
Make: Avoid SVE builds on macOS (abb970b)
Fix: Wrong SZ_DYNAMIC_DISPATCH check (acdab3f)
Make: Preinstall wheel in CI (62f8c98)
Make: Bump tapes to 4.0 (72fd6be)
Fix: Avoid forced inlining for HW flags (4cef520)
Fix: Feature-checking STL (b1418b5)
Fix: Missing allocator_traits include (44b8d72)
Fix: Workaround for static_cast (b6a4cc6)
Fix: Avoid unaligned XMM loads (0f840bb)
Fix: static_cast to standard for MSVC (43c953f)
Make: Avoid uv in GitHub CI (89ed74d)
Fix: Unused variable in group_by (3761107)
Fix: Unused qsort on MacOS (cd263ae)
Improve: Unaligned loads in serial hashes (5e9d488)
Make: Parallel backends CI (352b48d)
Make: Referencing old tests (7c4fe32)
Make: Caps introspection flags on Arm (d6c7cf3)
Fix: Converting to string views (6b00ab9)
Fix: Unused symbols (e825ed8)
Fix: Fetching engines ::capability_k (ddc640b)
Fix: uninitialized intersection count (ef3ca96)
Make: Disable NUMA by default (b27552d)
Fix: Guard SVE checks for cross-compilation (dac3941)
Make: Install Git on Alpine (1ff0c15)
Make: Log Alpine version (7ea327e)
Fix: Type-casting seed on Clang (4598f42)
Make: Bump Fork Union to 2.2.2 (e674413)
Fix: Deprecate levenshteinDistance in Swift (70c4add)
Fix: Passing StringZillas doctests (529fb76)
Fix: Allow NULL allocator args (b0c33bd)
Make: NVCC flags for Rust (6b38ea2)
Fix: Report requesting 1 CPU core (4ef0464)
Improve: Passing StringZillas.rs tests (34bb89a)
Improve: Use StringTape for GPU backends (87a0767)
Docs: StringZillas C API (34f4137)
Fix: Memalloc initialization on MSVC (#230) (a6e0a77)
Docs: Drop OpenMP and old name (7f45118)
Improve: Infer capabilities from DeviceScope (a807eba)
Improve: Introspect sz_device_scope_t (d9100a3)
Make: Consistent -O2 optimization (5fcde22)
Improve: Drop unused info1 (f4d4a76)
Fix: Rendering byte strings in Python (cf73d79)
Fix: Forward errors from sz_rune_parse (474dec4)
Improve: PyTest different MinHash dimensions (b5060a8)
Fix: Expose value_type for CUDA fingerprinter (7724886)
Fix: Fingerprinting memory management (f8dea13)
Fix: to_span compilation (e7fdd98)
Improve: Wrap high-dim fingerprints (bbf30d2)
Fix: Handling empty strings in arrays (517757c)
Improve: More readable PyTest (a6d3ed2)
Fix: Checking for Ice Lake caps (4f7649a)
Improve: Simpler & slower Py args parsing (36745f4)
Improve: Propagate error message to Py (711bd63)
Improve: Comparing 2 mem-allocators (7870cd3)
Fix: Skip missing affine_levenshtein_utf8_ice_t (1e112c6)
Make: Custom CudaBuildExtension for Python (4abf63c)
Make: Option to disable CUDA builds (84492e5)
Fix: Refer to prong_t in executor concepts (57d4ec8)
Fix: MSVC & Clang compilation errors (31fcdb3)
Make: Avoid OpenMP in builds (183ea96)
Fix: Announce LevenshteinDistancesUTF8Type (cb81c00)
Improve: Cache hardware capabilities (b9a1109)
Improve: Export capabilities as a tuple (0223b62)
Improve: Printing CUDA caps (94d8c36)
Fix: Match Apache Arrow layout (38c87b4)
Improve: Type-casting seeds in Strs.sample (5ac53c3)
Improve: Constructing Strs from PyArrow (7aebef4)
Fix: Track ownership of Strs offsets (65380ca)
Improve: Expose sz_capabilities in non-dynamic builds (6206eb4)
Fix: Avoid depending SZS -> SZ (c411e37)
Docs: Mark programming languages correctly (d1fc68c)
Make: Bump C++ & CUDA to 20 for libs (320f3da)
Fix: rebind_alloc in C++20 (492d726)
Fix: Tautological compare check (b924d90)
Improve: Random-access similarity outputs (c5e778d)
Make: Building parallel Python packages (0f44c25)
Fix: Clang build warnings (7741272)
Fix: Using braces for Clang builds (325cedb)
Make: Pull submodules in CI (5bb90b3)
Fix: Avoid _mm256_cvtepi64_epi32 on Haswell (014002e)
Make: Separate parallel library sources (ff019b1)
Make: Forward march flags through NVCC (496ae84)
Make: Move CUDA lib into header (d4a66c5)
Make: FMA flag for Haswell (2e1daa4)
Fix: Compilation of all C targets (a2b228c)
Make: Compiling StringZillas shared libs (6ac80e8)
Improve: gpu_specs_fetch & GPU args order (3d7b491)
Docs: Sync description one-liner (ab9b617)
Improve: Draft parallel fingerprinting API (e49f570)
Improve: Runtime variable window widths (031d067)
Make: Rename lib.rs (5d82454)
Docs: Reuse operators state (e5e2702)
Improve: Naming multi-input processors (a3c3510)
Fix: Scramble results between fingerprint benchmarks (fbf7203)
Make: Format Python to 120 columns (640b7c4)
Improve: Naming internal symbols (3b48c93)
Improve: Unroll CUDA fingerprints (37f3d80)
Docs: Refresh Python benchmarking suite (49cf4ea)
Fix: Weird compiler bug related to cuda_status_t (0a0955e)
Fix: Fingerprinting in CUDA (52b1d73)
Fix: Estimating hash counts in fingerprints (058af71)
Improve: Unroll & parallelize fingerprinting (cda36fd)
Fix: Inferring the prong type of executors (531b1e9)
Improve: Align thread-pool within stack-frame (b1077a4)
Improve: Wording inconsistencies (afaf11b)
Make: Launchers for Parallel C++ benchmarks (9f3beac)
Fix: Fingerprinting via Skylake extensions (08c1e86)
Fix: Passing fingerprinting builds (44a058d)
Improve: Include hash counts in fingerprints (0d9ba5b)
Improve: Consistent kernel naming without underscore prefixes (05725b2)
Improve: Expose floating-point SIMD states (4a57789)
Fix: Consistent barrett_mod in C++ & Python (1fc1cab)
Docs: Using uv for tests (f86dfaa)
Fix: Choosing co-primes with std::gcd (a1b3001)
Improve: Using fast calling convention for CPython (97ab23c)
Docs: Show higher recall with better hashes (255d443)
Improve: Ensure seed affects hashes (78d39f9)
Improve: Separate StringZillas Python code (883a3cd)
Fix: Fingerprinting compilation (19f92c4)
Improve: Explore Min-Hashes (46dd7d0)
Improve: Test fingerprint equivalence (e4aa3f7)
Fix: is_same_type usage over std::is_same (0ab5710)
Improve: Ignore previous UB commit in blame (7fc7323)
Fix: Avoid UB with underscore prefixes (74e3b6f)
Fix: sz_bitcast strict aliasing (80b97de)
Fix: Avoid std::swap in device code (c0aea26)
Fix: C++17 compatibility issues (7ea685d)
Fix: Guard C++20 concepts use (60763b3)
Fix: Backport std::remove_cvref to C++17 (ffee12b)
Improve: Move safe_vector (78d8c96)
Fix: Limit constexpr use in C++11 (866e2f2)
Fix: Minor build issues (46a6d63)
Improve: Extend dummy executors API (498d72a)
Fix: Wrong Fork Union class name (5def3af)
Improve: Compile-time-known span extents (8355b6e)
Improve: Move arrays_equality (e96d26f)
Improve: Merge fingerprinting drafts (cf6077e)
Make: Deprecate Find Many kernels (a6f799f)
Improve: Extend find_many tests (c80ce60)
Improve: Upgrade Fork Union (fb5f429)
Docs: Similar wording in "Explore Levenshtein" (766e250)
Fix: Correct namespaces for scripts (aafcbbd)
Fix: Replace +g with +m,r like GB (40bd3ed)
Fix: Wrong boundary conditions for count_many_parallel (0b22dd9)
Improve: Switch to "StringParaZilla" naming (0f3f928)
Improve: Cleaner haystack splitting (3e93b75)
Improve: Prioritize find-many tests (7a433a9)
Improve: SZ_DYNAMIC attributes (2f0334a)
Fix: Forwarding dataset nothrow-copyable views in tests (170b61b)
Improve: Use CUDA Atomics to aggregate globally (cf7d9a5)
Make: Missing bench_search target (1133cce)
Fix: find_many.cuh compilation issues (72affc2)
Fix: AC dictionary try_assign with different alloc (4f9d2db)
Fix: Warning around immutable span conversion (d88842a)
Docs: Target names (503e470)
Fix: Match C++ class names (966dd09)
Docs: Sections & links (f5a94b1)
Make: Bump fork_union (3f5de03)
Fix: Revert to separate find_many algos for different length (0acbeeb)
Improve: __reduce_max_sync in SW on Hopper (c45017d)
Fix: unified_alloc propagation (13e0201)
Docs: Move segmenting & features drafts (3972cb3)
Fix: Compiler over-optimizing bench_find_many (16f5fa4)
Fix: Prefix length in parallel counting of short needles (69c7b6a)
Fix: Including the entire haystack into match (3ee1676)
Fix: Buffer overflow due to wrong thread count (887c16b)
Improve: Skip vocabulary duplicates (5302a4d)
Improve: New multi-pattern search APIs (20c9135)
Fix: Missing std::generate include (853a0fa)
Improve: Multi-byte characters support (85c5bf8)
Improve: Propagate allocators in safe_vector (0c442d1)
Improve: Custom validators for nullary benchmarks (0cf994a)
Improve: Benchmark early exit (aaa6927)
Make: Rename bench_search -> bench_find (6f65624)
Docs: Aho-Corasick CUDA design (496e55a)
Improve: New caching primitives (df1f7ef)
Fix: Checking STRINGWARS_STRESS env-var (ade74f6)
Improve: Support executors in multi-pattern search (35fad3d)
Improve: Divergent branches on i16 SW on Hopper (817b15c)
Fix: OOB impact on SW scoring (cd9ff1e)
Fix: NW & SW on Hopper (4134e44)
Improve: More noticeable signaling in tests (ece416d)
Improve: Scheduling speculative kernels (8a6a185)
Improve: Run multiple warps per block (3b2f263)
Fix: Comparing Affine benchmark results (64d8f4d)
Improve: Fuzzy test Ice Lake kernels (d603f7d)
Improve: Shrink Affine Ice Lake kernels (79e7a2f)
Fix: Affine top row initialization (b9e4160)
Fix: Levenshtein w. Affine costs on GPU for zero-length ins (076e58a)
Improve: Test weird affine gaps (3bbebc8)
Improve: Match fork_union API in executors (1d6f58d)
Improve: Use fork_union pools (81463d4)
Docs: Inconsistent naming (8b62f2c)
Make: Add fork_union dependency (bd3d341)
Improve: Duplicate bytesum assignment (bfcd10e)
Make: Bump CUDA version (c709621)
Fix: Ice Lake calls for empty inputs (f59ba5e)
Fix: Correcting blends on Kepler (563a73d)
Improve: Scheduling Levenshtein in CUDA (77b1087)
Improve: Avoid views::group_by with callback (6a61e1b)
Improve: bytes_per_cell_t enum (c6e907a)
Fix: Inconsistent timing in bench_unary (60b99ab)
Improve: Bounded methods not supported (23a7f58)
Improve: Use requires clause (22c691e)
Improve: Naming "executor" interfaces (7e778d9)
Improve: Generalize Levenshtein in CUDA (29da524)
Fix: Inclusion guard macro names (8e79fb7)
Docs: More datasets on HuggingFace (0b33ac3)
Improve: Aligned ZMM diagonal stores (10b5405)
Improve: Branchless K-mask calculation (3998c1f)
Improve: Measure gap magnitude (b4eb6a4)
Fix: Avoid horizontal walker overflow (9b2b4e5)
Fix: All Gotoh baselines (84397ae)
Fix: Initializing affine DP matrices (640853f)
Improve: Differentiate unary & uniform costs (8c15447)
Fix: Horizontal Affine Walkers (e5d85f3)
Improve: Alloc type-size check in safe_vector (9c5a56c)
Improve: Fetch warp-size dynamically (83dd5fb)
Improve: Warp-shuffle reductions in SW (914e98d)
Fix: Over-estimating number of overlapping matches (b85d3ca)
Improve: Faster multi-needle tests (0895d28)
Improve: Splitting jobs in baseline multi-search (d47d849)
Fix: Slicing corner-cases in OpenMP (7bcc803)
Improve: Use std::execution for baseline tests (f228264)
Improve: Parallel baseline for substring search (6ebc7b0)
Fix: bytes_per_core_optimal estimate (83bc966)
Improve: Pointer-constructible spans (a0fd136)
Fix: Passing multi-needle tests (faad971)
Fix: Calculating find_many_match_t properties (d0ebee8)
Fix: Indexing needle IDs (42fa08d)
Fix: Use smaller types in BFS queue (0c6dd00)
Fix: Aho Corasick construction (e27f86f)
Improve: Consistent shuffling behavior in benchmarks (d4d55fa)
Fix: sz_copy_skylake tail handling on large input (#222) (6da5e1e)
Fix: Propagate substitutions to benchmarks (eabe605)
Fix: Underutilized 99% of the H100 (7a9a243)
Fix: Using OpenMP directives (41e1a6e)
Fix: Forward GPU specs in CUDA tests (9d86d4c)
Improve: Generalize memory requirements estimates (7bd92c5)
Fix: Missing STL includes (f7d365a)
Improve: Use CUDA constant memory (cebd180)
Improve: Forward-declare substituters (0a04960)
Improve: Show signed integers in SIMD types (382c05d)
Fix: i16 Ice Lake NW/SW alignment (8cc9794)
Improve: Catch & log exceptions (cb739e2)
Improve: Better UTF8 tests for similarity (b04e934)
Make: Parallel test launchers (1ece547)
Fix: Build issues (9db0287)
Fix: Included filename (ee38a22)
Fix: Uniform costs for UTF-32 runes (6edcf7f)
Improve: New template SFINAE in similarity.hpp (10279f9)
Fix: Initializing horizontal aligner (91df0ce)
Improve: Use tagged C enums (35ba76c)
Improve: Allow custom validators in benchmarks (bd2a21d)
Fix: Compiling constant_iterator in CUDA (bc59ee3)
Fix: std::allocator::rebind deprecated (4c87404)
Fix: Avoid std::iterator dependency (b3db596)
Make: Move similarity benchmarks (90bfcb6)
Fix: Report error codes in tests (05f17a6)
Make: Separate Parallel C++ and CUDA tests (b175103)
Make: Rename test files (2eeab83)
Fix: Passing CUDA similarity tests (4c75d81)
Fix: NW/SW test correspondence (fea39ed)
Improve: Differentiate min/max-imizers (c96c5ed)
Improve: Annotate throwing exceptions (2ef667c)
Fix: Missing sz_i16_t definition (238c86d)
Fix: Avoid similarity scoring references (e4b1bbd)
Improve: Shorter type aliases (3ef1d26)
Make: Revert to C++ for core tests (0c6ff1f)
Docs: StringCuZilla design choices (f42aa85)
Make: Separate StringCuZilla (c5fd4bc)
Improve: Move capabilities to types.h (1c1582f)
Make: Compile with OpenMP (1671b0f)
Improve: Allow datasets in VRAM (b1c9a74)
Fix: Shuffle datasets with over 4B tokens (247c6ec)
Fix: Overflow mean_token_length calculation (efcadd1)
Make: NVCC kernel debugging symbols (0c0ff42)
Fix: Arrow-like string array (fa4b0f4)
Fix: Overwriting alignment scores (668a386)
Fix: Synchronizing CUDA kernel launch (3b85a00)
Fix: Shared memory requirements (27ad8f5)
Make: NVCC can't handle fsanitize (ea7647f)
Make: Draft CUDA compilation (1b96ef4)
Fix: Hardening malloc(0) behavior (7524882)
Improve: Share C++ macros and typedefs (e82d045)
Fix: Accounting for different gap costs (e4e517f)
Fix: NVCC warning for negative size field (a6c0fa2)
Fix: Track capacity in fixed buffer alloc (64b40a9)
Fix: Sign-cast warning in _mm256_set1_epi64x (aac2e8f)
Fix: Overriding SZ_DEBUG macro (bca734a)
Fix: Calling unused helper struct unit tests (d369170)
Improve: Cleaner API for OpenMP (bc311b3)
Fix: Shifting Levenshtein diagonals in OpenMP (6b5ef98)
Fix: Unaligned loads/stores of hash state (37863c9)
Improve: Expose rune-parsing headers (2727a87)
Fix: Diagonals depth (ef53f75)
Docs: Showcase indexing diagonals (b55c696)
Fix: sz::lookup examples in Rs (778d4f0)
Fix: Compiling SVE on MacOS (42270c8)
Improve: Inline cheap calls (28282d2)
Make: List scripts/ deps for uv (1907d2b)
Docs: Formatting (dd57536)
Docs: uv instructions (9460fd4)
Fix: Unaligned sz_hash_state_t stores (7e65a1e)
Improve: Align inner hash-states (1b3cdd5)
Improve: Use GiB over GB (811fc59)
Improve: Construct Byteset::from_bytes (34660f2)
Fix: Remove missing Ice-Lake benchs (0c08564)
Fix: Compiling Py bindings (7e08180)
Make: CMake formatting (44485fb)
Docs: Formatting and references (928fd79)
Fix: No return (4cb096b)
Docs: Listing bench details (343b858)
Make: Patch passing SZ_USE_SVE definitions (45b15b0)
Fix: Expanding feature-detecting macros (75ef77e)
Improve: Bold benchmark names in CLI (2ab635e)
Improve: fill_random checksums in benchmarks (0bba772)
Improve: Simpler SVE find nested loop (a1604b7)
Improve: Unrolled serial hashing (77e482e)
Fix: find_sve mask update on long needles (a493ab8)
Improve: More simple substring search tests (68449fd)
Fix: bench_sequence CMake target (905749c)
Fix: Drop double negation in logging (3d77ec6)
Improve: Use SQINCP in SVE for increments (efda23b)
Fix: Dispatching SVE kernels (06ea5f7)
Docs: SVE2 intersects TODO (fafb8b0)
Improve: do_not_optimize token-level results (d1a3779)
Make: Rename bench_sort (991a78b)
Improve: Naming benchmark names (0074ad7)
Improve: Better sorting benchmarks (7d534fb)
Fix: Computing improvement percent (a5de795)
Improve: Faster equality checks on NEON/SVE (20f35c7)
Improve: New token-level benchmarks (12e1edd)
Docs: Describe trivial types (af686dd)
Fix: Naming byteset signature (9676cdb)
Improve: Naming "vtable" entries (b9794e5)
Make: Upgrade to C++20 for benchmarks (aa7f275)
Improve: New-style "container" benchmarks (3f1c723)
Fix: Reverse order std::search offsets (244e605)
Docs: Ignore formatting CMake (366816e)
Make: Formatting CMakeLists.txt (467b4b8)
Fix: Extra comma in printf (298d214)
Docs: Outdated function naming & spelling (92b9a56)
Improve: Token benchmarks (3b1897e)
Improve: Logging in container benchmarks (4d955d3)
Fix: No intersect for Skylake (48d70ea)
Fix: Revert to XMM on Haswell (4bec1e5)
Fix: Composing STL collections (f9da4ed)
Fix: std::string::data is mutable only since C++17 (ff23c3d)
Improve: Discard state in streaming hash (828263f)
Improve: Discarding buffer in streaming hashes (c4f7a0e)
Improve: Separate PRNG backends in benchmarks (af54e93)
Fix: Guard Skylake benchmarks (2965502)
Fix: Unused _sz_capabilities symbols (8bb90e5)
Fix: sz_intersect signature (f712de3)
Make: Don't build stringzilla_bare on MacOS (a7b35ba)
Fix: Variable in C++14 constexpr (feb415f)
Fix: Unused Levenshtein tests (90540d3)
Fix: find_1byte signature compatibility (d19e8b8)
Improve: Fix minor inconsistencies (f656577)
Docs: Exploring perfect Unicode hashing (197cd87)
Improve: Test set intersections (1d95601)
Fix: Randomization benchmarks (8dc4a2c)
Docs: Formatting (5c02c4e)
Docs: Details on the Unicode range (b6e4406)
Docs: Ignore C++ docstring updates blame (407dd2d)
Docs: New formatting in C++ (0d982a4)
Fix: Passing sz_sequence_t::handle (75fabf1)
Improve: Remove redundant comments from sz_hash_state functions in Rust (9fe25df)
Improve: Expose sz_lookup (471b002)
Improve: Expose sz_hash_state_init, sz_hash_state_stream, and sz_hash_state_fold to Rust (1757e4e)
Improve: Exposed sz_move, sz_fill, and sz_copy for Rust (b2085cc)
Improve: Inline most common Rust APIs (a30b5b7)
Make: cibuildwheel env variables (fbf256a)
Make: Decremental Rust builds (8877c82)
Fix: Detecting caps in dynamic builds (d52bf63)
Fix: fill_random test condition (8b396c8)
Fix: Compilation of all bindings (2caefac)
Make: Drop unused build.sh (2bbafa1)
Improve: Testing hash functions (80688bb)
Fix: Passing new hashing tests (268af53)
Improve: copy/move on Haswell with interleaving (69dfa10)
Docs: Announce JOINs (2225488)
Improve: Ordering includes (6e71536)
Improve: Vectorize sz_equal_haswell (7aad4bb)
Docs: Explaining compare.h operations (d7bab8d)
Improve: Clean memory.h header (7698392)
Improve: Use default allocator, when not provided (5a12c00)
Docs: Disable sorting includes (8bc161f)
Fix: Ice Lake partitioning logic (1da0e2b)
Improve: Expose Insertion-sort helpers (a38867f)
Fix: Merge-step bug in stable sort (db61d93)
Improve: Introduce typed _sz_swap macro (dcf6c65)
Improve: Rename sz_sort to sz_qsort (6191cc6)
Fix: sz_sort_serial passes tests (8bad799)
Fix: uniform_int_distribution lower bound (bdee111)
Fix: sz_sort_serial passes for same length inputs (0fda5a5)
Improve: Drop hybrid sort code (50d8291)
Fix: Underflow in serial sorting (5970fa4)
Make: Recommend pretty-printing GDB symbols (a818f97)
Fix: uniform_int_distribution upper bound (17f28a3)
Fix: In C++11 constexpr constructor must be empty (13bace2)
Fix: Sorting benchmarks for new API (66f2ac9)
Improve: Separate fingerprinting benchmarks (187e0bd)
Make: Renamed temp-git-split-file -> scripts/bench_token.cpp (031bedf)
Make: Renamed scripts/bench_token.cpp -> temp-git-split-file (07d2239)
Make: Renamed scripts/bench_token.cpp -> scripts/bench_fingerprint.cpp (a0318eb)
Docs: Signatures and typos (982dd4d)
Improve: Wrap std::accumulate for checksums (bce107a)
Improve: Validate checksums in benchmark (abe8d07)
Fix: Tail sum order in checksum_haswell (b20d7cd)
Fix: Infer allocators value_type (5bbd971)
Fix: Tail handling in sz_checksum_haswell (84cb4c8)
Fix: Loops in AVX-512 checksums (4044855)
Fix: Loop in sz_checksum_haswell (509b58b)
Improve: Relax many constexprs from C++20 to C++14 (0a3e363)
Make: Move drafts (1de3166)
Make: Renamed temp-git-split-file -> include/stringzilla/hash.h (5a36cb7)
Make: Renamed include/stringzilla/hash.h -> temp-git-split-file (7052266)
Make: Renamed include/stringzilla/hash.h -> include/stringzilla/fingerprint.h (0ef7cf1)
Docs: Spelling usnigned (d18a159)
Improve: hybrid bench sort performance (9880f26)
Fix: hybrid bench sorts loading initial stirng bytes incorrectly (455508f)
Fix: stable sort bench tests failing (821d19e)
Fix: Minor dispatch issues (d20e589)
Improve: Faster levenshtein_baseline (d9557d3)
Fix: BMI flags for BZHI (fa47deb)
Fix: Masks back to using BZHI (bd7054e)
Make: Library namespaced aliases (f3811d7)
Fix: sz_u512_vec_t members visibility (2007d49)
Fix: Bounded Levenshtein returns (749b0d8)
Fix: Skylake dispatch (48e0913)
Fix: Linking stderr (084d653)
Docs: Formatting docstring (c99daf3)
Fix: Initializing basic_charset (864ee03)
Fix: Correct basic_charset operator (#203) (e20d207)
Improve: Ignore 40 commits in blame (064829a)
Fix: Overriding LibC in 32-bit Windows (645539b)
Improve: C++ version macros naming (19c2ae9)
Make: Detect Apple Universal builds (6d61c21)
Make: Rename stringzillite to stringzilla_bare (364e2ca)
Fix: Symbols names & visibility (406bf0f)
Fix: Haswell compilation flag (00f27f6)
Fix: Filter compare.h file (6512f1d)
Make: Split ./include/stringzilla/find.h to ./include/stringzilla/compare.h (fc9e5d6)
Make: Split ./include/stringzilla/find.h to ./include/stringzilla/compare.h (49e8d9d)
Make: Split ./include/stringzilla/find.h to ./include/stringzilla/compare.h (fc408fa)
Fix: Partially filter stringzilla.h file (41e5917)
Fix: Minor macro mismatches (5f7ca59)
Fix: Filter types.h file (b835051)
Fix: Filter sort.h file (1ba7982)
Fix: Filter small_string.h file (5b55e19)
Make: Separate builds for Skylake & Ice (4a1f03c)
Improve: Platform-specific equality checks (8b44d6a)
Fix: Filter hash.h file (be4c63d)
Fix: Filter similarity.h file (8b401bd)
Fix: Filter memory.h file (295d49a)
Fix: Filter find.h file (2a1fcd1)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/memory.h (2f76521)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/memory.h (45e57ee)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/memory.h (66778d6)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/sort.h (c357c3e)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/sort.h (cbfe5c7)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/sort.h (085d2d3)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/small_string.h (3464cb4)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/small_string.h (89c4681)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/small_string.h (3f9c248)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/similarity.h (e23c35f)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/similarity.h (10d829e)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/similarity.h (d74e5dc)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/hash.h (1f60e6d)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/hash.h (08d0a20)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/hash.h (9e9f256)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/find.h (974ed78)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/find.h (14ba3bf)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/find.h (9e577be)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/types.h (8cb0742)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/types.h (22e3d1e)
Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzilla/types.h (ecb3775)
Fix: Wrong env. variable names (d0678f8)
Make: Inline ASM for detecting CPU features on ARM (0ee549a)
Fix: Default Levenshtein upper bound (62ca6a0)
Improve: Levenshtein functions for unicode (d3b423a)
Docs: Levenshtein tutorial in Jupyter (715ad10)
Fix: sz_look_up_transform_avx512 declaration (585f7d5)
Improve: #pragma region dashes (fe4449b)

ashvardanian/StringZilla v4.0.0 StringZilla 4 CUDA 🥳 on GitHub