- Made non-assembly up to 40% faster.
- AVX512 can use multiple goroutines for lower latency + higher individual throughput.
- AVX512 5-9% faster.
- All code faster with user defined goroutines and high concurrency. Up to 8x faster due to less cache evictions.
- CPUID detects AMD CPUs with hyperthreading/multiple threads/core.
- CPUID detects AMD per CCX L3 cache size.
- Use L1 cache size to set minimum split size.
- Tests/benchmarks can disable specific assembly types.