xxHash 0.8.2
Extremely fast non-cryptographic hash function
Loading...
Searching...
No Matches
Macros | Enumerations
Tuning parameters

Macros

#define XXH_NO_LONG_LONG
 Define this to disable 64-bit code.
 
#define XXH_FORCE_MEMORY_ACCESS   0
 Controls how unaligned memory is accessed.
 
#define XXH_SIZE_OPT   0
 Controls how much xxHash optimizes for size.
 
#define XXH_FORCE_ALIGN_CHECK   0
 If defined to non-zero, adds a special path for aligned inputs (XXH32() and XXH64() only).
 
#define XXH_NO_INLINE_HINTS   0
 When non-zero, sets all functions to static.
 
#define XXH3_INLINE_SECRET   0
 Determines whether to inline the XXH3 withSecret code.
 
#define XXH32_ENDJMP   0
 Whether to use a jump for XXH32_finalize.
 
#define XXH_OLD_NAMES
 
#define XXH_NO_STREAM
 Disables the streaming API.
 
#define XXH_DEBUGLEVEL   0
 Sets the debugging level.
 
#define XXH_CPU_LITTLE_ENDIAN   XXH_isLittleEndian()
 Whether the target is little endian.
 
#define XXH_VECTOR   XXH_SCALAR
 Overrides the vectorization implementation chosen for XXH3.
 
#define XXH_VECTOR   XXH_SCALAR
 Overrides the vectorization implementation chosen for XXH3.
 
#define XXH_ACC_ALIGN   8
 Selects the minimum alignment for XXH3's accumulators.
 
#define XXH3_NEON_LANES   XXH_ACC_NB
 Controls the NEON to scalar ratio for XXH3.
 

Enumerations

enum  XXH_VECTOR_TYPE {
  XXH_SCALAR = 0 , XXH_SSE2 = 1 , XXH_AVX2 = 2 , XXH_AVX512 = 3 ,
  XXH_NEON = 4 , XXH_VSX = 5 , XXH_SVE = 6
}
 Possible values for XXH_VECTOR. More...
 

Detailed Description

Various macros to control xxHash's behavior.

Macro Definition Documentation

◆ XXH_NO_LONG_LONG

#define XXH_NO_LONG_LONG

Define this to disable 64-bit code.

Useful if only using the XXH32 family and you have a strict C90 compiler.

◆ XXH_FORCE_MEMORY_ACCESS

#define XXH_FORCE_MEMORY_ACCESS   0

Controls how unaligned memory is accessed.

By default, access to unaligned memory is controlled by memcpy(), which is safe and portable.

Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.

The below switch allow selection of a different access method in the search for improved performance.

Possible options:
  • XXH_FORCE_MEMORY_ACCESS=0 (default): memcpy
    Use memcpy(). Safe and portable. Note that most modern compilers will eliminate the function call and treat it as an unaligned access.
  • XXH_FORCE_MEMORY_ACCESS=1: __attribute__((aligned(1)))
    Depends on compiler extensions and is therefore not portable. This method is safe if your compiler supports it, and generally as fast or faster than memcpy.
  • XXH_FORCE_MEMORY_ACCESS=2: Direct cast
    Casts directly and dereferences. This method doesn't depend on the compiler, but it violates the C standard as it directly dereferences an unaligned pointer. It can generate buggy code on targets which do not support unaligned memory accesses, but in some circumstances, it's the only known way to get the most performance.
  • XXH_FORCE_MEMORY_ACCESS=3: Byteshift
    Also portable. This can generate the best code on old compilers which don't inline small memcpy() calls, and it might also be faster on big-endian systems which lack a native byteswap instruction. However, some compilers will emit literal byteshifts even if the target supports unaligned access.
    Warning
    Methods 1 and 2 rely on implementation-defined behavior. Use these with care, as what works on one compiler/platform/optimization level may cause another to read garbage data or even crash.
    See https://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html for details.

Prefer these methods in priority order (0 > 3 > 1 > 2)

◆ XXH_SIZE_OPT

#define XXH_SIZE_OPT   0

Controls how much xxHash optimizes for size.

xxHash, when compiled, tends to result in a rather large binary size. This is mostly due to heavy usage to forced inlining and constant folding of the XXH3 family to increase performance.

However, some developers prefer size over speed. This option can significantly reduce the size of the generated code. When using the -Os or -Oz options on GCC or Clang, this is defined to 1 by default, otherwise it is defined to 0.

Most of these size optimizations can be controlled manually.

This is a number from 0-2.

  • XXH_SIZE_OPT == 0: Default. xxHash makes no size optimizations. Speed comes first.
  • XXH_SIZE_OPT == 1: Default for -Os and -Oz. xxHash is more conservative and disables hacks that increase code size. It implies the options XXH_NO_INLINE_HINTS == 1, XXH_FORCE_ALIGN_CHECK == 0, and XXH3_NEON_LANES == 8 if they are not already defined.
  • XXH_SIZE_OPT == 2: xxHash tries to make itself as small as possible. Performance may cry. For example, the single shot functions just use the streaming API.

◆ XXH_FORCE_ALIGN_CHECK

#define XXH_FORCE_ALIGN_CHECK   0

If defined to non-zero, adds a special path for aligned inputs (XXH32() and XXH64() only).

This is an important performance trick for architectures without decent unaligned memory access performance.

It checks for input alignment, and when conditions are met, uses a "fast path" employing direct 32-bit/64-bit reads, resulting in dramatically faster read speed.

The check costs one initial branch per hash, which is generally negligible, but not zero.

Moreover, it's not useful to generate an additional code path if memory access uses the same instruction for both aligned and unaligned addresses (e.g. x86 and aarch64).

In these cases, the alignment check can be removed by setting this macro to 0. Then the code will always use unaligned memory access. Align check is automatically disabled on x86, x64, ARM64, and some ARM chips which are platforms known to offer good unaligned memory accesses performance.

It is also disabled by default when XXH_SIZE_OPT >= 1.

This option does not affect XXH3 (only XXH32 and XXH64).

◆ XXH_NO_INLINE_HINTS

#define XXH_NO_INLINE_HINTS   0

When non-zero, sets all functions to static.

By default, xxHash tries to force the compiler to inline almost all internal functions.

This can usually improve performance due to reduced jumping and improved constant folding, but significantly increases the size of the binary which might not be favorable.

Additionally, sometimes the forced inlining can be detrimental to performance, depending on the architecture.

XXH_NO_INLINE_HINTS marks all internal functions as static, giving the compiler full control on whether to inline or not.

When not optimizing (-O0), using -fno-inline with GCC or Clang, or if XXH_SIZE_OPT >= 1, this will automatically be defined.

◆ XXH3_INLINE_SECRET

#define XXH3_INLINE_SECRET   0

Determines whether to inline the XXH3 withSecret code.

When the secret size is known, the compiler can improve the performance of XXH3_64bits_withSecret() and XXH3_128bits_withSecret().

However, if the secret size is not known, it doesn't have any benefit. This happens when xxHash is compiled into a global symbol. Therefore, if XXH_INLINE_ALL is not defined, this will be defined to 0.

Additionally, this defaults to 0 on GCC 12+, which has an issue with function pointers that are sometimes force inline on -Og, and it is impossible to automatically detect this optimization level.

◆ XXH32_ENDJMP

#define XXH32_ENDJMP   0

Whether to use a jump for XXH32_finalize.

For performance, XXH32_finalize uses multiple branches in the finalizer. This is generally preferable for performance, but depending on exact architecture, a jmp may be preferable.

This setting is only possibly making a difference for very small inputs.

◆ XXH_NO_STREAM

#define XXH_NO_STREAM

Disables the streaming API.

When xxHash is not inlined and the streaming functions are not used, disabling the streaming functions can improve code size significantly, especially with the XXH3 family which tends to make constant folded copies of itself.

◆ XXH_DEBUGLEVEL

#define XXH_DEBUGLEVEL   0

Sets the debugging level.

XXH_DEBUGLEVEL is expected to be defined externally, typically via the compiler's command line options. The value must be a number.

◆ XXH_CPU_LITTLE_ENDIAN

#define XXH_CPU_LITTLE_ENDIAN   XXH_isLittleEndian()

Whether the target is little endian.

Defined to 1 if the target is little endian, or 0 if it is big endian. It can be defined externally, for example on the compiler command line.

If it is not defined, a runtime check (which is usually constant folded) is used instead.

Note
This is not necessarily defined to an integer constant.
See also
XXH_isLittleEndian() for the runtime check.

◆ XXH_VECTOR [1/2]

#define XXH_VECTOR   XXH_SCALAR

Overrides the vectorization implementation chosen for XXH3.

Can be defined to 0 to disable SIMD or any of the values mentioned in XXH_VECTOR_TYPE.

If this is not defined, it uses predefined macros to determine the best implementation.

◆ XXH_VECTOR [2/2]

#define XXH_VECTOR   XXH_SCALAR

Overrides the vectorization implementation chosen for XXH3.

Can be defined to 0 to disable SIMD or any of the values mentioned in XXH_VECTOR_TYPE.

If this is not defined, it uses predefined macros to determine the best implementation.

◆ XXH_ACC_ALIGN

#define XXH_ACC_ALIGN   8

Selects the minimum alignment for XXH3's accumulators.

When using SIMD, this should match the alignment required for said vector type, so, for example, 32 for AVX2.

Default: Auto detected.

◆ XXH3_NEON_LANES

#define XXH3_NEON_LANES   XXH_ACC_NB

Controls the NEON to scalar ratio for XXH3.

This can be set to 2, 4, 6, or 8.

ARM Cortex CPUs are very sensitive to how their pipelines are used.

For example, the Cortex-A73 can dispatch 3 micro-ops per cycle, but only 2 of those can be NEON. If you are only using NEON instructions, you are only using 2/3 of the CPU bandwidth.

This is even more noticeable on the more advanced cores like the Cortex-A76 which can dispatch 8 micro-ops per cycle, but still only 2 NEON micro-ops at once.

Therefore, to make the most out of the pipeline, it is beneficial to run 6 NEON lanes and 2 scalar lanes, which is chosen by default.

This does not apply to Apple processors or 32-bit processors, which run better with full NEON. These will default to 8. Additionally, size-optimized builds run 8 lanes.

This change benefits CPUs with large micro-op buffers without negatively affecting most other CPUs:

Chipset Dispatch type NEON only 6:2 hybrid Diff.
Snapdragon 730 (A76) 2 NEON/8 micro-ops 8.8 GB/s 10.1 GB/s ~16%
Snapdragon 835 (A73) 2 NEON/3 micro-ops 5.1 GB/s 5.3 GB/s ~5%
Marvell PXA1928 (A53) In-order dual-issue 1.9 GB/s 1.9 GB/s 0%
Apple M1 4 NEON/8 micro-ops 37.3 GB/s 36.1 GB/s ~-3%

It also seems to fix some bad codegen on GCC, making it almost as fast as clang.

When using WASM SIMD128, if this is 2 or 6, SIMDe will scalarize 2 of the lanes meaning it effectively becomes worse 4.

See also
XXH3_accumulate_512_neon()

Enumeration Type Documentation

◆ XXH_VECTOR_TYPE

Possible values for XXH_VECTOR.

Note that these are actually implemented as macros.

If this is not defined, it is detected automatically. internal macro XXH_X86DISPATCH overrides this.

Enumerator
XXH_SCALAR 

Portable scalar version

XXH_SSE2 

SSE2 for Pentium 4, Opteron, all x86_64.

Note
SSE2 is also guaranteed on Windows 10, macOS, and Android x86.
XXH_AVX2 

AVX2 for Haswell and Bulldozer

XXH_AVX512 

AVX512 for Skylake and Icelake

XXH_NEON 

NEON for most ARMv7-A, all AArch64, and WASM SIMD128 via the SIMDeverywhere polyfill provided with the Emscripten SDK.

XXH_VSX 

VSX and ZVector for POWER8/z13 (64-bit)

XXH_SVE 

SVE for some ARMv8-A and ARMv9-A