A Collection of Useful Algorithms and Ideas

Jesse Walker-Schadler

2025-02-06

Wizard

In the beginning the universe was created.

The vast majority of programming languages out there come equipped with a comprehensive standard library. What these standard offer depends on the language. In the case of Java, for example, the JDK provides everything from networking to GUI. C++ comes with the Standard Template Library (STL), which offers an incredible variety of generic data structures and algorithms. Regardless of whichever ecosystem, these standard tools are used almmost ubiquitously regardless of their unwieldy sizes, arcane practices, and inaccessible documentation.

This made a lot of people angry and is widely considered a bad move.

Then there is good ole’ C. A step above Assembly, C remains the most ubiquitous systems programming language in the world, although it may at some point be usurped by Rust. Its standard library is notably lacking in comparison to a more popular systems language such as C++. No, seriously– compare every function in C’s standard library to every function in the C++ STL. There’s simply no competition; C comes with no conventional bells and whistles. If you need anything like a growable list such as std::vector or an ArrayList, you’re writing or finding a library that implements a growing buffer.

So how do C programmers get anything done without having any standard containers to hold their bits? They write their own or use a library like glib. Before even getting into the meat of a solution you’ve either added a dependency to your project or invested time rolling your own container. This is just a component and point of consideration when deciding what language to use. I think investing the time in rolling your own, at least once, is quite the educational endeavor. While researching how to implement a Hash Table (for example) from scratch, you’ll need understand hash algorithms. To implement a sorted set, you’ll likely need to immerse yourself in one of many self-balancing binary search trees. Eventually you’ll end up with a trove of snippets and ideas from varying sources, resulting in a toolbox of tricks.

What's in the box

Hashing

Fowler–Noll–Vo

Producing a hash from an arbitrary sequence of bytes is a non-trivial problem. Doing it quickly is even less trivial, especially when trying to produce the fewest amount of collisions possible. Enter the FNV hash algorithms, which offer O(n) complexity with fewer instructions than the majority of rolling hash implementations. The Fowler-Noll-Vo algorithm involves a prime number and integer offset, being 0x100000001b3L and 0xcbf29ce484222325L in the following example. These two magic numbers vary depending on which variant of the hash used, scaling anywhere from 32-bit to 1024-bit versions. This does not produce cryptographic hashes, so their results are not suitable for any purpose beyond hash tables or checksums… but it is very fast.

vl_hash vlHashString(const void* data, vl_memsize_t dataSize){
    //FNV1a-64
    vl_hash hashCode = 0xcbf29ce484222325L;

    for(vl_ularge_t i = 0; i < dataSize; ++i){
        hashCode ^= *((const vl_int8_t*)data + i);
        hashCode *= 0x100000001b3L;
    }

    return hashCode;
	
}

The Value of an Identity Hash

On the topic of hash functions being unsuitable for the purposes of cryptography, consider the idea of an identity hash. An identity hash yields a result that is equivalent to its input. At first glance you might wonder what purpose could this serve– the function just returns what it is given. Look at it this way: There are no two inputs given to these functions that could possibly yield the same output. Numbers are inherently unique in their representations, meaning there are no two ways to represent the same exact integer on the same system. It is for this reason that I often choose to do entirely away with hashing specific number types, but rather just plain byte sequences that are the same length as these types: the idea behind hashing a 32-bit float and a 32-bit integer is the same. A downside to this approach is that the distribution of results is only as good as the distribution of inputs.

vl_hash vlHash8(const void* data, vl_memsize_t s){
    return *((const vl_uint8_t*)data);
}

vl_hash vlHash16(const void* data, vl_memsize_t s){
    return *((const vl_uint16_t*)data);
}

vl_hash vlHash32(const void* data, vl_memsize_t s){
    return *((const vl_uint32_t*)data);
}

vl_hash vlHash64(const void* data, vl_memsize_t s){
    return *((const vl_uint64_t*)data);
}

Combining Hashes

We’ve outlined methods on how to hash variable and well-defined sequences of bytes. What if we’re in a position where we need to combine them? I’ve experienced a need for this when using multiple byte sequences as a key in a Hash Table. There are much more than a few ways to do it, but I will outline two here:

Basic XOR
1
2
3
vl_hash vlHashCombine(vl_hash a, vl_hash b){
return a ^ b;
}
- Pros:
  - It’s pretty darn fast.
  - Good for mixing entropy.
- Cons:
  - If both inputs are similar, it could make the result less random.

Multiplicative Combination

vl_hash vlHashCombine(vl_hash a, vl_hash b){
	static const vl_uint64_t prime1 = 0x9E3779B97F4A7C15ULL; // Large 64-bit prime
	static const vl_uint64_t prime2 = 0xC2B2AE3D27D4EB4FULL; // Another large 64-bit prime
	
	return (h1 * prime1) + (h2 * prime2);
}

Pros:
- Good for mixing entropy.
- Less prone to collisions than a basic XOR.
Cons:
- Slightly slower than basic XOR due to higher complexity.
- Magic numbers!

IEEE-754 Floats

Floating point numbers have a format that is arguably less straightforward than something such as an unsigned integer. They more-or-less work by representing a number in scientific notation. While not all systems in the world use this specific floating point standard, it is a safe assumption that the ones you run into certainly will. I will not go over specifics about them here, but the layout under this standard is as follows:

IEEE-754 32-Bit Layout

Dicing your Floats

Despite their added complexity, they are just as much a bunch of 1s and 0s as anything else. We can re-interpret a float using bitwise instructions and separate out the individual parts.

void vlFloat32Trisect(vl_float32_t val, vl_uint8_t* sign, vl_uint8_t* exponent, vl_uint32_t* mantissa){
	const vl_uint32_t bits = *((vl_uint32_t*)(&val));
	
	*sign = (bits >> 31);
	*exponent = (bits >> 23) & 0b11111111;			
	*mantissa = bits & 0b11111111111111111111111;
}

Fast Inverse Square Root

While not the fastest in the land as it had once been, the fast inverse square root function is worth noting here. It takes advantage of the underlying bits of the floating point format and is historically significant. Quake 3 owes its (for the time) high-quality and performant reflections to this single function.

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;				// evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );              	// what the fuck?
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );  	// 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );  	// 2nd iteration, this can be removed

	return y;
}

Splitmix64 PRNG

I’m not sure there exists a simpler pseudo-random number generator (PRNG) than Splitmix64. Passing “Big Crush“, this algorithm takes some initial state integer (aka, seed) and pseudo-randomly permutes it. The seed can be any integer; decent ideas for a seed could be the current time or some runtime memory address. It is important to note that this should not be used for any cryptographic purposes.

typedef vl_uint64_t vl_rand;

vl_rand    vlRandNext(vl_rand* rand){
    vl_rand z = (*rand += 0x9E3779B97F4A7C15ull);
    z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ull;
    z = (z ^ (z >> 27)) * 0x94D049BB133111EBull;
    return z ^ (z >> 31);
}

Consider the resulting integer as a sequence of bits. We can derive any other integer type from the result by reinterpreting them into other integer types. Casting between a wide and thin integer type (e.g, vl_uint64_t and vl_uint8_t) is technically undefined behavior. By reinterpreting the results of the Splitmix PRNG, we are asking the compiler to make a bitwise copy between the two types.

vl_int8_t vlRandInt8(vl_rand* rand){
    const vl_rand val = vlRandNext(rand);
    return *((const vl_int8_t*)(&val));
}

vl_int16_t vlRandInt16(vl_rand* rand){
    const vl_rand val = vlRandNext(rand);
    return *((const vl_int16_t*)(&val));
}

vl_int32_t vlRandInt32(vl_rand* rand){
    const vl_rand val = vlRandNext(rand);
    return *((const vl_int32_t*)(&val));
}

vl_int64_t vlRandInt64(vl_rand* rand){
    const vl_rand val = vlRandNext(rand);
    return *((const vl_int64_t*)(&val));
}

SIMD-Like Properties Using Primitive Types

By now, you might have come across a theme in this article: everything is just bunch of 1s and 0s. If our random number generator generates 64 bits at a time, shouldn’t we be able to produce two 32-bit integers at once? What about eight 8-bit integers? Four 16-bit integers? The idea of using a single instruction to permute multiple pieces of data is called “single instruction, multiple data” (SIMD).

And we absolutely can, using nothing more than primitive types. We’ll save proper SIMD instructions for another day, it’s a much deeper topic. There shouldn’t be any problem assigning all 64 bits of our state to any memory address assigned to a block of memory that is, itself, at least as wide as the random number state.

//Distribute sizeof(Splitmix State)-many bytes to specified memory address
#define VL_H_RANDOM_SPLIT(randPtr, resultPtr) (*((vl_ularge_t*)(resultPtr)) = vlRandNext(randPtr))

#define vlRandUInt8x8(randPtr, resultPtr) VL_H_RANDOM_SPLIT(randPtr, resultPtr)

#define vlRandUInt16x4(randPtr, resultPtr) VL_H_RANDOM_SPLIT(randPtr, resultPtr)

#define vlRandUInt32x2(randPtr, resultPtr) VL_H_RANDOM_SPLIT(randPtr, resultPtr)

Generating Random Floats

We can also apply this same idea to floating point numbers, since we know the IEEE-754 layout. This allows us to not only avoid a costly division when generating random floating point numbers, but even allows us to produce two at a time in almost the same number of instructions.

//bitmask for lower 23 bits
#define VL_IEEE754_FLOAT32_LOW23        0x007FFFFFu
//bit equivalent of 1.0f
#define VL_IEEE754_FLOAT32_ONE          0x3F800000u

//these are the same as above, but are two copies of it over 8 bytes.
#define VL_IEEE754_FLOAT32_LOW23_DW     0x007FFFFF007FFFFFull
#define VL_IEEE754_FLOAT32_ONE_DW       0x3F8000003F800000ull

//bitmask for lower 52 bits
#define VL_IEEE754_FLOAT64_LOW52       0x000FFFFFFFFFFFFFull
//bit equivalent of 1.0
#define VL_IEEE754_FLOAT64_ONE         0x3FF0000000000000ull

vl_float32_t vlRandF(vl_rand* rand){
    //extract 23 bits from the random integer, then use bitwise OR to insert the exponent and sign bits.
    const vl_uint32_t randFloatVal = ((vl_uint32_t)vlRandNext(rand) & VL_IEEE754_FLOAT32_LOW23) | VL_IEEE754_FLOAT32_ONE;
    return (*(float*)(&randFloatVal)) - 1.0f;//subtract 1 because we OR'd in the 1.0 bitfield
}

void vlRandFx2(vl_rand* rand, vl_float32_t* result){
    //this is the same method as above, but it uses DW (double-wide) equivalent bitfields.
    *((vl_uint64_t*)result) = ((vl_uint64_t) vlRandNext(rand) & VL_IEEE754_FLOAT32_LOW23_DW) | VL_IEEE754_FLOAT32_ONE_DW;
    result[0] -= 1.0f; result[1] -= 1.0f;
}

vl_float64_t vlRandD(vl_rand* rand){
    //modified version of vlRandF to use extra bits
    const vl_uint64_t randFloatVal = ((vl_uint64_t)vlRandNext(rand) & VL_IEEE754_FLOAT64_LOW52) | VL_IEEE754_FLOAT64_ONE;
    return (*(vl_float64_t*)(&randFloatVal)) - (vl_float64_t)1.0;
}

Memory Alignmment

Believe it or not, every single memory allocation (~~probably~~) produces an address that is a multiple of its own width. For example: if a machine uses 8-byte memory addressess, memory is allocated at addresses that are multiples of 8. Memory accesses must be aligned on “natural boundaries“, meaning they must be a multiple of a power of two. Memory alignment has recursive properties for this reason. By defaulting to the largest integer type on the system, which is always going to be a memory address, it is more likely that all number types stored in the memory will also have their addresses aligned to their respective widths. Thankfully, our compilers and systems handle alignment without us having to think too hard about it.

Here’s a visual representation of what I’m talking about. Consider a measuring stick, or ruler.

Ruler

Centimeters are guaranteed to be aligned to a multiple of 1/2 centimeter increments
Centimeters are guaranteed to be aligned to a multiple of 1/4 centimeter increments

… and so on, and so forth. However, consider the opposite:

1/2 centimeter increments are NOT always aligned to 1 centimeter increments
1/4 centimeter increments are NOT always aligned to 1 centimeter increments

Now lets consider a 64-bit integer as having width of 1 centimeter, 32-bit integers as a 1/2 centimeter, etc. We can say with certainty that the following is true:

64-bit integers are always aligned to 16-bit increments
64-bit integers are always aligned to 32-bit increments

but…

16-bit integers are NOT always aligned to 64-bit increments
32-bit integers are NOT always aligned to 64-bit increments

So why is this useful to know if it isn’t my problem?

Well, because at some point it could be your problem. I briefly touched on single instruction, multiple data (SIMD) instructions already, where multiple pieces of data are permuted using a single instruction. These instructions also require alignment, but to a higher multiple than the default system alignment. SIMD truly shines when it operates on aligned data, suffering a performance hit on unaligned operations. If our goal is to operate on 16-byte (128-bit) spans using SIMD instructions, the address of the memory itself is better aligned to 16 byte increments.

Getting it aligned

Memory can be aligned by requesting a slightly larger size when allocating, and computing an offset such that the result address is some multiple of an alignment. Smart implementations might try to stash the origin pointer for later use, rather than managing it separately.


//Return aligned address, assigning value of origin to the actual address of the block to be freed later.
void* vlMemAllocAligned(vl_memsize_t len, vl_uintptr_t alignment, void** origin){
	*origin = malloc(len + alignment);
	
	const vl_uintptr_t address = ((vl_uintptr_t)*origin)+alignment;
	
	return (void*)(address - (address % alignment));
}

Summary

It's all just binary? Always has been.

If it looks like a bunch of 1s and 0s, and it talks like a bunch of 1s and 0s, it’s probably just a bunch of 1s and 0s. I think it is freeing to think about memory in this polymorphic fashion, and it opens the doors to some interesting implementations when the language allows. Combining “less commonly known” aspects of the “most common denominator” computing environment can make for some quite performant code, as seen by the implementation of fast inverse square root. It’s interesting just how many spaces implementing containers can take you.

TL;DR I think this kind of thing is pretty freakin’ neat. I hope some of you out there do too!

Best,
J