Home > Fortran, Genomic Selection, QTL-detection > Hash function race !

## Hash function race !

I spent lately a lot of time optimizing my evaluation software. By doing so, I note that with increasing amount of data, time spent in hashing is becoming very long.

Thus I make a turn around internet to see whether or not I could optimize my hash function, and finally end up with several good resources among which this website : http://burtleburtle.net/bob/hash/index.html

My problem was that the hash function proposed were in c (and in a marvelous world, I ‘d prefer a Fortran version….that I didn’t find anywhere on the web….so I decided to write it myself.)

I took a list of 235930 pairs of integer (some that are hashed in real evaluations). For each pair, I compute a hash value both with a prime number hash function and with Jenkins’ lookup3.c function.

Technical details :

The prime number hashing function taken as a reference was the one provided by I. Misztal (et al.) in its blupf90 package. Basically, each coordinates to be hashed is multiplied by a specific prime number. The function keep the modulo of the figure obtained by the size of the storage array, where hash key are  stored. If for a specific hash key, the cell is already used, then we look for the  cell that is 17 cells forward.

In Jenkins’ function, we make several bit operations on the coordinates that we want  to hash. In case of collision, we hash again our coordinates until we found an available cell. One important thing is that the storage array size should be a power of two.

Let’s look at the results :

Following are the total number of collisions and the maximum number of collisions for one pair of integer.

Basically Jenkins hash reduces by a factor 4 the total number of collisions. Furthermore, for a matrix that should be filled at 90% Jenkins’ function won’t have more than 413 successive collisions for one single pair of coordinates, whereas this value could be as high as 8213 with the prime function.

If we observe the evolution of this figure with an increasing proportion of filled cell in our vectorWe clearly see that until 40% of the cell are still available, prime hash function tend to generate a lot of collisions, while Jenkins hash remain pretty efficient up to 70% of filled cells.

Due to the lower level of collisions, Jenkins Hash should be more efficient than Prime function. To test this hypothesis, I then compare the cpu time needed for each algorithm, with or without -03 optimization options with both ifort and gfortran.  The test has been repeated 100 times.

Apart when compiled with gfortran and no options,  Jenkins’ hash was significantly faster than the prime function. The difference doesn’t sound huge  (about  0.04 CPU s), but should be worthwhile with bigger problem. We can furthermore assume that in real use, we should also get some gains while “looking up”  data from the hash matrix.

For  interested people, you can find here a module which can replace the hashvr function in Misztal program.

Beware, you should also :

1. add  “use hash” in all the subroutine who need hashvr (These subroutine are  in sparse.f90)
2. Modify the increment size for the hash matrix “hash_incr”(it should be 2)
3. Change the default hash size, it should be a power of two.
4. Comment the old hashvr function in sparse2.f (to avoid name conflict)