## Hash function race !

I spent lately a lot of time optimizing my evaluation software. By doing so, I note that with increasing amount of data, time spent in hashing is becoming very long.

Thus I make a turn around internet to see whether or not I could optimize my hash function, and finally end up with several good resources among which this website : http://burtleburtle.net/bob/hash/index.html

My problem was that the hash function proposed were in c (and in a marvelous world, I ‘d prefer a Fortran version….that I didn’t find anywhere on the web….so I decided to write it myself.)

I took a list of 235930 pairs of integer (some that are hashed in real evaluations). For each pair, I compute a hash value both with a prime number hash function and with Jenkins’ lookup3.c function.

**Technical details :**

The prime number hashing function taken as a reference was the one provided by I. Misztal (et *al.*) in its blupf90 package. Basically, each coordinates to be hashed is multiplied by a specific prime number. The function keep the modulo of the figure obtained by the size of the storage array, where hash key are stored. If for a specific hash key, the cell is already used, then we look for the cell that is 17 cells forward.

In Jenkins’ function, we make several bit operations on the coordinates that we want to hash. In case of collision, we hash again our coordinates until we found an available cell. One important thing is that the storage array size should be a **power of two**.

**Let’s look at the results :**

Following are the total number of collisions and the maximum number of collisions for one pair of integer.

Basically Jenkins hash reduces by a factor 4 the total number of collisions. Furthermore, for a matrix that should be filled at 90% Jenkins’ function won’t have more than 413 successive collisions for one single pair of coordinates, whereas this value could be as high as 8213 with the prime function.

If we observe the evolution of this figure with an increasing proportion of filled cell in our vectorWe clearly see that until 40% of the cell are still available, prime hash function tend to generate a lot of collisions, while Jenkins hash remain pretty efficient up to 70% of filled cells.

Due to the lower level of collisions, Jenkins Hash should be more efficient than Prime function. To test this hypothesis, I then compare the cpu time needed for each algorithm, with or without -03 optimization options with both ifort and gfortran. The test has been repeated 100 times.

Apart when compiled with gfortran and no options, Jenkins’ hash was significantly faster than the prime function. The difference doesn’t sound huge (about 0.04 CPU s), but should be worthwhile with bigger problem. We can furthermore assume that in real use, we should also get some gains while “looking up” data from the hash matrix.

For interested people, you can find here a module which can replace the hashvr function in Misztal program.

Beware, you should also :

- add “use hash” in all the subroutine who need hashvr (These subroutine are in sparse.f90)
- Modify the increment size for the hash matrix “hash_incr”(it should be 2)
- Change the default hash size, it should be a power of two.
- Comment the old hashvr function in sparse2.f (to avoid name conflict)
- Update your makefile !

Last hint, if you have a rough idea of the hash matrix size you need, let’s say “H”, you can compute the closest power of two with this code

default_hash_size=2**(1+(log(H)+log10(H))))

You could thus avoid the cost of successive copies of your hash matrix.

## Looking Backward

Some month ago, while preparing a talk for the genetic department, I made some few research on INA. (French National Institute for Audiovisual ). I found a very interesting video founded by the Marshall plan, explaining what the future should be like. Unfortunately, I didn’t found it back, but on the other hand, I saw a video made in Jouy en Josas (the place where I work) in the sixties.

Several funny things :

- Nowadays for any video, journalist would ask to see the SNP chip scanner, not a computer, in the 60’s this was not the case ! (As shown in the first minutes of the video fortran program were THE ATTRACTION !). So, yes programming and computation use to be cool 50 years ago…to bad for me this is not the case anymore !
- Another amazing thing here, is that the video talk about what the future (i.e. the eighties by this time) of farming would be, and there are somehow pretty accurate prediction.

The whole video is available here :

Vodpod videos no longer available.

http://www.ina.fr/video/I05059617/le-futur-de-l-elevage-l-elevage-industriel.fr.html

A lot of other videos are also available and each are very interesting. This initiative is very nice and should be acknowledged.

Now the major challenge for me would be to imagine what future will be and….wait to see whether I am totally wrong or not !