Frankie's hash, I think suffers from collisions when the value range is large, because of using the modus operator for nHashIndex.
And using a more complex hash function leads to other problems.
I managed to speed up Frankie's method but gave up when I saw problems with collisions.
Frankie may know of a way around the problems though.
John
John, hashing always will generate collisions, this is the standard way it works on a finite size table.
In our case if you dimensions an hash table of 2^32 elements the hashing function
h = Val % 2^32 will be a perfect hashing
You will have no collisions and direct access to all keys, but at price of an huge memory usage.
With a reduced dimensions table you can have some benefit of direct access and some loop iteration in extended buckets. In my sample using a table that is 1/100th of array syze = 200k values the maximum looping is 2^32/200k = 21475 elements (extended buckets).
To speed up this you can use, as I mentioned on my original post, a rehashing method.
Rehashing means that the extended buckets will be an hash subtable were a new hash will allow direct addressing. The sub-hash function can be
Sub_hash = (Val%Main_Hash_Table_Size) % Hash_SubTable_Size, this will cover all indexes in the subtable. Still if the subtable is not big enough to get all values you will have extended buckets.
Proceeding this way you can have as many subtable as you want, and you can also decide to what extent keep the looping. The looping will be on how many possible extended buckets can exist after the permitted subhashing levels and their tables dimensions.
The advantage is that the subtables, as the extended buckets in standard hashing, will be created when necessary reducing the memory usage. Of course subhashing uses more memory of standard hashing because it can create a full dimension subtable even if only one value has to be stored, where the standard hashing would have created a single bucket.
Another optimizing way is to use the so called 'dynamic hashing', in this case the whole system is dynamic menaning that tables, subtables and hashing changes dynamically with the table contents. You can found info on dynamic hashing
here and
here.
Anyway you have to consider also the cost of dynamic handling in terms of memory allocation, the more you play with memory OS function the more time you spend. The point is always the same the right method have to be a correct balance between speed and memory requirement.
P.S. Be carefull with removing collisions, perfect hashing will lead to
2*sizeof(DWORD)*2^32 memory usage!