frequency of ints in huge array

czerny · July 07, 2014, 12:59:45 AM

Quote from: frankie on July 06, 2014, 09:38:19 PM
I tryied another system to reduce the time without ordering the array.
It uses hashing.
You can play with hash table dimensions to reduce/extend the time or the memory used (they have opposite behaviour).
The time overheading is due, as clear to anybody, to the huge number of loops to check the array. The indexing technique used with hashing avoid loops and hash handling is widly repayed by faster access.
The project posted show how it works, but it is not optimized yet. You can get much better results with a faster hashing library that reduce memory access, and using rehashing to remove looping in bucket extension.
I hope this is useful or at least shows the right way...

Oh, that looks promising! A little bit slower than qsort, yet. But a lot of memory for randomized data

JohnF · July 07, 2014, 04:04:06 PM

Quote from: frankie on July 06, 2014, 09:38:19 PM
I tryied another system to reduce the time without ordering the array.
It uses hashing.
You can play with hash table dimensions to reduce/extend the time or the memory used (they have opposite behaviour).
The time overheading is due, as clear to anybody, to the huge number of loops to check the array. The indexing technique used with hashing avoid loops and hash handling is widly repayed by faster access.
The project posted show how it works, but it is not optimized yet. You can get much better results with a faster hashing library that reduce memory access, and using rehashing to remove looping in bucket extension.
I hope this is useful or at least shows the right way...

Hey Frankie, does your brain really work this way?

Code Select


HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].key  = key;
HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].data = data;
return (HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].key ? 1: 0);

I've had to simplify those lines.

Anyway, it's a neat demo, thanks.

John

frankie · July 07, 2014, 08:37:20 PM

Quote from: czerny on July 07, 2014, 12:59:45 AM
Oh, that looks promising! A little bit slower than qsort, yet. But a lot of memory for randomized data

If you want a fast algorithm and want to use small or no memory the right solution is ... magic

frankie · July 07, 2014, 08:51:00 PM

Quote from: JohnF on July 07, 2014, 04:04:06 PM
Hey Frankie, does your brain really work this way?

Code Select Expand
HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].key = key; HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].data = data; return (HashTbl->buckets[nHashIndex].entry[HashTbl->buckets[nHashIndex].cnt-1].key ? 1: 0);

I've had to simplify those lines.

Anyway, it's a neat demo, thanks.

John

John, sometimes my brain works even worst that this

Seriously the lines you are showing makes sense because in the original library key and Data were pointers. key pointed to a structure for the key holding the key itself, hashing method, lenght of key, and some other info., Data pointed to a buffer holding user data for the key.
I made a quick and dirty simplification assigning the to key the array value and at data the frequency, and fixing the hashing function as <value % size_of_hash_table> (% stay for modulus I want clarify considering the previous post on XOR operator

).
The ripetiono of addressing is not a problem, in an optimized compiler it is a branch of program tree, the compiler should be able to recognize the ripetion of code and execute it only one time. Incidentally PellesC generally don't do that

, but there is time to improve.....
Anyway I think that all the points and possible solutions have been showed, and as I said before remains only the magic...

JohnF · July 08, 2014, 07:26:46 AM

Oh, I thought % was for percent!

John

czerny · July 09, 2014, 07:05:55 PM

Another idea!

JohnF · July 09, 2014, 08:01:02 PM

Quote from: czerny on July 09, 2014, 07:05:55 PM
Another idea!

It seems to work ok, but not better than the original qsort we were playing with.

I've been messing with the various ways to deal with your problem and the old qsort method is probably the best you have. Frankie's hash, I think suffers from collisions when the value range is large, because of using the modus operator for nHashIndex.

And using a more complex hash function leads to other problems.

I managed to speed up Frankie's method but gave up when I saw problems with collisions.

Frankie may know of a way around the problems though.

John

czerny · July 10, 2014, 09:59:52 AM

Quote from: JohnF on July 09, 2014, 08:01:02 PM
Quote from: czerny on July 09, 2014, 07:05:55 PM
Another idea!
It seems to work ok, but not better than the original qsort we were playing with.

It is much quicker, if there are many values with frequency 1.

JohnF · July 10, 2014, 10:22:18 AM

Quote from: czerny on July 10, 2014, 09:59:52 AM
Quote from: JohnF on July 09, 2014, 08:01:02 PM
Quote from: czerny on July 09, 2014, 07:05:55 PM
Another idea!
It seems to work ok, but not better than the original qsort we were playing with.
It is much quicker, if there are many values with frequency 1.

EDIT: For a limited range of values there are very fast alternatives

Using Frankie's method with modus the routine for 20000000 values using rand()%10000 to fill the array only takes 203ms.

This code was inspired by Frankie's example, it uses a 2 dimensional array.

Code Select


void insert(uint32_t value)
{
	uint32_t nHashIndex = value % SIZE;

	if(arry1[nHashIndex][VAL] != value){
		arry1[nHashIndex][VAL] = value;
		arry1[nHashIndex][FEQ]++;
	}else{
		arry1[nHashIndex][FEQ]++;
	}
}

John

frankie · July 11, 2014, 01:54:03 PM

Quote from: JohnF on July 09, 2014, 08:01:02 PM
Frankie's hash, I think suffers from collisions when the value range is large, because of using the modus operator for nHashIndex.

And using a more complex hash function leads to other problems.

I managed to speed up Frankie's method but gave up when I saw problems with collisions.

Frankie may know of a way around the problems though.

John

John, hashing always will generate collisions, this is the standard way it works on a finite size table.
In our case if you dimensions an hash table of 2^32 elements the hashing function h = Val % 2^32 will be a perfect hashing

You will have no collisions and direct access to all keys, but at price of an huge memory usage.
With a reduced dimensions table you can have some benefit of direct access and some loop iteration in extended buckets. In my sample using a table that is 1/100th of array syze = 200k values the maximum looping is 2^32/200k = 21475 elements (extended buckets).
To speed up this you can use, as I mentioned on my original post, a rehashing method.
Rehashing means that the extended buckets will be an hash subtable were a new hash will allow direct addressing. The sub-hash function can be Sub_hash = (Val%Main_Hash_Table_Size) % Hash_SubTable_Size, this will cover all indexes in the subtable. Still if the subtable is not big enough to get all values you will have extended buckets.
Proceeding this way you can have as many subtable as you want, and you can also decide to what extent keep the looping. The looping will be on how many possible extended buckets can exist after the permitted subhashing levels and their tables dimensions.
The advantage is that the subtables, as the extended buckets in standard hashing, will be created when necessary reducing the memory usage. Of course subhashing uses more memory of standard hashing because it can create a full dimension subtable even if only one value has to be stored, where the standard hashing would have created a single bucket.
Another optimizing way is to use the so called 'dynamic hashing', in this case the whole system is dynamic menaning that tables, subtables and hashing changes dynamically with the table contents. You can found info on dynamic hashing here and here.

Anyway you have to consider also the cost of dynamic handling in terms of memory allocation, the more you play with memory OS function the more time you spend. The point is always the same the right method have to be a correct balance between speed and memory requirement.

P.S. Be carefull with removing collisions, perfect hashing will lead to 2*sizeof(DWORD)*2^32 memory usage!

JohnF · July 11, 2014, 03:31:57 PM

Frankie, thanks for the explanation.

John

JohnF · July 11, 2014, 04:19:54 PM

Earlier we were talking about how it's difficult to beat the CRT library functions for speed.

Well, it's worth trying inline and maximum speed, some functions can be faster than the CRT.

I have not bothered with inlining much before as compilers are free not to inline the code, but it seems worth doing now with PellesC.

John

JohnF · July 12, 2014, 04:04:58 PM

Quote from: czerny on July 06, 2014, 02:52:59 PM
Quote from: JohnF on July 06, 2014, 02:33:22 PM
Yes Pelle's qsort is good enough. I've tried other implementations of qsort and Pelle's wins every time. In fact the C run time lib has many fast routines in it.
Must be an asm-implementation. But I do not understand the speed factor 4 or 5 comparing MySort and qsort. What could be optimized?

My pivot element is the very first element in an intervall. This should be no big problem if the data is sort of random. One could choose the median as a better choice, but to find the median costs time.

What else could be optimized?

czerny, here is a good qsort, it's only slightly slower than Pelle's qsort. Compile it maximum speed. It's from Tiny Lib C.

Code Select


static void shortsort(char *lo, char *hi, size_t width, int (*comp)(const void *, const void *));
static void swap(char *p, char *q, size_t width);

#define CUTOFF 8
#define STKSIZ (8*sizeof(void*) - 2)

void q_sort(void *base, size_t num, size_t width, int (*comp) (const void *, const void *))
{
	/**************************************************************
	  Note: the number of stack entries required is no more than
	   1 + log2(num), so 30 is sufficient for any array
	***************************************************************/
	char *lo, *hi;					// ends of sub-array currently sorting 
	char *mid;						// points to middle of subarray 
	char *loguy, *higuy;			// traveling pointers for partition step 
	size_t size;					// size of the sub-array 
	char *lostk[STKSIZ], *histk[STKSIZ];
	int stkptr;						// stack for saving sub-array to be processed 

	if (num < 2)
		return;	/* nothing to do */

	stkptr = 0;	/* initialize stack */

	lo = (char *)base;
	hi = (char *)base + width * (num - 1);	/* initialize limits */

	/* this entry point is for pseudo-recursion calling: setting
	   lo and hi and jumping to here is like recursion, but stkptr is
	   preserved, locals aren't, so we preserve stuff on the stack */
  recurse:

	size = (hi - lo) / width + 1;	/* number of el's to sort */

	/* below a certain size, it is faster to use a O(n^2) sorting method */
	if (size <= CUTOFF)
	{
		shortsort(lo, hi, width, comp);
	}
	else
	{
		/* First we pick a partitioning element.  The efficiency of the
		   algorithm demands that we find one that is approximately the median
		   of the values, but also that we select one fast.  We choose the
		   median of the first, middle, and last elements, to avoid bad
		   performance in the face of already sorted data, or data that is made
		   up of multiple sorted runs appended together.  Testing shows that a
		   median-of-three algorithm provides better performance than simply
		   picking the middle element for the latter case. */

		mid = lo + (size / 2) * width;	/* find middle element */

		/* Sort the first, middle, last elements into order */
		if (comp(lo, mid) > 0)
			swap(lo, mid, width);
		if (comp(lo, hi) > 0)
			swap(lo, hi, width);
		if (comp(mid, hi) > 0)
			swap(mid, hi, width);

		/* We now wish to partition the array into three pieces, one consisting
		   of elements <= partition element, one of elements equal to the
		   partition element, and one of elements > than it.  This is done
		   below; comments indicate conditions established at every step. */

		loguy = lo;
		higuy = hi;

		/* Note that higuy decreases and loguy increases on every iteration,
		   so loop must terminate. */
		for (;;)
		{
			/* lo <= loguy < hi, lo < higuy <= hi,
			   A[i] <= A[mid] for lo <= i <= loguy,
			   A[i] > A[mid] for higuy <= i < hi,
			   A[hi] >= A[mid] */

			/* The doubled loop is to avoid calling comp(mid,mid), since some
			   existing comparison funcs don't work when passed the same
			   value for both pointers. */

			if (mid > loguy)
			{
				do
				{
					loguy += width;
				} while (loguy < mid && comp(loguy, mid) <= 0);
			}
			if (mid <= loguy)
			{
				do
				{
					loguy += width;
				} while (loguy <= hi && comp(loguy, mid) <= 0);
			}

			/* lo < loguy <= hi+1, A[i] <= A[mid] for lo <= i < loguy,
			   either loguy > hi or A[loguy] > A[mid] */

			do
			{
				higuy -= width;
			} while (higuy > mid && comp(higuy, mid) > 0);

			/* lo <= higuy < hi, A[i] > A[mid] for higuy < i < hi,
			   either higuy == lo or A[higuy] <= A[mid] */

			if (higuy < loguy)
				break;

			/* if loguy > hi or higuy == lo, then we would have exited, so
			   A[loguy] > A[mid], A[higuy] <= A[mid],
			   loguy <= hi, higuy > lo */

			swap(loguy, higuy, width);

			/* If the partition element was moved, follow it.  Only need
			   to check for mid == higuy, since before the swap,
			   A[loguy] > A[mid] implies loguy != mid. */

			if (mid == higuy)
				mid = loguy;

			/* A[loguy] <= A[mid], A[higuy] > A[mid]; so condition at top
			   of loop is re-established */
		}

		/* A[i] <= A[mid] for lo <= i < loguy,
		   A[i] > A[mid] for higuy < i < hi,
		   A[hi] >= A[mid]
		   higuy < loguy
		   implying:
		   higuy == loguy-1
		   or higuy == hi - 1, loguy == hi + 1, A[hi] == A[mid] */

		/* Find adjacent elements equal to the partition element.  The
		   doubled loop is to avoid calling comp(mid,mid), since some
		   existing comparison funcs don't work when passed the same value
		   for both pointers. */

		higuy += width;
		if (mid < higuy)
		{
			do
			{
				higuy -= width;
			} while (higuy > mid && comp(higuy, mid) == 0);
		}
		if (mid >= higuy)
		{
			do
			{
				higuy -= width;
			} while (higuy > lo && comp(higuy, mid) == 0);
		}

		/* OK, now we have the following:
		   higuy < loguy
		   lo <= higuy <= hi
		   A[i]  <= A[mid] for lo <= i <= higuy
		   A[i]  == A[mid] for higuy < i < loguy
		   A[i]  >  A[mid] for loguy <= i < hi
		   A[hi] >= A[mid] */

		/* We've finished the partition, now we want to sort the subarrays
		   [lo, higuy] and [loguy, hi].
		   We do the smaller one first to minimize stack usage.
		   We only sort arrays of length 2 or more. */

		if (higuy - lo >= hi - loguy)
		{
			if (lo < higuy)
			{
				lostk[stkptr] = lo;
				histk[stkptr] = higuy;
				++stkptr;
			}	/* save big recursion for later */

			if (loguy < hi)
			{
				lo = loguy;
				goto recurse;	/* do small recursion */
			}
		}
		else
		{
			if (loguy < hi)
			{
				lostk[stkptr] = loguy;
				histk[stkptr] = hi;
				++stkptr;	/* save big recursion for later */
			}

			if (lo < higuy)
			{
				hi = higuy;
				goto recurse;	/* do small recursion */
			}
		}
	}

	/* We have sorted the array, except for any pending sorts on the stack.
	   Check if there are any, and do them. */

	--stkptr;
	if (stkptr >= 0)
	{
		lo = lostk[stkptr];
		hi = histk[stkptr];
		goto recurse;	/* pop subarray from stack */
	}
	else
		return;	/* all subarrays done */
}

static void shortsort(char *lo, char *hi, size_t width, int (*comp) (const void *, const void *))
{
	char *p, *max;

	/* Note: in assertions below, i and j are alway inside original bound of
	   array to sort. */

	while (hi > lo)
	{
		/* A[i] <= A[j] for i <= j, j > hi */
		max = lo;
		for (p = lo + width; p <= hi; p += width)
		{
			/* A[i] <= A[max] for lo <= i < p */
			if (comp(p, max) > 0)
				max = p;
			/* A[i] <= A[max] for lo <= i <= p */
		}

		/* A[i] <= A[max] for lo <= i <= hi */

		swap(max, hi, width);

		/* A[i] <= A[hi] for i <= hi, so A[i] <= A[j] for i <= j, j >= hi */

		hi -= width;

		/* A[i] <= A[j] for i <= j, j > hi, loop top condition established */
	}
	/* A[i] <= A[j] for i <= j, j > lo, which implies A[i] <= A[j] for i < j,
	   so array is sorted */
}

static void swap(char *a, char *b, size_t width)
{
	char tmp = 0;

	if (a != b)
		// Do the swap one character at a time to avoid potential alignment problems.
		while (width--)
		{
			tmp = *a;
			*a++ = *b;
			*b++ = tmp;
		}
}

John

czerny · July 14, 2014, 06:34:05 PM

Thank you John,

on my old machine it is even quicker:

Pelles qsort: 7035 ms, TinyLib q_sort: 6267 ms (5486 ms for same with unsigned long swap)

I will try to rewrite it for a special unsigned long version with inlined swaps.

Edit: I am a little bit confused about the following statement:

   /**************************************************************
     Note: the number of stack entries required is no more than
     1 + log2(num), so 30 is sufficient for any array
   ***************************************************************/

If num = 2^32 --> 1 + log2(num) = 33 > 30

JohnF · July 14, 2014, 08:22:58 PM

Quote from: czerny on July 14, 2014, 06:34:05 PM
Thank you John,

on my old machine it is even quicker:

Pelles qsort: 7035 ms, TinyLib q_sort: 6267 ms (5486 ms for same with unsigned long swap)

I will try to rewrite it for a special unsigned long version with inlined swaps.

Edit: I am a little bit confused about the following statement:

   /**************************************************************
     Note: the number of stack entries required is no more than
     1 + log2(num), so 30 is sufficient for any array
   ***************************************************************/

If num = 2^32 --> 1 + log2(num) = 33 > 30

The TinyLib qsort is based on the Microsoft one but with improvements.

As for this >> 1 + log2(num) = 33

I can't comment on it what the guy says. Whatever, that would be very large array indeed.

Change
#define STKSIZ (8*sizeof(void*) - 2)
to
#define STKSIZ (8*sizeof(void*))

John

News:

frequency of ints in huge array

czerny

JohnF

frankie

frankie

JohnF

czerny

JohnF

czerny

JohnF

frankie

JohnF

JohnF

JohnF

czerny

JohnF