Pelles C forum
General => ChitChat => Topic started by: czerny on July 03, 2014, 01:29:15 pm

I have a little problem. May be there is an intelligent solution I don't know?
I have a huge (typically 20000000 elements) array of unsigned long ints and I have to find that int with the highest frequency.
My idea so far is to use some sort of binary tree with something like
struct node {
unsigned int x;
unsigned int freq;
struct node *left;
struct node *right;
};
as node. But thisway I have 5 times as much memory to allocate.
Is there a cheaper way to do this?

Have you tried sorting the array? On my trusty AMD, that takes three seconds. Afterwards, you just go through the array and check how often a value is being repeated, and if it's repeat count is higher than the previous champion, you take that value.

Have you tried sorting the array? On my trusty AMD, that takes three seconds. Afterwards, you just go through the array and check how often a value is being repeated, and if it's repeat count is higher than the previous champion, you take that value.
Well, that would only work if the original order of the array doesn't matter, a fact that czerny unfortunately did not elude to if that is a restriction or not.
Also he did not mention if time is an issue here either, otherwise I would suggest just start counting from the top down, which could be reasonable fast depending on the distribution of values within the array.
Might also want to look into a modification of a binary search over the array...
Ralf

I would suggest just start counting from the top down
That sounds straightforward, can you post an example?

Have you tried sorting the array? On my trusty AMD, that takes three seconds. Afterwards, you just go through the array and check how often a value is being repeated, and if it's repeat count is higher than the previous champion, you take that value.
I do not want to change the original array. So I would have to make a copy. :(

I would suggest just start counting from the top down, ...
Do you mean something like that?
maxcount = 0;
for (i=0; i<MAX; i++) {
count = 0;
for (k=i; k<MAX; k++)
if (a[k] == a[i]) count++;
if (count > maxcount) {
maxcount = count;
index = i;
}
}
This is very slow for huge arrays! O(n²)

This is very slow for huge arrays! O(n²)
Indeed. If you have enough RAM, sorting a copy is a lot faster. Quicksort does it at O(n log n), a radix sort can even be faster, depending on the data. Finding the most frequent element after the sort takes a negligible time, see below.
AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
2596 ms for sorting
77 ms for finding

Indeed. If you have enough RAM, sorting a copy is a lot faster. Quicksort does it at O(n log n), a radix sort can even be faster, depending on the data. Finding the most frequent element after the sort takes a negligible time, see below.
AMD Athlon(tm) Dual Core Processor 4450B (MMX, SSE, SSE2, SSE3)
2596 ms for sorting
77 ms for finding
Plus the time to copy the array (which is small) and 2 * memory.
But by all means better than the tree.

I would suggest just start counting from the top down, ...
Do you mean something like that?
maxcount = 0;
for (i=0; i<MAX; i++) {
count = 0;
for (k=i; k<MAX; k++)
if (a[k] == a[i]) count++;
if (count > maxcount) {
maxcount = count;
index = i;
}
}
This is very slow for huge arrays! O(n²)
May be, that this can be optimized. The search can be terminated if maxcount > (MAXi). But this depends on the data.

czerny, see if this is ok, it takes 92 msecs here.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
long long __cdecl StartTimer(void)
{
long long t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}
long long __cdecl StopTimer(long long t1)
{
long long t2;
long long frequency;
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
long long elapsedTime = (t2  t1) * 1000 / frequency;
return (long long)elapsedTime;
}
int __cdecl main(void)
{
#define SIZE 20000000
unsigned long int i;
unsigned long int * arry;
arry = malloc(SIZE * sizeof(unsigned long int));
if(arry == NULL){
printf("Error, no mem!");
return 0;
}
for (i = 0; i<SIZE; i++)
{
arry[i] = rand()%10000;
}
long long t = StartTimer();
unsigned long int maximum;
int freq = 1;
maximum = arry[0];
for (i = 0; i < SIZE; i++)
{
if (arry[i] > maximum){
maximum = arry[i];
freq = 1;
}else if(arry[i] == maximum){
freq++;
}
}
printf("msecs %lld\n", StopTimer(t));
printf("maximum %d has occurences: %d times\n", maximum, freq);
free(arry);
return 0;
}
EDIT: freq = 1;
John

czerny, see if this is ok, it takes 92 msecs here.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
long long __cdecl StartTimer(void)
{
long long t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}
long long __cdecl StopTimer(long long t1)
{
long long t2;
long long frequency;
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
long long elapsedTime = (t2  t1) * 1000 / frequency;
return (long long)elapsedTime;
}
int __cdecl main(void)
{
#define SIZE 20000000
unsigned long int i;
unsigned long int * arry;
arry = malloc(SIZE * sizeof(unsigned long int));
if(arry == NULL){
printf("Error, no mem!");
return 0;
}
for (i = 0; i<SIZE; i++)
{
arry[i] = rand()%10000;
}
long long t = StartTimer();
unsigned long int maximum;
int freq = 1;
maximum = arry[0];
for (i = 0; i < SIZE; i++)
{
if (arry[i] > maximum){
maximum = arry[i];
freq = 1;
}else if(arry[i] == maximum){
freq++;
}
}
printf("msecs %lld\n", StopTimer(t));
printf("maximum %d has occurences: %d times\n", maximum, freq);
free(arry);
return 0;
}
EDIT: freq = 1;
John
Try this with: SIZE=3; arry = {5,3,3];
I want the result: maximum == 3; freq == 2;

Plus the time to copy the array (which is small) and 2 * memory.
199 ms for copying
5800 ms for sorting
127 ms for finding
"copying" includes allocating and freeing the copy. The biggest chunk is clearly the sorting. What kind of data do you have? Does it spread over the whole 32bit range?

Plus the time to copy the array (which is small) and 2 * memory.
199 ms for copying
5800 ms for sorting
127 ms for finding
"copying" includes allocating and freeing the copy. The biggest chunk is clearly the sorting. What kind of data do you have? Does it spread over the whole 32bit range?
Yes, there is no known subset.

czerny, sorry about my post, I should read more carefully.
This code takes just over 4 secs here. I found the main body of the code on the web.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
long long __cdecl StartTimer(void)
{
long long t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}
long long __cdecl StopTimer(long long t1)
{
long long t2;
long long frequency;
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
long long elapsedTime = (t2  t1) * 1000 / frequency;
return elapsedTime;
}
int __cdecl CompareInt(const void * param1, const void * param2)
{
unsigned long int * pInt1 = (unsigned long int *)param1;
unsigned long int * pInt2 = (unsigned long int *)param2;
if(*pInt1 < *pInt2)
return 1;
else if(*pInt1 > *pInt2)
return 1;
return 0;
}
int __cdecl main(void)
{
#define SIZE 20000000
unsigned long int i;
unsigned long int * arry, * arry1;
arry = malloc(SIZE * sizeof(unsigned long int));
if(arry == NULL){
printf("Error, no mem!");
return 0;
}
arry1 = malloc(SIZE * sizeof(unsigned long int));
if(arry1 == NULL){
printf("Error, no mem!");
return 0;
}
for (i = 0; i<SIZE; i++)
{
arry[i] = rand()%10000;
}
long long t = StartTimer();
memcpy(arry1, arry, SIZE*sizeof(unsigned long int));
qsort(arry, SIZE, sizeof(unsigned long int), CompareInt);
int previous = arry[0];
int popular = arry[0];
int count = 1;
int maxCount = 1;
for (int i = 1; i < SIZE; i++) {
if (arry[i] == previous)
count++;
else {
if (count > maxCount) {
popular = arry[i1];
maxCount = count;
}
previous = arry[i];
count = 1;
}
}
printf("msecs %lld\n", StopTimer(t));
printf("maximum %d has occured %d times\n", popular, maxCount);
free(arry);
free(arry1);
return 0;
}
If it's no good I'll stop. :)
John

I would propose a different approach that is very efficient if the distribution is narrow (we have many ripetitions) in memory usage and time because we scan the main array only one time and the 'reduced' array many times.
If all values are expected to be gaussian this algorithm is a nightmare like the others ;D
#define _DEBUG
#ifdef _DEBUG
#define PDEBUG(...) printf(__VA_ARGS__)
#else
#define PDEBUG(...)
#endif
#define ARRAY_SIZE 20000000
#define STEP (sizeof(array)/(100*sizeof(DWORD))*sizeof(EVAL))
typedef struct
{
DWORD val;
int Count;
} EVAL, *LPEVAL;
DWORD array[ARRAY_SIZE];
BOOL ArrayMode(int a[], int arraySize, DWORD *Val, int *freq)
{
DWORD CurrSize = STEP; //Start with a small percentage
DWORD Idx = 0; //Current free element
LPEVAL pE = malloc(CurrSize);
if (!pE)
return FALSE; //Memory fail
for (int i=0; i<sizeof(array)/sizeof(DWORD); i++)
{
DWORD j;
for (j=0; j<Idx; j++)
if (pE[j].val==a[i])
{
pE[j].Count++;
break;
}
if (j==Idx) //NotFound, we have to add the new element
{
//First check if we have room in the array
if (Idx>=CurrSize/sizeof(EVAL))
{ //We have to expand array
PDEBUG("Expanding array, Idx=%u\n", Idx);
getchar();
void *p = realloc(pE, CurrSize+STEP);
if (!p)
{
free(pE);
return FALSE;
}
pE = p;
CurrSize+=STEP;
}
//Add new element
pE[Idx].val = a[i];
pE[Idx].Count = 1;
PDEBUG("Adding element, val[%u]=%u\n", Idx, pE[Idx].val);
Idx++;
}
}
//Now search the mode of reduced array
DWORD IdxMax = 0;
for (DWORD i=0; i<Idx; i++)
if (pE[i].Count>pE[IdxMax].Count)
IdxMax = i;
//Copy results
*Val = pE[IdxMax].val;
*freq = pE[IdxMax].Count;
//Free memory
free(pE);
return TRUE;
}
void test(void)
{
DWORD val;
int freq;
//Fill a test array
for (int i=0; i<sizeof(array)/sizeof(DWORD); i++)
#if 1
if (i<1000)
array[i] = 1;
else if (i<100000)
array[i] = 2;
else
array[i] = 3;
#else
array[i] = rand();
#endif
ArrayMode(array, ARRAY_SIZE, &val, &freq);
printf("Array mode: value=%u, freq=%d\n", val, freq);
}

This code takes just over 4 secs here.
Time the qsort separately. It will be just under 4 secs ;)
If it's no good I'll stop. :)
It seems you are doing exactly what I did in my tests with assembler. So don't stop  unless we see something faster, it seems the way to go ;)

I would suggest just start counting from the top down, ...
Do you mean something like that?
maxcount = 0;
for (i=0; i<MAX; i++) {
count = 0;
for (k=i; k<MAX; k++)
if (a[k] == a[i]) count++;
if (count > maxcount) {
maxcount = count;
index = i;
}
}
This is very slow for huge arrays! O(n²)
May be, that this can be optimized. The search can be terminated if maxcount > (MAXi). But this depends on the data.
That's what I mentioned in my original reply. It highly depends on the spread of the values within the array.
I have done something similar years back, just need to dig it out, but I am not frequently at a computer, there's a long weekend here in the US of A and as I have no vacation time being selfemployed, it's as close as any "time off" will be for a while...
Ralf

This code takes just over 4 secs here.
Time the qsort separately. It will be just under 4 secs ;)
If it's no good I'll stop. :)
It seems you are doing exactly what I did in my tests with assembler. So don't stop  unless we see something faster, it seems the way to go ;)
I've tried two radix sorts and they were slower  maybe they were not great implementations, I don't know.
EDIT:
Changing the compare function slightly
int __cdecl CompareInt(const void * param1, const void * param2)
{
if(*(unsigned long int *)param1 < *(unsigned long int *)param2)
return 1;
else if(*(unsigned long int *)param1 > *(unsigned long int *)param2)
return 1;
else
return 0;
}
and setting maximize speed, the time comes down to 3.1 secs.
John

... setting maximize speed, the time comes down to 3.1 secs.
Sounds good. Which CPU? Mine is a Celeron M.

... setting maximize speed, the time comes down to 3.1 secs.
Sounds good. Which CPU? Mine is a Celeron M.
Pentium(R) DualCore 2.7 GHz
John

If it's no good I'll stop. :)
What do you want to stop? Stop breathing? Stop coding? Stop posting in the Pelles C forum?
Don't do it, please! ;)

I would propose a different approach that is very efficient if the distribution is narrow (we have many ripetitions)
Try this (there are many (10000000 x 17)):
for (int i=0; i<sizeof(array)/sizeof(DWORD); i++)
if (i%2) array[i] = 17; else array[i] = i;

377 Celeron M (http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.60GHz&id=710)
1632 Pentium(R) DualCore 2.7 GHz (http://www.cpubenchmark.net/cpu.php?cpu=Intel+Pentium+E5400+%40+2.70GHz)

377 Celeron M (http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.60GHz&id=710)
1632 Pentium(R) DualCore 2.7 GHz (http://www.cpubenchmark.net/cpu.php?cpu=Intel+Pentium+E5400+%40+2.70GHz)
One could go higher! ;)
John

Czerny, if you can get Pelle's OpenMP stuff working with his quick_sort I believe it would speed up the sorting quite a bit.
However, I've not been able get it working with unsigned int's.
http://www.smorgasbordet.com/pellesc/zip/qsort.zip (http://www.smorgasbordet.com/pellesc/zip/qsort.zip)
The fastest is SORT_METHOD 3
In his example you can sort 20000000 floats in 1.6 secs.
John

If you are not too disgusted by assembler, try the attachment. It runs either on the commandline argument, or (if there are no args) on a file called FatArray.dat

That's certainly fast, 1.6 secs in total here.
You could make a library that can be called from C?
All your best stuff in 32 and 64 bit libs. :)
John

If you are not too disgusted by assembler, try the attachment. It runs either on the commandline argument, or (if there are no args) on a file called FatArray.dat
Hmm!
On my old machine I got for copying + sorting + finding:
jj2700 : 2843 ms
john : 7780 ms (with qsort)
MySort : 37326 ms
I wonder why MySort is soooo slow!
void MySort(unsigned long arr[], unsigned long l, unsigned long r)
{
if (l<r) {
unsigned long h, k, last, first = l+1, p = arr[l];
last = l;
for (k=l+1; k<=r; k++) {
if (arr[k] < p) {
h = arr[k]; arr[k] = arr[first]; arr[first] = h;
last = first++;
}
}
if (last > l) {
arr[l] = arr[last]; arr[last] = p;
}
if (last > l) MySort(arr, l, last1);
if (last < r) MySort(arr, last+1, r);
}
}
jj2700: Your code for sorting is not included. I have downloaded SetupMasmBasic and executed SetupMasmBasic.exe but I got only an empty dialog with 'last chance ...' as title and 5 files in a temp directory.
I have found a nonrecursiv asm version on the net: http://www.cs.hofstra.edu/~cscccl/csc111/qsort.asm (http://www.cs.hofstra.edu/~cscccl/csc111/qsort.asm)
I would like to learn, howto make this Pelles C compatible. This should be easy in that case because der is a wrapper function for C.
Edit: Ok, I have changed .386p to .386 and deleted the .stack directive (don't know if this was right?)!
But it runs: 4587 ms

Czerny, if you can get Pelle's OpenMP stuff working with his quick_sort I believe it would speed up the sorting quite a bit.
I neither own a machine with some sort of multi core processor nor have I an operating system at home to run the newest Pelles C. :'(

I would propose a different approach that is very efficient if the distribution is narrow (we have many ripetitions)
Try this (there are many (10000000 x 17)):
for (int i=0; i<sizeof(array)/sizeof(DWORD); i++)
if (i%2) array[i] = 17; else array[i] = i;
Yes, but this is not a narrow distribution anyway, you have 10M values equal and 10M values different... :\
To speed up the calculation you have to sort the array, one of the fastest ways (if you don't hate the math) is to sample at powers of 2 steps as described here (http://www.quora.com/Algorithms/Whatisthemostefficientalgorithmforcalculatingthemodeofanarrayofintegers)... :P
P.S. if the time is relevant, and in your case with so large arrays of data it is the case, the suggestion from JJ to use assembler is appropriate. ;)
EDIT: if you reduce to 20K different values it takes 350ms on my machine:
for (int i=0; i<sizeof(array)/sizeof(DWORD); i++)
if (i%1000) array[i] = 17; else array[i] = i;

I got only an empty dialog with 'last chance ...'
Sorry... the setup is a bit mean in that it requires that Masm32 is installed (http://masm32.com/installation.htm)  the reason being that an assembler library can do harm in the wrong hands. That's why there is this little obstacle.
Anyway, I'll see if I can provide you a *.lib file with the integer sort.

Here is Pelle's OPenMP quick_sort (SORT_METHOD 3, fastest) now working with unsigned int.
Here it completes a 20000000 sort in 1.35 secs.
For anyone but Czerny, that is.
John

That is pretty fast! I managed to add the sort to a little library (attached), here is a test:
#include <stdio.h>
#include <windows.h> // MessageBox
#pragma warn(disable:2216) // retval never used
#pragma warn(disable:2118) // para not referenced
#pragma comment(lib, "\\Masm32\\MasmBasic\\Res\\xTimer.lib") // adjust to your path
extern void __cdecl xTimerStart(void); // uses QPC and returns
extern int __stdcall xTimerEnd(void); // microseconds
extern int __stdcall MbArrSort(int*pStart, int elements, int pKey, int mode); // MasmBasic ArraySort
int main(int argc, char* argv[]) {
int i, MyArr[6]={123, 789, 456, 111, 333, 222};
xTimerStart();
// mode=size (4, 8 ) or (32 and real) or (64 and ascending)
MbArrSort(&MyArr[0], 6, 0, 4+64); // size 4, integer, ascending
printf("Sorting took %i microseconds\n", xTimerEnd());
for ( i=0; i<6; i++ ) printf("%i\t%i\n", i, MyArr[ i ] )
return MessageBox(0, "Sorted?", "Confirm:", MB_YESNO  MB_ICONINFORMATION);
}
What is rather odd is that the linker can't find MbArrSort unless xTimerStart is also being used. I often call assembler routines from C, it always works, and sort and timer are not connected in any way, so for the time being it's a mystery...
MbArrSort handles also QWORD, float and double arrays, see the mode comment.

POLINK: fatal error: File not found: '\masm32\lib\masm32.lib'.
masm32.lib
gdi32.lib
user32.lib
Comctl32.lib
Comdlg32.lib
shell32.lib
oleaut32.lib
ole32.lib
msvcrt.lib
masmbasic.lib
You asked if I were disgusted by asm, well I'm getting close!
John

POLINK: fatal error: File not found: '\masm32\lib\masm32.lib'.
John,
It seems there are some dependencies that require a Masm32 installation (http://masm32.com/masmdl.htm)  sorry for that.
In any case, Pelle's quicksort is probably fast enough. My sort is a radix sort, it is faster than the Masm32 library quicksort if the integer range is not fully used, but somewhat (10%?) slower if the full 32bit are present in the array.

Yes Pelle's qsort is good enough. I've tried other implementations of qsort and Pelle's wins every time. In fact the C run time lib has many fast routines in it.
John

Yes Pelle's qsort is good enough. I've tried other implementations of qsort and Pelle's wins every time. In fact the C run time lib has many fast routines in it.
Must be an asmimplementation. But I do not understand the speed factor 4 or 5 comparing MySort and qsort. What could be optimized?
My pivot element is the very first element in an intervall. This should be no big problem if the data is sort of random. One could choose the median as a better choice, but to find the median costs time.
What else could be optimized?

Yes Pelle's qsort is good enough. I've tried other implementations of qsort and Pelle's wins every time. In fact the C run time lib has many fast routines in it.
Must be an asmimplementation. But I do not understand the speed factor 4 or 5 comparing MySort and qsort. What could be optimized?
My pivot element is the very first element in an intervall. This should be no big problem if the data is sort of random. One could choose the median as a better choice, but to find the median costs time.
What else could be optimized?
You could try a different sort method, maybe it would better qsort on this occasion.
However, if I were you I'd pick the fastest option available which is probably Pelle's qsort  unless you can find a fast asm routine that is.
I've tried to better Pelle's CRT routines with asm but have always been beaten by the CRT lib.
EDIT: attached is M.S. qsort.c. I've not tried it.
John

I've tried to better Pelle's CRT routines with asm but have always been beaten by the CRT lib.
The CRT routines are in general fast, and some cannot be improved with asm. However, many can be beaten easily with a factor 23. One reason is that most of them are still preSSE2 for backward compatibility.

However, if I were you I'd pick the fastest option available which is probably Pelle's qsort  unless you can find a fast asm routine that is.
For my actual needs I will use qsort. That's ok.
But there is still a lack of understanding which bothers me. I do not expect to beat an asmroutine with c. I would expect a factor of, say 2. An factor 5 says to me: there must be some algorithmic improvements!

... still preSSE2 for backward compatibility.
fortunately!

I tryied another system to reduce the time without ordering the array.
It uses hashing.
You can play with hash table dimensions to reduce/extend the time or the memory used (they have opposite behaviour).
The time overheading is due, as clear to anybody, to the huge number of loops to check the array. The indexing technique used with hashing avoid loops and hash handling is widly repayed by faster access.
The project posted show how it works, but it is not optimized yet. You can get much better results with a faster hashing library that reduce memory access, and using rehashing to remove looping in bucket extension.
I hope this is useful or at least shows the right way...

... still preSSE2 for backward compatibility.
fortunately!
Hmmm....: Pelles C Release Candidate 5 (for 32bit Windows Vista/7/8) (http://www.smorgasbordet.com/pellesc/download.htm)
That's a much tougher obstacle IMHO ;)

... still preSSE2 for backward compatibility.
fortunately!
Hmmm....: Pelles C Release Candidate 5 (for 32bit Windows Vista/7/8) (http://www.smorgasbordet.com/pellesc/download.htm)
What dou mean? 32Bit?
That's a much tougher obstacle IMHO ;)

No, the restriction to Vista and upwards:
 about 25% of all PCs run with XP
 about 2% run with CPUs that don't have SSE2...

I tryied another system to reduce the time without ordering the array.
It uses hashing.
You can play with hash table dimensions to reduce/extend the time or the memory used (they have opposite behaviour).
The time overheading is due, as clear to anybody, to the huge number of loops to check the array. The indexing technique used with hashing avoid loops and hash handling is widly repayed by faster access.
The project posted show how it works, but it is not optimized yet. You can get much better results with a faster hashing library that reduce memory access, and using rehashing to remove looping in bucket extension.
I hope this is useful or at least shows the right way...
Oh, that looks promising! A little bit slower than qsort, yet. But a lot of memory for randomized data

I tryied another system to reduce the time without ordering the array.
It uses hashing.
You can play with hash table dimensions to reduce/extend the time or the memory used (they have opposite behaviour).
The time overheading is due, as clear to anybody, to the huge number of loops to check the array. The indexing technique used with hashing avoid loops and hash handling is widly repayed by faster access.
The project posted show how it works, but it is not optimized yet. You can get much better results with a faster hashing library that reduce memory access, and using rehashing to remove looping in bucket extension.
I hope this is useful or at least shows the right way...
Hey Frankie, does your brain really work this way? :o
HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].key = key;
HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].data = data;
return (HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].key ? 1: 0);
I've had to simplify those lines.
Anyway, it's a neat demo, thanks.
John

Oh, that looks promising! A little bit slower than qsort, yet. But a lot of memory for randomized data
If you want a fast algorithm and want to use small or no memory the right solution is ... magic ;D

Hey Frankie, does your brain really work this way? :o
HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].key = key;
HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].data = data;
return (HashTbl>buckets[nHashIndex].entry[HashTbl>buckets[nHashIndex].cnt1].key ? 1: 0);
I've had to simplify those lines.
Anyway, it's a neat demo, thanks.
John
John, sometimes my brain works even worst that this ;D
Seriously the lines you are showing makes sense because in the original library key and Data were pointers. key pointed to a structure for the key holding the key itself, hashing method, lenght of key, and some other info., Data pointed to a buffer holding user data for the key.
I made a quick and dirty simplification assigning the to key the array value and at data the frequency, and fixing the hashing function as <value % size_of_hash_table> (% stay for modulus I want clarify considering the previous post on XOR operator ;D ;D).
The ripetiono of addressing is not a problem, in an optimized compiler it is a branch of program tree, the compiler should be able to recognize the ripetion of code and execute it only one time. Incidentally PellesC generally don't do that :'(, but there is time to improve.....
Anyway I think that all the points and possible solutions have been showed, and as I said before remains only the magic... ;D

Oh, I thought % was for percent! :D
John

Another idea!

Another idea!
It seems to work ok, but not better than the original qsort we were playing with.
I've been messing with the various ways to deal with your problem and the old qsort method is probably the best you have. Frankie's hash, I think suffers from collisions when the value range is large, because of using the modus operator for nHashIndex.
And using a more complex hash function leads to other problems.
I managed to speed up Frankie's method but gave up when I saw problems with collisions.
Frankie may know of a way around the problems though.
John

Another idea!
It seems to work ok, but not better than the original qsort we were playing with.
It is much quicker, if there are many values with frequency 1.

Another idea!
It seems to work ok, but not better than the original qsort we were playing with.
It is much quicker, if there are many values with frequency 1.
:)
EDIT: For a limited range of values there are very fast alternatives
Using Frankie's method with modus the routine for 20000000 values using rand()%10000 to fill the array only takes 203ms.
This code was inspired by Frankie's example, it uses a 2 dimensional array.
void insert(uint32_t value)
{
uint32_t nHashIndex = value % SIZE;
if(arry1[nHashIndex][VAL] != value){
arry1[nHashIndex][VAL] = value;
arry1[nHashIndex][FEQ]++;
}else{
arry1[nHashIndex][FEQ]++;
}
}
John

Frankie's hash, I think suffers from collisions when the value range is large, because of using the modus operator for nHashIndex.
And using a more complex hash function leads to other problems.
I managed to speed up Frankie's method but gave up when I saw problems with collisions.
Frankie may know of a way around the problems though.
John
John, hashing always will generate collisions, this is the standard way it works on a finite size table.
In our case if you dimensions an hash table of 2^32 elements the hashing function h = Val % 2^32 will be a perfect hashing ;D
You will have no collisions and direct access to all keys, but at price of an huge memory usage.
With a reduced dimensions table you can have some benefit of direct access and some loop iteration in extended buckets. In my sample using a table that is 1/100th of array syze = 200k values the maximum looping is 2^32/200k = 21475 elements (extended buckets).
To speed up this you can use, as I mentioned on my original post, a rehashing method.
Rehashing means that the extended buckets will be an hash subtable were a new hash will allow direct addressing. The subhash function can be Sub_hash = (Val%Main_Hash_Table_Size) % Hash_SubTable_Size, this will cover all indexes in the subtable. Still if the subtable is not big enough to get all values you will have extended buckets.
Proceeding this way you can have as many subtable as you want, and you can also decide to what extent keep the looping. The looping will be on how many possible extended buckets can exist after the permitted subhashing levels and their tables dimensions.
The advantage is that the subtables, as the extended buckets in standard hashing, will be created when necessary reducing the memory usage. Of course subhashing uses more memory of standard hashing because it can create a full dimension subtable even if only one value has to be stored, where the standard hashing would have created a single bucket.
Another optimizing way is to use the so called 'dynamic hashing', in this case the whole system is dynamic menaning that tables, subtables and hashing changes dynamically with the table contents. You can found info on dynamic hashing here (http://en.wikipedia.org/wiki/Dynamic_perfect_hashing) and here (http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node20.html).
Anyway you have to consider also the cost of dynamic handling in terms of memory allocation, the more you play with memory OS function the more time you spend. The point is always the same the right method have to be a correct balance between speed and memory requirement. 8)
P.S. Be carefull with removing collisions, perfect hashing will lead to 2*sizeof(DWORD)*2^32 memory usage!

Frankie, thanks for the explanation.
John

Earlier we were talking about how it's difficult to beat the CRT library functions for speed.
Well, it's worth trying inline and maximum speed, some functions can be faster than the CRT.
I have not bothered with inlining much before as compilers are free not to inline the code, but it seems worth doing now with PellesC.
John

Yes Pelle's qsort is good enough. I've tried other implementations of qsort and Pelle's wins every time. In fact the C run time lib has many fast routines in it.
Must be an asmimplementation. But I do not understand the speed factor 4 or 5 comparing MySort and qsort. What could be optimized?
My pivot element is the very first element in an intervall. This should be no big problem if the data is sort of random. One could choose the median as a better choice, but to find the median costs time.
What else could be optimized?
czerny, here is a good qsort, it's only slightly slower than Pelle's qsort. Compile it maximum speed. It's from Tiny Lib C.
static void shortsort(char *lo, char *hi, size_t width, int (*comp)(const void *, const void *));
static void swap(char *p, char *q, size_t width);
#define CUTOFF 8
#define STKSIZ (8*sizeof(void*)  2)
void q_sort(void *base, size_t num, size_t width, int (*comp) (const void *, const void *))
{
/**************************************************************
Note: the number of stack entries required is no more than
1 + log2(num), so 30 is sufficient for any array
***************************************************************/
char *lo, *hi; // ends of subarray currently sorting
char *mid; // points to middle of subarray
char *loguy, *higuy; // traveling pointers for partition step
size_t size; // size of the subarray
char *lostk[STKSIZ], *histk[STKSIZ];
int stkptr; // stack for saving subarray to be processed
if (num < 2)
return; /* nothing to do */
stkptr = 0; /* initialize stack */
lo = (char *)base;
hi = (char *)base + width * (num  1); /* initialize limits */
/* this entry point is for pseudorecursion calling: setting
lo and hi and jumping to here is like recursion, but stkptr is
preserved, locals aren't, so we preserve stuff on the stack */
recurse:
size = (hi  lo) / width + 1; /* number of el's to sort */
/* below a certain size, it is faster to use a O(n^2) sorting method */
if (size <= CUTOFF)
{
shortsort(lo, hi, width, comp);
}
else
{
/* First we pick a partitioning element. The efficiency of the
algorithm demands that we find one that is approximately the median
of the values, but also that we select one fast. We choose the
median of the first, middle, and last elements, to avoid bad
performance in the face of already sorted data, or data that is made
up of multiple sorted runs appended together. Testing shows that a
medianofthree algorithm provides better performance than simply
picking the middle element for the latter case. */
mid = lo + (size / 2) * width; /* find middle element */
/* Sort the first, middle, last elements into order */
if (comp(lo, mid) > 0)
swap(lo, mid, width);
if (comp(lo, hi) > 0)
swap(lo, hi, width);
if (comp(mid, hi) > 0)
swap(mid, hi, width);
/* We now wish to partition the array into three pieces, one consisting
of elements <= partition element, one of elements equal to the
partition element, and one of elements > than it. This is done
below; comments indicate conditions established at every step. */
loguy = lo;
higuy = hi;
/* Note that higuy decreases and loguy increases on every iteration,
so loop must terminate. */
for (;;)
{
/* lo <= loguy < hi, lo < higuy <= hi,
A[i] <= A[mid] for lo <= i <= loguy,
A[i] > A[mid] for higuy <= i < hi,
A[hi] >= A[mid] */
/* The doubled loop is to avoid calling comp(mid,mid), since some
existing comparison funcs don't work when passed the same
value for both pointers. */
if (mid > loguy)
{
do
{
loguy += width;
} while (loguy < mid && comp(loguy, mid) <= 0);
}
if (mid <= loguy)
{
do
{
loguy += width;
} while (loguy <= hi && comp(loguy, mid) <= 0);
}
/* lo < loguy <= hi+1, A[i] <= A[mid] for lo <= i < loguy,
either loguy > hi or A[loguy] > A[mid] */
do
{
higuy = width;
} while (higuy > mid && comp(higuy, mid) > 0);
/* lo <= higuy < hi, A[i] > A[mid] for higuy < i < hi,
either higuy == lo or A[higuy] <= A[mid] */
if (higuy < loguy)
break;
/* if loguy > hi or higuy == lo, then we would have exited, so
A[loguy] > A[mid], A[higuy] <= A[mid],
loguy <= hi, higuy > lo */
swap(loguy, higuy, width);
/* If the partition element was moved, follow it. Only need
to check for mid == higuy, since before the swap,
A[loguy] > A[mid] implies loguy != mid. */
if (mid == higuy)
mid = loguy;
/* A[loguy] <= A[mid], A[higuy] > A[mid]; so condition at top
of loop is reestablished */
}
/* A[i] <= A[mid] for lo <= i < loguy,
A[i] > A[mid] for higuy < i < hi,
A[hi] >= A[mid]
higuy < loguy
implying:
higuy == loguy1
or higuy == hi  1, loguy == hi + 1, A[hi] == A[mid] */
/* Find adjacent elements equal to the partition element. The
doubled loop is to avoid calling comp(mid,mid), since some
existing comparison funcs don't work when passed the same value
for both pointers. */
higuy += width;
if (mid < higuy)
{
do
{
higuy = width;
} while (higuy > mid && comp(higuy, mid) == 0);
}
if (mid >= higuy)
{
do
{
higuy = width;
} while (higuy > lo && comp(higuy, mid) == 0);
}
/* OK, now we have the following:
higuy < loguy
lo <= higuy <= hi
A[i] <= A[mid] for lo <= i <= higuy
A[i] == A[mid] for higuy < i < loguy
A[i] > A[mid] for loguy <= i < hi
A[hi] >= A[mid] */
/* We've finished the partition, now we want to sort the subarrays
[lo, higuy] and [loguy, hi].
We do the smaller one first to minimize stack usage.
We only sort arrays of length 2 or more. */
if (higuy  lo >= hi  loguy)
{
if (lo < higuy)
{
lostk[stkptr] = lo;
histk[stkptr] = higuy;
++stkptr;
} /* save big recursion for later */
if (loguy < hi)
{
lo = loguy;
goto recurse; /* do small recursion */
}
}
else
{
if (loguy < hi)
{
lostk[stkptr] = loguy;
histk[stkptr] = hi;
++stkptr; /* save big recursion for later */
}
if (lo < higuy)
{
hi = higuy;
goto recurse; /* do small recursion */
}
}
}
/* We have sorted the array, except for any pending sorts on the stack.
Check if there are any, and do them. */
stkptr;
if (stkptr >= 0)
{
lo = lostk[stkptr];
hi = histk[stkptr];
goto recurse; /* pop subarray from stack */
}
else
return; /* all subarrays done */
}
static void shortsort(char *lo, char *hi, size_t width, int (*comp) (const void *, const void *))
{
char *p, *max;
/* Note: in assertions below, i and j are alway inside original bound of
array to sort. */
while (hi > lo)
{
/* A[i] <= A[j] for i <= j, j > hi */
max = lo;
for (p = lo + width; p <= hi; p += width)
{
/* A[i] <= A[max] for lo <= i < p */
if (comp(p, max) > 0)
max = p;
/* A[i] <= A[max] for lo <= i <= p */
}
/* A[i] <= A[max] for lo <= i <= hi */
swap(max, hi, width);
/* A[i] <= A[hi] for i <= hi, so A[i] <= A[j] for i <= j, j >= hi */
hi = width;
/* A[i] <= A[j] for i <= j, j > hi, loop top condition established */
}
/* A[i] <= A[j] for i <= j, j > lo, which implies A[i] <= A[j] for i < j,
so array is sorted */
}
static void swap(char *a, char *b, size_t width)
{
char tmp = 0;
if (a != b)
// Do the swap one character at a time to avoid potential alignment problems.
while (width)
{
tmp = *a;
*a++ = *b;
*b++ = tmp;
}
}
John

Thank you John,
on my old machine it is even quicker:
Pelles qsort: 7035 ms, TinyLib q_sort: 6267 ms (5486 ms for same with unsigned long swap)
I will try to rewrite it for a special unsigned long version with inlined swaps.
Edit: I am a little bit confused about the following statement:
/**************************************************************
Note: the number of stack entries required is no more than
1 + log2(num), so 30 is sufficient for any array
***************************************************************/
If num = 2^32 > 1 + log2(num) = 33 > 30

Thank you John,
on my old machine it is even quicker:
Pelles qsort: 7035 ms, TinyLib q_sort: 6267 ms (5486 ms for same with unsigned long swap)
I will try to rewrite it for a special unsigned long version with inlined swaps.
Edit: I am a little bit confused about the following statement:
/**************************************************************
Note: the number of stack entries required is no more than
1 + log2(num), so 30 is sufficient for any array
***************************************************************/
If num = 2^32 > 1 + log2(num) = 33 > 30
The TinyLib qsort is based on the Microsoft one but with improvements.
As for this >> 1 + log2(num) = 33
I can't comment on it what the guy says. Whatever, that would be very large array indeed.
Change
#define STKSIZ (8*sizeof(void*)  2)
to
#define STKSIZ (8*sizeof(void*))
John

For anyone interested in OpenMP stuff.
Attached is a small project that shows how to use OpenMP with the CRT qsort.
When OPENMPTEST is defined as 0 no OpenMP stuff is used, when 1, it is used.
timings are 0 > 3285ms, 1 > 1969ms.
John

For anyone interested in OpenMP stuff.
Attached is a small project that shows how to use OpenMP with the CRT qsort.
When OPENMPTEST is defined as 0 no OpenMP stuff is used, when 1, it is used.
timings are 0 > 3285ms, 1 > 1969ms.
John
I get this error: (90): error #2918: [omp] Unrecognized or unsupported form of 'for' construct.
I'm using RC5

For anyone interested in OpenMP stuff.
Attached is a small project that shows how to use OpenMP with the CRT qsort.
When OPENMPTEST is defined as 0 no OpenMP stuff is used, when 1, it is used.
timings are 0 > 3285ms, 1 > 1969ms.
John
I get this error: (90): error #2918: [omp] Unrecognized or unsupported form of 'for' construct.
That's strange. I can't say what's wrong.
EDIT: Are you using the latest version of PellesC?
John