if one want more speed, lower c to something like 30 as it is now 255.
Yes, that speeds it up a lot. For an equal number of iterations, your version is only 33% slower than hand-crafted assembler - very good, actually.
By the way, I've done some testing of SetDIBitsToDevice vs BitBlt, and it turns out that the latter is over 5% faster.
EDIT: Bad news - #6 doesn't exit properly on Win7-32. It does process the WM_CLOSE message correctly, it passes the WM_DESTROY message and HeapFree succeeds. With PostQuitMessage, it loses 6 handles in TaskManager. Afterwards, Olly reports that the thread ended but Task Manager still sees it, and it must be killed manually...
Attached a version that asks if it should be killed brutally, with ExitProcess. That works but I guess it's not "by design" ;-)