Although there are many better ways to do this, from Vulkan to OpenGL to DirectX, I have always derived personal satisfaction from simple act of software rendering. When it's just me, the instruction set, a pixel pointer and and the framebuffer, there is something direct and amusing about it. Even the lowest level APIs don't seem to capture the magic for me quite as well.
Now... Of course, with Windows, there is no way I can directly go and muck with the graphics device, so my traditional approach has been to create a window and a DIB section. I draw whatever into the section, call either SetDIBitsToDevice or BitBlt, and all is fine with the world.
However, I was wondering if anyone here is more experienced with these things, and whether there would be even better way of doing this than the current method.
Also, does anyone know what
exactly happens when the DIBsection is blitted on screen? I have deduced pretty much, that the memory latency/bandwidth is a decisive bottleneck, especially when trying anything even remotely fancy, such taking mip-mapped, trilinear samples, from multiple different textures ( raw data buffers in RAM ), resulting in tons and tons of guaranteed cache and even TLB misses the way things are.
My recent thought has been to create the DIBsections with 16 bits per pixel, effectively halving the memory footprint per fragment, and using likewise 5/5/5 : 16 bit encoded texels in memory as my texel sources.
But is there even anything to gain from this? Sure, my sampling and rasterizing may be boosted some, but what will the blitting function make of all this? Will it just convert the dib into a 32 ARGB / 24 RGB behind my back and shuffle it over to VRAM? Will it shuffle it over as such, and convert it with the GPU in hardware?
I have mostly been pondering about using multiple threads to pipeline the whole rendering task in two or three steps, with blitting, dib section filling and whatever else happening pretty much in parallel ( double buffering in software side ). Of course, this whole scheme amounts to nothing, if the blitting routine ends up eating all the memory bandwidth, and all stages sit in wait-states, fighting over cache lines and waiting for their memory transactions ( which may happen regardless.... ).
Again, I know this is not a serious pursuit these days, but it's just too much fun to give up.
And if anyone has any experience/ideas how to go about this thing more efficiently, I would be grateful.