I don't see the trickiness, this is a compiler mistake when optimizing IMO, because the result is definitely not correct, especially the part where value of an uchar expression is 255 (or any other where highest bit is set) if it's assigned to a bigger type and 127 (highest bit unset because it was 9th bit of the intermediary sum and got discarded before the shift) if it's assigned to an uchar (but only in -O2 !).
An expression's value shouldn't change depending on the type it's assigned to (and only in -O2), both uchar and bigger types can fit 255 that this expression produces and it's type (via the cast) is even uchar already.
It also doesn't happen with MSVC, GCC and Clang and stb_image is sort of popular so if it was routinely corrupting some pngs (that happen to use the average filter which has this '(byte + otherbyte) >> 1' expression in decoder) it'd be noticed and fixed by now.
Edit: the bug here happens no matter if values are known at compile time and precomputed or not, e.g. it happens while doing 'unsigned char b = (atoi(argv[1]) & 255);' and passing 255 arg too. I'm just adding that since I was looking into it more and noticed asm output with -O2 of the original example has a precomputed 255 and 127 in it but it happens with values only known at runtime too (I first ran into it while loading certain png files).
Edit 2: a similar bug is triggered with two unsigned shorts that do '(a + b) >> 1' assigned to an unsigned short. If you look at optimized ASM output it's as if when the type of the destination variable to which '(a + b) >> 1' is assigned is same as type of a and b, then the intermediate result of the add is stored in that register too, which for high values of a and b (above 127 for uchar, above 32767 for ushort) discards the top bit that's a 1 (and that would be kept if integer promotion to int was done), so then shifting right once shifts a zero into top spot instead of the 1 that should be there, causing this bug.