some really optimized compilers are known to be 1:1 with assembler (i.e. Intel C++), and even better sometimes
Interesting ... Was that Pelles CRT or Microsoft's?
There is this concept called "Premature Optimization" that can be quite the trap if you aren't careful. This is the case when you spend 6 months developing and debugging something that shaves 10ms off a program's execution time. Unless the program itself is essentially time critical what you've really done is spend an enormous amout of time on what amounts to a microscopic improvement.
Using the ASM versions can be a real advantage if you have large arrays of long strings, say, for example, a list of 100,000 1meg buffers from some hyper fast disk array. But what is the gain if you are comparing a user's name and password in a list of 50 people? It's over in a heartbeat either way, the user won't even notice the time difference and how much work did you do for that?
This is not to trash ASM, in fact, I'm guessing that the best performing CRTs are largely ASM... but it does raise the question about the time and effort put forth as opposed to the time and effort saved. If you're going to save your user 5ms while logging in, forget it, not worth doing... if you're going to do real time logging from a 1gbps hard disk array, well, maybe it's worth it.