Stack Alignment
I am (sort-of) proud to have had a hand in the creation of this code:
/*
* Workaround GCC ABI-compliance issue with SSE on x86 by
* forcibly realigning the stack to a 16-byte boundary.
*/
volatile register unsigned long sp __asm("esp");
if (__builtin_expect(sp & 15UL, 0))
(void)alloca(16 - (sp & 15UL));
The idea here is basically to use alloca to do the actual work of modifying the stack pointer so that the inline assembly can be confined to just fetching its initial value.
The worst part of this: Apparently the code above had to be inserted into a deeply nested inner loop in order for the alignment to still be good when the SSE instructions are used.