Use static inline functions in header to do CPU feature detection.
The c files are already compiled/linked with SIMD support and might have
used instructions from that featureset already.
This way we use certain compiler flags (like -msse3) only on files
containing optimized code. This avoids problems that occured when
using these flags compiling generic code and running it on platforms
that don't support these optimizations (i.e. NEON optimization on
ARM platforms).