Packing / unpacking FP32s as BFloat16s with AVX2 (truncating)

FP32 can be converted to BFloat16 simply by truncating the lower 16 bits. So we can do this with some bit operations.

Internally two sets of FP32s (i.e., elements in two __m256s) are stored interleaved.

In case you’re memory bound, packing FP32s in this way may help your code’s performance.

Besides, I expect this code to run faster than _mm256_cvtph_ps, which has a latency of 6/7 cycles depending on what platform you’re using. In contrast, fp32_from_bf16_interleaved should have a latency of 1 cycle.

Converting FP32 to BFloat16 is usually done as a step of pre-processing so it’s latency shouldn’t be critical. But in case you care, it’s 2 cycles for bf16_interleaved_from_fp32.

Code:

Execution result:

Leave a Reply

Your email address will not be published. Required fields are marked *