当-xSSE4.1指定时生成AVX指令

I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?

我使用Intel编译器编译了一段带有选项-xSSE4.1的代码。当我查看生成的汇编文件时，我看到插入了AVX指令，如“vpmovzxbw”。但是，可执行文件似乎仍然运行在不支持AVX指令集的机器上。

Here's the particular code snippet -

下面是特定的代码片段

C -> src0_8x16b  = _mm_cvtepu8_epi16 (src0_8x16b);

Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]

Binary -> 00066 c4 62 79 30 29

Here's another snippet where the assembly instruction uses 3 operands -

这是另一个程序集指令使用3个操作数-的片段

C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);

Assembly -> vpsubw xmm1, xmm13, xmm11              

Binary -> 000bc c4 c1 11 f9 cb

For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -

为了进行比较，这里是icc为函数“foo”生成的分解(函数foo和上面的代码片段之间的惟一区别是代码片段是使用intrinsic编写的)

Compiler commands used - 
icc -S -xSSE4.1 -axavx -O3 foo.c

Function foo -
void foo(float *x, int n) 
{
    int i;

    for(i=0; i<n; i++) x[i] *= 2.0;
}

Autodispatch code - 
testl     $-131072, __intel_cpu_indicator(%rip)         #1.27
jne       foo.R                                         #1.27
testl     $-1, __intel_cpu_indicator(%rip)              #1.27
jne       foo.A

Loop in foo.R (AVX variant) - 
vmulps    (%rdi,%rcx,4), %ymm0, %ymm1                   #3.24
vmulps    32(%rdi,%rcx,4), %ymm0, %ymm2                 #3.24
vmovups   %ymm1, (%rdi,%rcx,4)                          #3.24
vmovups   %ymm2, 32(%rdi,%rcx,4)                        #3.24
addq      $16, %rcx                                     #3.5
cmpq      %rdx, %rcx                                    #3.5
jb        ..B2.12       # Prob 82%                      #3.5

Loop in foo.A (SSE variant) - 
movaps    (%rdi,%r8,4), %xmm1                           #3.24
movaps    16(%rdi,%r8,4), %xmm2                         #3.24
mulps     %xmm0, %xmm1                                  #3.24
mulps     %xmm0, %xmm2                                  #3.24
movaps    %xmm1, (%rdi,%r8,4)                           #3.24
movaps    %xmm2, 16(%rdi,%r8,4)                         #3.24
addq      $8, %r8                                       #3.5
cmpq      %rsi, %r8                                     #3.5
jb        ..B3.12       # Prob 82%                      #3.5

2 个解决方案

#1

The Intel compiler can

英特尔编译器可以

generate a single executable with multiple levels of vectorization with the -ax flag,

使用-ax标志生成具有多个层次的矢量化的单个可执行文件，

For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2.

例如，要生成与AVX、SSE4.1和SSE2兼容的代码，使用-axAVX -axSSE4.2 -xSSE2。

Since you compiled with -axAVX -xSSE4.1 Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.

由于您使用-axAVX -xSSE4.1编译，所以Intel生成了一个AVX分支和一个SSE4.1分支，在运行时它确定哪个指令集可用并选择它。

Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.

Agner Fog在他的优化c++ manaul中对英特尔的CPU调度程序有很好的描述。参见“Intel compiler”章节“13.7 CPU调度”。英特尔的CPU调度器并不理想，有几个原因，其中之一是它在AMD上表现不佳，Agner对此做了详细的描述。就我个人而言，我会自己制作调度员。

I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2

我使用ICC 13.0和选项-O3 -axavx -xsse2编译了以下代码

void foo(float *x, int n) {
    for(int i=0; i<n; i++) x[i] *= 2.0;
}

and the start of the assembly is

集合的开始是

    test      DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
    jne       _Z3fooPfi.R                                   #1.27
    test      DWORD PTR __intel_cpu_indicator[rip], -1      #1.27
    jne       _Z3fooPfi.A

going to the _Z3fooPfi.R branch find the main AVX loop

去_Z3fooPfi。R分支找到AVX主循环

..B2.12:                        # Preds ..B2.12 ..B2.11
vmulps    ymm1, ymm0, YMMWORD PTR [rdi+rcx*4]           #2.25
vmulps    ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4]        #2.25
vmovups   YMMWORD PTR [rdi+rcx*4], ymm1                 #2.25
vmovups   YMMWORD PTR [32+rdi+rcx*4], ymm2              #2.25
add       rcx, 16                                       #2.2
cmp       rcx, rdx                                      #2.2
jb        ..B2.12       # Prob 82%                      #2.2

going to the _Z3fooPfi.A branch has the main SSE loop

去_Z3fooPfi。分支具有主SSE循环

movaps    xmm1, XMMWORD PTR [rdi+r8*4]                  #2.25
movaps    xmm2, XMMWORD PTR [16+rdi+r8*4]               #2.25
mulps     xmm1, xmm0                                    #2.25
mulps     xmm2, xmm0                                    #2.25
movaps    XMMWORD PTR [rdi+r8*4], xmm1                  #2.25
movaps    XMMWORD PTR [16+rdi+r8*4], xmm2               #2.25
add       r8, 8                                         #2.2
cmp       r8, rsi                                       #2.2
jb        ..B3.12       # Prob 82%                      #2.2

#2

I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -

我尝试在另外两个编译器上复制结果，即gcc和Microsoft Visual Studio的v100编译器。我不能这样做，也就是说。， gcc和v100编译器似乎正在生成正确的反汇编。作为进一步的步骤，我仔细研究了在每种情况下指定的编译器参数之间存在的差异(如果有的话)。事实证明，在使用icc编译器时，我允许选择继承用于编译这个特定文件的项目默认值。项目设置的配置，使这个选项包括-

-xavx

As a result when this file was being compiled, the settings I had provided -

当这个文件被编译时，我提供的设置。

-xSSE4.1 -axavx

were overridden by the former. This was the cause of the behavior I have detailed in my question.

被前者推翻。这就是我在问题中详述的行为的原因。

I am sorry for this error, but I shall not delete this question since @Zboson 's answer is exceptional.

对于这个错误我很抱歉，但是我不会删除这个问题，因为@Zboson的答案是例外的。

PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.

我在我的一个评论中提到，我能够在SSE42机器上运行这段代码。那是因为我在那台机器上运行的exe确实符合SSE41，因为我显然使用了使用gcc编译器生成的exe。我运行了icc生成的exe，它确实由于SSE42机器上的非法指令错误而崩溃。

#1