I know there are similar questions to this, but compiling different file with different flag is not acceptable solution here since it would complicate the codebase real quick. An answer with "No, it is not possible" will do.
我知道有类似的问题,但是用不同的标记来编译不同的文件是不可接受的解决方案,因为这会使代码库变得更加复杂。答案是“不,这是不可能的”。
Is it possible, in any version of Clang OR GCC, to compile intrinsics function for SSE 2/3/3S/4.1 while only enable compiler to use SSE instruction set for its optimization?
在Clang或GCC的任何版本中,是否有可能编译SSE 2/3/3S/4.1的特性函数,而只允许编译器使用SSE指令集进行优化?
EDIT: For example, I want compiler to turn _mm_load_si128()
to movdqa
, but compiler must not do emit this instruction at any other place than this intrinsics function, similar to how MSVC compiler works.
编辑:例如,我希望编译器将_mm_load_si128()转换为movdqa,但是编译器在任何其他地方都不能发出这种指令,而不是像MSVC编译器那样。
EDIT2: I have dynamic dispatcher in place and several version of single function with different instruction sets written using intrinsics function. Using multiple file will make this much harder to maintain as same version of code will span multiple file, and there are a lot of this type of functions.
EDIT2:我有一个动态调度器和几个版本的单一功能,使用的是使用内部函数编写的不同指令集。使用多个文件将使维护同一版本的代码变得更加困难,因为相同的代码将跨越多个文件,并且有许多这种类型的函数。
EDIT3: Example source code as requested: https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp or most file in that folder really.
EDIT3:请求的示例源代码:https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp或该文件夹中的大多数文件。
2 个解决方案
#1
9
Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.
这里有一个使用gcc的方法,可能是可以接受的。所有源代码都进入一个源文件。单个源文件被分成几个部分。一个部分根据使用的命令行选项生成代码。这一节将介绍main()和处理器特性检测等功能。另一节根据目标重写pragma生成代码。可以使用目标覆盖值支持的内部函数。这一节中的函数只有在处理器特性检测确认了所需的处理器特性之后才会被调用。这个例子有一个AVX2代码的覆盖部分。当为多个目标优化编写函数时,可以使用多个覆盖部分。
// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------
int dummy1 (int a) {return a;}
//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------
static __m256i bitShiftLeft256ymm (__m256i *data, int count)
{
__m256i innerCarry, carryOut, rotate;
innerCarry = _mm256_srli_epi64 (*data, 64 - count); // carry outs in bit 0 of each qword
rotate = _mm256_permute4x64_epi64 (innerCarry, 0x93); // rotate ymm left 64 bits
innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC); // clear lower qword
*data = _mm256_slli_epi64 (*data, count); // shift all qwords left
*data = _mm256_or_si256 (*data, innerCarry); // propagate carrys from low qwords
carryOut = _mm256_xor_si256 (innerCarry, rotate); // clear all except lower qword
return carryOut;
}
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------
int main (void)
{
return 0;
}
//----------------------------------------------------------------------------
#2
-1
There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.
除了编译器本身的开关外,没有办法控制编译器使用的指令集。换句话说,这里没有任何实用程序或其他特性,只有整体的编译器标志。
This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include
etc to keep one single textfile as the main source, and just include the same file in multiple places)
这意味着,唯一可行的解决方案,实现你想要的是使用-msseX和源代码分割成多个文件(当然,你总是可以用各种巧妙的# include等保持一个文本文件为主要来源,就包括相同的文件在多个地方)
Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:
当然,编译器的源代码是可用的。我确信GCC和Clang/LLVM的维护者将乐于接受改进的补丁。但是请记住,从“解析源”到“发出指令”的路径是相当长的和复杂的。如果我们这样做会发生什么:
#pragma use_sse=1
void func()
{
... some code goes here ...
}
#pragma use_sse=3
void func2()
{
...
func();
...
}
Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().
现在,func足够短,可以内联,编译器应该内联吗?如果是这样的话,应该使用sse1或sse3指令进行func()。
I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.
我知道您可能并不关心这类困难,但是Clang和GCC的维护者确实需要在某种程度上解决这个问题。
Edit: In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:
编辑:在声明SSE特性的headerfiles(和许多其他的特性)中,一个典型的函数看起来是这样的:
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).
当您启用了-msse选项时,builtin_ia32_addss仅在编译器中可用。因此,如果您说服编译器仍然允许您使用_mm_add_ss(),当您有-mno-sse时,它会给您一个错误,因为“__builtin_ia32_addss没有在这个范围内声明”(我刚刚尝试过)。
It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.
要改变这种特定的行为可能并不困难——可能只有少数几个地方的代码“引入了builtin函数”。但是,我不相信代码中还有其他问题,稍后在编译器中实际发出指令时。
I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.
我在基于clanga的编译器中完成了一些“builtin函数”的工作,不幸的是,从“解析器”到“代码生成”的过程中涉及到几个步骤,其中包含了builtin函数。
Edit2:
Edit2:
Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:
与GCC相比,为Clang解决这个问题更加复杂,因为编译器本身理解了SSE指令,所以它只是在头文件中有这个:
static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
return __a + __b;
}
The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).
然后,编译器会知道要添加几个__m128,它需要生成正确的SSE指令。我刚刚下载的叮当声(我在家里,我的工作在叮当声是在工作,而不是与SSE,只是一般安装在内部的功能——我还没有做太多的改变叮当声,但它足以理解大致装入的功能如何工作)。
However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+
translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.
然而,从你的角度来看,它不是一个内置函数会使它更糟,因为运算符+翻译要复杂得多。我很确定编译器使它变成一个“添加这两件事”,然后将其传递给进一步的工作- LLVM LLVM将部分理解SSE指令等。要不是你的目的,这使它更糟的是,因为这是一个“固有功能”现在几乎丢失,和编译器处理它,就像你写的a + b,a和b的副作用是类型128位长。它使得处理生成“正确的指令”和在不同的SSE级别上保持“所有其他”指令变得更加复杂。
#1
9
Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.
这里有一个使用gcc的方法,可能是可以接受的。所有源代码都进入一个源文件。单个源文件被分成几个部分。一个部分根据使用的命令行选项生成代码。这一节将介绍main()和处理器特性检测等功能。另一节根据目标重写pragma生成代码。可以使用目标覆盖值支持的内部函数。这一节中的函数只有在处理器特性检测确认了所需的处理器特性之后才会被调用。这个例子有一个AVX2代码的覆盖部分。当为多个目标优化编写函数时,可以使用多个覆盖部分。
// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------
int dummy1 (int a) {return a;}
//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------
static __m256i bitShiftLeft256ymm (__m256i *data, int count)
{
__m256i innerCarry, carryOut, rotate;
innerCarry = _mm256_srli_epi64 (*data, 64 - count); // carry outs in bit 0 of each qword
rotate = _mm256_permute4x64_epi64 (innerCarry, 0x93); // rotate ymm left 64 bits
innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC); // clear lower qword
*data = _mm256_slli_epi64 (*data, count); // shift all qwords left
*data = _mm256_or_si256 (*data, innerCarry); // propagate carrys from low qwords
carryOut = _mm256_xor_si256 (innerCarry, rotate); // clear all except lower qword
return carryOut;
}
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------
int main (void)
{
return 0;
}
//----------------------------------------------------------------------------
#2
-1
There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.
除了编译器本身的开关外,没有办法控制编译器使用的指令集。换句话说,这里没有任何实用程序或其他特性,只有整体的编译器标志。
This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include
etc to keep one single textfile as the main source, and just include the same file in multiple places)
这意味着,唯一可行的解决方案,实现你想要的是使用-msseX和源代码分割成多个文件(当然,你总是可以用各种巧妙的# include等保持一个文本文件为主要来源,就包括相同的文件在多个地方)
Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:
当然,编译器的源代码是可用的。我确信GCC和Clang/LLVM的维护者将乐于接受改进的补丁。但是请记住,从“解析源”到“发出指令”的路径是相当长的和复杂的。如果我们这样做会发生什么:
#pragma use_sse=1
void func()
{
... some code goes here ...
}
#pragma use_sse=3
void func2()
{
...
func();
...
}
Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().
现在,func足够短,可以内联,编译器应该内联吗?如果是这样的话,应该使用sse1或sse3指令进行func()。
I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.
我知道您可能并不关心这类困难,但是Clang和GCC的维护者确实需要在某种程度上解决这个问题。
Edit: In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:
编辑:在声明SSE特性的headerfiles(和许多其他的特性)中,一个典型的函数看起来是这样的:
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).
当您启用了-msse选项时,builtin_ia32_addss仅在编译器中可用。因此,如果您说服编译器仍然允许您使用_mm_add_ss(),当您有-mno-sse时,它会给您一个错误,因为“__builtin_ia32_addss没有在这个范围内声明”(我刚刚尝试过)。
It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.
要改变这种特定的行为可能并不困难——可能只有少数几个地方的代码“引入了builtin函数”。但是,我不相信代码中还有其他问题,稍后在编译器中实际发出指令时。
I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.
我在基于clanga的编译器中完成了一些“builtin函数”的工作,不幸的是,从“解析器”到“代码生成”的过程中涉及到几个步骤,其中包含了builtin函数。
Edit2:
Edit2:
Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:
与GCC相比,为Clang解决这个问题更加复杂,因为编译器本身理解了SSE指令,所以它只是在头文件中有这个:
static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
return __a + __b;
}
The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).
然后,编译器会知道要添加几个__m128,它需要生成正确的SSE指令。我刚刚下载的叮当声(我在家里,我的工作在叮当声是在工作,而不是与SSE,只是一般安装在内部的功能——我还没有做太多的改变叮当声,但它足以理解大致装入的功能如何工作)。
However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+
translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.
然而,从你的角度来看,它不是一个内置函数会使它更糟,因为运算符+翻译要复杂得多。我很确定编译器使它变成一个“添加这两件事”,然后将其传递给进一步的工作- LLVM LLVM将部分理解SSE指令等。要不是你的目的,这使它更糟的是,因为这是一个“固有功能”现在几乎丢失,和编译器处理它,就像你写的a + b,a和b的副作用是类型128位长。它使得处理生成“正确的指令”和在不同的SSE级别上保持“所有其他”指令变得更加复杂。