I am trying to create an ldm
(resp. stm
) instruction with inline assembly but have problems to express the operands (especially: their order).
我正在尝试创建一个ldm (resp)。使用内联汇编的指令,但有问题来表示操作数(特别是:它们的顺序)。
A trivial
一个简单的
void *ptr;
unsigned int a;
unsigned int b;
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
does not work because it might put a
into r1
and b
into r0
:
不工作,因为它可能把a代入r1,把b代入r0:
ldm ip!, {r1, r0}
ldm
expects registers in ascending order (as they are encoded in a bitfield) so I need a way to say that the register used for a
is lower than this of b
.
ldm期望寄存器按升序(因为它们被编码在位域中),所以我需要一种方法来说明用于a的寄存器比用于b的寄存器低。
A trivial way is the fixed assignment of registers:
一个简单的方法是寄存器的固定赋值:
register unsigned int a asm("r0");
register unsigned int b asm("r1");
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
But this removes a lot of flexibility and might make the generated code not optimal.
但是这样做会减少很多灵活性,并且可能会使生成的代码不是最优的。
Does gcc (4.8) support special constraints for ldm/stm
? Or, are there better ways to solve this (e.g. some __builtin
function)?
gcc(4.8)是否支持ldm/stm的特殊约束?或者,是否有更好的方法来解决这个问题(例如一些__builtin函数)?
EDIT:
Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words). Pseudo code is
因为有使用“更高层次”结构的建议……我想解决的问题是一个32位字的20位(例如输入是8个字,输出是5个字)。伪代码是
asm("ldm %[in]!,{ %[a],%[b],%[c],%[d] }" ...)
asm("ldm %[in]!,{ %[e],%[f],%[g],%[h] }" ...) /* splitting of ldm generates better code;
gcc gets out of registers else */
/* do some arithmetic on a - h */
asm volatile("stm %[out]!,{ %[a],%[b],%[c],%[d],%[e] }" ...)
Speed matters here and ldm
is 50% faster than ldr
. The arithmetic is tricky and because gcc
generates much better code than me ;) I would like to solve it in inline assembly with giving some hints about optimized memory access.
速度很重要,ldm比ldr快50%。这个算法很棘手,因为gcc生成的代码比我好得多;我想用内联程序集来解决它,并给出一些关于优化内存访问的提示。
1 个解决方案
#1
1
I have recommended the same solution in ARM memtest. Ie, explicitly assign the registers. The analysis on gcc-help is wrong. There is no need to re-write GCC's register allocation. The only thing that is needed is to allow the ordering of registers in an assembler specification.
我在ARM memtest中推荐了同样的解决方案。显式地分配寄存器。对gcc-help的分析是错误的。没有必要重写GCC的寄存器分配。唯一需要做的是允许汇编规范中的寄存器的排序。
That said the following will assemble,
也就是说,
int main(void)
{
void *ptr;
register unsigned int a __asm__("r1");
register unsigned int b __asm__("r0");
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
return 0;
}
This will not compile as there is an illegal ARM instruction, ldm r3!,{r1,r0}
in my gcc. A solution is to use the -S flag to assemble only and then run a script that will order the ldm
/stm
operands. Perl can easily do this with,
这将不会编译,因为有一个非法的ARM指令,ldm r3!在我gcc,{ r1 r0 }。解决方案是使用-S标志来组装,然后运行一个脚本,该脚本将对ldm/stm操作数进行排序。Perl很容易做到这一点,
$reglist = join(',', sort(split(',', $reglist)));
Or any other way. Unfortunately, there doesn't appear to be anyway to do this using assembler constraints. If we had access to an assigned register number, inline alternative or conditional compiling could be used.
或任何其他方式。不幸的是,似乎并没有使用汇编器约束来实现这一点。如果我们可以访问分配的寄存器编号,可以使用内联替代或条件编译。
Probably the easiest solution is to use explicit register assignment. Unless you are writing a vector library that needs to load/store multiple values and you want to give the compiler some freedom to generate better code. In this case, it is probably better to use structures as the higher level gcc optimizations will be able to detect un-needed operation (such as multiplies by one or addition of zero, etc).
可能最简单的解决方案是使用显式的寄存器分配。除非您正在编写一个需要加载/存储多个值的向量库,并且希望给编译器一些*来生成更好的代码。在这种情况下,最好使用结构,因为较高级别的gcc优化将能够检测不需要的操作(例如将一个操作乘以1或添加零,等等)。
Edit:
编辑:
Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words).
因为有使用“更高层次”结构的建议……我想解决的问题是打包20位32位的字(例如输入是8字,输出是5个字)。
This will probably give better results,
这可能会带来更好的结果,
u32 *ip, *op;
u32 in, out, mask;
int shift = 0;
const u32 *op_end = op + 5;
while(op != op_end) {
in = *ip++;
/* mask and accumulate... */
if(shift >= 32) {
*op++ = out;
shift -=32;
}
}
The reasoning is that the ARM pipeline is generally several stages. With a separate load/store unit. ALU (arithmetic) may proceed in parallel with the load and the store. So you can be working on the first word while you are loading later words. In this case, you may also replace the value in-place which will give a cache benefit, unless you need to re-use the 20-bit values. Once the code is in the cache, the ldm/stm
has little benefit if you stall on data. That will be your case.
其原因是ARM管道通常是几个阶段。有一个单独的装载/存储单元。ALU(算术)可以与load和store并行进行。所以当你在加载后面的单词时,你可以处理第一个单词。在这种情况下,您还可以替换将要提供缓存好处的值,除非您需要重用20位值。一旦代码在缓存中,如果您延迟数据,ldm/stm就没有什么好处。那将是你的情况。
2nd Edit: The main job of a compiler is to not load values from memory. Ie, register assignment is crucial. Generally, the ldm
/stm
are most useful in memory transfer functions. Ie, a memory test, a memcpy()
implementation, etc. If you are doing computation with the data, then the compiler may have better knowledge about pipe line scheduling. You probably need to either accept plain 'C' code or move to complete assembler. Remember, the ldm
has the first operands available to use immediately. Use of the ALU with subsequent registers can cause a stall for the data to load. Similarly, the stm
needs the first register calculations to be complete when it executes; but this is less critical.
第2次编辑:编译器的主要工作是不从内存中加载值。登记转让是至关重要的。通常,ldm/stm在内存传输函数中最有用。Ie,内存测试,memcpy()实现,等等。如果您使用数据进行计算,那么编译器可能对管道线路调度有更好的了解。您可能需要接受普通的“C”代码,或者转移到complete assembler。记住,ldm有第一个可立即使用的操作数。使用ALU和后续的寄存器会导致数据的延迟加载。类似地,stm在执行时需要完成第一个寄存器计算;但这并不那么关键。
#1
1
I have recommended the same solution in ARM memtest. Ie, explicitly assign the registers. The analysis on gcc-help is wrong. There is no need to re-write GCC's register allocation. The only thing that is needed is to allow the ordering of registers in an assembler specification.
我在ARM memtest中推荐了同样的解决方案。显式地分配寄存器。对gcc-help的分析是错误的。没有必要重写GCC的寄存器分配。唯一需要做的是允许汇编规范中的寄存器的排序。
That said the following will assemble,
也就是说,
int main(void)
{
void *ptr;
register unsigned int a __asm__("r1");
register unsigned int b __asm__("r0");
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
return 0;
}
This will not compile as there is an illegal ARM instruction, ldm r3!,{r1,r0}
in my gcc. A solution is to use the -S flag to assemble only and then run a script that will order the ldm
/stm
operands. Perl can easily do this with,
这将不会编译,因为有一个非法的ARM指令,ldm r3!在我gcc,{ r1 r0 }。解决方案是使用-S标志来组装,然后运行一个脚本,该脚本将对ldm/stm操作数进行排序。Perl很容易做到这一点,
$reglist = join(',', sort(split(',', $reglist)));
Or any other way. Unfortunately, there doesn't appear to be anyway to do this using assembler constraints. If we had access to an assigned register number, inline alternative or conditional compiling could be used.
或任何其他方式。不幸的是,似乎并没有使用汇编器约束来实现这一点。如果我们可以访问分配的寄存器编号,可以使用内联替代或条件编译。
Probably the easiest solution is to use explicit register assignment. Unless you are writing a vector library that needs to load/store multiple values and you want to give the compiler some freedom to generate better code. In this case, it is probably better to use structures as the higher level gcc optimizations will be able to detect un-needed operation (such as multiplies by one or addition of zero, etc).
可能最简单的解决方案是使用显式的寄存器分配。除非您正在编写一个需要加载/存储多个值的向量库,并且希望给编译器一些*来生成更好的代码。在这种情况下,最好使用结构,因为较高级别的gcc优化将能够检测不需要的操作(例如将一个操作乘以1或添加零,等等)。
Edit:
编辑:
Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words).
因为有使用“更高层次”结构的建议……我想解决的问题是打包20位32位的字(例如输入是8字,输出是5个字)。
This will probably give better results,
这可能会带来更好的结果,
u32 *ip, *op;
u32 in, out, mask;
int shift = 0;
const u32 *op_end = op + 5;
while(op != op_end) {
in = *ip++;
/* mask and accumulate... */
if(shift >= 32) {
*op++ = out;
shift -=32;
}
}
The reasoning is that the ARM pipeline is generally several stages. With a separate load/store unit. ALU (arithmetic) may proceed in parallel with the load and the store. So you can be working on the first word while you are loading later words. In this case, you may also replace the value in-place which will give a cache benefit, unless you need to re-use the 20-bit values. Once the code is in the cache, the ldm/stm
has little benefit if you stall on data. That will be your case.
其原因是ARM管道通常是几个阶段。有一个单独的装载/存储单元。ALU(算术)可以与load和store并行进行。所以当你在加载后面的单词时,你可以处理第一个单词。在这种情况下,您还可以替换将要提供缓存好处的值,除非您需要重用20位值。一旦代码在缓存中,如果您延迟数据,ldm/stm就没有什么好处。那将是你的情况。
2nd Edit: The main job of a compiler is to not load values from memory. Ie, register assignment is crucial. Generally, the ldm
/stm
are most useful in memory transfer functions. Ie, a memory test, a memcpy()
implementation, etc. If you are doing computation with the data, then the compiler may have better knowledge about pipe line scheduling. You probably need to either accept plain 'C' code or move to complete assembler. Remember, the ldm
has the first operands available to use immediately. Use of the ALU with subsequent registers can cause a stall for the data to load. Similarly, the stm
needs the first register calculations to be complete when it executes; but this is less critical.
第2次编辑:编译器的主要工作是不从内存中加载值。登记转让是至关重要的。通常,ldm/stm在内存传输函数中最有用。Ie,内存测试,memcpy()实现,等等。如果您使用数据进行计算,那么编译器可能对管道线路调度有更好的了解。您可能需要接受普通的“C”代码,或者转移到complete assembler。记住,ldm有第一个可立即使用的操作数。使用ALU和后续的寄存器会导致数据的延迟加载。类似地,stm在执行时需要完成第一个寄存器计算;但这并不那么关键。