一些叮当产生的程序集不能在实际模式下工作(。COM,微小的内存模型)

First, this is kind of a follow-up to Custom memory allocator for real-mode DOS .COM (freestanding) — how to debug?. But to have it self-contained, here's the background:

首先，这是real-mode DOS。com(独立的)自定义内存分配器的后续——如何调试?但要让它自成一体，背景如下:

clang (and gcc, too) has an -m16 switch so long instructions of the i386 instruction set are prefixed for execution in "16bit" real mode. This can be exploited to create DOS .COM 32bit-realmode-executables using the GNU linker, as described in this blog post. (of course still limited to the tiny memory model, means everything in one 64KB segment) Wanting to play with this, I created a minimal runtime that seems to work quite nice.

clang(和gcc)也有一个-m16开关，所以i386指令集的长指令以“16bit”实模式为前缀执行。这可以利用GNU链接器创建DOS . com 32bit-realmode-executables，如本文所述。(当然仍然局限于小内存模型，意味着在一个64KB段中的所有内容)想要使用它，我创建了一个运行时很小的程序，看起来运行得很好。

Then I tried to build my recently-created curses-based game with this runtime, and well, it crashed. The first thing I encountered was a classical heisenbug: printing the offending wrong value made it correct. I found a workaround, only to face the next crash. So the first thing to blame I had in mind was my custom malloc() implementation, see the other question. But as nobody spotted something really wrong with it so far, I decided to give my heisenbug a second look. It manifests in the following code snippet (note this worked flawlessly when compiling for other platforms):

然后我尝试用这个运行时构建我最近创建的基于指针的游戏，它崩溃了。我遇到的第一件事是一个经典的heisenbug:打印错误的错误值使它正确。我找到了一个变通的办法，却要面对下一次崩溃。因此，我首先要责备的是我的自定义malloc()实现，请参见另一个问题。但到目前为止，还没有人发现它有什么问题，于是我决定再看一眼heisenbug。它在以下代码片段中体现(在为其他平台编译时，请注意这一点是完美的):

typedef struct
{
    Item it;    /* this is an enum value ... */
    Food *f;    /* and this is an opaque pointer */
} Slot;

typedef struct board
{
    Screen *screen;
    int w, h;
    Slot slots[1];    /* 1 element for C89 compatibility */
} Board;

[... *snip* ...]

    size = sizeof(Board) + (size_t)(w*h-1) * sizeof(Slot);
    self = malloc(size);
    memset(self, 0, size);

sizeof(Slot) is 8 (with clang and i386 architecture), sizeof(Board) is 20 and w and h are the dimensions of the game board, in case of running in DOS 80 and 24 (because one line is reserved for the title/status bar). To debug what's going on here, I made my malloc() output its parameter, and it was called with the value 12 (sizeof(board) + (-1) * sizeof(Slot)?)

sizeof(Slot)为8(具有clang和i386架构)，sizeof(Board)为20,w为game Board的尺寸，h为DOS 80和24中的尺寸(因为标题/状态栏预留了一行)。为了调试这里发生了什么，我让malloc()输出它的参数，并使用值12 (sizeof(board) + (-1) * sizeof(Slot)?)调用它。

Printing out w and h showed the correct values, still malloc() got 12. Printing out size showed the correctly calculated size and this time, malloc() got the correct value, too. So, classical heisenbug.

输出w和h显示了正确的值，仍然是malloc()得到12。打印出的大小显示了正确的计算大小，这次，malloc()也得到了正确的值。因此,古典heisenbug。

The workaround I found looks like this:

我找到的解决方案是这样的:

    size = sizeof(Board);
    for (int i = 0; i < w*h-1; ++i) size += sizeof(Slot);

Weird enough, this worked. Next logical step: compare the generated assembly. Here I have to admit I'm totally new to x86, my only assembly experience was with the good old 6502. So, In the following snippets, I'll add my assumptions and thoughts as comments, please correct me here.

足够奇怪,这个工作。下一个逻辑步骤:比较生成的程序集。在这里我不得不承认我对x86是全新的，我唯一的组装经验是使用旧的6502。因此，在下面的片段中，我将把我的假设和想法添加到注释中，请在这里纠正我。

First the "broken" original version (w, h are in %esi, %edi):

首先是“破损”原始版本(w, h在%esi， %edi):

    movl    %esi, %eax
    imull   %edi, %eax           # ok, calculate the product w*h
    leal    12(,%eax,8), %eax    # multiply by 8 (sizeof(Slot)) and add
                                 # 12 as an offset. Looks good because
                                 # 12 = sizeof(Board) - sizeof(Slot)...
    movzwl  %ax, %ebp            # just use 16bit because my size_t for
                                 # realmode is "unsigned short"
    movl    %ebp, (%esp)
    calll   malloc

Now, to me, this looks good, but my malloc() sees 12, as mentioned. The workaround with the loop compiles to the following assembly:

现在，对我来说，这看起来不错，但是我的malloc()看到了12，如前所述。带有循环的工作区编译为以下程序集:

    movl    %edi, %ecx
    imull   %esi, %ecx             # ok, w*h again.
    leal    -1(%ecx), %edx         # edx = ecx-1? loop-end condition?
    movw    $20, %ax               # sizeof(Board)
    testl   %edx, %edx             # I guess that sets just some flags in
                                   # order to check whether (w*h-1) is <= 0?
    jle .LBB0_5
    leal    65548(,%ecx,8), %eax   # This seems to be the loop body
                                   # condensed to a single instruction.
                                   # 65548 = 65536 (0x10000) + 12. So
                                   # there is our offset of 12 again (for 
                                   # 16bit). The rest is the same ...
.LBB0_5:
    movzwl  %ax, %ebp              # use bottom 16 bits
    movl    %ebp, (%esp)
    calll   malloc

As described before, this second variant works as expected. My question after all this long text is as simple as ... WHY? Is there something special about realmode I'm missing here?

如前所述，第二个变体按照预期工作。我的问题是，这么长的文本就这么简单……为什么?这里有什么特别的关于realmode的吗?

For reference: this commit contains both code versions. Just type make -f libdos.mk for a version with the workaround (crashing later). To compile the code leading to the bug, remove the -DDOSREAL from the CFLAGS in libdos.mk first.

参考:这个提交包含两个代码版本。只需输入make -f libdos。用于一个版本的解决方案(稍后崩溃)。要编译导致错误的代码，请从libdos中的CFLAGS中删除-DDOSREAL。可放在第一位。

Update: given the comments, I tried to debug this myself a bit deeper. Using dosbox' debugger is somewhat cumbersome, but I finally got it to break at the position of this bug. So, the following assembly code intended by clang:

更新:考虑到这些评论，我试着更深入地调试它。使用dosbox的调试器有点麻烦，但是我最终使它在这个错误的位置崩溃。因此，以下是clang设计的汇编代码:

    movl    %esi, %eax
    imull   %edi, %eax
    leal    12(,%eax,8), %eax
    movzwl  %ax, %ebp
    movl    %ebp, (%esp)
    calll   malloc

ends up as this (note intel syntax used by dosbox' disassembler):

(请注意dosbox的反汇编程序使用的intel语法):

0193:2839  6689F0              mov  eax,esi
0193:283C  660FAFC7            imul eax,edi
0193:2840  668D060C00          lea  eax,[000C]             ds:[000C]=0000F000
0193:2845  660FB7E8            movzx ebp,ax                                    
0193:2849  6766892C24          mov  [esp],ebp              ss:[FFB2]=00007B5C
0193:284E  66E8401D0000        call 4594 ($+1d40)

I think this lea instruction looks suspicious, and indeed, after it, the wrong value is in ax. So, I tried to feed the same assembly source to the GNU assembler, using .code16 with the following result (disassembly by objdump, I think it is not entirely correct because it might misinterpret the size prefix bytes):

我认为这个lea指令看起来很可疑，实际上，在它之后，ax的值是错误的。因此，我尝试向GNU汇编程序提供相同的汇编源，使用.code16，得到如下结果(objdump反汇编，我认为它并不完全正确，因为它可能会误解size前缀字节):

00000000 <.text>:
   0:   66 89 f0                mov    %si,%ax
   3:   66 0f af c7             imul   %di,%ax
   7:   67 66 8d 04             lea    (%si),%ax
   b:   c5 0c 00                lds    (%eax,%eax,1),%ecx
   e:   00 00                   add    %al,(%eax)
  10:   66 0f b7 e8             movzww %ax,%bp
  14:   67 66 89 2c             mov    %bp,(%si)

The only difference is this lea instruction. Here it starts with 67 meaning "address is 32bit" in 16bit real mode. My guess is, this is actually needed because lea is meant to operate on addresses and just "abused" by the optimizer to do data calculation here. Are my assumptions correct? If so, could this be a bug in clangs internal assembler for -m16? Maybe someone can explain where this 668D060C00 emitted by clang comes from and what may be the meaning? 66 means "data is 32bit" and 8D probably is the opcode itself --- but what about the rest?

唯一的区别是这是lea指令。在这里，它从67的含义开始，在16位的实模式中“地址是32位”。我的猜测是，这实际上是需要的，因为lea的意思是对地址进行操作，而优化器只是“滥用”来进行数据计算。我的假设是正确的吗?如果是这样，这是否可能是clangs内部汇编程序中用于-m16的错误?也许有人能解释一下为什么会有这种声音发出的668D060C00，这是什么意思?66表示“数据是32位的”，8D可能是操作码本身——但是其余的呢?

1 个解决方案

#1

Your objdump output is bogus. It looks like it's disassembling with the assumption of 32bit address and operand sizes, rather than 16. So it thinks lea ends sooner than it does, and disassembles some of the address bytes into lds / add. And then miraculously gets back into sync, and sees a movzww that zero extends from 16b to 16b... Pretty funny.

你的objdump输出是假的。它看起来是用32位地址和操作数大小来分解的，而不是16。因此，它认为lea比实际要快结束，并将一些地址字节反汇编到lds / add中，然后奇迹般地恢复同步，看到一个0从16b扩展到16b的movzww…很有趣的。

I'm inclined to trust your DOSBOX disassembly output. It perfectly explains your observed behaviour (malloc always called with an arg of 12). You are correct that the culprit is

我倾向于相信你的DOSBOX拆解输出。它完美地解释了您观察到的行为(malloc总是以12的arg调用)。你说的没错，罪魁祸首是。

lea   eax,[000C]   ;  eax = 0x0C = 12.  Intel/MASM/NASM syntax
leal  12, %eax     #or AT&T syntax:

It looks like a bug in whatever assembled your DOSBOX binary (clang -m16 I think you said), since it assembled leal 12(,%eax,8), %eax into that.

它看起来就像一个安装了DOSBOX二进制文件(我想你说过，clang -m16)的bug，因为它安装了leal 12(，%eax,8)， %eax。

leal  12(,%eax,8), %eax  # AT&T
lea   eax, [12 + eax*8]  ; Intel/MASM/NASM syntax

I could probably dig through some instruction encoding tables / docs and figure out exactly how that lea should have been assembled into machine code. It should be the same as the 32bit-mode encoding, but with 67 66 prefixes (address size and operand size, respectively). (And no, the order of those prefixes doesn't matter, 66 67 would work, too.)

我可能会仔细研究一些指令编码表/文档，并弄清楚lea应该如何组装成机器代码。它应该与32位模式编码相同，但是有67个前缀(分别是地址大小和操作数大小)。(不，这些前缀的顺序无关紧要，66 67也可以。)

Your DOSBOX and objdump outputs don't even have the same binary, so yes, they did come out differently. (objdump is misinterpreting the operand-size prefix in previous instructions, but that didn't affect the insn length until LEA.)

您的DOSBOX和objdump输出甚至都没有相同的二进制，因此，是的，它们的结果是不同的。(objdump在之前的指令中错误地解释了操作码的前缀，但是这并没有影响到LEA的insn长度。)

Your GNU as .code16 binary has 67 66 8D 04 C5, then the 32bit 0x0000000C displacement (little-endian). This is LEA with both prefixes. I assume that's the correct encoding of leal 12(,%eax,8), %eax for 16bit mode.

您的GNU as .code16二进制文件有67 66 8D 04 C5，然后是32位0x0000000C位移(little-endian)。这是带两个前缀的LEA。我假设在16位模式下，leal 12(%eax,8)和%eax的编码是正确的。

Your DOSBOX disassembly has just 66 8D 06, with a 16bit 0x0C absolute address. (Missing the 32bit address size prefix, and using a different addressing mode.) I'm not an x86 binary expert; I haven't had problems with disassemblers / instruction encoding before. (And I usually only look at 64bit asm.) So I'd have to look up the encodings for the different addressing modes.

您的DOSBOX反汇编只有668d 06，具有16bit 0x0C绝对地址。(缺少32位地址大小的前缀，并使用不同的寻址模式)。我不是x86二进制专家;我以前没有遇到过反汇编器/指令编码的问题。(我通常只看64位asm。)所以我需要查找不同寻址模式的编码。

My go-to source for x86 instructions is Intel's Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z. (linked from https://*.com/tags/x86/info, BTW.)

我的来源x86指令是英特尔的Intel®64和ia - 32架构的软件开发人员手动卷2(2、2和2 c):指令集引用,a - z。(链接https://*.com/tags/x86/info,顺便说一句。)

It says: (section 2.1.1)

它说:(2.1.1节)

The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can be the default; use of the prefix selects the non-default size.

操作数大小覆盖前缀允许程序在16位和32位操作数大小之间进行切换。大小都可以是默认值;使用前缀选择非默认大小。

So that's easy, everything is pretty much the same as normal 32bit protected mode, except 16bit operand-size is the default.

这很简单，除了默认的16bit操作数之外，所有东西都和普通的32位保护模式差不多。

The LEA insn description has a table describing exactly what happens with various combinations of 16, 32, and 64bit address (67H prefix) and operand sizes (66H prefix). In all cases, it truncates or zero extend the result when there's a size mismatch, but it's an Intel insn ref manual so it has to lay out every case separately. (This is helpful for more complex instruction behaviour.)

LEA insn描述有一个表，详细描述了16、32和64bit地址(67H前缀)和操作数大小(66H前缀)的各种组合所发生的情况。在所有情况下，当出现大小不匹配时，它都会截断或扩展结果，但这是一个英特尔insn ref手册，所以它必须分别列出每个案例。(这有助于更复杂的指令行为。)

And yes, "abusing" lea by using it on non-address data is a common and useful optimization. You can do a non-destructive add of 2 registers, placing the result in a 3rd. And at the same time add a constant, and scale one of the inputs by 2, 4, or 8. So it can do things that would take up to 4 other instructions. (mov / shl / add r,r / add r,i). Also, it doesn't affect flags, which is a bonus if you want to preserve flags for another jump or especially cmov.

是的，在非地址数据上“滥用”lea是一种常见的、有用的优化方法。您可以对两个寄存器进行非破坏性的添加，将结果放在第三个寄存器中。同时加入一个常数，将其中一个输入的比例增加2 4 8。它可以做最多需要4个指令的事情。(mov / shl / add r,r / add r,i)此外，它不会影响标志，如果您想要保存其他跳转的标志，特别是cmov，这将是一个额外的好处。

#1