你见过的最核心的优化是什么?

时间:2022-09-20 19:40:44

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.

我不是在谈论算法的东西(例如使用quicksort而不是bubblesort),而我不是在讨论像循环展开之类的简单事情。

I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.

我在谈论硬核的东西。像Tiny Teensy ELF,Mel的故事;实际上是demoscene中的所有东西,等等。

12 个解决方案

#1


28  

I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.

我曾经写过一个暴力RC5密钥搜索,一次处理两个密钥,第一个密钥使用整数流水线,第二个密钥使用SSE流水线,两个密钥在指令级交错。然后将其与管理程序结合使用,该程序在系统中的每个核心上运行代码实例。总的来说,代码运行速度比原始C版本快25倍。

#2


15  

In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.

在我使用过的一个(这里未命名的)视频游戏引擎中,他们重写了模型导出工具(将Maya网格转换为游戏加载的东西),这样它实际上不会发出数据,而是实际发出精确的流渲染特定模型所必需的微指令。它使用遗传算法找到将在最小循环次数内运行的遗传算法。也就是说,给定模型的数据格式实际上是一个完美优化的子程序,用于渲染该模型。因此,将网格绘制到屏幕意味着将其加载到内存中并分支到其中。

(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)

(这不是用于PC,而是用于具有与CPU分离且并行的向量单元的控制台。)

#3


10  

In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:

在DOS的早期,当我们使用软盘进行所有数据传输时,也存在病毒。病毒感染不同计算机的一种常见方法是将病毒引导程序复制到插入的floppydisc的引导程序中。当用户将floppydisc插入另一台计算机并重新启动而不记得移除软盘时,病毒就会运行并感染硬盘启动器,从而永久感染主机PC。我被感染的一种特殊的令人讨厌的病毒被称为“Form”,为了对抗这一点,我编写了一个具有以下功能的自定义软盘bootsector:

  • Validate the bootsector of the host harddrive and make sure it was not infected.
  • 验证主机硬盘的bootsector并确保它没有被感染。

  • Validate the floppy bootsector and make sure that it was not infected.
  • 验证软盘引导扇区并确保它没有被感染。

  • Code to remove the virus from the harddrive if it was infected.
  • 如果病毒被感染,请从硬盘中删除病毒的代码。

  • Code to duplicate the antivirus bootsector to another floppy if a special key was pressed.
  • 如果按下特殊键,则将防病毒引导程序复制到另一张软盘的代码。

  • Code to boot the harddrive if all was well, and no infections was found.
  • 如果一切正常,则启动硬盘的代码,并且没有发现感染。

This was done in the program space of a bootsector, about 440 bytes :)

这是在bootsector的程序空间中完成的,大约440字节:)

The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"

我的伙伴最大的问题是显示非常神秘的消息,因为我需要所有代码空间。它就像“FFVD RM?”,意思是“检测到FindForm病毒,删除?”

I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.

我很满意这段代码。优化是程序大小,而不是速度。装配中有两种完全不同的优化。

#4


9  

My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.

我最喜欢的是通过整数运算的浮点反平方根。对于如何存储浮点值并且可以更快地执行(即使执行1 /结果比库存标准平方根函数更快)或者产生比标准方法更准确的结果,这是一个很酷的小黑客。

In c/c++ the code is: (sourced from Wikipedia)

在c / c ++中,代码是:(来自*)

float InvSqrt (float x)
{
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);   // Now this is what you call a real magic number
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
}

#5


8  

I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.

几年前,我用65816汇编语言为Apple IIgs写了一个基于磁贴的游戏引擎。这是一台相当慢的机器,“在金属上”编程是哄骗可接受性能的虚拟要求。

In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.

为了快速更新图形屏幕,必须将堆栈映射到屏幕,以便使用一些特殊指令,允许用户仅在5个机器周期内更新4个屏幕像素。这并不是特别奇妙,在IIgs Tech Note#70中有详细描述。硬核位是我必须组织代码以使其足够灵活以成为通用库,同时仍保持最大速度。

I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.

我将图形屏幕分解为扫描线,并创建了一个246字节的代码缓冲区来插入专门的65816操作码。需要246个字节,因为图形屏幕的每条扫描线宽度为80字,每端需要1个附加字以便平滑滚动。推送有效地址(PEA)指令占用3个字节,因此3 *(80 + 1 + 1)= 246个字节。

The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.

通过跳转到对应于屏幕右边缘的246字节代码缓冲区内的地址并将BRanch Always(BRA)指令修补到紧随最左边单词后面的单词的代码中来呈现图形屏幕。 BRA指令将带符号的8位偏移量作为其参数,因此它几乎没有跳出代码缓冲区的范围。

Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:

即便这样也不是太困难,但真正的核心优化就在这里。我的图形引擎实际上支持两个独立的背景图层和动画图块,使用不同的3字节代码序列,具体取决于模式:

  1. Background 1 uses a Push Effective Address (PEA) instruction
  2. 背景1使用推送有效地址(PEA)指令

  3. Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
  4. 背景2使用加载间接索引(LDA($ 00),y)指令,然后是推送(PHA)

  5. Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
  6. 动画磁贴使用加载直接页索引(LDA $ 00,x)指令后跟推送(PHA)

The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.

关键限制是65816寄存器(X和Y)都用于引用数据,不能修改。此外,直接页面寄存器(D)是基于第二背景的原点设置的,并且不能改变;数据库寄存器设置为保存第二背景的像素数据的数据库,不能更改;堆栈指针(S)映射到图形屏幕,因此不可能跳转到子程序并返回。

Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:

鉴于这些限制,我需要快速处理即将被推入堆栈的单词混合的情况,即一半来自背景1,一半来自后台2.我的解决方案是交换内存以获得速度。因为所有正常寄存器都在使用中,所以我只能使用程序计数器(PC)寄存器。我的解决方案如下:

  1. Define a code fragment to do the blend in the same 64K program bank as the code buffer
  2. 定义一个代码片段,在与代码缓冲区相同的64K程序库中进行混合

  3. Create a copy of this code for each of the 82 words
  4. 为82个单词中的每个单词创建此代码的副本

  5. There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
  6. 存在1-1对应关系,因此代码片段的返回可以是硬编码的地址

  7. Done! We have a hard-coded subroutine that does not affect the CPU registers.
  8. 完成!我们有一个硬编码的子程序,不会影响CPU寄存器。

Here is the actual code fragments

这是实际的代码片段

code_buff:   PEA $0000            ; rightmost word (16-bits = 4 pixels)
             PEA $0000            ; background 1
             PEA $0000            ; background 1
             PEA $0000            ; background 1
             LDA (72),y           ; background 2
             PHA
             LDA (70),y           ; background 2
             PHA
             JMP word_68          ; mix the data
word_68_rtn: PEA $0000            ; more background 1
             ...
             PEA $0000
             BRA *+40             ; patched exit code

             ...                  
word_68:     LDA (68),y           ; load data for background 2
             AND #$00FF           ; mask
             ORA #$AB00           ; blend with data from background 1
             PHA
             JMP word_68_rtn      ; jump back
word_66:     LDA (66),y
             ...

The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.

最终的结果是接近最佳的阻击器,具有最小的开销,并且在具有1MB / s存储器总线的2.5MHz CPU上以320x200的速度输出超过15帧/秒。

#6


7  

A Very Biological Optimisation

Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.

快速背景:DNA核苷酸(A,C,G和T)的三重态编码氨基酸,这些氨基酸与蛋白质结合,蛋白质是大多数生物的组成部分。

Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)

通常,每种不同的蛋白质需要单独的DNA三联体序列(其“基因”)来编码其氨基酸 - 例如,长度为30,40和50的3种蛋白质总共需要90 + 120 + 150 = 360个核苷酸。然而,在病毒中,空间是非常宝贵的 - 因此一些病毒与不同基因的DNA序列重叠,使用6个可能的“阅读框架”用于DNA到蛋白质翻译(即从一个位置开始)可被3整除;从3除以余数1的位置;或从3除以余数2的位置;再次相同,但反过来读取序列。)

For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)

为了比较:尝试编写一个x86汇编语言程序,其中300字节函数doFoo()从偏移量0x1000开始...另一个200字节函数doBar()从偏移量0x1001开始! (我为这次比赛提出一个名字:你比乙肝更聪明吗?)

That's hardcore space optimisation!

这是硬核空间优化!

UPDATE: Links to further info:

更新:链接到更多信息:

  • Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
  • *上的阅读框架表明,乙型肝炎和“大麦黄矮星”病毒(一种植物病毒)都与阅读框架重叠。

  • Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
  • *上的乙型肝炎基因组信息。似乎不同的阅读框亚基产生表面蛋白的不同变异。

  • Or you could google for "overlapping reading frames"
  • 或者你可以谷歌搜索“重叠阅读框”

  • Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
  • 似乎这甚至可能发生在哺乳动物身上!在第二个哺乳动物基因中广泛重叠的阅读框是由Marilyn Kozak撰写的2001年科学论文,该论文讨论了“广泛重叠阅读框”的大鼠“第二”基因。 (这是非常令人惊讶的,因为哺乳动物的基因组结构为不同蛋白质的单独基因提供了充足的空间。)我自己并没有超越摘要。

#7


5  

Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.

迈克尔·阿布拉什(Michael Abrash)的“汇编语言之禅”(Zen of Assembly Language)有一些漂亮的东西,不过我承认我不记得我头脑中的细节。

Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.

实际上,似乎Abrash所写的一切都有一些漂亮的优化内容。

#8


3  

The Stalin Scheme compiler is pretty crazy in that aspect.

斯大林计划编译器在这方面非常疯狂。

#9


3  

I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:

我曾经看过一个带有很多空案例的switch语句,交换机头部的评论说了一句话:

Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases

添加了从未被命中的case语句,因为如果有多于N个的情况,编译器只会将开关转换为跳转表

I forget what N was. This was in the source code for Windows that was leaked in 2004.

我忘了N是什么。这是在2004年泄露的Windows源代码中。

#10


2  

I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.

我已经去了英特尔(或AMD)架构参考,看看有什么说明。 movsx - 带符号扩展名的移动非常棒,可以将小符号值移动到大空间中,例如,在一条指令中。

Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.

同样,如果你知道你只使用16位值,但你可以访问所有EAX,EBX,ECX,EDX等 - 那么你有8个非常快速的值位置 - 只需将寄存器旋转16位即可访问另一个值。

#11


1  

The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.

EFF DES破解者使用定制硬件生成候选密钥(他们制造的硬件可能证明是关键不是解决方案,但无法证明关键是解决方案),然后使用更传统的代码进行测试。

#12


1  

The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)

由波兰团队制造的FSG 2.0包装机,专门用于包装由装配制成的可执行文件。如果包装组件不够令人印象深刻(假设几乎尽可能低),它配备的装载机是158字节并且功能齐全。如果你尝试用UPX之类的东西打包任何程序集.exe,它会抛出一个NotCompressableException;)

#1


28  

I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.

我曾经写过一个暴力RC5密钥搜索,一次处理两个密钥,第一个密钥使用整数流水线,第二个密钥使用SSE流水线,两个密钥在指令级交错。然后将其与管理程序结合使用,该程序在系统中的每个核心上运行代码实例。总的来说,代码运行速度比原始C版本快25倍。

#2


15  

In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.

在我使用过的一个(这里未命名的)视频游戏引擎中,他们重写了模型导出工具(将Maya网格转换为游戏加载的东西),这样它实际上不会发出数据,而是实际发出精确的流渲染特定模型所必需的微指令。它使用遗传算法找到将在最小循环次数内运行的遗传算法。也就是说,给定模型的数据格式实际上是一个完美优化的子程序,用于渲染该模型。因此,将网格绘制到屏幕意味着将其加载到内存中并分支到其中。

(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)

(这不是用于PC,而是用于具有与CPU分离且并行的向量单元的控制台。)

#3


10  

In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:

在DOS的早期,当我们使用软盘进行所有数据传输时,也存在病毒。病毒感染不同计算机的一种常见方法是将病毒引导程序复制到插入的floppydisc的引导程序中。当用户将floppydisc插入另一台计算机并重新启动而不记得移除软盘时,病毒就会运行并感染硬盘启动器,从而永久感染主机PC。我被感染的一种特殊的令人讨厌的病毒被称为“Form”,为了对抗这一点,我编写了一个具有以下功能的自定义软盘bootsector:

  • Validate the bootsector of the host harddrive and make sure it was not infected.
  • 验证主机硬盘的bootsector并确保它没有被感染。

  • Validate the floppy bootsector and make sure that it was not infected.
  • 验证软盘引导扇区并确保它没有被感染。

  • Code to remove the virus from the harddrive if it was infected.
  • 如果病毒被感染,请从硬盘中删除病毒的代码。

  • Code to duplicate the antivirus bootsector to another floppy if a special key was pressed.
  • 如果按下特殊键,则将防病毒引导程序复制到另一张软盘的代码。

  • Code to boot the harddrive if all was well, and no infections was found.
  • 如果一切正常,则启动硬盘的代码,并且没有发现感染。

This was done in the program space of a bootsector, about 440 bytes :)

这是在bootsector的程序空间中完成的,大约440字节:)

The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"

我的伙伴最大的问题是显示非常神秘的消息,因为我需要所有代码空间。它就像“FFVD RM?”,意思是“检测到FindForm病毒,删除?”

I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.

我很满意这段代码。优化是程序大小,而不是速度。装配中有两种完全不同的优化。

#4


9  

My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.

我最喜欢的是通过整数运算的浮点反平方根。对于如何存储浮点值并且可以更快地执行(即使执行1 /结果比库存标准平方根函数更快)或者产生比标准方法更准确的结果,这是一个很酷的小黑客。

In c/c++ the code is: (sourced from Wikipedia)

在c / c ++中,代码是:(来自*)

float InvSqrt (float x)
{
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);   // Now this is what you call a real magic number
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
}

#5


8  

I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.

几年前,我用65816汇编语言为Apple IIgs写了一个基于磁贴的游戏引擎。这是一台相当慢的机器,“在金属上”编程是哄骗可接受性能的虚拟要求。

In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.

为了快速更新图形屏幕,必须将堆栈映射到屏幕,以便使用一些特殊指令,允许用户仅在5个机器周期内更新4个屏幕像素。这并不是特别奇妙,在IIgs Tech Note#70中有详细描述。硬核位是我必须组织代码以使其足够灵活以成为通用库,同时仍保持最大速度。

I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.

我将图形屏幕分解为扫描线,并创建了一个246字节的代码缓冲区来插入专门的65816操作码。需要246个字节,因为图形屏幕的每条扫描线宽度为80字,每端需要1个附加字以便平滑滚动。推送有效地址(PEA)指令占用3个字节,因此3 *(80 + 1 + 1)= 246个字节。

The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.

通过跳转到对应于屏幕右边缘的246字节代码缓冲区内的地址并将BRanch Always(BRA)指令修补到紧随最左边单词后面的单词的代码中来呈现图形屏幕。 BRA指令将带符号的8位偏移量作为其参数,因此它几乎没有跳出代码缓冲区的范围。

Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:

即便这样也不是太困难,但真正的核心优化就在这里。我的图形引擎实际上支持两个独立的背景图层和动画图块,使用不同的3字节代码序列,具体取决于模式:

  1. Background 1 uses a Push Effective Address (PEA) instruction
  2. 背景1使用推送有效地址(PEA)指令

  3. Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
  4. 背景2使用加载间接索引(LDA($ 00),y)指令,然后是推送(PHA)

  5. Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
  6. 动画磁贴使用加载直接页索引(LDA $ 00,x)指令后跟推送(PHA)

The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.

关键限制是65816寄存器(X和Y)都用于引用数据,不能修改。此外,直接页面寄存器(D)是基于第二背景的原点设置的,并且不能改变;数据库寄存器设置为保存第二背景的像素数据的数据库,不能更改;堆栈指针(S)映射到图形屏幕,因此不可能跳转到子程序并返回。

Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:

鉴于这些限制,我需要快速处理即将被推入堆栈的单词混合的情况,即一半来自背景1,一半来自后台2.我的解决方案是交换内存以获得速度。因为所有正常寄存器都在使用中,所以我只能使用程序计数器(PC)寄存器。我的解决方案如下:

  1. Define a code fragment to do the blend in the same 64K program bank as the code buffer
  2. 定义一个代码片段,在与代码缓冲区相同的64K程序库中进行混合

  3. Create a copy of this code for each of the 82 words
  4. 为82个单词中的每个单词创建此代码的副本

  5. There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
  6. 存在1-1对应关系,因此代码片段的返回可以是硬编码的地址

  7. Done! We have a hard-coded subroutine that does not affect the CPU registers.
  8. 完成!我们有一个硬编码的子程序,不会影响CPU寄存器。

Here is the actual code fragments

这是实际的代码片段

code_buff:   PEA $0000            ; rightmost word (16-bits = 4 pixels)
             PEA $0000            ; background 1
             PEA $0000            ; background 1
             PEA $0000            ; background 1
             LDA (72),y           ; background 2
             PHA
             LDA (70),y           ; background 2
             PHA
             JMP word_68          ; mix the data
word_68_rtn: PEA $0000            ; more background 1
             ...
             PEA $0000
             BRA *+40             ; patched exit code

             ...                  
word_68:     LDA (68),y           ; load data for background 2
             AND #$00FF           ; mask
             ORA #$AB00           ; blend with data from background 1
             PHA
             JMP word_68_rtn      ; jump back
word_66:     LDA (66),y
             ...

The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.

最终的结果是接近最佳的阻击器,具有最小的开销,并且在具有1MB / s存储器总线的2.5MHz CPU上以320x200的速度输出超过15帧/秒。

#6


7  

A Very Biological Optimisation

Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.

快速背景:DNA核苷酸(A,C,G和T)的三重态编码氨基酸,这些氨基酸与蛋白质结合,蛋白质是大多数生物的组成部分。

Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)

通常,每种不同的蛋白质需要单独的DNA三联体序列(其“基因”)来编码其氨基酸 - 例如,长度为30,40和50的3种蛋白质总共需要90 + 120 + 150 = 360个核苷酸。然而,在病毒中,空间是非常宝贵的 - 因此一些病毒与不同基因的DNA序列重叠,使用6个可能的“阅读框架”用于DNA到蛋白质翻译(即从一个位置开始)可被3整除;从3除以余数1的位置;或从3除以余数2的位置;再次相同,但反过来读取序列。)

For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)

为了比较:尝试编写一个x86汇编语言程序,其中300字节函数doFoo()从偏移量0x1000开始...另一个200字节函数doBar()从偏移量0x1001开始! (我为这次比赛提出一个名字:你比乙肝更聪明吗?)

That's hardcore space optimisation!

这是硬核空间优化!

UPDATE: Links to further info:

更新:链接到更多信息:

  • Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
  • *上的阅读框架表明,乙型肝炎和“大麦黄矮星”病毒(一种植物病毒)都与阅读框架重叠。

  • Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
  • *上的乙型肝炎基因组信息。似乎不同的阅读框亚基产生表面蛋白的不同变异。

  • Or you could google for "overlapping reading frames"
  • 或者你可以谷歌搜索“重叠阅读框”

  • Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
  • 似乎这甚至可能发生在哺乳动物身上!在第二个哺乳动物基因中广泛重叠的阅读框是由Marilyn Kozak撰写的2001年科学论文,该论文讨论了“广泛重叠阅读框”的大鼠“第二”基因。 (这是非常令人惊讶的,因为哺乳动物的基因组结构为不同蛋白质的单独基因提供了充足的空间。)我自己并没有超越摘要。

#7


5  

Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.

迈克尔·阿布拉什(Michael Abrash)的“汇编语言之禅”(Zen of Assembly Language)有一些漂亮的东西,不过我承认我不记得我头脑中的细节。

Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.

实际上,似乎Abrash所写的一切都有一些漂亮的优化内容。

#8


3  

The Stalin Scheme compiler is pretty crazy in that aspect.

斯大林计划编译器在这方面非常疯狂。

#9


3  

I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:

我曾经看过一个带有很多空案例的switch语句,交换机头部的评论说了一句话:

Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases

添加了从未被命中的case语句,因为如果有多于N个的情况,编译器只会将开关转换为跳转表

I forget what N was. This was in the source code for Windows that was leaked in 2004.

我忘了N是什么。这是在2004年泄露的Windows源代码中。

#10


2  

I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.

我已经去了英特尔(或AMD)架构参考,看看有什么说明。 movsx - 带符号扩展名的移动非常棒,可以将小符号值移动到大空间中,例如,在一条指令中。

Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.

同样,如果你知道你只使用16位值,但你可以访问所有EAX,EBX,ECX,EDX等 - 那么你有8个非常快速的值位置 - 只需将寄存器旋转16位即可访问另一个值。

#11


1  

The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.

EFF DES破解者使用定制硬件生成候选密钥(他们制造的硬件可能证明是关键不是解决方案,但无法证明关键是解决方案),然后使用更传统的代码进行测试。

#12


1  

The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)

由波兰团队制造的FSG 2.0包装机,专门用于包装由装配制成的可执行文件。如果包装组件不够令人印象深刻(假设几乎尽可能低),它配备的装载机是158字节并且功能齐全。如果你尝试用UPX之类的东西打包任何程序集.exe,它会抛出一个NotCompressableException;)