如何改进自定义OpenGL es2.0深度纹理生成的性能?

时间:2021-02-09 03:27:48

I have an open source iOS application that uses custom OpenGL ES 2.0 shaders to display 3-D representations of molecular structures. It does this by using procedurally generated sphere and cylinder impostors drawn over rectangles, instead of these same shapes built using lots of vertices. The downside to this approach is that the depth values for each fragment of these impostor objects needs to be calculated in a fragment shader, to be used when objects overlap.

我有一个开源的iOS应用程序,它使用自定义OpenGL ES 2.0着色器来显示分子结构的3-D表示。它通过在矩形上绘制程序生成的球面和圆柱形视点来实现这一点,而不是使用大量的顶点构建这些相同的形状。这种方法的缺点是,这些视点替用特效对象的每个片段的深度值需要在碎片着色器中计算,以便在对象重叠时使用。

Unfortunately, OpenGL ES 2.0 does not let you write to gl_FragDepth, so I've needed to output these values to a custom depth texture. I do a pass over my scene using a framebuffer object (FBO), only rendering out a color that corresponds to a depth value, with the results being stored into a texture. This texture is then loaded into the second half of my rendering process, where the actual screen image is generated. If a fragment at that stage is at the depth level stored in the depth texture for that point on the screen, it is displayed. If not, it is tossed. More about the process, including diagrams, can be found in my post here.

不幸的是,OpenGL ES 2.0不允许您写入gl_FragDepth,因此我需要将这些值输出到自定义的深度纹理中。我使用framebuffer对象(FBO)对场景进行遍历,只显示与深度值相对应的颜色,结果存储在纹理中。然后,将这个纹理加载到我的渲染过程的后半部分,在那里生成实际的屏幕图像。如果在那个阶段的一个片段是存储在屏幕上那个点的深度纹理中的深度级,它就会显示出来。如果没有,它就被抛起。关于这个过程的更多信息,包括图表,可以在我的文章中找到。

The generation of this depth texture is a bottleneck in my rendering process and I'm looking for a way to make it faster. It seems slower than it should be, but I can't figure out why. In order to achieve the proper generation of this depth texture, GL_DEPTH_TEST is disabled, GL_BLEND is enabled with glBlendFunc(GL_ONE, GL_ONE), and glBlendEquation() is set to GL_MIN_EXT. I know that a scene output in this manner isn't the fastest on a tile-based deferred renderer like the PowerVR series in iOS devices, but I can't think of a better way to do this.

这种深度纹理的生成是我渲染过程中的一个瓶颈,我正在寻找一种让它更快的方法。它似乎比应该慢,但我不知道为什么。为了实现这种深度纹理的正确生成,GL_DEPTH_TEST被禁用,GL_BLEND使用glBlendFunc(GL_ONE, GL_ONE)启用,glBlendEquation()被设置为GL_MIN_EXT。我知道以这种方式输出的场景在基于插件的延迟渲染器上不是最快的,比如iOS设备中的PowerVR系列,但我想不出更好的方法来实现这一点。

My depth fragment shader for spheres (the most common display element) looks to be at the heart of this bottleneck (Renderer Utilization in Instruments is pegged at 99%, indicating that I'm limited by fragment processing). It currently looks like the following:

我的球体景深碎片着色器(最常见的显示元素)看起来是这个瓶颈的核心(渲染器在仪器中的使用率是99%,这表明我受到碎片处理的限制)。目前的情况如下:

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;

const vec3 stepValues = vec3(2.0, 1.0, 0.0);
const float scaleDownFactor = 1.0 / 255.0;

void main()
{
    float distanceFromCenter = length(impostorSpaceCoordinate);
    if (distanceFromCenter > 1.0)
    {
        gl_FragColor = vec4(1.0);
    }
    else
    {
        float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter);
        mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;

        // Inlined color encoding for the depth values
        float ceiledValue = ceil(currentDepthValue * 765.0);

        vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - stepValues;

        gl_FragColor = vec4(intDepthValue, 1.0);
    }
}

On an iPad 1, this takes 35 - 68 ms to render a frame of a DNA spacefilling model using a passthrough shader for display (18 to 35 ms on iPhone 4). According to the PowerVR PVRUniSCo compiler (part of their SDK), this shader uses 11 GPU cycles at best, 16 cycles at worst. I'm aware that you're advised not to use branching in a shader, but in this case that led to better performance than otherwise.

在iPad上1,这需要35 - 68 ms渲染一帧的DNA使用透传spacefilling模型材质显示(18 - 35女士iPhone 4)。根据PowerVR PVRUniSCo编译器(SDK)的一部分,该材质使用11 GPU周期最多16个周期在最坏的情况。我知道你被建议不要在着色器中使用分支,但是在这种情况下,这会带来更好的性能。

When I simplify it to

把它化简成

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;

void main()
{
    gl_FragColor = vec4(adjustedSphereRadius * normalizedDepth * (impostorSpaceCoordinate + 1.0) / 2.0, normalizedDepth, 1.0);
}

it takes 18 - 35 ms on iPad 1, but only 1.7 - 2.4 ms on iPhone 4. The estimated GPU cycle count for this shader is 8 cycles. The change in render time based on cycle count doesn't seem linear.

iPad 1需要18 - 35毫秒,而iPhone 4只需要1.7 - 2.4毫秒。这个着色器的估计GPU周期数为8个周期。基于循环计数的渲染时间变化似乎不是线性的。

Finally, if I just output a constant color:

最后,如果我只输出一个常量颜色:

precision mediump float;

void main()
{
    gl_FragColor = vec4(0.5, 0.5, 0.5, 1.0);
}

the rendering time drops to 1.1 - 2.3 ms on iPad 1 (1.3 ms on iPhone 4).

iPad 1的渲染时间下降到1.1 - 2.3 ms (iphone4的1.3 ms)。

The nonlinear scaling in rendering time and sudden change between iPad and iPhone 4 for the second shader makes me think that there's something I'm missing here. A full source project containing these three shader variants (look in the SphereDepth.fsh file and comment out the appropriate sections) and a test model can be downloaded from here, if you wish to try this out yourself.

在渲染时间上的非线性缩放,以及iPad和iPhone 4对第二个着色器的突然改变,让我觉得我在这里漏掉了一些东西。一个包含这三个着色器变体的完整源项目(请查看SphereDepth)。fsh文件和注释出适当的部分)和一个测试模型可以从这里下载,如果你想亲自尝试这个。

If you've read this far, my question is: based on this profiling information, how can I improve the rendering performance of my custom depth shader on iOS devices?

如果您已经阅读了这篇文章,我的问题是:基于这些分析信息,我如何改进我的自定义深度着色器在iOS设备上的渲染性能?

4 个解决方案

#1


19  

Based on the recommendations by Tommy, Pivot, and rotoglup, I've implemented some optimizations which have led to a doubling of the rendering speed for the both the depth texture generation and the overall rendering pipeline in the application.

根据Tommy、Pivot和rotoglup的建议,我实现了一些优化,使应用程序中的深度纹理生成和整个呈现管道的呈现速度提高了一倍。

First, I re-enabled the precalculated sphere depth and lighting texture that I'd used before with little effect, only now I use proper lowp precision values when handling the colors and other values from that texture. This combination, along with proper mipmapping for the texture, seems to yield a ~10% performance boost.

首先,我重新启用了预先计算的球面深度和光线纹理,我以前使用的效果很小,只是现在我在处理颜色和其他纹理值时使用适当的低精度值。这种组合,加上纹理的适当的mipmapping,似乎产生了大约10%的性能提升。

More importantly, I now do a pass before rendering both my depth texture and the final raytraced impostors where I lay down some opaque geometry to block pixels that would never be rendered. To do this, I enable depth testing and then draw out the squares that make up the objects in my scene, shrunken by sqrt(2) / 2, with a simple opaque shader. This will create inset squares covering area known to be opaque in a represented sphere.

更重要的是,我现在在渲染深度纹理和最终的光线追踪视点之前做一个传递,在那里我放置了一些不透明的几何图形来块像素,这些像素永远不会被渲染。为此,我启用了深度测试,然后用一个简单的不透明材质绘制出构成场景中的对象的方块,用sqrt(2) / 2缩小。这将创建嵌套正方形,覆盖已知的在表示的球体中不透明的区域。

I then disable depth writes using glDepthMask(GL_FALSE) and render the square sphere impostor at a location closer to the user by one radius. This allows the tile-based deferred rendering hardware in the iOS devices to efficiently strip out fragments that would never appear onscreen under any conditions, yet still give smooth intersections between the visible sphere impostors based on per-pixel depth values. This is depicted in my crude illustration below:

然后,我使用glDepthMask(GL_FALSE)禁用深度写入,并以一个半径在离用户更近的位置上呈现方形球面impostor。这使得iOS设备中基于网格的延迟渲染硬件能够有效地去除那些在任何情况下都不会出现在屏幕上的片段,但仍然可以根据每个像素的深度值在可见的球体视点视点视点之间提供平滑的交叉点。以下是我粗略的说明:

如何改进自定义OpenGL es2.0深度纹理生成的性能?

In this example, the opaque blocking squares for the top two impostors do not prevent any of the fragments from those visible objects from being rendered, yet they block a chunk of the fragments from the lowest impostor. The frontmost impostors can then use per-pixel tests to generate a smooth intersection, while many of the pixels from the rear impostor don't waste GPU cycles by being rendered.

在本例中,前两个视点的不透明块不会阻止这些可见对象的任何片段被呈现,但是它们会阻止来自最低视点的片段。最前面的视点替用特效可以使用每个像素的测试来生成平滑的交点,而后面的视点替用特效不会因为渲染而浪费GPU的周期。

I hadn't thought to disable depth writes, yet leave on depth testing when doing the last rendering stage. This is the key to preventing the impostors from simply stacking on one another, yet still using some of the hardware optimizations within the PowerVR GPUs.

我还没有想过禁用深度写入,但是在执行最后的渲染阶段时,仍然要进行深度测试。这是防止骗子相互叠加的关键,但仍然使用PowerVR gpu中的一些硬件优化。

In my benchmarks, rendering the test model I used above yields times of 18 - 35 ms per frame, as compared to the 35 - 68 ms I was getting previously, a near doubling in rendering speed. Applying this same opaque geometry pre-rendering to the raytracing pass yields a doubling in overall rendering performance.

在我的基准测试中,渲染我上面使用的测试模型,结果是每帧18 - 35毫秒,相比之下,我之前得到的35 - 68毫秒,渲染速度几乎翻了一番。将同样不透明的几何图形预渲染应用到射线追踪传递中,会使整体渲染性能提高一倍。

Oddly, when I tried to refine this further by using inset and circumscribed octagons, which should cover ~17% fewer pixels when drawn, and be more efficient with blocking fragments, performance was actually worse than when using simple squares for this. Tiler utilization was still less than 60% in the worst case, so maybe the larger geometry was resulting in more cache misses.

奇怪的是,当我试图通过使用inset和受限制的八角形来进一步细化它时,它在绘制时应该减少17%的像素,并且在使用块块时效率更高,实际上性能比使用简单的方块时要差。在最坏的情况下,Tiler的利用率仍然不到60%,因此更大的几何尺寸可能导致更多的缓存丢失。

EDIT (5/31/2011):

编辑(5/31/2011):

Based on Pivot's suggestion, I created inscribed and circumscribed octagons to use instead of my rectangles, only I followed the recommendations here for optimizing triangles for rasterization. In previous testing, octagons yielded worse performance than squares, despite removing many unnecessary fragments and letting you block covered fragments more efficiently. By adjusting the triangle drawing as follows:

基于Pivot的建议,我创建了刻划和外切的八边形来代替矩形,只是按照这里的建议对三角形进行光栅化。在之前的测试中,尽管去除了许多不必要的片段,并让您更有效地屏蔽了覆盖的片段,但八边形的性能比正方形更糟糕。调整三角图如下:

如何改进自定义OpenGL es2.0深度纹理生成的性能?

I was able to reduce overall rendering time by an average of 14% on top of the above-described optimizations by switching to octagons from squares. The depth texture is now generated in 19 ms, with occasional dips to 2 ms and spikes to 35 ms.

我能够将总体呈现时间减少14%,在上面所描述的优化上,从方块切换到八边形。深度纹理现在产生在19毫秒,偶尔下降到2毫秒,尖峰到35毫秒。

EDIT 2 (5/31/2011):

编辑2(5/31/2011):

I've revisited Tommy's idea of using the step function, now that I have fewer fragments to discard due to the octagons. This, combined with a depth lookup texture for the sphere, now leads to a 2 ms average rendering time on the iPad 1 for the depth texture generation for my test model. I consider that to be about as good as I could hope for in this rendering case, and a giant improvement from where I started. For posterity, here is the depth shader I'm now using:

我重新回顾了汤米使用阶跃函数的思想,现在我有了更少的片段,因为有了八角形。结合球体的深度查找纹理,现在iPad 1的深度纹理生成的平均渲染时间为2毫秒。我认为在这个渲染案例中,这是我所能期望的最好的结果,并且从我开始的地方有了巨大的改进。对于后代,这里是我现在使用的深度着色器:

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
varying mediump vec2 depthLookupCoordinate;

uniform lowp sampler2D sphereDepthMap;

const lowp vec3 stepValues = vec3(2.0, 1.0, 0.0);

void main()
{
    lowp vec2 precalculatedDepthAndAlpha = texture2D(sphereDepthMap, depthLookupCoordinate).ra;

    float inCircleMultiplier = step(0.5, precalculatedDepthAndAlpha.g);

    float currentDepthValue = normalizedDepth + adjustedSphereRadius - adjustedSphereRadius * precalculatedDepthAndAlpha.r;

    // Inlined color encoding for the depth values
    currentDepthValue = currentDepthValue * 3.0;

    lowp vec3 intDepthValue = vec3(currentDepthValue) - stepValues;

    gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);
}

I've updated the testing sample here, if you wish to see this new approach in action as compared to what I was doing initially.

我已经更新了这里的测试样例,如果您希望看到这个新方法与我最初所做的比较。

I'm still open to other suggestions, but this is a huge step forward for this application.

我仍然愿意接受其他的建议,但是对于这个应用程序来说,这是一个巨大的进步。

#2


9  

On the desktop, it was the case on many early programmable devices that while they could process 8 or 16 or whatever fragments simultaneously, they effectively had only one program counter for the lot of them (since that also implies only one fetch/decode unit and one of everything else, as long as they work in units of 8 or 16 pixels). Hence the initial prohibition on conditionals and, for a while after that, the situation where if the conditional evaluations for pixels that would be processed together returned different values, those pixels would be processed in smaller groups in some arrangement.

在桌面上,这是许多早期可编程设备,虽然他们可以处理8或16同时碎片,他们只有一个程序计数器的很多(因为这也意味着只有一个获取/解码单元和其他之一,只要他们工作单位的8或16像素)。因此,最初禁止条件,之后的一段时间,如果一起处理的像素的条件评估返回不同的值,这些像素将以某种方式在更小的组中进行处理。

Although PowerVR aren't explicit, their application development recommendations have a section on flow control and make a lot of recommendations about dynamic branches usually being a good idea only where the result is reasonably predictable, which makes me think they're getting at the same sort of thing. I'd therefore suggest that the speed disparity may be because you've included a conditional.

虽然PowerVR并不明确,但它们的应用程序开发建议有一个关于流控制的部分,并提出了许多关于动态分支的建议,这通常是一个好主意,但结果是合理的可预测的,这让我认为它们是在做同样的事情。因此我认为速度差异可能是因为你包含了条件。

As a first test, what happens if you try the following?

作为第一个测试,如果您尝试以下操作会发生什么?

void main()
{
    float distanceFromCenter = length(impostorSpaceCoordinate);

    // the step function doesn't count as a conditional
    float inCircleMultiplier = step(distanceFromCenter, 1.0);

    float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter * inCircleMultiplier);
    mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;

    // Inlined color encoding for the depth values
    float ceiledValue = ceil(currentDepthValue * 765.0) * inCircleMultiplier;

    vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - (stepValues * inCircleMultiplier);

     // use the result of the step to combine results
    gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);

}

#3


8  

Many of these points have been covered by others who have posted answers, but the overarching theme here is that your rendering does a lot of work that will be thrown away:

这些观点中有许多已经被其他发布了答案的人所涵盖,但这里的主要主题是你的渲染做了很多将被抛弃的工作:

  1. The shader itself does some potentially redundant work. The length of a vector is likely to be calculated as sqrt(dot(vector, vector)). You don’t need the sqrt to reject fragments outside of the circle, and you’re squaring the length to calculate the depth, anyway. Additionally, have you looked at whether or not explicit quantization of the depth values is actually necessary, or can you get away with just using the hardware’s conversion from floating-point to integer for the framebuffer (potentially with an additional bias to make sure your quasi-depth tests come out right later)?

    着色器本身做一些潜在的冗余工作。向量的长度很可能被计算为√(点(向量,向量))。您不需要sqrt来拒绝圆圈之外的片段,而且您正在对长度进行平方以计算深度。另外,您是否考虑过对深度值的显式量化是必要的,还是仅仅使用硬件从浮点数转换到framebuffer的整数(可能会有额外的偏差,以确保您的准深度测试稍后才会出来)?

  2. Many fragments are trivially outside the circle. Only π/4 of the area of the quads you’re drawing produce useful depth values. At this point, I imagine your app is heavily skewed towards fragment processing, so you may want to consider increasing the number of vertices you draw in exchange for a reduction in the area that you have to shade. Since you’re drawing spheres through an orthographic projection, any circumscribing regular polygon will do, although you may need a little extra size depending on zoom level to make sure you rasterize enough pixels.

    许多碎片在圆圈外面是微不足道的。只有π/ 4的四管你要画的面积产生有用的深度值。此时,我假设您的应用程序严重倾向于碎片处理,因此您可能想要考虑增加您绘制的顶点数量,以换取必须着色的区域的减少。由于你是通过直射投影绘制球体,任何环绕的普通多边形都可以,尽管你可能需要一些额外的尺寸,取决于缩放的级别,以确保你栅格化了足够的像素。

  3. Many fragments are trivially occluded by other fragments. As others have pointed out, you’re not using hardware depth test, and therefore not taking full advantage of a TBDR’s ability to kill shading work early. If you’ve already implemented something for 2), all you need to do is draw an inscribed regular polygon at the maximum depth that you can generate (a plane through the middle of the sphere), and draw your real polygon at the minimum depth (the front of the sphere). Both Tommy’s and rotoglup’s posts already contain the state vector specifics.

    许多碎片被其他碎片所覆盖。正如其他人所指出的,您没有使用硬件深度测试,因此没有充分利用TBDR的能力尽早终止阴影工作。如果你已经实现了2),你所需要做的就是在你能产生的最大深度上绘制一个刻有规则的多边形(一个平面穿过球体的中间),并在最小深度上绘制你的真实多边形(球面的前部)。Tommy和rotoglup的帖子都已经包含了状态矢量的细节。

Note that 2) and 3) apply to your raytracing shaders as well.

注意,2)和3)也适用于光线跟踪着色器。

#4


2  

I'm no mobile platform expert at all, but I think that what bites you is that:

我一点都不是移动平台专家,但我认为最让你生气的是:

  • your depth shader is quite expensive
  • 你的深度着色是相当昂贵的
  • experience massive overdraw in your depth pass as you disable GL_DEPTH test
  • 当您禁用GL_DEPTH测试时,在您的深度pass中体验大量的透支

Wouldn't an additional pass, drawn before the depth test be helpful ?

在深度测试之前再做一次测试不是很有帮助吗?

This pass could do a GL_DEPTH prefill, for example by drawing each sphere represented as quad facing camera (or a cube, that may be easier to setup), and contained in the associated sphere. This pass could be drawn without color mask or fragment shader, just with GL_DEPTH_TEST and glDepthMask enabled. On desktop platforms, these kind of passes get drawn faster than color + depth passes.

此传递可以进行GL_DEPTH预填充,例如通过绘制每个表示为四轴面摄像机(或更容易设置的立方体)的球体,并包含在相关的球体中。可以不使用颜色蒙版或碎片着色器绘制此通行证,只需启用GL_DEPTH_TEST和glDepthMask。在桌面平台上,这些传递比颜色+深度传递更快。

Then in you depth computation pass, you could enable GL_DEPTH_TEST and disable glDepthMask, this way your shader would not be executed on pixels that are hidden by nearer geometry.

然后在深度计算通过时,您可以启用GL_DEPTH_TEST并禁用glDepthMask,这样您的着色器就不会在更靠近几何图形的像素上执行。

This solution would involve issuing another set of draw calls, so this may not be beneficial.

这种解决方案将涉及发出另一组抽签通知,因此这可能没有好处。

#1


19  

Based on the recommendations by Tommy, Pivot, and rotoglup, I've implemented some optimizations which have led to a doubling of the rendering speed for the both the depth texture generation and the overall rendering pipeline in the application.

根据Tommy、Pivot和rotoglup的建议,我实现了一些优化,使应用程序中的深度纹理生成和整个呈现管道的呈现速度提高了一倍。

First, I re-enabled the precalculated sphere depth and lighting texture that I'd used before with little effect, only now I use proper lowp precision values when handling the colors and other values from that texture. This combination, along with proper mipmapping for the texture, seems to yield a ~10% performance boost.

首先,我重新启用了预先计算的球面深度和光线纹理,我以前使用的效果很小,只是现在我在处理颜色和其他纹理值时使用适当的低精度值。这种组合,加上纹理的适当的mipmapping,似乎产生了大约10%的性能提升。

More importantly, I now do a pass before rendering both my depth texture and the final raytraced impostors where I lay down some opaque geometry to block pixels that would never be rendered. To do this, I enable depth testing and then draw out the squares that make up the objects in my scene, shrunken by sqrt(2) / 2, with a simple opaque shader. This will create inset squares covering area known to be opaque in a represented sphere.

更重要的是,我现在在渲染深度纹理和最终的光线追踪视点之前做一个传递,在那里我放置了一些不透明的几何图形来块像素,这些像素永远不会被渲染。为此,我启用了深度测试,然后用一个简单的不透明材质绘制出构成场景中的对象的方块,用sqrt(2) / 2缩小。这将创建嵌套正方形,覆盖已知的在表示的球体中不透明的区域。

I then disable depth writes using glDepthMask(GL_FALSE) and render the square sphere impostor at a location closer to the user by one radius. This allows the tile-based deferred rendering hardware in the iOS devices to efficiently strip out fragments that would never appear onscreen under any conditions, yet still give smooth intersections between the visible sphere impostors based on per-pixel depth values. This is depicted in my crude illustration below:

然后,我使用glDepthMask(GL_FALSE)禁用深度写入,并以一个半径在离用户更近的位置上呈现方形球面impostor。这使得iOS设备中基于网格的延迟渲染硬件能够有效地去除那些在任何情况下都不会出现在屏幕上的片段,但仍然可以根据每个像素的深度值在可见的球体视点视点视点之间提供平滑的交叉点。以下是我粗略的说明:

如何改进自定义OpenGL es2.0深度纹理生成的性能?

In this example, the opaque blocking squares for the top two impostors do not prevent any of the fragments from those visible objects from being rendered, yet they block a chunk of the fragments from the lowest impostor. The frontmost impostors can then use per-pixel tests to generate a smooth intersection, while many of the pixels from the rear impostor don't waste GPU cycles by being rendered.

在本例中,前两个视点的不透明块不会阻止这些可见对象的任何片段被呈现,但是它们会阻止来自最低视点的片段。最前面的视点替用特效可以使用每个像素的测试来生成平滑的交点,而后面的视点替用特效不会因为渲染而浪费GPU的周期。

I hadn't thought to disable depth writes, yet leave on depth testing when doing the last rendering stage. This is the key to preventing the impostors from simply stacking on one another, yet still using some of the hardware optimizations within the PowerVR GPUs.

我还没有想过禁用深度写入,但是在执行最后的渲染阶段时,仍然要进行深度测试。这是防止骗子相互叠加的关键,但仍然使用PowerVR gpu中的一些硬件优化。

In my benchmarks, rendering the test model I used above yields times of 18 - 35 ms per frame, as compared to the 35 - 68 ms I was getting previously, a near doubling in rendering speed. Applying this same opaque geometry pre-rendering to the raytracing pass yields a doubling in overall rendering performance.

在我的基准测试中,渲染我上面使用的测试模型,结果是每帧18 - 35毫秒,相比之下,我之前得到的35 - 68毫秒,渲染速度几乎翻了一番。将同样不透明的几何图形预渲染应用到射线追踪传递中,会使整体渲染性能提高一倍。

Oddly, when I tried to refine this further by using inset and circumscribed octagons, which should cover ~17% fewer pixels when drawn, and be more efficient with blocking fragments, performance was actually worse than when using simple squares for this. Tiler utilization was still less than 60% in the worst case, so maybe the larger geometry was resulting in more cache misses.

奇怪的是,当我试图通过使用inset和受限制的八角形来进一步细化它时,它在绘制时应该减少17%的像素,并且在使用块块时效率更高,实际上性能比使用简单的方块时要差。在最坏的情况下,Tiler的利用率仍然不到60%,因此更大的几何尺寸可能导致更多的缓存丢失。

EDIT (5/31/2011):

编辑(5/31/2011):

Based on Pivot's suggestion, I created inscribed and circumscribed octagons to use instead of my rectangles, only I followed the recommendations here for optimizing triangles for rasterization. In previous testing, octagons yielded worse performance than squares, despite removing many unnecessary fragments and letting you block covered fragments more efficiently. By adjusting the triangle drawing as follows:

基于Pivot的建议,我创建了刻划和外切的八边形来代替矩形,只是按照这里的建议对三角形进行光栅化。在之前的测试中,尽管去除了许多不必要的片段,并让您更有效地屏蔽了覆盖的片段,但八边形的性能比正方形更糟糕。调整三角图如下:

如何改进自定义OpenGL es2.0深度纹理生成的性能?

I was able to reduce overall rendering time by an average of 14% on top of the above-described optimizations by switching to octagons from squares. The depth texture is now generated in 19 ms, with occasional dips to 2 ms and spikes to 35 ms.

我能够将总体呈现时间减少14%,在上面所描述的优化上,从方块切换到八边形。深度纹理现在产生在19毫秒,偶尔下降到2毫秒,尖峰到35毫秒。

EDIT 2 (5/31/2011):

编辑2(5/31/2011):

I've revisited Tommy's idea of using the step function, now that I have fewer fragments to discard due to the octagons. This, combined with a depth lookup texture for the sphere, now leads to a 2 ms average rendering time on the iPad 1 for the depth texture generation for my test model. I consider that to be about as good as I could hope for in this rendering case, and a giant improvement from where I started. For posterity, here is the depth shader I'm now using:

我重新回顾了汤米使用阶跃函数的思想,现在我有了更少的片段,因为有了八角形。结合球体的深度查找纹理,现在iPad 1的深度纹理生成的平均渲染时间为2毫秒。我认为在这个渲染案例中,这是我所能期望的最好的结果,并且从我开始的地方有了巨大的改进。对于后代,这里是我现在使用的深度着色器:

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
varying mediump vec2 depthLookupCoordinate;

uniform lowp sampler2D sphereDepthMap;

const lowp vec3 stepValues = vec3(2.0, 1.0, 0.0);

void main()
{
    lowp vec2 precalculatedDepthAndAlpha = texture2D(sphereDepthMap, depthLookupCoordinate).ra;

    float inCircleMultiplier = step(0.5, precalculatedDepthAndAlpha.g);

    float currentDepthValue = normalizedDepth + adjustedSphereRadius - adjustedSphereRadius * precalculatedDepthAndAlpha.r;

    // Inlined color encoding for the depth values
    currentDepthValue = currentDepthValue * 3.0;

    lowp vec3 intDepthValue = vec3(currentDepthValue) - stepValues;

    gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);
}

I've updated the testing sample here, if you wish to see this new approach in action as compared to what I was doing initially.

我已经更新了这里的测试样例,如果您希望看到这个新方法与我最初所做的比较。

I'm still open to other suggestions, but this is a huge step forward for this application.

我仍然愿意接受其他的建议,但是对于这个应用程序来说,这是一个巨大的进步。

#2


9  

On the desktop, it was the case on many early programmable devices that while they could process 8 or 16 or whatever fragments simultaneously, they effectively had only one program counter for the lot of them (since that also implies only one fetch/decode unit and one of everything else, as long as they work in units of 8 or 16 pixels). Hence the initial prohibition on conditionals and, for a while after that, the situation where if the conditional evaluations for pixels that would be processed together returned different values, those pixels would be processed in smaller groups in some arrangement.

在桌面上,这是许多早期可编程设备,虽然他们可以处理8或16同时碎片,他们只有一个程序计数器的很多(因为这也意味着只有一个获取/解码单元和其他之一,只要他们工作单位的8或16像素)。因此,最初禁止条件,之后的一段时间,如果一起处理的像素的条件评估返回不同的值,这些像素将以某种方式在更小的组中进行处理。

Although PowerVR aren't explicit, their application development recommendations have a section on flow control and make a lot of recommendations about dynamic branches usually being a good idea only where the result is reasonably predictable, which makes me think they're getting at the same sort of thing. I'd therefore suggest that the speed disparity may be because you've included a conditional.

虽然PowerVR并不明确,但它们的应用程序开发建议有一个关于流控制的部分,并提出了许多关于动态分支的建议,这通常是一个好主意,但结果是合理的可预测的,这让我认为它们是在做同样的事情。因此我认为速度差异可能是因为你包含了条件。

As a first test, what happens if you try the following?

作为第一个测试,如果您尝试以下操作会发生什么?

void main()
{
    float distanceFromCenter = length(impostorSpaceCoordinate);

    // the step function doesn't count as a conditional
    float inCircleMultiplier = step(distanceFromCenter, 1.0);

    float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter * inCircleMultiplier);
    mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;

    // Inlined color encoding for the depth values
    float ceiledValue = ceil(currentDepthValue * 765.0) * inCircleMultiplier;

    vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - (stepValues * inCircleMultiplier);

     // use the result of the step to combine results
    gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);

}

#3


8  

Many of these points have been covered by others who have posted answers, but the overarching theme here is that your rendering does a lot of work that will be thrown away:

这些观点中有许多已经被其他发布了答案的人所涵盖,但这里的主要主题是你的渲染做了很多将被抛弃的工作:

  1. The shader itself does some potentially redundant work. The length of a vector is likely to be calculated as sqrt(dot(vector, vector)). You don’t need the sqrt to reject fragments outside of the circle, and you’re squaring the length to calculate the depth, anyway. Additionally, have you looked at whether or not explicit quantization of the depth values is actually necessary, or can you get away with just using the hardware’s conversion from floating-point to integer for the framebuffer (potentially with an additional bias to make sure your quasi-depth tests come out right later)?

    着色器本身做一些潜在的冗余工作。向量的长度很可能被计算为√(点(向量,向量))。您不需要sqrt来拒绝圆圈之外的片段,而且您正在对长度进行平方以计算深度。另外,您是否考虑过对深度值的显式量化是必要的,还是仅仅使用硬件从浮点数转换到framebuffer的整数(可能会有额外的偏差,以确保您的准深度测试稍后才会出来)?

  2. Many fragments are trivially outside the circle. Only π/4 of the area of the quads you’re drawing produce useful depth values. At this point, I imagine your app is heavily skewed towards fragment processing, so you may want to consider increasing the number of vertices you draw in exchange for a reduction in the area that you have to shade. Since you’re drawing spheres through an orthographic projection, any circumscribing regular polygon will do, although you may need a little extra size depending on zoom level to make sure you rasterize enough pixels.

    许多碎片在圆圈外面是微不足道的。只有π/ 4的四管你要画的面积产生有用的深度值。此时,我假设您的应用程序严重倾向于碎片处理,因此您可能想要考虑增加您绘制的顶点数量,以换取必须着色的区域的减少。由于你是通过直射投影绘制球体,任何环绕的普通多边形都可以,尽管你可能需要一些额外的尺寸,取决于缩放的级别,以确保你栅格化了足够的像素。

  3. Many fragments are trivially occluded by other fragments. As others have pointed out, you’re not using hardware depth test, and therefore not taking full advantage of a TBDR’s ability to kill shading work early. If you’ve already implemented something for 2), all you need to do is draw an inscribed regular polygon at the maximum depth that you can generate (a plane through the middle of the sphere), and draw your real polygon at the minimum depth (the front of the sphere). Both Tommy’s and rotoglup’s posts already contain the state vector specifics.

    许多碎片被其他碎片所覆盖。正如其他人所指出的,您没有使用硬件深度测试,因此没有充分利用TBDR的能力尽早终止阴影工作。如果你已经实现了2),你所需要做的就是在你能产生的最大深度上绘制一个刻有规则的多边形(一个平面穿过球体的中间),并在最小深度上绘制你的真实多边形(球面的前部)。Tommy和rotoglup的帖子都已经包含了状态矢量的细节。

Note that 2) and 3) apply to your raytracing shaders as well.

注意,2)和3)也适用于光线跟踪着色器。

#4


2  

I'm no mobile platform expert at all, but I think that what bites you is that:

我一点都不是移动平台专家,但我认为最让你生气的是:

  • your depth shader is quite expensive
  • 你的深度着色是相当昂贵的
  • experience massive overdraw in your depth pass as you disable GL_DEPTH test
  • 当您禁用GL_DEPTH测试时,在您的深度pass中体验大量的透支

Wouldn't an additional pass, drawn before the depth test be helpful ?

在深度测试之前再做一次测试不是很有帮助吗?

This pass could do a GL_DEPTH prefill, for example by drawing each sphere represented as quad facing camera (or a cube, that may be easier to setup), and contained in the associated sphere. This pass could be drawn without color mask or fragment shader, just with GL_DEPTH_TEST and glDepthMask enabled. On desktop platforms, these kind of passes get drawn faster than color + depth passes.

此传递可以进行GL_DEPTH预填充,例如通过绘制每个表示为四轴面摄像机(或更容易设置的立方体)的球体,并包含在相关的球体中。可以不使用颜色蒙版或碎片着色器绘制此通行证,只需启用GL_DEPTH_TEST和glDepthMask。在桌面平台上,这些传递比颜色+深度传递更快。

Then in you depth computation pass, you could enable GL_DEPTH_TEST and disable glDepthMask, this way your shader would not be executed on pixels that are hidden by nearer geometry.

然后在深度计算通过时,您可以启用GL_DEPTH_TEST并禁用glDepthMask,这样您的着色器就不会在更靠近几何图形的像素上执行。

This solution would involve issuing another set of draw calls, so this may not be beneficial.

这种解决方案将涉及发出另一组抽签通知,因此这可能没有好处。