OpenGL perf. problems caused by DirectX 9 oriented design

devsh · Post by **devsh** » Tue Feb 24, 2015 6:51 pm

discuss... mainly discuss the solutions

I will give more examples, but the most prominent is the complete lack of respect for the concept of asynchronous resource upload:
A) The upload of VBO meshbuffers to the GPU, DIRECTLY BEFORE THE DRAW CALL WHERE THEY ARE REQUIRED TO BE USED
B) The lock() and unlock() interfaces on ITexture (complete GPU-CPU stall)
C) FBO attachment rebinding every single time setRenderTarget is used

most of this I guess, was caused by first implementing the engine features in DirectX 9 and having a almost 1-1 mapping between DirectX functions (lock(), unlock()) and then coding the OpenGL later and having to fit it into the existing API.

I've read and modified most areas of the video:: and scene:: namespaces, and I can see a reflection of DirectX 9 in the API.

I get that I am complaining about OpenGL, but this affects DirectX 10,11,12 performance too because you use resources differently if you want the efficient implementation (Direct State Access etc.).

Nadro · Post by **Nadro** » Tue Feb 24, 2015 7:48 pm

You're right, Irrlicht was designed around D3D9, so we might see perf problems in other drivers, however some problems are fixed.
A) In shader-pipeline branch there is already initial solution for this issue, however we'll add next improvements for this case in future (user will have full control over hardware buffers methods and will be able to use double/tripple buffering).
B) There is no solution for this issue yet.
C) This case is already solved in my local Irrlicht repo, this patch will be merged with trunk in upcoming days.

Granyte · Post by **Granyte** » Wed Feb 25, 2015 2:49 am

This makes me cringe every time you write about.

I doubt irrlicht was built around dx9.

Most dx9 features barely had any equivalent or exposure in irrlicht interphase many still don't.
as fo your specific point
A) That is harmfull even for dx9

B) i don't really see what can be done it does indeed work like that in dx9 but locking a texture does result in a pipeline stall even in dx11 the subressource mapping still result in pipeline stalls if openGL has a way to work around this i'd like to ear

C) I have nothing to add Rebinding is harmfull for every driver dx9 included only dx9 seem to cope with it better then the others

and now i Have to add my point
D) the GUI something need to be done about it every element use multiple upload and draws back to back the performance is already bad on dx9 but on dx11 it's just impossible to deal with and if OGL is similar to dx11 it must be a pain there to

devsh · Post by **devsh** » Wed Feb 25, 2015 6:59 pm

All of the problems can be solved, but the reason why they are not going to be is because of backward compatibility.

I solved (B) and I'm coding around (C) but unlike the irrlicht devs, I dont have inertia put upon me and I can do quite drastic solutions (like breaking the DX9 and DX8 drivers).
This is how these problems can be solved:

A) Deferred VBO updates, mesh is asynchronously uploaded during its creation or has a method "updateVBO()", then its not going to stall if you wait at least 4 frames before using it.

A loads better implementation would be implemented by splitting the IMeshBuffer into more objects.
I.e. CMeshBuffer or IMeshBuffer have pointers to VertexBuffer and IndexBuffer objects along with offsets, element counts and strides&attribute divisors (+ attribute slot allocation in the shader)

I also have full control over which buffer is in a GPU driver-side buffer and when I update the buffer, I should also have an explicit parameter whether Irrlicht should keep a copy of the buffer data even if the buffer is uploaded to the driver (this is the equivalent problem to the large memory requirement created by texture keeping copies of IImages).

This way I can share vertex buffer between multiple meshes where each only draws a part of it (very useful for LoD, where you can basically swap index lists while keeping vertex buffer the same).
Me and Soren were planning this for a long time as a method of precomputed triangle culling (index list swapping).

B) We solved this already and I published the sourcecode to my version of irrlicht, you basically grab the openGL handle and do a glTexSubImage2D/3D which is asynchronous (glTexImage2D is not!) as long as you copy from (you do taht by binding the buffer before calling glTexSubImage) a corrrect OpenGL BUFFER to which you uploaded the data beforehand (using the right parameters, otherwise the thing stalls as OpenGL has to wait for data to be uploaded to the buffer before it can use it). A good way to see how it works is to google "Asynchronous OpenGL texture transfers". We use a circular upload buffer (quadruple buffering) and share the upload buffers between all of the "uploadee textures" (although the benefit gets destroyed if you fill the same buffer 2ce in one frame).
If you use the correct hints then the driver side OpenGL buffer from which you upload, will reside in system RAM not video RAM memory and will be uploaded to the GPU using DMA which runs parallel to rendering. The only stall you're going to have is when the DMA engine is not finished copying before the GPU actually wants to use the result of the copy (which shouldnt occur even if you use the texture straight after async upload because OpenGL commands get executed by the GPU waaaaaaayyyyyyyy after theyve been issued and the DMA texture transfer starts ASAP).
The cool thing about exposed OpenGL buffers is that you can "cast" textures into vertex buffers, by copying from a texture to a GPU side buffer and then copying from the buffer into a vertex buffer (or from texture subimage directly to vertex buffer). The only problem you may run into is if you create an OpenGL buffer with a usage hint (Texture Copy etc.) and then use it for a different purpose (vertex buffer), then slowdowns can happen.
For more information about how many buffers there are:
https://www.opengl.org/wiki/Buffer_Object

one example of cool buffer usage, cast a transform feedback buffer into a indirect rendering buffer (OpenGL 4.3/DX11), performing per-mesh occlusion culling in the geometry shader and using the vertices which "pass" the geometry shader to be inputs to the indirect rendering function (where the GPU side data basically specifies the drawPrimitiveArray arguments)

or in DX10 you can copy into occlusion query buffers.

you can also copy to and between uniform buffers (more sophisticated way of setting shader constants) etc.

C) For that again, I will introduce a new kind of object to be created by the driver. A dynamic object (so attachments can be changed and swapped out if extreme need be), called something like IMultipleRenderTarget which will support rendering to specific mip maps of render target textures! So you create all of your FBOs before you start drawing your frames, and avoid costly FBO revalidataion.

the only problem that I see here is the removeTexture() function, gotta find a way around organising that.

D) The only way to cope with that is to introduce a "Layered" system with a integer Z coordinate, this way you can defer all of the draw2D functions and draw layer-by-layer while concatenating meshbuffers for GUI elements sharing the same texture. Complex static GUI elements could even be put into VBO (example, text).

There is another problem:
E) I dont know if this has been solved, but I had to remove the Irrlicht side copies of texture data. The reasons is that you already have a version on the GPU, and the driver also possibly keeps one for GPU VRAM paging as well... so on AMD and NVidia you end up with 3 texture copies (1 GPU, 1 driver in RAM, 1 Irrlicht in RAM).

AAH AND ONE FINAL REMARK,

We NEED every ISceneNode to have a pointer to an OccluderMesh made out of a new vertex format that only stores positions+indices (no textures,normals and colors) and a new pass in the ISceneManager (like shadows, transparent etc) which is called Z-PrePass which gives you a conservative approximation to the depth buffer against which you can do transform feedback quadtree culling or just simple occlusion queries but without having to extrapolate stuff from the previous frame.

CuteAlien · Post by **CuteAlien** » Wed Feb 25, 2015 7:11 pm

About E) We (mostly Nadro) worked on that in the ogl-es branch. Others will hopefully follow one day (it was one the first things I reported to Irrlicht before was in the team, been open a while...). It's useful so far for faster texture-locking, but mainly needed in Irrlicht for font-printing.

hendu · Post by **hendu** » Wed Feb 25, 2015 9:51 pm

For things only used once, GL 4.5 has the pinned memory extension, which causes the GPU to read directly from system RAM, no transfer to VRAM done. The future is APUs with a single address space and no special memory anyway

Irrlicht Engine

OpenGL perf. problems caused by DirectX 9 oriented design

OpenGL perf. problems caused by DirectX 9 oriented design

Re: OpenGL perf. problems caused by DirectX 9 oriented desig

Re: OpenGL perf. problems caused by DirectX 9 oriented desig

Re: OpenGL perf. problems caused by DirectX 9 oriented desig

Re: OpenGL perf. problems caused by DirectX 9 oriented desig

Re: OpenGL perf. problems caused by DirectX 9 oriented desig