65% FPS Increase from MultiDrawIndirect

Discuss about anything related to the Irrlicht Engine, or read announcements about any significant features or usage changes.
Post Reply
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

65% FPS Increase from MultiDrawIndirect

Post by devsh »

I just made a vertex pushing benchmark which simulates BaW-like loads.

Lots of unique meshbuffers (2k to 4k) with non-trivial polygon counts (from 32 to 8000) which cannot be instanced and which have different draw attributes such as transformations, etc.

I packed all vertex and index buffer data in one buffer, and used one VAO (mesh format descriptor) for all meshbuffers.

Couldn't be bothered to implement frustum culling at first, so I just did the viewProj*world transformation and normal calculation for all objects on the CPU followed by a glBufferSubData upload in both cases.

Around 16+ million triangles submitted for drawing, and it ran at:
130 FPS one glDrawElements per object
210 FPS just one glMultiDrawElementsIndirect

You might not get the same results, because once you oversaturate with polygon count (when I increased the max polycount per mesh to 64k and total went up to 50 Mil, I had almost 0% FPS difference) the bottleneck moves from draw command processing/submission to vertex shader/triangle rasterization.
Same might happen if the number of meshbuffers drawn is <1000 as the overhead of multiple explicit cpu drawcalls might not be big enough to notice.

If you get really low FPS (<100), you might want to compile and try with a lower max value for the uniform distribution which will give you less overall polycount.

Linux binary pre-built, you can use space to dynamically switch between the two modes.
https://github.com/buildaworldnet/Irrli ... tVSCPUCull

This is really amazing, as its a 65% performance increase on a single core, without any command buffer recording across multiple cores. So imagine the performance gains that Vulkan can bring.

I'll keep you posted what my results will be once I implement box-frustum culling on both the CPU and GPU, but my feeling is that the performance gap will widen.
Post Reply