A suggestion to optimize a bit the instanced mesh scene node

Discuss about anything related to the Irrlicht Engine, or read announcements about any significant features or usage changes.
Post Reply
Mel
Competition winner
Posts: 2292
Joined: Wed May 07, 2008 11:40 am
Location: Granada, Spain

A suggestion to optimize a bit the instanced mesh scene node

Post by Mel »

The current implementation isn't the best actually, if you ask me i'd do it other way, i don't know how big can a constants buffer be in DX11 (i don't know either on GL) but it is larger than DX9, that's for sure.

The current implementation adds an extra transformation matrix per vertex. Isn't that overkill? (unless i am missing something that is) Isn't it easier to add an integer that refers to a transformation matrix inside an array instead? And that's using shader instancing, not the "true" instancing DX provides. Since DX9 it is possible to use a secondary buffer to store transformation matrices and indices, i don't know exactly the details, but it gets the one mesh-multiple-copies thing that can be expected of an instancing system. None the less, this is just a small criticism, the current system is really nice.

What i propose though is the usage of a map (and irrlicht has one implemented) instead of an array. The maps provide a much convenient way to add and remove objects from a set which need to be identified, so you can remove an object by its identifier directly, instead of checking through the entire array linearly (which can be, obviously, big) constantly, and the objects can still be accessed iteratively because maps support iterators, thus, if you need to process every item of the map it is also possible. The access time to an item in a map is logaritmic because the map actually "sorts" (i.e. stores the items in a red-black tree) the items when you insert them, so i think it could be an interesting thing to try.
"There is nothing truly useless, it always serves as a bad example". Arthur A. Schmitt
christianclavet
Posts: 1638
Joined: Mon Apr 30, 2007 3:24 am
Location: Montreal, CANADA
Contact:

Re: A suggestion to optimize a bit the instanced mesh scene

Post by christianclavet »

Hi, Mel! On what branch is the instanced node implementation is done?
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: A suggestion to optimize a bit the instanced mesh scene

Post by devsh »

The current implementation adds an extra transformation matrix per vertex. Isn't that overkill? (unless i am missing something that is) Isn't it easier to add an integer that refers to a transformation matrix inside an array instead? And that's using shader instancing, not the "true" instancing DX provides. Since DX9 it is possible to use a secondary buffer to store transformation matrices and indices, i don't know exactly the details, but it gets the one mesh-multiple-copies thing that can be expected of an instancing system. None the less, this is just a small criticism, the current system is really nice.
Thats the exact definition of shooting yourself in the foot.

There are 5 ways of doing HW instancing:
1) 8bit integer ID per vertex referring to transformation matrix in the uniform (constant) shader array
Space Complexity: O(Instances*VerticesPerMesh)
You can usually squeeze 1024 floats in on the shitty intel GPUs, you can actually query the max number of uniform slots on the GPU (a slot is one vec4/float4), but beware you must subtract the memory you're using to store OTHER shader constants (sampler IDs etc.) to get the true value of memory left for instance data.
You also need to be clever on how you store your matrices, cause you can use 1024 floats in many ways

Code: Select all

 
You only need translation = 3 floats => can fit 341 instances in one shader draw call
You only need rotation = Quaternions which are 4 floats => can fit 256 instances
You only need translation+rotation = 7 floats =>146 instances
You need everything => DONT BE AN IDIOT, USE a 3x4 Matrix not a 4x4 => 85 instances
You are super economic and use 3 floats for scale, 3 floats for translation and a quaternion => 102 instances
 
2) an integer (no more than 16bits) per vertex which you then translate into a texture coordinate in a 2D RGBA32F texture containing matrix data in 3 or 4 rows and read that in the vertex shader (which will be slower than method 1), this can enable you to fire off 2^16 instances in one go, but you'd need a 32bit index buffer and a vertex buffer much longer than 2^16 (more than 3 vertices in the mesh). You must also handle updating the texture if the transformations change (asynchronous texture uploads, streaming)
Space Complexity: O(Instances*VerticesPerMesh)
12a) some other alternative that reads supplementary data (matrices), from a source available to the vertex shader.
3) Smart instancing with transform feedback... pre-transform all the vertices of your instances and write that out into a transform feedback buffer. Sure you cant do that every frame, because it defeats the purpose... but if you only update transformations sporadically, this is a win because you dont clog up all your shader uniform/constant buffers when you actually render the mesh and you dont perform transforms every time you render.
4) OpenGL 3.2/DX 10 instancing, you have a draw instanced call and in the vertex shader you now have "gl_InstanceID" which enabled you to duplicate (1) without messing with the vertex format
5) Same as (4) only now instead of shader uniform buffers, you use special instance buffers (just like VBOs, but they store per-instance data)
6) OpenGL 4.0+/DX 11 instancing, you can use the GPU to fill the instance buffer mentioned in (5) as well as provide the instance count, vertex count, primitive count, offset buffers .... etc. (very useful for GPU-decided LoD and culling)



Also, gently caress maps... what do you need them for? Array is good... O(n) to fill the entire thing, not O(n log n)
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: A suggestion to optimize a bit the instanced mesh scene

Post by devsh »

This is a new fun thing that I did not think possible
OpenGL 3.3 arrived with the brand new ARB_instanced_arrays extension now being a core feature. With this addition you could use actual Vertex Buffer Objects (VBOs) to deliver your data. Yay!
There is a restriction to it however, you can only pass 16 vertex attributes to your vertex shader (by specification, or GL_MAX_VERTEX_ATTRIB_BINDINGS), which makes it 16 * vec4 = 64 float values.
http://www.informit.com/articles/articl ... 0&seqNum=5

It appears that with this method you can set different sampling intervals for different vertex attribute streams
Before the instancing API, there was simply the vertex stream frequency divider. His
allowed the user to have some control over how the data in a vertex buffer was iterated over.
It is controlled via the method:
IDirect3dDevice9::SetStreamSourceFreq(
UINT StreamNumber, UINT Setting)
This method was used to specify the stream di
vider for the specified stream. That is, when
iterating over the VB, the vertex processing system would only increment the element
pointer in the stream once for every N “vertices”
processed. This means that with a divider
of 2, each element in the stream would be used twice.
In OpenGL its glVertexAttribDivisor()

in that case you could actually store the matrices in vertex attributes with divisors equal to the mesh's vertex count
hendu
Posts: 2600
Joined: Sat Dec 18, 2010 12:53 pm

Re: A suggestion to optimize a bit the instanced mesh scene

Post by hendu »

Slight correction, the divisor in GL is not per-vertex but per-instance.
Mel
Competition winner
Posts: 2292
Joined: Wed May 07, 2008 11:40 am
Location: Granada, Spain

Re: A suggestion to optimize a bit the instanced mesh scene

Post by Mel »

@christianclavet
In the shader branch, the trunk doesn't contain any implementation about instancing yet.

About the efficiency, it is o(n) to fill the thing yes, but o(n) to look for a single item and delete it. The searches performed over ordered sets have a complexity of log(n), keeping in mind that the set of instances can be modified in run time, while the insertion time is worse (log(n) in each case because it has to look for a suitable place) the deletion is also log(n).

In the worst case, adding n instances (the insertion time is constant) and deleting n instances with an array, is n+ n linear searches -> O(n^2)
with a map (or any other asociative structure like a set or an unordered set), it is n log(n) during the insertions (n searches) and n log(n) (another n searches) also during the deletions -> O(n log(n))

But it is just a small remmark.
6) OpenGL 4.0+/DX 11 instancing, you can use the GPU to fill the instance buffer mentioned in (5) as well as provide the instance count, vertex count, primitive count, offset buffers .... etc. (very useful for GPU-decided LoD and culling)
That is exactly the instancing i would expect. The current instancing implementation doesn't make use of the DX11/DX9 features at all (nor the GL features, i suspect) while the effort is good, and the irrlicht interface looks appropriate to create/remove instances on the fly, the creation of the aditional buffers involves less code and memory movement than the current system to add transformation matrices to the geometry copies. And i'd encourage whoever is in charge of this to try and use the proper instancing in both GL and DX :)
"There is nothing truly useless, it always serves as a bad example". Arthur A. Schmitt
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: A suggestion to optimize a bit the instanced mesh scene

Post by devsh »

A map is a wrong idea, the stuff moves around in memory

Imagine you have M instances

You dont want to modify the vertex buffer EVER, you either draw instances from a big buffer (pseudo instancing) or a small one (proper).. either way you draw 1<=n<=MAX_MESHBUFFER_SZ/InstanceMeshSize or 1<=n<=MAX_INSTACES
The only thing you modify is the auxilary array of matrices, that you use to draw, hence you need to populate a buffer or array of size n with n elements from M, now finding an element in M will take O(logM)
Hence to fill the auxilary buffer for each draw in a frame will take O(MlogM)
You really want something that has O(1) search time

We made an assumption here that you cant just traverse the tree and draw the instances in that order (which you often cant).

THATS COST INCURRED EVERY FRAME

Its not like you're going to add and delete every instance every frame, hence a higher modify cost in the global storage is justified
You may move instances around every frame, so maybe something with a O(1) search time would be nicer -> only takes you O(m) to animate the scene instead of O(m log m), and aside from rendering, yet another argument against red-black trees


P.S. Asymptiotic analysis fails in RT-CG
Granyte
Posts: 850
Joined: Tue Jan 25, 2011 11:07 pm
Contact:

Re: A suggestion to optimize a bit the instanced mesh scene

Post by Granyte »

I'm not quite sure what you guys are talking about but the current implementation work the same way as the dx9 instancing example from the dx sdk

it use only 1 matrice per mesh instance that matrice is stored in a vertex buffer thats is steped only once per mesh instance using the stream frequency methode

also i'm looking at a way to expose the fact that in dx you can step a stream buffer every two time you draw a mesh(or w/e amount you wish) but i have no idea what kind of capabilities OGL expose so i don't know how to properly do it for both
The_Glitch
Competition winner
Posts: 523
Joined: Tue Jan 15, 2013 6:36 pm

Re: A suggestion to optimize a bit the instanced mesh scene

Post by The_Glitch »

Damn Irrlicht supports Instancing.....
Post Reply