SSE vector3df and matrix4

devsh · Post by **devsh** » Sat Dec 11, 2010 3:37 pm

I want to implement SSE 3dvectors and matrices to speed up the engine CPU part by 50%. I have some results...

simple assignment (constructor and = operator) using vector3df with SSE takes 1400ms +/- 50ms for 50 MILLION assignments (loop)

standard irrlicht takes 1680ms +/- 50ms

That is JUST assignment

update:
assignment with values (x,y,z) is slower on SSE

hybrid · Post by **hybrid** » Sat Dec 11, 2010 3:57 pm

Show some code first. And be aware that 300ms saved in 50 mio operations won't help a lot. I guess that this would be equivalent to several hundred frames. So savings are definitely below a few percent, you won't notice that really.

slavik262 · Post by **slavik262** » Sat Dec 11, 2010 4:32 pm

Irrlicht's vector types are clean and fast. Any efforts to improve them would be marginal at best. The main determinant of speed when programming is the algorithm you chose. Tweaking the underlying data types to go a few percent faster really isn't worth the effort.

devsh · Post by **devsh** » Sat Dec 11, 2010 5:16 pm

SSE is SSE, boosts 4d vector ops by 400%

I have a 13% increase on the += operator
None on the + operator
I have 30% on just the assignment from ONE float i.e. vector3df(1.f)

hybrid · Post by **hybrid** » Sat Dec 11, 2010 7:41 pm

Show code! Full regression suite with correct time measurement.
Will move this to discussion while it's not finished. You can choose later on whether it deserves a new post in Code snippets when you have fiinal(!) code.

devsh · Post by **devsh** » Sat Dec 11, 2010 7:49 pm

I ran a couple of benchmarks... SSE doesnt help with vector3df... I could see that before I implemented operators like div and sub. however I will see about matrices, also could you please point me to areas where large arrays of floating point numbers are multiplied or added?

devsh · Post by **devsh** » Sat Dec 11, 2010 9:18 pm

I tried, all.. I even implemented SSE in particle system in hopes of speeding up the billboarding... SSE simply doesnt accelerate non-16-aligned floats, simply just as fast... I conclude we shouldn't bother with SSE as this would have us rewriting irrlicht

SSE starts to loose its 20% advantage after enabling the -03 flag in GCC

devsh · Post by **devsh** » Sun Dec 12, 2010 5:49 pm

I decided to give it one last shot... I have made core::matrix4sse and.... it performs at avg. 4700 ms and standard matrix4 performs at avg. 8800ms

Here is the test code, ofc in the native irrlicht version the matrix4sse declarations change into matrix4

Code: Select all

//We multiply many matrices together so it actually takes some time
#define ARRAY_SIZE 1024*64

void ComputeArrayCPlusPlus(core::matrix4sse* vec, core::matrix4sse* other, u32 i)
{
    vec[i] = other[i]*other[i]*vec[i]*0.00001526f;
}



/*..... sOME CODE and in the main function.... */


	core::matrix4sse vec[ARRAY_SIZE];
	core::matrix4sse vec2[ARRAY_SIZE+1];
	for (u32 i = 0; i<ARRAY_SIZE; i++)
	{
        vec2[i][0] = sinf(i*55.f-1.f);
        vec2[i][1] = cosf(i+1.f);
        vec2[i][2] = cosf(i*128.f-128.f);
        vec2[i][3] = sinf(i*55.f-1.f);
        vec2[i][4] = 1.f/sinf(i*55.f-1.f);
        vec2[i][5] = 1.f/cosf(i+1.f);
        vec2[i][6] = 1.f/cosf(i*128.f-128.f);
        vec2[i][7] = 1.f/sinf(i*55.f-1.f);
        vec2[i][8] = sinf(i*5.f-1.f);
        vec2[i][9] = cosf(i+2.f);
        vec2[i][10] = cosf(i*18.f-12.f);
        vec2[i][11] = sinf(i*5.f-11.f);
        vec2[i][12] = 1.f/sinf(i*535.f-12.f);
        vec2[i][13] = 1.f/cosf(i*0.25f+14.f);
        vec2[i][14] = 1.f/cosf(i*12.f-18.f);
        vec2[i][15] = 1.f/sinf(i*0.5f-111.f);
	}
	for (u32 i = 0; i<ARRAY_SIZE; i++)
	{
        vec[i][0] = sinf(i*551.f-1.f);
        vec[i][1] = cosf(i+11.f);
        vec[i][2] = cosf(i*1268.f-1238.f);
        vec[i][3] = sinf(i*535.f-13.f);
        vec[i][4] = 15.f/sinf(i*55.f-1.f);
        vec[i][5] = 17.f/cosf(i+1.f);
        vec[i][6] = 13.f/cosf(i*128.f-128.f);
        vec[i][7] = 14.f/sinf(i*55.f-1.f);
        vec[i][8] = sinf(i*56.f-1.f);
        vec[i][9] = cosf(i+261.f);
        vec[i][10] = cosf(i*1813.f-112.f);
        vec[i][11] = sinf(i*56.f-131.f);
        vec[i][12] = 3.f/sinf(i*535.f-12.f);
        vec[i][13] = 31.f/cosf(i*0.25f+14.f);
        vec[i][14] = 3.f/cosf(i*12.f-18.f);
        vec[i][15] = 3.f/sinf(i*0.5f-111.f);
	}
	u32 time = device->getTimer()->getRealTime();
	for (u32 j = 0; j<512; j++)
    {
        for (u32 i = 0; i<ARRAY_SIZE; i++)
            ComputeArrayCPlusPlus(vec,vec2,i);
    }
    printf("Time Taken: %u  Result: %f, %f, %f \n",device->getTimer()->getRealTime()-time,vec[123][10],vec[123][1],vec[123][15]);//,((f32*)&vec[123].coords)[0],((f32*)&vec[123].coords)[1],((f32*)&vec[123].coords)[2]);

EDIT: I do realize I would need the dev's openmindedness... I can see this is not likely to make it into the engine and moreover the SSE needs to be written in inline ASM to be truly effective, I cant do so. So here is my matrix4 stub

Code: Select all

// Copyright (C) 2002-2009 Nikolaus Gebhardt
// This file is part of the "Irrlicht Engine".
// For conditions of distribution and use, see copyright notice in irrlicht.h

#ifndef __IRR_MATRIX_4_SSE_H_INCLUDED__
#define __IRR_MATRIX_4_SSE_H_INCLUDED__

#include "irrMath.h"
#include "vector3d.h"
#include "vector2d.h"
#include "plane3d.h"
#include "aabbox3d.h"
#include "rect.h"
#include "irrString.h"
#include "xmmintrin.h"

// enable this to keep track of changes to the matrix
// and make simpler identity check for seldomly changing matrices
// otherwise identity check will always compare the elements
//#define USE_MATRIX_TEST

// this is only for debugging purposes
//#define USE_MATRIX_TEST_DEBUG

namespace irr
{
namespace core
{

	//! 4x4 matrix. Mostly used as transformation matrix for 3d calculations.
	/** The matrix is a D3D style matrix, row major with translations in the 4th row. */
	class matrix4sse
	{
		public:

			//! Constructor Flags
			enum eConstructor
			{
				EM4CONST_NOTHING = 0,
				EM4CONST_COPY,
				EM4CONST_IDENTITY,
				EM4CONST_TRANSPOSED,
				EM4CONST_INVERSE,
				EM4CONST_INVERSE_TRANSPOSED
			};

			//! Default constructor
			/** \param constructor Choose the initialization style */
			matrix4sse( eConstructor constructor = EM4CONST_IDENTITY );
			//! Copy constructor
			/** \param other Other matrix to copy from
			\param constructor Choose the initialization style */
			matrix4sse(const matrix4sse& other, eConstructor constructor = EM4CONST_COPY);

			//! Simple operator for directly accessing every element of the matrix.
			f32& operator()(const s32 row, const s32 col) {	return ((f32*)(M+row))[col]; }

			//! Simple operator for directly accessing every element of the matrix.
			const f32& operator()(const s32 row, const s32 col) const { return ((f32*)(M+row))[col]; }

			//! Simple operator for linearly accessing every element of the matrix.
			f32& operator[](u32 index) { return ((f32*)M)[index]; }

			//! Simple operator for linearly accessing every element of the matrix.
			const f32& operator[](u32 index) const { return ((f32*)M)[index]; }

			//! Sets this matrix equal to the other matrix.
			inline matrix4sse& operator=(const matrix4sse &other);

			//! Sets all elements of this matrix to the value.
			inline matrix4sse& operator=(const f32& scalar);

			//! Returns pointer to internal array
			const f32* pointer() const { return (f32*)M; }
			f32* pointer()
			{
				return (f32*)M;
			}

			//! Multiply by another matrix.
			/** Calculate other*this */
			matrix4sse operator*(const matrix4sse& other) const;

			//! Multiply by another matrix.
			/** Calculate and return other*this */
			//matrix4sse& operator*=(const matrix4sse& other);

			//! Multiply by scalar.
			matrix4sse operator*(const f32& scalar) const;

			//! Set matrix to identity.
			inline matrix4sse& makeIdentity();

			//! Gets transposed matrix
			matrix4sse getTransposed() const;

			//! Gets transposed matrix
			inline void getTransposed( matrix4sse& dest ) const;

		private:
			//! Matrix data, stored in row-major order
			__m128 M[4];

	};

    // Default constructor
	inline matrix4sse::matrix4sse( eConstructor constructor )
	{
		switch ( constructor )
		{
			case EM4CONST_NOTHING:
			case EM4CONST_COPY:
				break;
			case EM4CONST_IDENTITY:
			case EM4CONST_INVERSE:
			default:
				makeIdentity();
				break;
		}
	}

	// Copy constructor
	inline matrix4sse::matrix4sse( const matrix4sse& other, eConstructor constructor)
	{
		switch ( constructor )
		{
			case EM4CONST_IDENTITY:
				makeIdentity();
				break;
			case EM4CONST_NOTHING:
				break;
			case EM4CONST_COPY:
				*this = other;
				break;
			case EM4CONST_TRANSPOSED:
				other.getTransposed(*this);
				break;/*
			case EM4CONST_INVERSE:
				if (!other.getInverse(*this))
					memset(M, 0, 16*sizeof(T));
				break;
			case EM4CONST_INVERSE_TRANSPOSED:
				if (!other.getInverse(*this))
					memset(M, 0, 16*sizeof(T));
				else
					*this=getTransposed();
				break;*/
		}
	}

    //! Multiply by scalar.
	inline matrix4sse matrix4sse::operator*(const f32& scalar) const
	{
		matrix4sse temp ( EM4CONST_NOTHING );

        __m128 scalarSSE = _mm_load1_ps(&scalar);

		temp.M[0] = _mm_mul_ps(M[0],scalarSSE);
		temp.M[1] = _mm_mul_ps(M[1],scalarSSE);
		temp.M[2] = _mm_mul_ps(M[2],scalarSSE);
		temp.M[3] = _mm_mul_ps(M[3],scalarSSE);

		return temp;
	}

    //! multiply by another matrix
	inline matrix4sse matrix4sse::operator*(const matrix4sse& m2) const
	{
		matrix4sse m3 ( EM4CONST_NOTHING );

/*
        m3.M[0] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+0)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+1))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+2)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+3))));
        m3.M[1] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+4)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+5))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+6)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+7))));
        m3.M[2] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_load1_ps(((f32*)m2.M)+8)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+9))),_mm_add_ps(_mm_mul_ps(M[2],_mm_load1_ps(((f32*)m2.M)+10)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+11))));
        m3.M[3] = _mm_add_ps(_mm_mul_ps(_mm_add_ps(M[0],_mm_load1_ps(((f32*)m2.M)+12)),_mm_mul_ps(M[1],_mm_load1_ps(((f32*)m2.M)+13))),_mm_mul_ps(_mm_add_ps(M[2],_mm_load1_ps(((f32*)m2.M)+14)),_mm_mul_ps(M[3],_mm_load1_ps(((f32*)m2.M)+15))));
*/
        m3.M[0] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[0],m2.M[0],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[0],m2.M[0],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[0],m2.M[0],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[0],m2.M[0],0xff))));
        m3.M[1] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[1],m2.M[1],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[1],m2.M[1],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[1],m2.M[1],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[1],m2.M[1],0xff))));
        m3.M[2] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[2],m2.M[2],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[2],m2.M[2],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[2],m2.M[2],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[2],m2.M[2],0xff))));
        m3.M[3] = _mm_add_ps(_mm_add_ps(_mm_mul_ps(M[0],_mm_shuffle_ps(m2.M[3],m2.M[3],0x00)),_mm_mul_ps(M[1],_mm_shuffle_ps(m2.M[3],m2.M[3],0x55))),_mm_add_ps(_mm_mul_ps(M[2],_mm_shuffle_ps(m2.M[3],m2.M[3],0xaa)),_mm_mul_ps(M[3],_mm_shuffle_ps(m2.M[3],m2.M[3],0xff))));
/*
		const f32 *m1 = (f32*)M;

		m3[0] = m1[0]*m2[0] + m1[4]*m2[1] + m1[8]*m2[2] + m1[12]*m2[3];
		m3[1] = m1[1]*m2[0] + m1[5]*m2[1] + m1[9]*m2[2] + m1[13]*m2[3];
		m3[2] = m1[2]*m2[0] + m1[6]*m2[1] + m1[10]*m2[2] + m1[14]*m2[3];
		m3[3] = m1[3]*m2[0] + m1[7]*m2[1] + m1[11]*m2[2] + m1[15]*m2[3];

		m3[4] = m1[0]*m2[4] + m1[4]*m2[5] + m1[8]*m2[6] + m1[12]*m2[7];
		m3[5] = m1[1]*m2[4] + m1[5]*m2[5] + m1[9]*m2[6] + m1[13]*m2[7];
		m3[6] = m1[2]*m2[4] + m1[6]*m2[5] + m1[10]*m2[6] + m1[14]*m2[7];
		m3[7] = m1[3]*m2[4] + m1[7]*m2[5] + m1[11]*m2[6] + m1[15]*m2[7];

		m3[8] = m1[0]*m2[8] + m1[4]*m2[9] + m1[8]*m2[10] + m1[12]*m2[11];
		m3[9] = m1[1]*m2[8] + m1[5]*m2[9] + m1[9]*m2[10] + m1[13]*m2[11];
		m3[10] = m1[2]*m2[8] + m1[6]*m2[9] + m1[10]*m2[10] + m1[14]*m2[11];
		m3[11] = m1[3]*m2[8] + m1[7]*m2[9] + m1[11]*m2[10] + m1[15]*m2[11];

		m3[12] = m1[0]*m2[12] + m1[4]*m2[13] + m1[8]*m2[14] + m1[12]*m2[15];
		m3[13] = m1[1]*m2[12] + m1[5]*m2[13] + m1[9]*m2[14] + m1[13]*m2[15];
		m3[14] = m1[2]*m2[12] + m1[6]*m2[13] + m1[10]*m2[14] + m1[14]*m2[15];
		m3[15] = m1[3]*m2[12] + m1[7]*m2[13] + m1[11]*m2[14] + m1[15]*m2[15];*/
		return m3;
	}

	/*!
	*/
	inline matrix4sse& matrix4sse::makeIdentity()
	{
		((f32*)M)[0] = ((f32*)M)[5] = ((f32*)M)[10] = ((f32*)M)[15] = 1.f;
		return *this;
	}

	inline matrix4sse& matrix4sse::operator=(const matrix4sse &other)
	{
		if (this==&other)
			return *this;
		M[0] = other.M[0];
		M[1] = other.M[1];
		M[2] = other.M[2];
		M[3] = other.M[3];
		return *this;
	}


	inline matrix4sse& matrix4sse::operator=(const f32& scalar)
	{
		M[0] = M[1] = M[2] = M[3] = _mm_set1_ps(scalar);

		return *this;
	}

	// returns transposed matrix
	inline matrix4sse matrix4sse::getTransposed() const
	{
		matrix4sse t ( EM4CONST_NOTHING );
		getTransposed ( t );
		return t;
	}


	// returns transposed matrix
	inline void matrix4sse::getTransposed( matrix4sse& o ) const
	{
	    o=*this;
		_MM_TRANSPOSE4_PS(o.M[0], o.M[1], o.M[2], o.M[3]);
		/*o[ 0] = ((f32*)M)[ 0];
		o[ 1] = ((f32*)M)[ 4];
		o[ 2] = ((f32*)M)[ 8];
		o[ 3] = ((f32*)M)[12];

		o[ 4] = ((f32*)M)[ 1];
		o[ 5] = ((f32*)M)[ 5];
		o[ 6] = ((f32*)M)[ 9];
		o[ 7] = ((f32*)M)[13];

		o[ 8] = ((f32*)M)[ 2];
		o[ 9] = ((f32*)M)[ 6];
		o[10] = ((f32*)M)[10];
		o[11] = ((f32*)M)[14];

		o[12] = ((f32*)M)[ 3];
		o[13] = ((f32*)M)[ 7];
		o[14] = ((f32*)M)[11];
		o[15] = ((f32*)M)[15];*/
	}

} // end namespace core
} // end namespace irr

#endif

slavik262 · Post by **slavik262** » Mon Dec 13, 2010 3:42 am

devsh wrote:SSE is SSE, boosts 4d vector ops by 400%

I have a 13% increase on the += operator
None on the + operator
I have 30% on just the assignment from ONE float i.e. vector3df(1.f)

You're missing my point. Even if it is 14% faster, you're just getting a nominal performance bump at the cost of specializing the code to a certain processor architecture. IRRLICHT_FAST_MATH is disabled by default for similar reasons.

hybrid · Post by **hybrid** » Mon Dec 13, 2010 8:12 am

Did you ran the regression suite against your new implementation? Looks like the internal structure has changed so much, that safe access to all operators is not guaranteed anymore.
There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.

slavik262 · Post by **slavik262** » Mon Dec 13, 2010 1:02 pm

hybrid wrote:There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.

What else in the engine is restricted to x86? I thought a big selling point of Irrlicht was that with a little tweaking, it can be compiled to Xbox, Windows mobile, iPhone OS, Droid, etc.

devsh · Post by **devsh** » Mon Dec 13, 2010 8:27 pm

actually x86 and x64, yeh i get the point.. even i dont think its worth it

hybrid · Post by **hybrid** » Mon Dec 13, 2010 10:19 pm

slavik262 wrote:
hybrid wrote:There's no problem about specializing for one or the other architecture. We have this all over the place. But additional maintenance overhead must justify by either being very low, or by giving much benefit.
What else in the engine is restricted to x86? I thought a big selling point of Irrlicht was that with a little tweaking, it can be compiled to Xbox, Windows mobile, iPhone OS, Droid, etc.

The code would not restrict anything to a certain platform. It would simply give certain extra support (performance, no extra functionality) on some platforms. Pretty much like d3d drivers, which are also windows only.

Simpe · Post by **Simpe** » Tue Dec 14, 2010 12:26 pm

devsh wrote:actually x86 and x64, yeh i get the point.. even i dont think its worth it

... together with the ppc-architectures that's on the xbox360/ps3 as well. I wouldn't be surprised if SSE instructions will be available on mobile devices quite soon. If you design your code to be SSE-friendly you can get quite huge optimizations on calculation-tight places.

You might even get it so fast that it's faster/easier to do a brute-force solution (such as culling dynamic objects) rather than using some sort of culling-tree due to the amount of operations you can push through instead of stalling the cpu due to l2 cache misses, missed branches, pipes due to in-order-architecture etc.

I think such an implementation as an SSE-library would be good since it's designing the software for the future. Maybe not replace the existing vectors but instead add a library that can be used on places where there would be gain from it.

devsh · Post by **devsh** » Tue Dec 14, 2010 4:54 pm

You can see from my tests that SSE vectors and matrices give little or no advantage. You need to put SSE raw into your mesh buffer calculations