WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTORS!

Post those lines of code you feel like sharing or find what you require for your project here; or simply use them as tutorials.

WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTORS!

Postby devsh » Fri Sep 05, 2014 3:52 am

just a word of warning, even though 128bit alignment is forced on the vectorSIMDf.... you can't use malloc, free, core::array<> (without patching the allocator) or std::vector<> because they allocate memory dynamically and cant guarantee 16byte alignment. You need to always declare your static arrays of vectorSIMDs as 16byte aligned, dynamically allocate memory with align malloc/free (but you can use new and delete in the coming version) and use custom allcoators in STD/STL and irrlicht dynamic arrays/lists etc.

THIS MEANS THAT ANY STRUCT/CLASS CONTAINING SSE SIMD OBJECTS IS SUBJECT TO THE SAME PROBLEM OF HEAP ALLOCATION


But I did code some fixes for the above problems




For now:
-- the vectorized 4d float class (with exceptions of some functions to do with rotation)
-- the vectorized boolean class implemented only for 4 and 2 component wide vectors.
-- fixed up irrMath.h
-- aligned memory irrAllocator
-- aligned malloc/free wrappers

Please test!!!

If you have some suggestions for functions, such as reflect() feel free to contribute
***ALSO FEEL FREE TO CONTRIBUTE SOURCE CODE***

UPDATE 5/09/2014:

-- added support for adapting to featuresets SSE2,SSE3,SSSE3,SSE4.1,SSE4.2,AVX,AVX2 (only in some places, need to figure out where this happens)
-- added initalization of 2d and 3d vectors as 4d SIMDs with 0s in tail
-- added bitwise operation on 4df SIMD
-- added "makeSafeNd()" functions for zeroing out the last components
-- added GLSL style member names (lower case x,y,z,w and r,g,b,a and s,t,p,q)
-- removed SSSE3 requirement in favour of SSE3
-- implemented one of the rotate functions with SSE2 (no SSE3 needed)


UPDATE 5/09/2014:

--implemented all of the rotate functions (with and without center) - only got 1 function left to implement
--removed inline constructors
--caught a fatal bug (not returning the value of _mm_load_ps() in getAsRegister())
--added overloaded new and delete operators (so you can use new/ new[] and delete/ [] delete without worrying about 16byte alignment) //commented out
--fixed up irrMath.h to not use assembly SSE anymore
--implemented all functions of irrMath for floats for vectorSIMDf (except for floor and fract, look at target B)



UPDATE 30/04/2014:

--implemented the floor and fract functions for vectorSIMDf
--a few more functions added to vectorSIMDf
--implemented a matrixSIMDf class with basic functionality (without matrix building functions)
NOTE: THE MATRIX CLASS IS NOT A DROP IN REPLACEMENT OR IN FACT, ANY SORT OF COMPATIBLE REPLACEMENT FOR IRRLICHT'S CMatrix4<>/matrix4
THE ORDER OF THE ELEMENTS OF THE MATRIX IN MEMORY IS COMPLETELY DIFFERENT AND THE MATRICES MULTIPLY THE PROPER WAY AROUND



Targets:
A) add runtime or compile time (but strict) swizzle support for 8 and 16 component vectors
B) implement all the GLSL functions with 4df vectors
C) implement some spherical coordinate functions in vectorSIMDf
E) implement 32 and 16bit vectors of signed and unsigned integers
F) make conversion functions between all of the types
H) implement all the basic types of irrlicht in SIMD vectors
G) implement more funny functions from GLSL such as pow(), exp2() etc http://gruntthepeon.free.fr/ssemath/

All code provided is on the irrlicht license (and please attribute me and BaW)


as I progress I'll add more code and update the listings


For the fixed up irrMath.h, follow the link:http://irrlicht.sourceforge.net/forum/viewtopic.php?f=9&t=50230&p=289502#p289502

Minor Changes:
cpp Code: Select all
Index: irrlicht/source/Irrlicht/Irrlicht.cpp
===================================================================
 
 namespace core
 {
-   const matrix4 IdentityMatrix(matrix4::EM4CONST_IDENTITY);
+   const matrix4 IdentityMatrix(matrix4::EM4CONST_IDENTITY);
+#ifdef __IRR_COMPILE_WITH_X86_SIMD_
+   //const matrixSIMD4 IdentityMatrix(matrix4::EM4CONST_IDENTITY);
+#endif
    irr::core::stringc LOCALE_DECIMAL_POINTS(".");
 }
 
Index: irrlicht/include/irrlicht.h
===================================================================
@@ -158,7 +158,8 @@
 #include "Keycodes.h"
 #include "line2d.h"
 #include "line3d.h"
-#include "matrix4.h"
+#include "matrix4.h"
+#include "matrixSIMD4.h"
 #include "plane3d.h"
 #include "position2d.h"
 #include "quaternion.h"
@@ -183,7 +184,8 @@
 #include "SViewFrustum.h"
 #include "triangle3d.h"
 #include "vector2d.h"
-#include "vector3d.h"
+#include "vector3d.h"
+#include "vectorSIMD.h"
 
 
Index: irrlicht/include/IrrCompileConfig.h
===================================================================
@@ -13,6 +13,40 @@
 // it undefined
 //#define IRRLICHT_VERSION_SVN -alpha
 #define IRRLICHT_SDK_VERSION "1.8.1-baw"
+
+#define __IRR_COMPILE_WITH_X86_SIMD_
+
+#ifdef __IRR_COMPILE_WITH_X86_SIMD_
+#define __IRR_COMPILE_WITH_SSE2
+#define __IRR_COMPILE_WITH_SSE3
+
+#include <immintrin.h>
+
+#ifdef __SSE2__
+#define __IRR_COMPILE_WITH_SSE2
+#endif
+
+#ifdef __SSE3__
+#define __IRR_COMPILE_WITH_SSE3
+#endif
+
+#ifdef __SSE4_1__
+#define __IRR_COMPILE_WITH_SSE4_1
+#endif
+
+#ifdef __AVX__
+#define __IRR_COMPILE_WITH_AVX
+#endif
+
+
+
+#ifdef __IRR_COMPILE_WITH_AVX
+#define SIMD_ALIGNMENT 32
+#else
+#define SIMD_ALIGNMENT 16
+#endif // __IRR_COMPILE_WITH_AVX
+
+#endif
 
 #include <stdio.h> // TODO: Although included elsewhere this is required at least for mingw
 
@@ -673,7 +707,7 @@
 precision will be lower but speed higher. currently X86 only
 */
 #if !defined(_IRR_OSX_PLATFORM_) && !defined(_IRR_SOLARIS_PLATFORM_)
-   //#define IRRLICHT_FAST_MATH
+// #define IRRLICHT_FAST_MATH
    #ifdef NO_IRRLICHT_FAST_MATH
    #undef IRRLICHT_FAST_MATH
    #endif
 


if you use this allocator then you can use core::array<> with no problem (can result in more memory fragmentation, but will make all of your small memcpy's A LOT faster)
irrAllocator.h
cpp Code: Select all
// Copyright (C) 2002-2012 Nikolaus Gebhardt
// This file is part of the "Irrlicht Engine" and the "irrXML" project.
// For conditions of distribution and use, see copyright notice in irrlicht.h and irrXML.h
 
#ifndef __IRR_ALLOCATOR_H_INCLUDED__
#define __IRR_ALLOCATOR_H_INCLUDED__
 
#include "irrTypes.h"
#include <new>
// necessary for older compilers
#include <memory.h>
 
namespace irr
{
namespace core
{
 
#ifdef DEBUG_CLIENTBLOCK
#undef DEBUG_CLIENTBLOCK
#define DEBUG_CLIENTBLOCK new
#endif
 
//! Very simple allocator implementation, containers using it can be used across dll boundaries
 #ifdef __AVX__
template <typename T, std::size_t Alignment=32>
#else
template <typename T, std::size_t Alignment=16>
#endif
class irrAllocator
{
public:
 
    //! Destructor
    virtual ~irrAllocator() {}
 
    //! Allocate memory for an array of objects
    T* allocate(size_t cnt)
    {
        return (T*)internal_new(cnt* sizeof(T));
    }
 
    //! Deallocate memory for an array of objects
    void deallocate(T* ptr)
    {
        internal_delete(ptr);
    }
 
    //! Construct an element
    void construct(T* ptr, const T&e)
    {
        new ((void*)ptr) T(e);
    }
 
    //! Destruct an element
    void destruct(T* ptr)
    {
        ptr->~T();
    }
 
protected:
 
#ifdef __IRR_COMPILE_WITH_X86_SIMD_
    virtual void* internal_new(size_t cnt)
    {
        void *memoryallocatedaligned = 0;
#ifdef _IRR_WINDOWS_
        memoryallocatedaligned = _aligned_malloc(cnt,Alignment);
#else
        posix_memalign((void**)&memoryallocatedaligned,Alignment,cnt);
#endif
        return memoryallocatedaligned;
    }
 
    virtual void internal_delete(void* ptr)
    {
#ifdef _IRR_WINDOWS_
        _aligned_free(ptr);
#else
        free(ptr);
#endif
    }
#else
    virtual void* internal_new(size_t cnt)
    {
        return operator new(cnt);
    }
 
    virtual void internal_delete(void* ptr)
    {
        operator delete(ptr);
    }
#endif
};
 
 
//! Fast allocator, only to be used in containers inside the same memory heap.
/** Containers using it are NOT able to be used it across dll boundaries. Use this
when using in an internal class or function or when compiled into a static lib */

 #ifdef __AVX__
template <typename T, std::size_t Alignment=32>
#else
template <typename T, std::size_t Alignment=16>
#endif
class irrAllocatorFast
{
public:
 
#ifdef __IRR_COMPILE_WITH_X86_SIMD_
    //! Allocate memory for an array of objects
    T* allocate(size_t cnt)
    {
        cnt *= sizeof(T);
        T *memoryallocatedaligned = 0;
#ifdef _IRR_WINDOWS_
        memoryallocatedaligned = (T*)_aligned_malloc(cnt,Alignment);
#else
        posix_memalign((void**)&memoryallocatedaligned,Alignment,cnt);
#endif
        return memoryallocatedaligned;
    }
 
    //! Deallocate memory for an array of objects
    void deallocate(T* ptr)
    {
#ifdef _IRR_WINDOWS_
        _aligned_free(ptr);
#else
        free(ptr);
#endif
    }
#else
    //! Allocate memory for an array of objects
    T* allocate(size_t cnt)
    {
        return (T*)operator new(cnt* sizeof(T));
    }
 
    //! Deallocate memory for an array of objects
    void deallocate(T* ptr)
    {
        operator delete(ptr);
    }
#endif // __IRR_COMPILE_WITH_X86_SIMD_
    //! Construct an element
    void construct(T* ptr, const T&e)
    {
        new ((void*)ptr) T(e);
    }
 
    //! Destruct an element
    void destruct(T* ptr)
    {
        ptr->~T();
    }
};
 
 
 
#ifdef DEBUG_CLIENTBLOCK
#undef DEBUG_CLIENTBLOCK
#define DEBUG_CLIENTBLOCK new( _CLIENT_BLOCK, __FILE__, __LINE__)
#endif
 
//! defines an allocation strategy
enum eAllocStrategy
{
    ALLOC_STRATEGY_SAFE    = 0,
    ALLOC_STRATEGY_DOUBLE  = 1,
    ALLOC_STRATEGY_SQRT    = 2
};
 
 
} // end namespace core
} // end namespace irr
 
#endif
 



alligned malloc wrappers
cpp Code: Select all
    // aligned crossplatform malloc
    inline void* FW_malloc_align(size_t inNumBytes, size_t alignment)
    {
        void *memoryallocatedaligned = 0;
#ifdef _IRR_WINDOWS_
        memoryallocatedaligned = _aligned_malloc(inNumBytes,alignment);
#else
        posix_memalign((void**)&memoryallocatedaligned,alignment,inNumBytes);
#endif
        return memoryallocatedaligned;
    }
 
    // aligned crossplatform free
    inline void FW_free_align(void *alignedMemoryBlock)
    {
#ifdef _IRR_WINDOWS_
        _aligned_free(alignedMemoryBlock);
#else
        free(alignedMemoryBlock);
#endif
    }


Changes to get irrMath.h working
cpp Code: Select all
 
Index: irrlicht/include/irrMath.h
===================================================================
@@ -120,7 +120,7 @@
 
    //! returns minimum of two values. Own implementation to get rid of the STL (VS6 problems)
    template<class T>
-   inline const T& min_(const T& a, const T& b)
+   inline T min_(const T& a, const T& b)
    {
        return a < b ? a : b;
    }
@@ -134,7 +134,7 @@
 
    //! returns maximum of two values. Own implementation to get rid of the STL (VS6 problems)
    template<class T>
-   inline const T& max_(const T& a, const T& b)
+   inline T max_(const T& a, const T& b)
    {
        return a < b ? b : a;
    }
@@ -453,18 +453,6 @@
 
    REALINLINE void clearFPUException ()
    {
-#ifdef IRRLICHT_FAST_MATH
-       return;
-#ifdef feclearexcept
-       feclearexcept(FE_ALL_EXCEPT);
-#elif defined(_MSC_VER)
-       __asm fnclex;
-#elif defined(__GNUC__) && defined(__x86__)
-       __asm__ __volatile__ ("fclex \n\t");
-#else
-#  warn clearFPUException not supported.
-#endif
-#endif
    }
 
    // calculate: sqrt ( x )
@@ -496,30 +484,23 @@
    // calculate: 1 / sqrt ( x )
    REALINLINE f64 reciprocal_squareroot(const f64 x)
    {
+#if defined ( IRRLICHT_FAST_MATH )
+        double result = 1.0 / sqrt(x);
+        //! pending perf test
+        //_mm_store_sd(&result,_mm_div_sd(_mm_set_pd(0.0,1.0),_mm_sqrt_sd(_mm_load_sd(&x))));
+        return result;
+#else // no fast math
        return 1.0 / sqrt(x);
+#endif
    }
 
    // calculate: 1 / sqrtf ( x )
    REALINLINE f32 reciprocal_squareroot(const f32 f)
    {
-#if defined ( IRRLICHT_FAST_MATH )
-   #if defined(_MSC_VER)
-       // SSE reciprocal square root estimate, accurate to 12 significant
-       // bits of the mantissa
-       f32 recsqrt;
-       __asm rsqrtss xmm0, f           // xmm0 = rsqrtss(f)
-       __asm movss recsqrt, xmm0       // return xmm0
-       return recsqrt;
-
-/*
-       // comes from Nvidia
-       u32 tmp = (u32(IEEE_1_0 << 1) + IEEE_1_0 - *(u32*)&x) >> 1;
-       f32 y = *(f32*)&tmp;
-       return y * (1.47f - 0.47f * x * y * y);
-*/

-   #else
-       return 1.f / sqrtf(f);
-   #endif
+#if defined ( IRRLICHT_FAST_MATH ) && defined ( __IRR_COMPILE_WITH_SSE2 )
+        float result;
+        _mm_store_ss(&result,_mm_rsqrt_ps(_mm_load_ss(&f)));
+        return result;
 #else // no fast math
        return 1.f / sqrtf(f);
 #endif
@@ -534,31 +515,10 @@
    // calculate: 1 / x
    REALINLINE f32 reciprocal( const f32 f )
    {
-#if defined (IRRLICHT_FAST_MATH)
-
-       // SSE Newton-Raphson reciprocal estimate, accurate to 23 significant
-       // bi ts of the mantissa
-       // One Newtown-Raphson Iteration:
-       // f(i+1) = 2 * rcpss(f) - f * rcpss(f) * rcpss(f)
-       f32 rec;
-       __asm rcpss xmm0, f               // xmm0 = rcpss(f)
-       __asm movss xmm1, f               // xmm1 = f
-       __asm mulss xmm1, xmm0            // xmm1 = f * rcpss(f)
-       __asm mulss xmm1, xmm0            // xmm2 = f * rcpss(f) * rcpss(f)
-       __asm addss xmm0, xmm0            // xmm0 = 2 * rcpss(f)
-       __asm subss xmm0, xmm1            // xmm0 = 2 * rcpss(f)
-                                         //        - f * rcpss(f) * rcpss(f)
-       __asm movss rec, xmm0             // return xmm0
-       return rec;
-
-
-       //! i do not divide through 0.. (fpu expection)
-       // instead set f to a high value to get a return value near zero..
-       // -1000000000000.f.. is use minus to stay negative..
-       // must test's here (plane.normal dot anything ) checks on <= 0.f
-       //u32 x = (-(AIR(f) != 0 ) >> 31 ) & ( IR(f) ^ 0xd368d4a5 ) ^ 0xd368d4a5;
-       //return 1.f / FR ( x );
-
+#if defined (IRRLICHT_FAST_MATH) && defined ( __IRR_COMPILE_WITH_SSE2 )
+        float result;
+        _mm_store_ss(&result,_mm_rcp_ps(_mm_load_ss(&f)));
+        return result;
 #else // no fast math
        return 1.f / f;
 #endif
@@ -573,106 +533,21 @@
 
    // calculate: 1 / x, low precision allowed
    REALINLINE f32 reciprocal_approxim ( const f32 f )
-   {
-#if defined( IRRLICHT_FAST_MATH)
+   {
+        //what was here before was not faster
+        return reciprocal(f);
+    }
 
-       // SSE Newton-Raphson reciprocal estimate, accurate to 23 significant
-       // bi ts of the mantissa
-       // One Newtown-Raphson Iteration:
-       // f(i+1) = 2 * rcpss(f) - f * rcpss(f) * rcpss(f)
-       f32 rec;
-       __asm rcpss xmm0, f               // xmm0 = rcpss(f)
-       __asm movss xmm1, f               // xmm1 = f
-       __asm mulss xmm1, xmm0            // xmm1 = f * rcpss(f)
-       __asm mulss xmm1, xmm0            // xmm2 = f * rcpss(f) * rcpss(f)
-       __asm addss xmm0, xmm0            // xmm0 = 2 * rcpss(f)
-       __asm subss xmm0, xmm1            // xmm0 = 2 * rcpss(f)
-                                         //        - f * rcpss(f) * rcpss(f)
-       __asm movss rec, xmm0             // return xmm0
-       return rec;
 
-
-/*
-       // SSE reciprocal estimate, accurate to 12 significant bits of
-       f32 rec;
-       __asm rcpss xmm0, f             // xmm0 = rcpss(f)
-       __asm movss rec , xmm0          // return xmm0
-       return rec;
-*/

-/*
-       register u32 x = 0x7F000000 - IR ( p );
-       const f32 r = FR ( x );
-       return r * (2.0f - p * r);
-*/

-#else // no fast math
-       return 1.f / f;
-#endif
-   }
-
-
    REALINLINE s32 floor32(f32 x)
    {
-#ifdef IRRLICHT_FAST_MATH
-       const f32 h = 0.5f;
-
-       s32 t;
-
-#if defined(_MSC_VER)
-       __asm
-       {
-           fld x
-           fsub    h
-           fistp   t
-       }
-#elif defined(__GNUC__)
-       __asm__ __volatile__ (
-           "fsub %2 \n\t"
-           "fistpl %0"
-           : "=m" (t)
-           : "t" (x), "f" (h)
-           : "st"
-           );
-#else
-#  warn IRRLICHT_FAST_MATH not supported.
        return (s32) floorf ( x );
-#endif
-       return t;
-#else // no fast math
-       return (s32) floorf ( x );
-#endif
    }
 
 
    REALINLINE s32 ceil32 ( f32 x )
    {
-#ifdef IRRLICHT_FAST_MATH
-       const f32 h = 0.5f;
-
-       s32 t;
-
-#if defined(_MSC_VER)
-       __asm
-       {
-           fld x
-           fadd    h
-           fistp   t
-       }
-#elif defined(__GNUC__)
-       __asm__ __volatile__ (
-           "fadd %2 \n\t"
-           "fistpl %0 \n\t"
-           : "=m"(t)
-           : "t"(x), "f"(h)
-           : "st"
-           );
-#else
-#  warn IRRLICHT_FAST_MATH not supported.
        return (s32) ceilf ( x );
-#endif
-       return t;
-#else // not fast math
-       return (s32) ceilf ( x );
-#endif
    }
 
 
@@ -679,30 +554,7 @@
 
    REALINLINE s32 round32(f32 x)
    {
-#if defined(IRRLICHT_FAST_MATH)
-       s32 t;
-
-#if defined(_MSC_VER)
-       __asm
-       {
-           fld   x
-           fistp t
-       }
-#elif defined(__GNUC__)
-       __asm__ __volatile__ (
-           "fistpl %0 \n\t"
-           : "=m"(t)
-           : "t"(x)
-           : "st"
-           );
-#else
-#  warn IRRLICHT_FAST_MATH not supported.
        return (s32) round_(x);
-#endif
-       return t;
-#else // no fast math
-       return (s32) round_(x);
-#endif
    }



LIST OF NEW FILES:
include/SIMDswizzle.h : http://irrlicht.sourceforge.net/forum/viewtopic.php?f=9&t=50230&p=293603#p293603
include/matrixSIMD4.h : http://irrlicht.sourceforge.net/forum/viewtopic.php?f=9&t=50230&p=293604#p293604
include/vectorSIMD.h : http://irrlicht.sourceforge.net/forum/viewtopic.php?f=9&t=50230&p=293600#p293600
Last edited by devsh on Fri May 01, 2015 12:41 pm, edited 26 times in total.
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Brkopac » Fri Sep 05, 2014 6:17 am

This looks awesome. Any unit test and/or example on how to use this?
Image - The glory days.
User avatar
Brkopac
 
Posts: 88
Joined: Fri Sep 19, 2008 2:36 am

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Akabane87 » Fri Sep 05, 2014 8:09 am

Nice use of intrinsics ;). This only works with x64 procs isn't it ?
User avatar
Akabane87
 
Posts: 50
Joined: Sat May 05, 2012 6:11 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby hendu » Fri Sep 05, 2014 8:24 am

SSSE3, which means Intel Core / AMD Bulldozer minimum. No 32-bit cpu has that AFAIK (and the very recent AMD requirement makes it unusable IMHO, Phenoms and pre-bulldozer Athlons are in quite wide use).
hendu
 
Posts: 2587
Joined: Sat Dec 18, 2010 12:53 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 2:24 pm

SSSE3, which means Intel Core / AMD Bulldozer minimum. No 32-bit cpu has that AFAIK (and the very recent AMD requirement makes it unusable IMHO, Phenoms and pre-bulldozer Athlons are in quite wide use).


funking hell, not the **this is too new** attitude... the code is very usable:
A) SSSE3 runs on post 2007 CPU's (AMD k10 arch supports this)
B) maybe the CPUs that can run it are 64bit, but it compiles and works fine in 32bit binaries and OSes on the same CPUs.... so 64bit system/binary is not a requirement
C) even if your CPU doesn't support SSSE3, this is very easy to switch because the code is very nice and compliant with existing vector classes (well basically if you add a few duplicated functions to vector3df)
cpp Code: Select all
 
#ifdef _IRR_COMPILE_WITH_SSSE3_
vectorSIMDf a,b,c;
#else
vector3df a,b,c;
#endif
 
/***
NOW DO YOUR MATH
***/

a.set(1,2,3); //okay I need to add a constructor from 2 and 3 numbers
b.set(-3,-2,1); //okay I need to add a constructor from 2 and 3 numbers
c = a.crossProduct(b);
 
printf("Cross %f,%f,%f\n",c.X,c.Y,c.Z);
printf("Dot %f\n",a.dotProductAsFloat(b)); //if you add a typedef to vector3df so that "float dotProductAsFloat(const vector3df &b) return dotProduct(b);"
 


and here is the example Brkopac

While yes, its true that certain comparison operators are different and I've swapped .normalize() to return a normalized vector instead of normalizing itself and returning a reference... but I have done that to be more intuitive (the normalize was funking around with me 5 years ago). But this is NOT designed as a DROP in replacement for vector3df
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby kklouzal » Fri Sep 05, 2014 3:14 pm

I cant use these :( I can only use up to SSE2 here..
Dream Big Or Go Home.
Help Me Help You.
User avatar
kklouzal
 
Posts: 318
Joined: Sun Mar 28, 2010 8:14 pm
Location: USA - Arizona

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 3:56 pm

what CPU have you got?

UPDATE:
added macro tests for compiletime feature set and adapts code to whatever flag you set (-msse2, -msse3 etc.)
ok so this is how its going to work... people who have below SSSE3, you can code the emulation in SSE3 or SSE2... just remember to never access the __m128 member of the union and always use the load/store ops (aligned variants are faster, always use on vectorSIMD pointer arrays of the union) and beware that shuffling costs 1 cycle (about the throughput of an add)

AND ONE LAST WORD OF CAUTION: sometimes emulating a SSSE3 instruction in SSE2 is impossible or slower than a C++ implementation, so when C++ is faster, use that or mix between the two (store/load aligned to local vectorSIMDs shouldnt cost must)

Right now I will not implement SSE4 and greater because I wont use it, but feel free to code these implementations (SSE4 makes dot product in 1 instruction instead of 3)
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby zognotadog » Fri Sep 05, 2014 5:31 pm

Shrek says yes
zognotadog
 
Posts: 1
Joined: Fri Sep 05, 2014 5:29 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby hendu » Fri Sep 05, 2014 6:25 pm

A) SSSE3 runs on post 2007 CPU's (AMD k10 arch supports this)


Negative. K10 supports SSE3, note the two S's. Bulldozer is required for SSSE3, with three S's.
hendu
 
Posts: 2587
Joined: Sat Dec 18, 2010 12:53 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 9:11 pm

I am pleased... kudos to whoever wrote the matrix4 class

you sir are a SIMD hero (he picked the right matrix format out of row-major and column-major)
I was fearing that I will have to change matrix format, breaking all code in the engine if matrix4 was to be replaced with matrixSIMD

matrix mul with vector in just 4 muls, 3 shuffles and 3 adds (as opposed to 12 muls and 12 adds)


I CHANGED MY MIND

With SSE3 the other way is faster
Last edited by devsh on Fri May 01, 2015 11:20 am, edited 1 time in total.
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby gerdb » Fri Sep 05, 2014 9:55 pm

super nice work, thx, my E6600 ( Conroe 65nm Q3'06 ) does have SSSE3 :arrow: :wink:
gerdb
 
Posts: 194
Joined: Wed Dec 02, 2009 8:21 pm
Location: Dresden, Germany

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 10:28 pm

okay, we did our research at Build A World.. current state of affairs

64bit binaries only:
SSSE3 and above (Bulldozer and above for AMD)
Covers only post 2011 AMD CPUs and post 2007 Intel CPUs
As the AMD is behind intel, you most probably want to use SSE4.1 instead of SSSE3
You may be better off using AVX while you're at it (includes Q4 2011 AMD CPUs)

32bit binaries:
Highest intrinsics available are SSE3 (which has horizontal add so you still get fast dot product)

YOU CAN STILL USE SSE3 and below on 64bit builds!

I willl only code the SSE3 implementation of the SIMD irrlicht from now on, higher and lower featuresets will have to be contributed by someone else
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby kklouzal » Fri Sep 05, 2014 10:43 pm

Visual Studio Express 2013 only allows me to set compiler flags for SSE, SSE2, AVX, AVX2 even though I have an Intel Core I7 processor.

I really appreciate your contribution here as the performance increase is very noteworthy, however I wish I could take advantage of it.
Dream Big Or Go Home.
Help Me Help You.
User avatar
kklouzal
 
Posts: 318
Joined: Sun Mar 28, 2010 8:14 pm
Location: USA - Arizona

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 10:47 pm

does
cpp Code: Select all
/arch:SSE3
work?
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby kklouzal » Fri Sep 05, 2014 11:12 pm

1>cl : Command line warning D9002: ignoring unknown option '/arch:SSE3'
Dream Big Or Go Home.
Help Me Help You.
User avatar
kklouzal
 
Posts: 318
Joined: Sun Mar 28, 2010 8:14 pm
Location: USA - Arizona

Next

Return to Code Snippets

Who is online

Users browsing this forum: No registered users and 1 guest