[OBSOLETE] SIMD IRRLICHT VECTORS!

Post those lines of code you feel like sharing or find what you require for your project here; or simply use them as tutorials.
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

I swear there is a way cause we use MSVC and particles are all SSE3 in BaW

maybe compile with Intel or MinGW (gcc for windows)

or if you want to test it really quickly you can slap on the AVX flag if you have a post 2011 CPU

https://software.intel.com/en-us/articl ... piler-2003
http://eigen.tuxfamily.org/bz/show_bug.cgi?id=136
kklouzal
Posts: 343
Joined: Sun Mar 28, 2010 8:14 pm
Location: USA - Arizona

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by kklouzal »

Visual Studio 2013 Express is not going to allow me to compile with SSE3, it flat out doesn't support it, nor do any of the express version support compiling to 64 bit. This is what I was trying to get at earlier when I said it doesn't seem like a smart decision to include SSE3 and you should instead use SSE2 because there are way too many devices still in existence that can use SSE2 but not SSE3. The target audience is just too limited for the effort it takes to code the intrinsics.
Dream Big Or Go Home.
Help Me Help You.
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

oh you have a free Visual Studio :D

change to more powerful compiler like GCC (it will actually work better for everything than Visual Studio not-Ultimate)

its not SSE3's fault, its Microsoft's

sse3 is a featureset for 32bit machines, its just that VSE wont let you compile to with AVX (which requires 64bit)

unless you are willing to upgrade compiler (to GCC, Intel Student/Personal free license, Visual Studio for students - if that works for 64bit) then you're stuck with producing SSE2 executables.

And I'd like to see these devices that can support SSE2 but not SSE3 !
Granyte
Posts: 850
Joined: Tue Jan 25, 2011 11:07 pm
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by Granyte »

most of my work lied around the matrix work so it does not really overlap with yours the flag IRR_SSE enable a special path for the matrix4(CMatrix4<f32>) that use sse

experimental flag stuff does not work currently I was waiting on the fvf to change the vector format


68862 character so I cannot post the file directly .... here is my matrix4.h it is not as fast as it could as it's still using _mm_set_ps when operating on vectors but when you have 10 operation it's way more then worth it
https://www.dropbox.com/s/0qcnz0ye9004t ... ix4.h?dl=0
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

good effort, thanks for sharing


Few remarks, as your code could be 200% faster

set_ps() is REALLY slow, unaligned load (_mm_loadu_ps()) beats it by far and padding and aligning the vector class lets you use load
main thing about it, is that I think its some kind of bitshifting macro (that most probs takes 11 of 1-cycle ops)

the set_ps is killing quite a lot of performance here, and you have a BIG bug

Code: Select all

vect.X=result.m128_f32[0];
        vect.Y=result.m128_f32[1];
        vect.Z=result.m128_f32[2];
A) only MSVC supports this
B) this way of access is only intended for debug!
C) because of (B) your compiler drops to a failsafe implementation and introduces big penalties in storing the results

use _mm_storeu_ps() or movemask (so you dont overrun writing out of bounds memory)

I'd advise to borrow my vector class and amend your matrix code to get full 16byte aligned performance
Granyte
Posts: 850
Joined: Tue Jan 25, 2011 11:07 pm
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by Granyte »

ya my code is not yet ready for release it was only ment as a proof of concept for a drop in replacement sse matrix class

and even in this form it's faster then the non sse version

I did not know about loadu and storeu ill get to work on these after that i'll likely try to integrate your vector class
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

update... just used +,-,*,/ and compound assignment equivalents in a function that gets average position from a list

SSEems to work
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

just a word of warning, even though 128bit alignment is forced on the vectorSIMDf.... you can't use core::array<> or std::vector<> because they allocate memory dynamically and cant guarantee 16byte alignment

instead you need to make/get a custom allocator that does 16byte alignment
https://stackoverflow.com/questions/116 ... stl-vector
https://gist.github.com/donny-dont/1471329

P.S. I think core::array has support for custom allocators
Granyte
Posts: 850
Joined: Tue Jan 25, 2011 11:07 pm
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by Granyte »

I just pushed an update

there is only one use of the set PS methode left
it require sse 4.1 because I make use of the extract methode


the last file missed my work on the transform vector methodes this one has them

https://www.dropbox.com/s/0qcnz0ye9004t ... ix4.h?dl=0
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

oh btw your code is also affected by silent bugs in aligned load/store as the matrix itself may not be on aligned boundary (see top of first post)
Granyte
Posts: 850
Joined: Tue Jan 25, 2011 11:07 pm
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by Granyte »

in case the matrix is inside an array yes unless we use an aligned alocator

and MSVC prior to 2013 will sometimes even fail to align matrices outside of an array bullet had some serious issues with this befor
robmar
Posts: 1125
Joined: Sun Aug 14, 2011 11:30 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by robmar »

Is it correct to say that irrlicht will load and draw meshes much faster with this code? Or are their other frequently used function that will benefit?
hendu
Posts: 2600
Joined: Sat Dec 18, 2010 12:53 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by hendu »

It will not affect mesh loading at all. It speeds up vector calculations.
devsh
Competition winner
Posts: 2057
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK
Contact:

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Post by devsh »

it will speed up whatever you use the vectors in (it doesnt magically replace vector3df) so if you modify culling,animation,worldtransforms etc. to use SIMD vectors... it will be faster
Post Reply