WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTORS!

Post those lines of code you feel like sharing or find what you require for your project here; or simply use them as tutorials.

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Fri Sep 05, 2014 11:28 pm

I swear there is a way cause we use MSVC and particles are all SSE3 in BaW

maybe compile with Intel or MinGW (gcc for windows)

or if you want to test it really quickly you can slap on the AVX flag if you have a post 2011 CPU

https://software.intel.com/en-us/articles/streaming-simd-extensions-3-enabling-for-the-microsoft-net-compiler-2003
http://eigen.tuxfamily.org/bz/show_bug.cgi?id=136
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby kklouzal » Fri Sep 05, 2014 11:57 pm

Visual Studio 2013 Express is not going to allow me to compile with SSE3, it flat out doesn't support it, nor do any of the express version support compiling to 64 bit. This is what I was trying to get at earlier when I said it doesn't seem like a smart decision to include SSE3 and you should instead use SSE2 because there are way too many devices still in existence that can use SSE2 but not SSE3. The target audience is just too limited for the effort it takes to code the intrinsics.
Dream Big Or Go Home.
Help Me Help You.
User avatar
kklouzal
 
Posts: 318
Joined: Sun Mar 28, 2010 8:14 pm
Location: USA - Arizona

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sat Sep 06, 2014 12:57 am

oh you have a free Visual Studio :D

change to more powerful compiler like GCC (it will actually work better for everything than Visual Studio not-Ultimate)

its not SSE3's fault, its Microsoft's

sse3 is a featureset for 32bit machines, its just that VSE wont let you compile to with AVX (which requires 64bit)

unless you are willing to upgrade compiler (to GCC, Intel Student/Personal free license, Visual Studio for students - if that works for 64bit) then you're stuck with producing SSE2 executables.

And I'd like to see these devices that can support SSE2 but not SSE3 !
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sat Sep 06, 2014 1:14 am

CODE UPDATED!
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Granyte » Sat Sep 06, 2014 2:10 am

most of my work lied around the matrix work so it does not really overlap with yours the flag IRR_SSE enable a special path for the matrix4(CMatrix4<f32>) that use sse

experimental flag stuff does not work currently I was waiting on the fvf to change the vector format


68862 character so I cannot post the file directly .... here is my matrix4.h it is not as fast as it could as it's still using _mm_set_ps when operating on vectors but when you have 10 operation it's way more then worth it
https://www.dropbox.com/s/0qcnz0ye9004t ... ix4.h?dl=0
Granyte
 
Posts: 846
Joined: Tue Jan 25, 2011 11:07 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sat Sep 06, 2014 2:42 am

good effort, thanks for sharing


Few remarks, as your code could be 200% faster

set_ps() is REALLY slow, unaligned load (_mm_loadu_ps()) beats it by far and padding and aligning the vector class lets you use load
main thing about it, is that I think its some kind of bitshifting macro (that most probs takes 11 of 1-cycle ops)

the set_ps is killing quite a lot of performance here, and you have a BIG bug
cpp Code: Select all
vect.X=result.m128_f32[0];
        vect.Y=result.m128_f32[1];
        vect.Z=result.m128_f32[2];


A) only MSVC supports this
B) this way of access is only intended for debug!
C) because of (B) your compiler drops to a failsafe implementation and introduces big penalties in storing the results

use _mm_storeu_ps() or movemask (so you dont overrun writing out of bounds memory)

I'd advise to borrow my vector class and amend your matrix code to get full 16byte aligned performance
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Granyte » Sat Sep 06, 2014 3:47 am

ya my code is not yet ready for release it was only ment as a proof of concept for a drop in replacement sse matrix class

and even in this form it's faster then the non sse version

I did not know about loadu and storeu ill get to work on these after that i'll likely try to integrate your vector class
Granyte
 
Posts: 846
Joined: Tue Jan 25, 2011 11:07 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sat Sep 06, 2014 3:49 am

update... just used +,-,*,/ and compound assignment equivalents in a function that gets average position from a list

SSEems to work
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sat Sep 06, 2014 6:58 pm

just a word of warning, even though 128bit alignment is forced on the vectorSIMDf.... you can't use core::array<> or std::vector<> because they allocate memory dynamically and cant guarantee 16byte alignment

instead you need to make/get a custom allocator that does 16byte alignment
https://stackoverflow.com/questions/11600943/aligned-allocation-with-stl-vector
https://gist.github.com/donny-dont/1471329

P.S. I think core::array has support for custom allocators
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Granyte » Sat Sep 06, 2014 10:22 pm

I just pushed an update

there is only one use of the set PS methode left
it require sse 4.1 because I make use of the extract methode


the last file missed my work on the transform vector methodes this one has them

https://www.dropbox.com/s/0qcnz0ye9004t ... ix4.h?dl=0
Granyte
 
Posts: 846
Joined: Tue Jan 25, 2011 11:07 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sun Sep 07, 2014 2:57 am

oh btw your code is also affected by silent bugs in aligned load/store as the matrix itself may not be on aligned boundary (see top of first post)
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby Granyte » Sun Sep 07, 2014 4:04 am

in case the matrix is inside an array yes unless we use an aligned alocator

and MSVC prior to 2013 will sometimes even fail to align matrices outside of an array bullet had some serious issues with this befor
Granyte
 
Posts: 846
Joined: Tue Jan 25, 2011 11:07 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby robmar » Sun Sep 07, 2014 1:09 pm

Is it correct to say that irrlicht will load and draw meshes much faster with this code? Or are their other frequently used function that will benefit?
robmar
 
Posts: 1002
Joined: Sun Aug 14, 2011 11:30 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby hendu » Sun Sep 07, 2014 7:54 pm

It will not affect mesh loading at all. It speeds up vector calculations.
hendu
 
Posts: 2587
Joined: Sat Dec 18, 2010 12:53 pm

Re: WANT 4x SPEEDUPS on CPU-side CODE??? SIMD IRRLICHT VECTO

Postby devsh » Sun Sep 07, 2014 11:54 pm

it will speed up whatever you use the vectors in (it doesnt magically replace vector3df) so if you modify culling,animation,worldtransforms etc. to use SIMD vectors... it will be faster
We chose to stream mesh data from Multiple OpenGL Contexts in many threads and do the other things, not because they are easy, but because they are hard! - JFK
User avatar
devsh
Competition winner
 
Posts: 1769
Joined: Tue Dec 09, 2008 6:00 pm
Location: UK

PreviousNext

Return to Code Snippets

Who is online

Users browsing this forum: No registered users and 1 guest