Debate over CPU architecture (cache coerency) and Peterson

Discussion about everything. New games, 3d math, development tips...
Post Reply
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Debate over CPU architecture (cache coerency) and Peterson

Post by REDDemon »

Today we had a small debate about CPU architecture in the forum of our classroom :D

Basically the teacher proposed a problem:

The following code implements Peterson algorithm, it is prooven to be correct but on a modern x86 processor does not work (code given by the teacher):

Code: Select all

 
#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
static volatile int flag1 = 0;
static volatile int flag2 = 0;
static volatile int turn = 1;
static volatile int gSharedCounter = 0;
int gLoopCount;
 
void proc1() {
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 0;
}
 
void proc2() {
    flag2 = 1;
    turn = 1;
    while((flag1 == 1) && (turn == 1)) ;
    // critical section
    gSharedCounter++;
    // leave critical section
    flag2 = 0;
}
 
//
// Tasks, as a level of indirection
//
void *task1(void *arg) {
    int i;
    printf("Starting task1n");
    for(i=0;i<gLoopCount;i++) proc1();
}
 
void *task2(void *arg) {
    int i;
    printf("Starting task1n");
    for(i=0;i<gLoopCount;i++) proc2();
}
int
main(int argc, char ** argv)
{
    int loopCount = 0;
    pthread_t dekker_thread_1;
    pthread_t dekker_thread_2;
    void * returnCode;
    int result;
    int expected_sum;
    
    /* Check arguments to program*/
    if(argc != 2)
    {
        fprintf(stderr, "USAGE: %s <loopcount>n", argv[0]);
        exit(1);
    }
    
    /* Parse argument */
    loopCount = atoi(argv[1]);
    gloopCount = loopCount;
    expected_sum = 2*gloopCount;
    
    /* Start the threads */
    result = pthread_create(&dekker_thread_1, NULL, task1, NULL);
    result = pthread_create(&dekker_thread_2, NULL, task2, NULL);
    
    /* Wait for the threads to end */
    result = pthread_join(dekker_thread_1,&returnCode);
    result = pthread_join(dekker_thread_2,&returnCode);
    printf("Both threads terminatedn");
    
    /* Check result */
    if( gSharedCounter != expected_sum ) {
        printf("[-] Sum %d rather than %d.n", gSharedCounter, expected_sum);
        return 1;
        } else {
        printf("[+] Okn");
        return 0;
    }
}
 
the above algorithm infact does not work, and prints "Sum RANDOM NUMBER rather than 1000000000" (when input of program for gloopCount is 500000000).

However he asked to investigate the problem and find possible solutions.

Here's mine (c++11):

Code: Select all

 
#include <thread>
#include <limits>
#include <iostream>
#undef NDEBUG
#include <cassert>
 
// make globals local to this file (do no interfere with other tests namespace)
namespace {
       
        alignas(4) static volatile int flag1 = 0;
        alignas(4) static volatile int flag2 = 0;
        alignas(4) static volatile int turn = 1;
        alignas(4) static volatile int gSharedCounter = 0;
        alignas(4) int gLoopCount = 500000000;
       
        void proc1() {
                flag1 = 1;
                turn = 2;
                while((flag2 == 1) && (turn == 2)) ;
               
                // ENTER
                ++gSharedCounter;
                // LEAVE
               
                flag1 = 0;
        }
 
        void proc2() {
                flag2 = 1;
                turn = 1;
                while((flag1 == 1) && (turn == 1)) ;
               
                // ENTER
                ++gSharedCounter;
                // LEAVE
               
                flag2 = 0;
        }
       
        void task1(){
                for(alignas(4) int i = 0; i<gLoopCount; i++)
                        proc1();
               
        }
       
        void task2(){
                for(alignas(4) int i = 0; i<gLoopCount; i++)
                        proc2();
        }
       
}
 
 
// Runned by CMake Test runner. See CMakeLists.txt
int main(int argc, char ** argv){
        std::cout<<"running peterson with iterations: "<<gLoopCount<<std::endl;
        std::thread t1(task1);
        std::thread t2(task2);
        t2.join();
        t1.join();
        assert(gLoopCount*2==gSharedCounter);
        return 0;
}
the real change in the code is that I use "alignas(4)". But in theory the compiler should already align integers to 4 bytes and hence to cache lines

The debate is that according to teacher's assistant the problem is not alignment of variables and infact applying "alignas(4)" to the original C code does not make the algorithm working.(however I read alignas of C is different from C++.. or that is false?)

So here we have 2 different versions:
1) not working C version (indipendently of the fact someone used alignas macro)
2) working C++ version (however people here a bit lazy, no one tested running my code, so it could be a peculiarity of my CPU)

Wanna join the debate?

Could be possible that code of teacher does not work because we are running it inside a VM? (well I don't know in wich environment the teacher's assistent runned the code, but at least me I runned the first algorithm on a Linux VM (because using pthreads), while I runned the 2nd algoirthm under windows Vista without using any Virtual env. How I love no one tried to run my code. ^^)

Little warning, the test requires ~2 minutes to end

EDIT: another intersting thing.

I was compiling using "-O3" with that flag my code works nice, without it doesn't work (maybe it depends on MMX registers too?)
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
mongoose7
Posts: 1227
Joined: Wed Apr 06, 2011 12:13 pm

Re: Debate over CPU architecture (cache coerency) and Peters

Post by mongoose7 »

You can't do this:

Code: Select all

 
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
BTW Intel CPUs are always cache coherent I believe. They use the I2 bus I think. Other architectures may only force cache coherency in synchronisation primitives.
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: Debate over CPU architecture (cache coerency) and Peters

Post by REDDemon »

Apparently with -O3 optimization I can do that and it works, but why?

As far as i know all modern (x86) cpus are cache coerent. probably I'm wrong, but I do not understand why -O3 makes a difference here (the algorithm should never work, regardless of the architecture)
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
hendu
Posts: 2600
Joined: Sat Dec 18, 2010 12:53 pm

Re: Debate over CPU architecture (cache coerency) and Peters

Post by hendu »

(Am I doing your homework for you?)

It would only be correct in a non-optimizing compiler. The C standard allows the compiler to optimize the sets out, reorder them, or any other such transformation. Volatile alone is not what you think it is, the compiler is not thread aware.
mongoose7
Posts: 1227
Joined: Wed Apr 06, 2011 12:13 pm

Re: Debate over CPU architecture (cache coerency) and Peters

Post by mongoose7 »

I had a loooong look into this. It appears that coherence is a subjective matter. It means that you can interleave operations from the two threads in such a way that the observation by each thread is consistent. The sample code is:

Code: Select all

void proc1() {
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 0;
}
But flag1 is never read, so the following is actually cache consistent.

Code: Select all

void proc1() {
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 1;
    flag1 = 0;
}
What is needed here is a fence.
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: Debate over CPU architecture (cache coerency) and Peters

Post by REDDemon »

yup worked with fence. The reason why It won't worked without it was really complex
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
mongoose7
Posts: 1227
Joined: Wed Apr 06, 2011 12:13 pm

Re: Debate over CPU architecture (cache coerency) and Peters

Post by mongoose7 »

It did appear to defy logic. However, you probably cannot ask the cache consistency code to solve multithreading issues for you. I thought that each processor could update each others cache, but in case of a conflict you cannot signal the other processor that the update has been ignored (and the data at that location is not what you wrote). It starts to look like multithreaded coding in hardware. Your example was interesting and instructive.
REDDemon
Developer
Posts: 1044
Joined: Tue Aug 31, 2010 8:06 pm
Location: Genova (Italy)

Re: Debate over CPU architecture (cache coerency) and Peters

Post by REDDemon »

Thanks, but the example came from my teacher :D. I just wrote my first spinlock and workstealing queue :O__
Junior Irrlicht Developer.
Real value in social networks is not about "increasing" number of followers, but about getting in touch with Amazing people.
- by Me
Post Reply