Optimizing Performance Through Parallelism: A Primer
<<< Previous		Next >>>

Multi-threaded Implementation

Background

In order to improve the speed of our example on symmetric multi-processing (SMP) machines, we need to make use of threads. We would like to stick to our design in the previous example, which means we need to find a way for each object to have its own thread of execution. Unfortunately, mixing C++ and pthreads is non-trivial, as pthread_create() expects a function which has been linked with C-style linking, not C++. We have solved this problem by creating an accessory class and a static member function:

class Threaded {
public:
  Threaded() {};
  virtual ~Threaded() {};
  static void *entry (void *ptr);
  virtual void *run ()=0;
};

void *Threaded::entry(void *ptr){
  return reinterpret_cast<Threaded*>(ptr)->run();
}

class CountPrimes : public Threaded {
public:
  CountPrimes(long start_, long stop_);
  long total();
  void *run();
private:
  long start;
  long stop;
  long count;
  bool counted;
  bool is_prime (long candidate);
};

void *CountPrimes::run(){
  long *return_val=new long(total());
  return reinterpret_cast<void*>(return_val);
}

The remainder of the CountPrimes object implementation is the same. The points to note here are:

The Threaded class is an abstract class.
The entry() function is a static member function which means that it has knowledge of the details of the Threaded class, but is not tied to a specific instance. It therefore does not go through name-mangling and can be used as a C style function. This is the function pointer we will pass to pthread_create() along with a pointer to the object to be threaded.
The run() member function is pure virtual, and as such must be implemented by any class derived from Threaded. This function will be the main execution point of the derived class, and its return value will represent the result of the computation. In the case of our CountPrimes class, this function simply calculates the total for the range and returns it.

Using Our New Object

We would like to retain the usage form of our serial implementation, and therefore add only one extra parameter which controls the number of threads which will be spawned to complete the task. Because we do not know beforehand how many objects (threads) will be needed, we will allocate them dynamically:

int main (int argc, char *argv[]){
  int num_threads(2);
  // were we called properly?
  if (argc<3)
    usage(argv[0]);
  if (argc==4)
    num_threads=atol(argv[3]);

  try {
    pthread_t *t_id=new pthread_t[num_threads];
    CountPrimes **counter=new CountPrimes*[num_threads];
    // start up the threads
    long start(atol(argv[1]));
    long stop(atol(argv[2]));
    long incr((stop-start)/num_threads);
    for (int i=0; i<num_threads; i++){
      counter[i]=new CountPrimes(start,(start+incr)>stop?stop:start+incr);
      start+=incr+1;
      pthread_create(&t_id[i],NULL,counter[i]->entry,counter[i]);
    }
    // now wait for the results
    long count=0;
    for (int i=0; i<num_threads; i++){
      void *return_val;
      pthread_join(t_id[i],&return_val);
      count+=*(reinterpret_cast<long*>(return_val));
      delete reinterpret_cast<long*>(return_val);
    }
    for (int i=0; i<num_threads; i++)
      delete counter[i];

    delete[] counter;
    delete[] t_id;

    if (count>1)
      cout << "There were " << count << " primes." << endl;
    else
      cout << "There was " << count << " prime." << endl;
  }
  catch (range_error e){
    cout << "Exception: " << e.what() << endl;
    exit(EXIT_FAILURE);
  }
  
  return EXIT_SUCCESS;
}

There are a few more subtleties in this example, so we will go through the code in a little more detail:

We set the default number of threads to 2, and then check to see if the user specified another value.
We create a dynamic array of pthread_t which will store the thread ids.
We create a dynamic array of pointers to CountPrimes objects. This is important since we need to instantiate each one with a different range.
In fact, we could not create a static array of CountPrimes objects since there is no default constructor. This design forces us to use the object correctly.
Next is a loop which will spawn the individual threads with their respective ranges of numbers to check. Note that we have made no attempt at thread balancing here. We will return to this concept later. The important point is that each CountPrimes object is instantiated dynamically, and its pointer is stored in the array created above. The thread is actually spawned through pthread_create(). We pass a pointer to the entry member function and a pointer to the object itself. The id of the spawned thread is stored in the thread id array.
Next we wait for the threads to finish computing their totals by using pthread_join() on the thread ids. Because we dynamically allocated space for the return value, we must deallocate it here. Each thread's total is added to count.
Finally, the CountPrimes objects are dealloacted, and the both the thread id array, and counter object pointer array are deallocated.

The complete program can be found at ftp://ftp.cim.mcgill.ca/pub/people/ericb/primes.tar.bz2. Here is an example of compiling and using the program:

bash$ g++ -o primes_threaded primes_threaded.cpp -lpthread
bash$ ./primes_threaded 0 10000 4
There were 1229 primes.

<<< Previous	Home	Next >>>
Serial Implementation		A Distributed Implementation