Foster's Methodology: A Parallel Algorithm Design Methodology
Last time I talked about the several kinds of parallelism, now I'd like to talk about the methodology we can use in designing a parallel algorithms, The Foster's Methodology.
First, let's define first a Unit of Computation and call it a Task. A Parallel Computation is merely a set of Tasks. A Task can consist of:
-
Program
-
Local Memory
-
Collection of IO Ports
The Tasks in a Parallel Computation interact one another by sending messages thru channel.
Foster's Methodology is basically a sequence of steps to partitioning computation/data into pieces (tasks), determining communication between these pieces (tasks), grouping the pieces (tasks) that have intense communication to one another, then finally mapping these groups into several processors we have. We will look at these steps in more detail one by one.
Foster's Methodology:
-
Partitioning
-
Communication
-
Agglomeration
-
Mapping
1. Partitioning
Here we divide computation and data into pieces. There are two kind of partitioning:
-
Domain Decomposition: Here we divide data into pieces, then determine how to associate computations with the data. In other words, here we break Data Structure into smaller structure that build the original Data Structure. Let say, break a Matrix into smaller Matrices (subsets of the original Matrix), or maybe break a Matrix into its scalar elements if necessary. Then we determine how to do computation to these Matrices (subsets) or scalar elements. From there, we can see whether there are computations to the pieces that can be performed at the same time (read: in parallel).
-
Functional Decomposition: Here we divide computation into pieces, then determine how to associate data with the computations. In other words, we're recognizing the functions needed in a computation and how these functions related one another (the relation structure between these functions, which function has to be computed first, which functions need results from which functions, etc). Then we determine how data (in a particular Data Structure) flows thru these functions. From there, we can see whether there are computations that can be performed at the same time (read: in parallel).
Here's some Partitioning Guidenes:
-
Minimize redundant computations and redundant data storage.
-
Primitive Tasks (Pieces) are roughly the same size.
-
Number of Tasks (Pieces) is an increasing function of problem size.
2. Communication
Here we're looking at the communication structure between Tasks (Pieces) we have. We determine the Data Structure and values passed among tasks. There are two kinds of comunication:
-
Local Communication: Task needs value from a small number of other Tasks. We need to create channels ilustrating data flow.
-
Global Communication: Significant number of Tasks contribute data to perform a communication. It is not recommended to create channels for them early in the design. We will look into this later.
Here's some Communication guidelines:
-
Communication operations are balanced among Tasks.
-
Each Task communicate with only small group of neighbours.
-
Tasks can perform communications concurrently.
-
Tasks can perform computations concurrently.
3. Agglomeration
Here we group several Tasks (Pieces) into larger parts, larger Tasks. You might be wondering; "We've broken the computations into pieces then now we need to group the pieces again, what's the point?" Well, in this step, we're not putting the pieces back into the original size. We're just grouping "close" Tasks so that we can increase performance, maintain sclability of the program and simplify programming. Let say, we group two or more tasks that are intensely communicate one another into a larger group of Tasks. In MPI (Message Passing Interface) Programming, the goal is often to create one agglomerated Tasks per processor.
How Agglomeration can increase performance? As I said before, we increase performance by eliminating communication between primitive Tasks into consolidated Tasks. We can also combine Tasks into groups of "Sending Tasks" and "Receiving Tasks".
Here's some Agglomeration guidelines:
-
Locality of Parallel Algorithm has increased.
-
Replicated computations take less time than communications they replace..
-
Data replications doesn't affect scalability.
-
Agglomerated task has similiar computational (remember the Big-Oh? [:]) and communication costs.
-
Number of Tasks increases with problem size.
-
Number of Tasks suitable for likely target systems.
-
Trade-Off between agglomeration and code modification is reasonable.
4. Mapping
This is the process of assigning Tasks (Agglomerated Tasks) into processors that we have. There are situations where mapping is done by Operating System (centralized multiprocessor), and there are situations where we manually do the mapping (distributed memory system).
There is a conflicting goals in mapping activities:
Finding an optimal mapping of tasks itself is an NP-Hard problem. Like it or not, we need to rely on heuristics here.
Here's Mapping Decision Tree that can be used as a strategy:
Here's some Mapping guidelines:
-
Consider designs based on one task per processor and multiple task per processor
-
Evaluate Static and Dynamic task allocation
-
If Dynamic Allocation is chosen, the task allocator should not be a bottleneck to the system
-
If Static Allocation is chosen, ratio of tasks to processor at least 10:1.
That's about it. I will talk more on this on future posts, along with examples. Again, my message is that using a parallel API (or even just a multithreading API) is "easy". Designing Parallel Algorithms to solve a problem is where the "real challange" is.
Reference:
-
Quinn, Michael J., "Parallel Programming in C with MPI and OpenMP - Chapter 3 Parallel Algorithm Design", McGrawHill