Dynamic Sparse Data Exchange with MPI_IBARRIER
Video Summary
The below video is a short summary of the article in presentation format. Read on for more details and also links to related work and example codes.
Background
For this guide, we are going to lean heavily on the work for Torsten
Hoefler1 in designing scalable communication protocols for dynamic
sparse data exchange using standard MPI-3 functionality. Over the years,
when discussing MPI scalability issues with users whose applications
have dynamic communication patterns, I often point to this paper as a
way to potentially improve their performance. The reason is that they
typically use MPI collectives (typically MPI_ALLTOALL
or
MPI_ALLTOALLV
) to perform their data exchanges. Howver, MPI “all”
collectives can be expensive at scale. Every process contributes
something to the eventual output seen at every other process. In
practice they act as sort of a heavyweight global synchronization.
For users with sparse data, users might often look to MPI_ALLTOALLV
to
lighten the communication load and memory consumption during
communication. MPI_ALLTOALLV
allows each process to contribute
different amounts of data to the collective. In a sparse exchange, many
processes may contribute no data at all. This can be effective when the
communication pattern is regular over time, but not so much when it is
irregular. The issue is that MPI_ALLTOALLV
requires each process to
specify the amount of data they will receive from every other process
in the collective. If this information is not known a priori, then
typically a user will first have to perform another MPI_ALLTOALL
with the necessary sendcounts for each process before starting the
MPI_ALLTOALLV
. Remember that we are trying to avoid “all”
collectives. Performing a second one is not likely to help performance!
Other MPI Features
What other options does MPI provide? When neighborhood collectives were originally proposed to the MPI Forum, they were called “sparse” collectives. However, these operations assume a fixed set of neighbors. Any changes to the process communication topology requires a new, expensive to create communicator object with the updated topology attached. Any potential performance gains from the more sparse communication operations will likely be wiped out by the overhead of creating new topology-aware communicators.
MPI one-sided communication is well-suited to irregular communication patterns, but real-world adoption in applications is low. The reason being that the one-sided model is more complex to program, both from an application perspective and an MPI implementation perspective. This leads to unpredictable performance from system to system that often requires deep understanding of the architecture to diagnose and address.
A Use for MPI_IBARRIER
If we limit ourselves to traditional point-to-point messaging and
collectives, but want to avoid using an “all” collective due to
scalability issues, what is left? The key is another collective
operation added in MPI-3 – MPI_IBARRIER
. It is may not be immediately
obvious how a nonblocking barrier is useful to applications, but the key
is in the way that participants notify that they have reached a
certain point in the execution, but do not block on that
synchronization. Rather, they are free to continue communicating with
other processes. The use of MPI_IBARRIER
and nonblocking synchronous
send operations allows each process to dynamically modify its
communication neighbors in a more lightweight manner than a traditional
collective approach. The literature refers to this as the “nonblocking
census” algorithm.
Example
Let’s take a look at an example using both MPI_ALLTOALLV
and the
nonblocking census algorithm to see how they perform in practice. To
simulate the dynamic nature the communication, we will randomly select
communication partners. Each process communicates with a “sparse” 10% of
other processes. The size of each messages is constant for simplicity.
MPI_ALLTOALLV
data exchange
for (int i = 0; i < NUM_ITERS; i++) {
int num_neighbors;
int *neighbors = get_neighbors(&num_neighbors);
start = MPI_Wtime();
memset(sendcounts, 0, sizeof(int) * wsize);
for (int j = 0; j < num_neighbors; j++) {
sendcounts[neighbors[j]] = NUM_DOUBLES;
}
MPI_Alltoall(sendcounts, 1, MPI_INT, recvcounts, 1, MPI_INT, MPI_COMM_WORLD);
MPI_Alltoallv(sendbuf, sendcounts, displs, MPI_DOUBLE, recvbuf, recvcounts,
displs, MPI_DOUBLE, MPI_COMM_WORLD);
end = MPI_Wtime();
total_time += end - start;
}
MPI collectives are attractive for their brevity. A lot of complexity is abstracted away by the collective interfaces, resulting in relatively short code. Compare that to the nonblocking census next.
Nonblocking Census (MPI_IBARRIER
+ MPI_ISSEND
)
for (int i = 0; i < NUM_ITERS; i++) {
/* determine neighbors for this iteration */
int num_neighbors;
int *neighbors = get_neighbors(&num_neighbors);
start = MPI_Wtime();
for (int j = 0; j < num_neighbors; j++) {
int dest = neighbors[j];
double *buf = sendbuf + (dest * NUM_DOUBLES);
MPI_Issend(buf, 16, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD, &sreqs[j]);
}
MPI_Request barrier_req = MPI_REQUEST_NULL;
while (1) {
MPI_Status status;
int flag;
/* check for incoming messages */
MPI_Iprobe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &flag, &status);
if (flag) {
int src = status.MPI_SOURCE;
double *buf = recvbuf + (src * NUM_DOUBLES);
MPI_Recv(buf, NUM_DOUBLES, MPI_DOUBLE, src, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
if (barrier_req != MPI_REQUEST_NULL) {
MPI_Test(&barrier_req, &flag, MPI_STATUS_IGNORE);
if (flag) {
/* barrier completed, this iteration is done */
free(neighbors);
end = MPI_Wtime();
total_time += end - start;
break;
}
} else {
MPI_Testall(num_neighbors, sreqs, &flag, MPI_STATUSES_IGNORE);
if (flag) {
/* sends are done, start barrier */
MPI_Ibarrier(MPI_COMM_WORLD, &barrier_req);
}
}
}
}
As you can see, the nonblocking census is longer and involves more complex logic. As a user, there must be a clear benefit to the longer version if it is to be developed and maintained in application codes. For our experiment, we will run the example for 1000 iterations of each algorithm and compare the average communication time. For this run we use 8 JLSE Skylake nodes with 56 processes per node to fully subscribe the CPU cores. The MPI in use is MPICH 4.2.3.
Experiment
# mpiexec -bind-to core -ppn 56 ./dsde
avg alltoallv time = 27.182003
avg nonblocking census time = 18.780934
Takeaway
The result about shows about a 30% reduction in communication time using the nonblocking census algorithm. But this is only for our synthetic example. Users should consider this approach for their application if it fits the criteria we discussed. That is, if each process has relatively few communication partners, and those partners change from iteration to iteration. If so, it may be worth trying out the alternative and running some representative experiments for comparison. If the benefits are significant, they could justify the added complexity in your code!