While GM automatically handles transient network errors such as dropped, corrupted, or misrouted packets, and while the GM mapper automatically reconfigures the network if links or nodes appear or disappear, GM cannot automatically handle catastrophic errors such as crashed hosts or loss of network connectivity without the cooperation of the client program.
When GM detects a catastrophic error, it temporarily disables the
delivery of all messages with the same sender port, target port, and
priority as the message that experienced the error, and GM informs the
client of catastrophic network errors by passing a status other than
GM_SUCCESS to the client's send completion callback routine. The
client program is then expected to call either
gm_drop_sends(), which reenable the
delivery of messages with the same sender port, target port, and
priority. This mechanism preserves the message order over the
prioritized connection between the sending and receiving ports, while
allowing the client to decide if the other packets that it has already
enqueued over the same connection should be transmitted or dropped.
Simpler GM programs, such as MPI programs, will typically consider GM
send errors to be fatal and will typically exit when they see a send
error. This is reasonable for applications running on small or
physically robust clusters where errors are rare and when users can
tolerate restarting jobs in the rare event of a network error. Poorly
written GM programs may simply ignore the error codes, which will cause
the program to eventually hang with no error indication when
catastrophic errors are encountered. This poor programming practice is
strongly discouraged: Developers should always check the send completion
status. More sophisticated applications, such as high availability
database applications, will respond to the network faults, which appear
to the client as send completion status codes other than
The send completion status codes are as follows:
gm_set_acceptable_sizes()) the size of the message was unacceptable. This error indicates a programming error in the client software.
gm_drop_sends().) This status code does not indicate an error.
When the send completion status code indicates an error a sophisticated
client program may respond by calling
gm_resume_sending() causes GM to
simply reenable delivery of subsequent messages over the connection, including
those that have already been enqueued. This would be the typical
response of a distributed database that assumes the underlying network
is unreliable and layers its own reliability protocol over GM.
gm_drop_sends() causes GM to drop all enqueued sends
over the disabled connection, return them to the client with
GM_SEND_DROPPED, and reenable the connection. This would
be the typical response of a program that wishes to reorder subsequent
communication over the connection in response to the error.
Note that each of the fault response functions (
gm_resume_sending()) requires a send token. This send token
is implicitly returned to the caller when the callback function passed
gm_resume_sending() is called by GM.
Go to the first, previous, next, last section, table of contents.