MIT LCS TR 205: naming and synchronization in a decentralized computer system

Kragen Sitaker kragen at pobox.com
Mon May 31 20:26:34 EDT 2004


David Patrick Reed's thesis, entitled "Naming and Synchronization in a
Decentralized Computer System", and also known as MIT LCS TR 205 (MIT
Laboratory for Computer Science technical report 205), is a very
interesting read; he describes a design for a distributed system that
provides ACID properties in the face of node autonomy, node failures,
and network failures, by means of optimistic synchronization.

I haven't been able to find a text version of
it yet, although there's a PDF of page scans at
http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-205.pdf.
I transcribed 20 pages in passing, mostly HTMLified except for <p> tags;
they are here:

<h3>1.1.4 Need for a Robust Interface</h3>

An important class of failures in a decentralized system result in
either temporary or permanent loss of availability of some set of
resources.  Examples include communications failure that might cut off
access to some set of nodes and data objects from some using program,
a crash of a computer that might have similar effects in addition to
destroying the state of any computations in the middle of execution,
and detected software errors indicating that an object is in an
impossible state.

<pageno>15</pageno>

These failures can occur at any time an attempt is made to use some
resource.  A program that executes in the distributed environment must
always be prepared to discover that some resource it is using is
suddenly unavailable for some reason.

Autonomy can also lead to reasons for resources to become suddenly
unavailable.  The owner of a node may suddenly turn off his machine,
revoke access rights for a particular set of objects granted to some
program while that program is using them, etc.  Such examples are not
limited to distributed systems.  In Multics, for example, a
computation may at any microsecond of its execution discover that its
right to use a segment it has been reading for the last ten minutes
have been revoked.

Since abstract operations and objects are constructed out of simpler
ones, such that the execution of an abstract operation may involve
many steps dealing with many different resources, there are many
different points at which loss of availability can strike the
implementation.  Nonetheless, a desirable feature of abstractions is
that their behavior should not be strongly dependent on the
implementation.  Thus, an attempt to perform an operation on some
abstract object ought to have a well defined effect if it cannot
complete due to some loss of availability encountered.during its
execution.  This effect should be specified in terms of the abstract
view of the operation, not in terms of the program and data it uses in
its implementation.

Abstract operations that modify the (abstract) state of some abstract
object or objects become quite difficult to support in an environment
where sudden loss of availability must be expected.  Since the
implementation of the abstract operation makes the modification by a
number of steps, there may be points during the execution of these
steps where loss of availability of some resource leaves the
implementation at a point that has no meaning in terms of the
abstractions being implemented.  Some recovery from this is necessary.
The simplest approach to recovery is to undo the steps already taken
in the operation, so that the operation can be thought of as having no
effect if any resource it uses is unavailable during the operation.

<pageno>16</pageno>

Unfortunately, in the case of failures undoing what has already been
done is not straightforward.  Further loss of availability may make it
impossible to undo what has been done by simply reversing the changes
made.  If the resource that becomes unavailable is the processor that
is executing the operation, we have an extreme case of being unable to
undo what has been done.  We may not even know what has been done so
far.

Correction of failures by undoing results also interacts strongly with
synchronization.  In order to properly undo a computation, one must
also ensure that independent computations do not observe the transient
state during which the abstract operation was attempted but not yet
undone.

Although the user of an operation may not be in a position to know how
to recover from a failure, the system as a whole (all nodes involved
in the operation) can maintain this knowledge.  In order to do this,
the system must be aware of interfaces, and must be able to decide
that a computation has failed and effect the undoing of operations
when a failure occurs.  The system, in order to decide that a
computation has failed, cannot depend on the program or the user of
the program, since one or both of these may have also failed.  Nor can
the system depend on being able to access all of the nodes containing
objects that have to be corrected at the time failure is detected.
Consequently, the algorithms used by the system for recovery must be
very carefully designed to work correctly in the face of the same loss
of availability that caused the original failure.

An alternative approach that might be taken to handle failures that
result in loss of availability is to build the system so that such
failures never show through to the programs executing on the system.
Essentially this approach amounts to guaranteeing availability.  It is
usually possible to guarantee that resources are available to
computations that use them given either that the computation can
afford to wait, or that enough money is allocated to buy sufficient
redundancy within the system to reduce the probability of failure.
The tradeoff is not always possible, however.  Money is often in short
supply.  Computations may often be executing on behalf of an
interactive user at a terminal who cannot afford to wait until some
remote node his program had started to use is repaired.  Consequently,
the approach of having the system provide a mechanism that allows
undoing of computations that fail in the middle is often the best.

<pageno>17</pageno>

The case where the owner of a node decides to make it unavailable
differs slightly from the failure case in that by reducing the owner's
autonomy it is possible to reduce the expectation that loss of
availability of this sort interferes in a bad way with users of shared
objects.  Nonetheless, the name of the game is to permit as much
autonomy as possible.  Were the owner to discover that a bug was
allowing remote users to access too much of his data, it would be nice
if he could shut off access to his shared objects immediately, even in
the middle of "atomic" actions, should that action not cause his own
data to come to harm.  Analogously, even in a central system,
protection mechanisms that allow for immediate access revocation can
introduce severe malfunctions into operations on shared data if
invoked at the wrong time.  Consequently, the strong degrees of
autonomy allow actions by owners that look a lot like unpredictable
failures from the user's point of view.

As a result of these arguments, object interfaces that can handle
sudden loss of availability are absolutely essential in the autonomous
distributed system environment.

The model of failure recovery we have specified is closely allied with
the <i>termination model</i> of exception handling espoused in the
exception handling mechanisms of the CLU language
[<endnote>Liskov77b</endnote>].  Upon encountering a failure that
prevents execution of the module, the execution of the module is
terminated.  In the CLU termination model, the effect
<error>of</error> the module for each type of failure is specified as
part of the interface; in contrast, we have taken the "stronger" view
that a failure is to be made equivalent to never executing the module,
unless that module explicitly chooses otherwise.  Because the
programmer of a module cannot be expected to know about all possible
failures that may result during the steps of execution of the module,
the "stronger" view is safer, handling unexpected failures more
effectively.

An alternative exception handling mechanism is the <i>resumption
model</i> described by Goodenough[<endnote>Goodenough75</endnote>] and
Levin[<endnote>Levin77</endnote>], in which the module encountering an
error is suspended and possibly resumed after recovering from the
error.  Such a model did not seem appropriate for handling failures
resulting in loss of availability because a) the executing module may
lose its state, b) after an availability loss, it is difficult to tell
a handler what to do to recover from the <error>the</error> error, and
c) it is hard to design the handler of such errors without
<pageno>18</pageno> it having to include detailed knowledge of the
internal workings of a module, so that the level of abstraction of the
interface is compromised.

<h3>1.1.5 Conversational Interactions with Multiple Machines</h3>

Another aspect of the distributed system organization is that it ought
to allow on-line construction of unplanned-for actions determining the
state of shared objects at several nodes, and possibly even modifying
other nodes as the result of some decision made by a person sitting at
a terminal.  An important special case occurs when the several
machines are independently designed databases.  Here the problem is
that the program and its actions are being created as the program is
executed, and outside the control of the system itself.
Synchronization and failure management techniques that work when the
program is executing completely under the control of the
<error>machine</error> may not work.  In addition, even if programs
can be prevented from making mistakes by some kind of verification
method built into the system, the suer will make mistakes, and should
have at hand means to recover from his mistakes.  If the failure
management techniques built into the system can generalize to the case
of recovering from user mistakes, this would tremendously aid in
conversational use of systems.

<h3>1.2 Naming Mechanism as a Solution</h3>

This dissertation describes a system called NAMOS (Naming Applied to
Modular Object Synchronization).  NAMOS consists of a unified approach
to the problems of synchronization and reliability just described.  Of
particular concern are the problems of constructing modular systems
and the problems of unplanned concurrency and unexpected failure that
we expect to arise in the construction of distributed systems.

The central idea of the thesis is an unusual view of synchronization
of accesses to shared data.  Traditionally synchronization has been
achieved by mutual exclusion.  The approach to synchronization used in
NAMOS is based on a mechanism for <u>naming</u> states of the system
and objects (hence the title of the dissertation).  To understand the
difference between the two approaches it is helpful to use a
non-computer analogy in which both techniques are well developed.

<pageno>19</pageno>

Consider a set of files (say personnel files) kept in a file storage
room of some organization.  We may think of each file folder, labeled
by a person's name, as an object of the database.  Occasionally, files
must be inspected or updated.  For example, one might wish to compute
the average salary of women in the company.  Or as a sample update,
one might wish to modify the salaries of the women in the company by
an appropriate percentage so that the average woman's salary is equal
to the average man's.  We impose a rather strong constraint on these
operations that is not usually required in such a set of files.  The
constraint is that these "transactions" on the files must be atomic
operations.  That is, during the computation of the average woman's
salary, no woman's salary is changed by some other clerk.  Similarly,
in updating the women's salaries, no other clerk is to change any of
the men's or the women's salaries, to ensure that equality is
achieved.  The strong constraint is not normally required in
human-managed systems because people are good at dealing with
inconsistency.  Computer programs using a database, on the other hand,
have few checks built in to deal with inconsistency, so avoiding
inconsistency is much more important.

One simple way to solve the problem of synchronizing the accesses made
by clerks to the database is to allow only one clerk into the room
containing the files at a time, and requiring that he remain there
until completing the transaction.  This is the basic idea of mutual
exclusion.  The clerk gains exclusive access to the entire state of
the database, and can then make the modifications needed to construct
the next state of the database.  A refinement can be made to this
approach, because normally clerks will need to access only a subset of
the database.  The refinement consists of having the clerk go into the
room and collect all of the files he needs to read and update,
replacing each file folder with a note that indicates that the clerk
has taken the file folder to his desk.  The clerk can then work with
the set of file folders at his desk in private, and other clerks can
work on different sets of files independently.  A clerk needing to
access a file folder that has been removed must wait for the folder to
be returned.  This approach to synchronization is analogous to locking
in the use of a computerized database.  Each clerk performs
transactions by gaining exclusive access to the group of files he
needs to access for some period of time.

<pageno>20</pageno>

The approach in NAMOS is quite different.  Assume that each file
folder contains only one item, say the salary of the person whose file
it is.  Instead of erasing the current salary and replacing it with a
new one when the salary is changed, the salary of an individual is
changed by adding to the file a new sheet of paper with the new
salary.  Each sheet of paper with a salary is stamped with the data
and time when the salary becomes effective.  How does this help?
First of all, consider a transaction that only reads the database,
such as the one to compute the average salary of all of the women.
Instead of having to seize all of the folders to prevent changes from
happening, the clerk can simply take his time in processing the
folders; he simply must choose a date and time for which the average
is to be effective, then go to each woman's folder and read out the
salary corresponding to that date and time.  Concurrently, other
clerks may update the women's salaries.  However, a consistent
computation of the average salary does not interfere with the clerks
that are updating salaries.

This strategy is equally applicable in a distributed database.  If the
personnel files of a company are distributed to the company's many
locations, it may be unacceptable to gather up all of the files of
women employees in order to send them to company headquarters to
compute the average.  Instead, company headquarters can send a memo to
clerks at each site with instructions to sum up the women's salaries
effective as of a common date and time.  The key idea here is that the
central headquarters can construct a name (consisting of the effective
date and time itself) for the state of the database at a particular
time, and then use that name to gain access to that state of the
database even though the database is under constant change.  NAMOS
provides a similar mechanism to computer programs --- the ability to
name particular states of the data stored in the system along with the
ability to use those names to gain access to the values of data
objects in the state.

Synchronizing updates in the salary database is somewhat tricky,
however.  The salary adjustment for the women is performed by having
the clerk decide upon a time when the adjustment is to be effective.
He then must compute the average salary of the men and the average
salary of the women as of just before the effective time of the
adjustment, to preclude any other changes to individual salaries
happening between the computation of the adjustment percentage and the
actual application of the adjustment to the women's salaries.  After
computing the <pageno>21</pageno> adjustment percentage, the clerk can
then go to each file and add a new salary sheet with the adjusted
salary.

The tricky part is in the interaction between the adjustment
transaction and any other clerk's attempt to read salaries.  Let T be
the time the adjustment is to be effective, and T-1 be the time at
which the averages are computed in order to compute the adjustment
percentage.  Although the adjustment is effective at T, it may be that
the performance of the adjustment is not completed until sometime
after T because the job is so difficult.  Then it is possible that
another clerk will want to know the salary of Jane Jones at time T
even though it has not been computed yet.  He may not even be aware
that an update is in progress.  The most recent salary of Jane Jones
recorded in her folder is that of an earlier time.  If he takes that
salary as the value effective at T, and then the adjustment is
completed, then he will be wrong.  There are two solutions to this
problem incorporated in NAMOS.  First of all, reading a value out of
the folder always includes making a notation on the sheet containing
the salary read that indicates the effective time of the read.  Thus,
if Jane Jones's salary is read as of time T, the sheet containing the
salary believed to be effective then is noted to have been read at
time T.  When the adjustment is applied, it will be discovered that
someone has already read a different salary than the one that has been
computed to be effective at time T.  The clerk doing the adjustment
would then have no choice but to undo all of the adjustments to
salaries he has made thus far, choosing a new time for his adjustment
to be effective.  This is NAMOS's primary solution, guaranteeing that
each transaction is atomic.

In the case of the salary file, however, aborting the adjustment of
all women's salaries because someone at random read Jane Jones's
salary is a bit impractical.  The solution does, however, work well in
many applications.  For cases like the salary adjustment, though,
NAMOS provides a second mechanism as a refinement (the refinement is
not described until chapter six).  Essentially the refinement amounts
to allowing the clerk to mark all of the women's folders in advance
that a change may be made to the salaries as of time T.  Thus a clerk
that queries Jane Jones's file will observe that the most recent
salary on file for Jane Jones is likely to change as of time T.  He
can then wait until the change is made, or he can ignore the notation
and read the most recent salary (deciding that it is effective as of
time T, and eventually aborting the adjustment as before).

<pageno>22</pageno>

A rather surprising property of the naming approach is that it is not
necessary to predict in advance what records might be accessed.
Instead the naming mechanism ensures that whenever a particular file
is accessed, the proper version is obtained.  In the locking or mutual
exclusion approach, a consistent state is observed by assembling all
of the relevant folders on one's desk at the same time.  In the naming
approach, one can get a folder at a time, read the correct version,
and return it to the files, and still obtain a consistent state of the
system.  This property is the essence of NAMOS's solution for the
problem of constructing new systems from existing ones.  In a locking
system, composition of a new function from several pre-existing ones
usually requires doing all seizing for the composite operation before
any component operation is executed.

NAMOS includes, as well as the naming approach to synchronization, an
approach to recovering from failures.  Essentially<error></error> the
problem can be modeled in the personnel file case as what happens when
a clerk has a heart attack while in the middle of carrying out a
transaction (or in a distributed system, if one of the planes carrying
a message to a clerk at a remote site crashes).  Part of the update
may have already been made, and no other clerk may know how much the
clerk had actually done.  In the mutual exclusion case, to prepare for
such an eventuality, each clerk will have to diligently record the old
value of any salaries he updates in some sort of update log.  He
cannot return any files to the file room until he has completed the
update, because it might happen that that file would be picked up by
another clerk, used,<error></error> returned, and then the original
clerk may have died.  The problem is that the original clerk then must
have his work erased, but the other clerk has read the output of that
work<error></error> and there is no record that he has read it.

NAMOS takes a different approach<error>,</error> that has a similar
effect.  Since an update generates a new version, it is only that
version that is affected if the clerk dies after performing part of an
update.  The solution is to add to the sheet of paper containing the
new version a note to the effect that<error></error> "this sheet is
part of a coordinated update being performed by clerk John Jones.  To
find out if the update is completed, call John and ask if update
0987654327 has been completed."  This has the effect that if John
dies, someone will answer the phone and say that John has died.  John
then merely has to record somewhere what the numbers of the updates he
has performed are, so that when he dies, the person taking over his
<pageno>23</pageno> job can answer the question.  The update in
progress when he dies will not be performed, which is what is desired.

In some databases, recovery from failure is achieved by recording in a
central place a log, called an update log, of all changes made by
transactions.  An entry in the log consists of a "transaction
identifier" to identify the transaction that changed the value, the
name of the object changed, and the old value before the change.  A
partly completed transaction can be undone by searching backwards
through the update log and undoing all of the changes made.  In a
sense, the multiple versions of objects kept by NAMOS encode the same
information as the update log, but the old versions of objects are
also directly accessible for transactions to read after changing the
objects, simplifying the synchronization of concurrent transactions.

With this example<error></error> the basic elements of NAMOS have been
characterized.  In the remainder of the thesis, we explore the actual
mechanisms needed to make the analogy work in real computer systems.
This involves some careful definition of the exact behavior of the
synchronization and reliability mechanisms, and some engineering
tradeoffs in making the system perform well.

What has not been captured in the analogy is the notion of
constructing transactions as modules that can be used in the building
of other transactions.  It was noted above that<error></error> because
synchronization is achieved by naming states of objects rather than
seizing control of the objects, the NAMOS approach does not require
all resources used by a module to be known to its caller.  Exploiting
this property requires designing in additional structure to the names
used for states of the system that is not present in the "date and
time effective" name used for states of the personnel files.  The
structure needed will be described in chapter three.

<h3>1.3 Related Work</h3>

The fields of synchronization and reliability are old, so any attempt
to list exhaustively the related literature would be unfortunately
long.  However, a number of relatively recent developments in these
fields have particular bearing on the problems and approaches
described in the thesis.

<pageno>24</pageno>

Distributed systems are a more recent phenomenon, and the related
literature on the kinds of approaches referred to here is very small.
The idea of distributed systems that are composed of autonomous nodes
integrated only loosely through agreements to use a common mechanism
for sharing information and services through the network is best
described by d'Oliveira[<endnote>DOliveira77</endnote>] and
Saltzer[<endnote>Saltzer78</endnote>].  Related work to develop a
system for integration of existing programs on a set of relatively
autonomous nodes is to be found in the National Software Works Project
described by Crocker[<endnote>Crocker75</endnote>].  Our work differs
from the National Software Works project in that it does not take as a
requirement the fact that existing programs and data need to be
integrated, and can thus define a much more coherent interface to be
used by programs to facilitate easy sharing.

The work in developing languages to support design of objects and
operations in which information about the details of the
implementation is largely hidden from the user is basic to our
approach to defining a system to support decentralized development of
software that is later shared.  The languages
CLU[<endnote>Liskov76</endnote>,<endnote>Liskov77a</endnote>] and
ALPHARD[<endnote>Wulf74</endnote>], along with the operating system
kernel Hydra[<endnote>Wulf75</endnote>], are essential precursors of
the present thesis.

Our approach to synchronization is derived from two distinct but
closely related ideas.  First, the notion of version numbering to
achieve synchronization is closely related to the synchronization
mechanism developed by the author and
Kanodia[<endnote>Reed78</endnote>].  Maintaining a sequence of
versions of an object was inspired by an idea present in both the
TENEX file system[<endnote>Bobrow72</endnote>] and the ITS file
system[<endnote>Eastlake69</endnote>] where multiple versions of a
file can be catalogued with successive version numbers while accessing
a file gets the most recently created version by default.  This
provides a primitive synchronization mechanism among the accesses to a
file, allowing a new version of the file to be written while the older
version is still being read, giving a sort of "read-locking" for free.

The second related group of ideas involves the use of timestamps for
synchronization.  Johnson and Thomas[<endnote>Johnson75</endnote>]
suggested the first such mechanism, which used timestamps plus an
underlying property of the network that messages are delivered in
order to assure synchronization of a simple distributed data base.
Thomas [<endnote>Thomas76</endnote>] elaborated this approach to allow
somewhat more general operations, while still requiring that the
database be completely replicated at each node of the system.
Lamport[<endnote>Lamport78</endnote>] describes <pageno>25</pageno>
the use of timestamps to define an ordering among requests that can be
used for synchronization, and a simple algorithm to achieve mutually
exclusive execution of sequences of <error>program</error> in time in
a distributed system based on timestamps.  The SDD-1 distributed
database system developed by Computer Corporation of America uses
timestamps internally to enforce a locking
strategy[<endnote>Bernstein77</endnote>,<endnote>Rothnie77</endnote>].

The major difference between the use of timestamps to achieve
synchronization by these projects and our use of pseudo-time in NAMOS
is that pseudo-time is a part of the "programmer's box of tools" in
NAMOS, whereas timestamps are hidden mechanisms in the above
approaches.

The notion of designing a program that accesses shared objects by
building it out of pieces that execute as if they are the sole agents
of change to shared objects has its roots in the concept of a database
transaction.  The essence of the database transaction concept is
described quite well by Eswaran, <i>et
al.</i>[<endnote>Eswaran76</endnote>].

Two nice overviews of the field of designing reliable systems are
Gray[<endnote>Gray77</endnote>] and
Randell[<endnote>Randell78</endnote>].  They define the basic strategy
of backward error recovery used to handle loss of availability within
NAMOS.

The implementation mechanism used to support possibilities, the commit
record, is closely related to the intentions list of the algorithm
used to achieve coordinated reliable updates in Lampson and
Sturgis[<endnote>Lampson76</endnote>].  Also related is the log
mechanism described by Gray[<endnote>Gray77</endnote>] for handling
the backwards recovery of failed transactions in a central data<error>
</error>base system.  The two-phase commit protocol described by Gray
is essentially the same as the notion of making all changes
conditional on the eventual state of a commit record in NAMOS.
However, NAMOS supports modular construction of operations and
objects, while these other systems do not necessarily do so.

Lampson and Sturgis's approach to recovery also shares our basic
assumptions about the properties of memory described in chapter two,
categorizing memory into two classes --- stable and volatile.  

<pageno>26</pageno>

In essence, the NAMOS system developed here differs from other work in
synchronization and reliability because it ties together four related
concepts in one framework --- synchronization, reliability, modular
construction of programs, and decentralized, distributed systems.  Not
only does this ensure that the solutions harmonize with one another,
but in fact each problem is simpler when solved in conjunction with
the others.  The simplification occurs because it is hard to separate
the four areas of synchronization, reliability, modular programming,
and distributed systems cleanly from one another --- one has to think
about certain parts of the other areas in dealing with any one area.

<h3>1.4 Thesis Plan</h3>

The remainder of the thesis is divided into two parts.  The first part consists of an explanation of the object<error> </error>level
interfaces, and the semantics of operations at this level.  The second part discusses the issues of implementing the interface on a
distributed system, handling low-level failures of communications and nodes, managing storage, and so forth.  The description is done
this way to adhere to standard top-down design.

In fact, I think it is interesting ton toe that the actual thinking
process was not top<error> </error>down at all --- I was much more
concerned with what was implementable than with what to implement.
Especially when dealing with fault-tolerant mechanisms, one has to be
careful not to ask for too much from a mechanism --- it may not be
achievable.  No matter how much one may try to sweep the consequences
of failures under the cover of the "system blanket," it keeps burning
its way through.  Consequently, some notion of the kinds of
implementations that are possible shows through in the interface
semantics, and I will allude to various notions in the description of
the interface.

Chapter two, then, provides some background in the problems of
implementation.  Failures of communications and nodes are described,
and an argument is made that low-level error correction (such as
link-by-link error correction, or reliable stream communication) is
insufficient to solve the problem of failure recovery.  Similarly,
some of the interaction between communications technology of the
packet network, reliability mechanisms, and proper synchronization
will be discussed.

<pageno>27</pageno>

Chapter three begins the discussion of the object interface.  Three
basic notions<error>,</error> the system history, creation of
hypothetical states, and the frozen system states in which
computations can be executed (called <i>pseudo-temporal
environments</i>), are described, and linguistic constructs that
reflect these notions are developed.  This chapter is long and
involved because it presents all of the aspects of the interface,
which is quite different in some respects from traditional
synchronization mechanisms.  Nonetheless, it is essential to the
understanding of the rest of the thesis.

Chapter four discusses several ways in which the object interface can be used.  A very important case, the creation of multi-node
<i>transactions</i> (as defined by Eswaran <i>et al.</i>[<endnote>Eswaran76</endnote>]), will be shown to be easily handled with the
object interface.  More importantly, the definition of unplanned transactions, and the creation of conversational transactions will be
shown to be easily accomplished.  Other uses, such as consistent recovery from permanent mistakes or errors (backup), will be
described.  Finally, some "unstructured" ways to use the mechanisms will be shown, for two reasons.  First, I want to show that even
with my scheme, all is not perfect, and second, if there is no way for a mechanism to be misused, one should suspect that there are
probably some perfectly reasonable uses that it cannot support.

Chapter five begins the discussion of implementation, talking about a mechanism for implementing the hypothetical-ness of hypothetical
states of the system.  The basic idea is to build at a low<error>-</error>level in the implementation data representations that allow
operations to be <i>recoverable</i> --- a term used by Gray [<endnote>Gray77</endnote>] to mean that there is a single instant during
their execution when they "happen".  If a failure occurs before this instant, it is as if nothing had happened, and if failure occurs
afterwards, the operation is guaranteed to appear to have completed correctly.  A mechanism called a <i>commit record</i> that I have
developed is described, and various implementations are discussed.

Chapter six continues the discussion of the implementation, talking about how the system state is built of states of individual objects
related through the concept of pseudo-time, and the representation of object histories.  Issues that are key to the implementation,
such as the necessary synchronization of clocks, management of storage for objects, and the details of representation of object
histories needed to ensure correct operation in the face of failure, are discussed.

<pageno>28</pageno>

Optimization of performance of the mechanisms, and ways to eliminate deadlock<error></error> are also discussed.

Finally, chapter seven summarizes the thesis, giving goals and directions for future work.  A particular issue, that of the
compatibility of the mechanisms of this thesis with specially designed hardware that makes the synchronization of objects on the same
system quite inexpensive, is elaborated in some depth.

<pageno>29</pageno>

<chapter>
<h1>Chapter Two</h1>
<h2>The communications system and storage system</h2>

In this chapter<error></error> I want to discuss the interactions of the low-level components of the distributed system with
synchronization and failure management.  The important components are the long-term memories of each node and the message passing
network that connects the nodes, allowing them to call upon each other for services and access to data.  In a sense, the
characteristics of the components constitute my assumptions about where the technology is and where it is likely to go.

The major problems with each component consist of reliability problems and synchronization problems.  In turn, reliability breaks down
into availability --- whether the component is available to have data placed into it and taken out of it --- and integrity --- whether
the data entrusted to the component is damaged or not.  Synchronization problems involve the ability to use the basic properties of
both the message system and the storage system to control the order of actions taken by computations in the system.  Since multiple
computations will be proceeding in parallel, with only loose coupling between the computations at best, the order of actions taken in
the system is relatively unconstrained.  As shall be discussed, the message system and the <error>storages</error> system each add the
opportunity for more unconstrained ordering of actions.

Both the message communications system and the data storage system are quite similar in function.  In each system, one places data into
the system with some tagging of the intended destination, and then later the data is taken out, selecting the data by means of its
destination tag.  The differences between the two are basically either technological or in their intended use.  Typically the data sent
in a message is intended to be transient, used only once or not at all, and in any case, used fairly promptly.  In contrast, the data
stored in a data record is to be saved for many potential later uses, that can be separated by quite a long time from the initial
transmission.

<pageno>31</pageno>

It is, in fact, the case that a communications system can be built on what seems to be a "memory" technology; for example, networks
have been built by connecting multiple processors to a shared bulk memory, such that messages to the other processors are stored in
queues polled by the other processors.  The distinction between such a system and a shared memory multiprocessor system is slight.  The
construction of a memory system using communications hardware is conceivable, but seems not to be a viable way to go (although once
upon a time, the cost per bit of delay lines was relatively quite cheap).  Thus we cannot make the simple argument that communications
and memory are basically different, and that we must therefore distinguish the two.

I do, however, assume that there are two components, the communications system and the storage system.  The communications system is an
abstraction designed to capture the notion of data transport.  The storage system is an abstraction designed to capture the notion of
long term memory of information.  These abstractions can be thought of as extreme points on a continuum that contains all real storage
and communications systems.

It is useful to distinguish the two components, given their basic similarity, for two reasons.  First, the autonomy property of the
distributed system argues against treating the shared network as a long-term repository of shared information.  Because the network is
shared, it should do as little as possible for its users in order to reduce user interdependence.  Second, in the message transmission
mechanism, the tradeoff between reliability, cost<error></error> and delay becomes very important because of the large physical
distances involved, whereas in a local node, reliability can be achieved with relatively little cost and delay.  Consequently, the
reliability strategies for data storage systems generally achieve a high degree of reliability in the transmission of information from
source to user, while a significantly lesser degree is generally provided by message communication systems.

Since<error></error> in a distributed system, messages are used to request remote actions, the properties of the message system both in
terms of reliability and synchronization have a serious effect on the ability to create actions that are composed of several subactions
initiated at several nodes by messages.  Taking no particular care to ensure reliability and synchronization of the delivery of
messages, the behavior of such an action (its semantics) in terms of its effect at <pageno>32</pageno>the multiple receiving nodes, and
its interactions with other such actions that may be initiated concurrently, is extremely complex to describe.

To reduce the complexity of describing the behavior of such actions, a method based on numbering messages can be used to transform the
problems of lack of integrity, variable delay, and duplication into a common problem, lost messages.  The problem of coping with the
unusual behavior of the message system is thus reduced to coping with the problem of coping with lost messages.

<h3>2.1 Reliability of Message Communications</h3>

Achieving integrity of the messages sent through the communications system is not usually a difficult problem.  One can get quite a
large amount of integrity by associating a checksum of an appropriate size with a message, checking upon receipt of all messages that
the checksum correctly matches the data in the message.  I am assuming that errors within messages are random.  The result of the use
of checksums is the transformation of all message content errors into lost message errors (thus transforming a question of the
integrity of the data into a question of the instantaneous availability of the communications path between source and destination).  By
the use of encryption of messages, one can also treat attempts to modify messages in transit as random corruption of data in the
unenciphered form of the message [<endnote>Kent76</endnote>].

Messages are used either to communicate information to, or to cause actions by, a computation at some other node.  In essence, then, it
is unimportant where unavailability or lack of integrity occurs --- the important thing is that the system as a whole provide
reliability from the source computation to the destination computation's use of the message.  Any guarantee of reliability of the
message system alone cannot ensure the reliable functioning of the system as a whole, unless we make the rather unreasonable assumption
that the only unreliable component of the distributed system is the message transport mechanism.  The reliability of the message system
itself is much less important than the function the message system provides for coordinating responses to failures both inside and
outside the message system.

<pageno>33</pageno>

To detect failure of a requested action, the standard mechanism is positive acknowledgment, i.e. when the action is performed, a message
is used to inform the requester that the action has been performed.  Of course, the need to wait for a positive response can lead to
some rather serious problems.  The basic problem lies in the knowledge that the requester has of the state of his action after a
requesting message has been sent.  If the requester receives a proper response, then it is sure that the action has been performed.
However, if it has received no response, then the requester only knows that the request may not have been processed, not that it has
not been wholly or partially processed.  Achieving reliable control of remote actions requires some tricky design of the remote
actions, so that a request may be repeated if no response has been gotten in an appropriate time, without causing errors due to running
the request more than once (such requests are <i>idempotent</i>).  Handling repeated requests will be discussed shortly.

As a basic assumption, I conjecture that the problem of unexpected loss of availability can be characterized by a request uncertainty
principle, stated as follows:

<blockquote><i>Once a remote action has been requested, the requester cannot always determine, in a bounded time, whether or not it has
occurred.</i></blockquote>

A program that requests remote actions must thus always be prepared to somehow handle the case that it has initiated a remote
operation, but cannot determine the status of its request.  In non-distributed systems, this case is usually so rare that it is not
explicitly considered in the design of software.

Unfortunately, if the requesting node fails, or chooses to give up after a while, it may be the case that it still does not know
whether the request has been processed.  It is important that the system give the requester the option of giving up without causing the
possibly partially completed action to leave the system in an irrecoverable state.  The option to give up on a request that has not yet
been completed adds no difficulties that are not already present due to the possibility of a failure of the requester, and adds to the
autonomy of the requesting node.

<pageno>34</pageno>

It is important to note that a certain part of the unreliability of the message system cannot be reduced by using more reliable
components.  The portion I refer to is that caused by autonomy.  The likelihood that a node owner will disconnect or shut off his
machine is independent of the innate hardware reliability.  Also, it is often not economically feasible to provide compete reliability
of the message system, especially where long-distance communication, with hazards of natural disasters, wars, etc., is involved in the
system.  For this reason, the unreliability of the message system must be taken for granted, and reflected in the application
programming interface.

<h3>2.2 Synchronization of Message Communications</h3>

The primary problems with message communications from the point of view of synchronization of remote actions are duplication and delay.
In most communications networks both of these problems arise normally, as a result of the internal structure of the networks.  Even
were the network design specialized to prevent duplication and varying delay on messages, however, protocols that attempt to ensure the
reliability of message communications will introduce these factors anyway.

Duplication and loss of messages can be characterized quite simply.  For every message sent in the system, that message will arrive at
its intended receiver any number of times, from none on up.  Delay can also be simply characterized for the purposes of the thesis ---
the individual arrivals of the copies of the message may be at any times later than the sending of the message.

Duplication is a problem in the use of communications systems because messages are usually used to cause actions at the receiver.
Depending on the kind of action, the repeated performance of the action requested by a message may be an error --- for example, a
message the requests the receiver to subtract one from some integer cell will, if no attempt is made to prevent repeated execution due
to duplicated messages, cause the cell to be decremented some number of times.  One way to avoid problems resulting from duplication is
to remember all messages ever received at the receiver, assuming that they are distinguishable.  If the receivers of the system all
ignore duplicated messages based on this information, then the behavior of the message system is simplified to the statement that for
every message sent, it is received at the intended receiver either once or never.

<pageno>35</pageno>

<!--
 LocalWords:  pageno CLU Liskov Goodenough Levin NAMOS pre d'Oliveira DOliveira
 LocalWords:  Saltzer Wulf Kanodia Eastlake SDD Rothnie Eswaran Lampson Sturgis
 LocalWords:  ness subactions
-->


More information about the Kragen-fw mailing list