Darrell Long

Accessing Replicated Data in a Large-Scale Distributed System

Richard Golding, Darrell D. E. Long

July 2003

Abstract

Replicating a data object improves the availability of the data, and can improve access latency by locating copies of the object near to their use. When accessing replicated objects across an internetwork, the time to access different replicas is non-uniform. Further, the probability that a particular replica is inaccessible is much higher in an internetwork than in a local-area network (LAN) because of partitions and the many intermediate hosts and networks that can fail. We report three replica-accessing algorithms which can be tuned to minimize either access latency or the number of messages sent. These algorithms assume only an unreliable datagram mechanism for communicating with replicas. Our work extends previous investigations into the performance of replication algorithms by assuming unreliable communication.

We have investigated the performance of these algorithms by measuring the communication behavior of the Internet, and by building discrete-event simulations based on our measurements. We find that almost all message failures are either transient or due to long-term host failure, so that retrying messages a few times adds only a small amount to the overall message traffic while improving both access latency as long as the probability of message failure is small. Moreover, the algorithms which retry messages on failure provide significantly improved availability over those which do not.

BibTeX

@article{golding-ijcs91,
  author       = {Richard Golding and Darrell D. E. Long},
  title        = {Accessing Replicated Data in a Large-Scale Distributed System},
  journal      = {International Journal in Computer Simulation},
  pages        = {347-372},
  volume       = {1},
  number       = {4},
  month        = jul,
  year         = {1991},
}

Read full paper here.