A Mechanism for Scalable Redundancy in Parallel File Systems

Bradley W. Settlemyer.
“A Mechanism for Scalable Redundancy in Parallel File Systems.” [pdf]
Clemson University Master’s Thesis, May 2006.

Abstract — As parallel file systems span larger and larger numbers of nodes in order to provide the performance and scalability necessary for modern cluster applications, the need for fault-tolerance and high data availability file systems has arisen. Modern parallel file systems spanning tens, hundreds, or even thousands of servers will require fault tolerance to avoid job failure and catastrophic data loss due to a single disk failure or server loss. Effective fault tolerance in parallel file systems must provide a high degree of data resiliency, consistency, and scalable performance.
In this thesis we provide an in depth description of the resiliency and consistency requirements of parallel file systems. We then describe a data replication mechanism that meets the resiliency and consistency requirements of parallel file systems and provides scalable performance. We also provide an in depth description of how the file system responds during a system fault and how the system may be recovered to its original, fully redundant state after a failure. Finally, we measure the performance of our proposed mechanism by implementing it in a popular parallel file system, PVFS2. We primarily focus on measuring the performance costs and scalability impacts associated with consistency and resiliency.

Comments are closed.