TY - GEN
T1 - AutoReplica
T2 - 35th IEEE International Performance Computing and Communications Conference, IPCCC 2016
AU - Yang, Zhengyu
AU - Wang, Jiayin
AU - Evans, David
AU - Mi, Ningfang
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/17
Y1 - 2017/1/17
N2 - Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called "AutoReplica" - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called "fusion cache" to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.
AB - Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called "AutoReplica" - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called "fusion cache" to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.
KW - Atomicity
KW - Backup
KW - Cache and Replacement Policy
KW - Cluster Migration
KW - Consistency
KW - Device Failure Recovery
KW - Distributed Storage System
KW - Fault Tolerance
KW - Parallel I/O
KW - Replica
KW - SLA
KW - VM Crash
UR - http://www.scopus.com/inward/record.url?scp=85013408949&partnerID=8YFLogxK
U2 - 10.1109/PCCC.2016.7820664
DO - 10.1109/PCCC.2016.7820664
M3 - Conference contribution
AN - SCOPUS:85013408949
T3 - 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016
BT - 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 9 December 2016 through 11 December 2016
ER -