AutoReplica

Automatic data replica manager in distributed caching and data processing systems

Zhengyu Yang, Jiayin Wang, David Evans, Ningfang Mi

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

17 Citations (Scopus)

Abstract

Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called "AutoReplica" - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called "fusion cache" to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.

Original languageEnglish
Title of host publication2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509052523
DOIs
StatePublished - 17 Jan 2017
Event35th IEEE International Performance Computing and Communications Conference, IPCCC 2016 - Las Vegas, United States
Duration: 9 Dec 201611 Dec 2016

Publication series

Name2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016

Other

Other35th IEEE International Performance Computing and Communications Conference, IPCCC 2016
CountryUnited States
CityLas Vegas
Period9/12/1611/12/16

Fingerprint

Managers
Fault tolerance
Resource allocation
Fusion reactions
Recovery
Fibers
Cyber Physical System

Keywords

  • Atomicity
  • Backup
  • Cache and Replacement Policy
  • Cluster Migration
  • Consistency
  • Device Failure Recovery
  • Distributed Storage System
  • Fault Tolerance
  • Parallel I/O
  • Replica
  • SLA
  • VM Crash

Cite this

Yang, Z., Wang, J., Evans, D., & Mi, N. (2017). AutoReplica: Automatic data replica manager in distributed caching and data processing systems. In 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016 [7820664] (2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/PCCC.2016.7820664
Yang, Zhengyu ; Wang, Jiayin ; Evans, David ; Mi, Ningfang. / AutoReplica : Automatic data replica manager in distributed caching and data processing systems. 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016. Institute of Electrical and Electronics Engineers Inc., 2017. (2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016).
@inproceedings{202f222b0d4e443ba4fa16699bd986e8,
title = "AutoReplica: Automatic data replica manager in distributed caching and data processing systems",
abstract = "Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called {"}AutoReplica{"} - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called {"}fusion cache{"} to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.",
keywords = "Atomicity, Backup, Cache and Replacement Policy, Cluster Migration, Consistency, Device Failure Recovery, Distributed Storage System, Fault Tolerance, Parallel I/O, Replica, SLA, VM Crash",
author = "Zhengyu Yang and Jiayin Wang and David Evans and Ningfang Mi",
year = "2017",
month = "1",
day = "17",
doi = "10.1109/PCCC.2016.7820664",
language = "English",
series = "2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016",

}

Yang, Z, Wang, J, Evans, D & Mi, N 2017, AutoReplica: Automatic data replica manager in distributed caching and data processing systems. in 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016., 7820664, 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016, Institute of Electrical and Electronics Engineers Inc., 35th IEEE International Performance Computing and Communications Conference, IPCCC 2016, Las Vegas, United States, 9/12/16. https://doi.org/10.1109/PCCC.2016.7820664

AutoReplica : Automatic data replica manager in distributed caching and data processing systems. / Yang, Zhengyu; Wang, Jiayin; Evans, David; Mi, Ningfang.

2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820664 (2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - AutoReplica

T2 - Automatic data replica manager in distributed caching and data processing systems

AU - Yang, Zhengyu

AU - Wang, Jiayin

AU - Evans, David

AU - Mi, Ningfang

PY - 2017/1/17

Y1 - 2017/1/17

N2 - Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called "AutoReplica" - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called "fusion cache" to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.

AB - Nowadays, replication technique is widely used in data center storage systems for large scale Cyber-physical Systems (CPS) to prevent data loss. However, side-effect of replication is mainly the overhead of extra network and I/O traffics, which inevitably downgrades the overall I/O performance of the cluster. To effectively balance the trade-off between I/O performance and fault tolerance, in this paper, we propose a complete solution called "AutoReplica" - a replica manager in distributed caching and data processing systems with SSD-HDD tier storages. In detail, AutoReplica utilizes the remote SSDs (connected by high speed fibers) to replicate local SSD caches to protect data. In order to conduct load balancing among nodes and reduce the network overhead, we propose three approaches (i.e., ring, network, and multiple-SLA network) to automatically setup the cross-node replica structure with the consideration of network traffic, I/O speed and SLAs. To improve the performance during migrations triggered by load balance and failure recovery, we propose the a migrate-on-write technique called "fusion cache" to seamlessly migrate and prefetch among local and remote replicas without pausing the subsystem. Moreover, AutoReplica can also recover from different failure scenarios, while limits the performance downgrading degree. Lastly, AutoReplica supports parallel prefetching from multiple nodes with a new dynamic optimizing streaming technique to improve I/O performance. We are currently in the process of implementing AutoReplica to be easily plugged into commonly used distributed caching systems, and solidifying our design and implementation details.

KW - Atomicity

KW - Backup

KW - Cache and Replacement Policy

KW - Cluster Migration

KW - Consistency

KW - Device Failure Recovery

KW - Distributed Storage System

KW - Fault Tolerance

KW - Parallel I/O

KW - Replica

KW - SLA

KW - VM Crash

UR - http://www.scopus.com/inward/record.url?scp=85013408949&partnerID=8YFLogxK

U2 - 10.1109/PCCC.2016.7820664

DO - 10.1109/PCCC.2016.7820664

M3 - Conference contribution

T3 - 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016

BT - 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Yang Z, Wang J, Evans D, Mi N. AutoReplica: Automatic data replica manager in distributed caching and data processing systems. In 2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820664. (2016 IEEE 35th International Performance Computing and Communications Conference, IPCCC 2016). https://doi.org/10.1109/PCCC.2016.7820664