Fault-tolerance schemes for clusterheads in clustered mesh networks

Jason Zurawski, Dajin Wang

Research output: Contribution to journalArticleResearchpeer-review

Abstract

To improve the overall system performance for distributed systems using mesh as their underlying structure, a hierarchical approach was proposed in [7]. The hierarchical configuration divides the mesh into clusters, thus allowing for processing to occur in small local groups at the lower levels. After local operations, the results are passed to higher logical levels. This method has been shown to be able to significantly reduce the total communication cost for the entire system. This paper is concerned with the hierarchical system's ability to handle node failure. When using a hierarchical configuration, certain nodes in the mesh become more important to the overall system than others. It is important that the hierarchical system have a reorganising mechanism in case of node failure, in such a way that the performance gain from hierarchical configuration is salvaged as much as possible. The work presented in this paper focuses on minimising the loss of performance in the system hierarchy due to the presence of failing nodes. We will propose fault-tolerance schemes for that purpose. The performance results will be compared to that of an ideal, fault-free system. We will present strategies to reconstruct the hierarchy, accommodating to the situation that some nodes in the original hierarchy are not functioning anymore. To that end, new local heads may be selected and local nodes regrouped. We will also present experiment results that examine the effectiveness of the proposed schemes. Examples of both faulty and fault-free hierarchical mesh systems will be tested to quantify how good the proposed schemes are.

Original languageEnglish
Pages (from-to)271-287
Number of pages17
JournalInternational Journal of Parallel, Emergent and Distributed Systems
Volume23
Issue number3
DOIs
StatePublished - 1 Jun 2008

Fingerprint

Hierarchical systems
Fault tolerance
Communication
Processing
Costs
Experiments

Keywords

  • Distributed processing
  • Fault tolerance
  • Hierarchical control
  • Interconnection networks
  • Mesh

Cite this

@article{e1e80ebfd1b0424ab74a5b87411af50b,
title = "Fault-tolerance schemes for clusterheads in clustered mesh networks",
abstract = "To improve the overall system performance for distributed systems using mesh as their underlying structure, a hierarchical approach was proposed in [7]. The hierarchical configuration divides the mesh into clusters, thus allowing for processing to occur in small local groups at the lower levels. After local operations, the results are passed to higher logical levels. This method has been shown to be able to significantly reduce the total communication cost for the entire system. This paper is concerned with the hierarchical system's ability to handle node failure. When using a hierarchical configuration, certain nodes in the mesh become more important to the overall system than others. It is important that the hierarchical system have a reorganising mechanism in case of node failure, in such a way that the performance gain from hierarchical configuration is salvaged as much as possible. The work presented in this paper focuses on minimising the loss of performance in the system hierarchy due to the presence of failing nodes. We will propose fault-tolerance schemes for that purpose. The performance results will be compared to that of an ideal, fault-free system. We will present strategies to reconstruct the hierarchy, accommodating to the situation that some nodes in the original hierarchy are not functioning anymore. To that end, new local heads may be selected and local nodes regrouped. We will also present experiment results that examine the effectiveness of the proposed schemes. Examples of both faulty and fault-free hierarchical mesh systems will be tested to quantify how good the proposed schemes are.",
keywords = "Distributed processing, Fault tolerance, Hierarchical control, Interconnection networks, Mesh",
author = "Jason Zurawski and Dajin Wang",
year = "2008",
month = "6",
day = "1",
doi = "10.1080/17445760701640332",
language = "English",
volume = "23",
pages = "271--287",
journal = "International Journal of Parallel, Emergent and Distributed Systems",
issn = "1744-5760",
publisher = "Taylor and Francis Ltd.",
number = "3",

}

Fault-tolerance schemes for clusterheads in clustered mesh networks. / Zurawski, Jason; Wang, Dajin.

In: International Journal of Parallel, Emergent and Distributed Systems, Vol. 23, No. 3, 01.06.2008, p. 271-287.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Fault-tolerance schemes for clusterheads in clustered mesh networks

AU - Zurawski, Jason

AU - Wang, Dajin

PY - 2008/6/1

Y1 - 2008/6/1

N2 - To improve the overall system performance for distributed systems using mesh as their underlying structure, a hierarchical approach was proposed in [7]. The hierarchical configuration divides the mesh into clusters, thus allowing for processing to occur in small local groups at the lower levels. After local operations, the results are passed to higher logical levels. This method has been shown to be able to significantly reduce the total communication cost for the entire system. This paper is concerned with the hierarchical system's ability to handle node failure. When using a hierarchical configuration, certain nodes in the mesh become more important to the overall system than others. It is important that the hierarchical system have a reorganising mechanism in case of node failure, in such a way that the performance gain from hierarchical configuration is salvaged as much as possible. The work presented in this paper focuses on minimising the loss of performance in the system hierarchy due to the presence of failing nodes. We will propose fault-tolerance schemes for that purpose. The performance results will be compared to that of an ideal, fault-free system. We will present strategies to reconstruct the hierarchy, accommodating to the situation that some nodes in the original hierarchy are not functioning anymore. To that end, new local heads may be selected and local nodes regrouped. We will also present experiment results that examine the effectiveness of the proposed schemes. Examples of both faulty and fault-free hierarchical mesh systems will be tested to quantify how good the proposed schemes are.

AB - To improve the overall system performance for distributed systems using mesh as their underlying structure, a hierarchical approach was proposed in [7]. The hierarchical configuration divides the mesh into clusters, thus allowing for processing to occur in small local groups at the lower levels. After local operations, the results are passed to higher logical levels. This method has been shown to be able to significantly reduce the total communication cost for the entire system. This paper is concerned with the hierarchical system's ability to handle node failure. When using a hierarchical configuration, certain nodes in the mesh become more important to the overall system than others. It is important that the hierarchical system have a reorganising mechanism in case of node failure, in such a way that the performance gain from hierarchical configuration is salvaged as much as possible. The work presented in this paper focuses on minimising the loss of performance in the system hierarchy due to the presence of failing nodes. We will propose fault-tolerance schemes for that purpose. The performance results will be compared to that of an ideal, fault-free system. We will present strategies to reconstruct the hierarchy, accommodating to the situation that some nodes in the original hierarchy are not functioning anymore. To that end, new local heads may be selected and local nodes regrouped. We will also present experiment results that examine the effectiveness of the proposed schemes. Examples of both faulty and fault-free hierarchical mesh systems will be tested to quantify how good the proposed schemes are.

KW - Distributed processing

KW - Fault tolerance

KW - Hierarchical control

KW - Interconnection networks

KW - Mesh

UR - http://www.scopus.com/inward/record.url?scp=42649133829&partnerID=8YFLogxK

U2 - 10.1080/17445760701640332

DO - 10.1080/17445760701640332

M3 - Article

VL - 23

SP - 271

EP - 287

JO - International Journal of Parallel, Emergent and Distributed Systems

JF - International Journal of Parallel, Emergent and Distributed Systems

SN - 1744-5760

IS - 3

ER -