An efficient fault-tolerant algorithm for distributed cloud services

Jameela Al-Jaroodi, Nader Mohamed, Klaithem Al Nuaimi

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    8 Citations (Scopus)

    Abstract

    Several approaches for fault-tolerance in distributed systems were introduced; however, they require prior knowledge of the environment's operating conditions and/or constant monitoring of these conditions at run time. That allows the applications to adjust the load and redistribute the tasks when failures occur. These techniques work well when there is no high communication delay. Yet, this is not true in the Cloud, where data and computation servers are connected over the Internet and distributed across large geographic areas. Thus they usually exhibit high and dynamic communication delays that make discovering and recovering from failures take a long time. This paper proposes a delay-tolerant fault-tolerance algorithm that effectively reduces execution time and adapts for failures while minimizing the fault discovery and recovery overhead in the Cloud. Distributed tasks that can use this algorithm include downloading data from replicated servers and executing parallel applications on multiple independent distributed servers in the Cloud. The experimental results show the efficiency of the algorithm and its fault tolerance feature.

    Original languageEnglish
    Title of host publicationProceedings - IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012
    Pages1-8
    Number of pages8
    DOIs
    Publication statusPublished - 2012
    Event2012 IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012 - London, United Kingdom
    Duration: Dec 3 2012Dec 4 2012

    Publication series

    NameProceedings - IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012

    Other

    Other2012 IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012
    Country/TerritoryUnited Kingdom
    CityLondon
    Period12/3/1212/4/12

    Keywords

    • Cloud computing
    • Fault-tolerance
    • Heterogeneous distributed systems
    • Load balancing

    ASJC Scopus subject areas

    • Computer Networks and Communications

    Fingerprint

    Dive into the research topics of 'An efficient fault-tolerant algorithm for distributed cloud services'. Together they form a unique fingerprint.

    Cite this