SciELO - Scientific Electronic Library Online

 
vol.15 issue3Parallel Computing Applied to Satellite Images Processing for Solar Resource EstimatesFacial Recognition Using Neural Networks over GPGPU author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Related links

Share


CLEI Electronic Journal

On-line version ISSN 0717-5000

Abstract

MONTEZANTI, Diego et al. SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters. CLEIej [online]. 2012, vol.15, n.3, pp.5-5. ISSN 0717-5000.

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.

Keywords : transient fault; silent data corruption; multicore cluster; parallel scientific application; soft error detection; message content validation; reliability.

        · abstract in Spanish     · text in English     · English ( pdf )

 

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License