SciELO - Scientific Electronic Library Online

 
vol.15 número3Parallel Computing Applied to Satellite Images Processing for Solar Resource EstimatesFacial Recognition Using Neural Networks over GPGPU índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Links relacionados

Compartir


CLEI Electronic Journal

versión On-line ISSN 0717-5000

Resumen

MONTEZANTI, Diego et al. SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters. CLEIej [online]. 2012, vol.15, n.3, pp.5-5. ISSN 0717-5000.

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.

Palabras clave : transient fault; silent data corruption; multicore cluster; parallel scientific application; soft error detection; message content validation; reliability.

        · resumen en Español     · texto en Inglés     · Inglés ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons