SciELO - Scientific Electronic Library Online

 
vol.15 número3Parallel Computing Applied to Satellite Images Processing for Solar Resource EstimatesFacial Recognition Using Neural Networks over GPGPU índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Links relacionados

Compartilhar


CLEI Electronic Journal

versão On-line ISSN 0717-5000

Resumo

MONTEZANTI, Diego et al. SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters. CLEIej [online]. 2012, vol.15, n.3, pp.5-5. ISSN 0717-5000.

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.

Palavras-chave : transient fault; silent data corruption; multicore cluster; parallel scientific application; soft error detection; message content validation; reliability.

        · resumo em Espanhol     · texto em Inglês     · Inglês ( pdf )

 

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons