<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002016000100002</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Balancing Energy and Performance in Dense Linear System Solvers for Hybrid ARM+GPU platforms]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Silva]]></surname>
<given-names><![CDATA[Juan P.]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Dufrechou]]></surname>
<given-names><![CDATA[Ernesto]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[Pablo]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[Enrique S.]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[Alfredo]]></given-names>
</name>
<xref ref-type="aff" rid="A03"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[Peter]]></given-names>
</name>
<xref ref-type="aff" rid="A03"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad de la República Facultad de Ingeniería ]]></institution>
<addr-line><![CDATA[Montevideo ]]></addr-line>
<country>Uruguay</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Universidad Jaume I  ]]></institution>
<addr-line><![CDATA[Castellón ]]></addr-line>
<country>Spain</country>
</aff>
<aff id="A03">
<institution><![CDATA[,Institute for Dynamics of Complex Technical Systems  ]]></institution>
<addr-line><![CDATA[Magdeburg ]]></addr-line>
<country>Germany</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>04</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>04</month>
<year>2016</year>
</pub-date>
<volume>19</volume>
<numero>1</numero>
<fpage>2</fpage>
<lpage>2</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002016000100002&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002016000100002&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002016000100002&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[La comunidad de computación de alto desempeño (HPC del inglés) tradicionalmente se ha focalizado únicamente en la reducción de los tiempos de ejecución, sin embargo, en los últimos años, la optimización del consumo energético se ha convertido en un tema importante. La reducción en el uso de energía sin una pérdida de desempeño requiere la adopción de hardware eficiente desde el punto de vista energético, acompañado también del desarrollo de algoritmos que consideren este aspecto. La resolución de sistemas lineales es una operación clave en diversos problemas relacionados con la computación científica. Su relevancia ha motivado un volumen importante de trabajo y, en consecuencia, es posible encontrar resolutores que incluyen técnicas de HPC para una amplia gama de plataformas de hardware. En este trabajo, se avanza en el desarrollo de variantes para la resolución de sistemas lineales que incluyen técnicas de computación de alto desempeño y contemplan aspectos energéticos. En particular, se desarrollan dos resolutores para una plataforma híbrida CPU+GPU de bajo consumo, la Jetson TK1 de NVIDIA. Estos resolutores implementan el método de Gauss-Huard, buscando hacer un uso eficiente de la plataforma de hardware objetivo, considerando las características de su jerarquía de memoria. La evaluación experimental muestra que la novel propuesta reporta importantes reducciones tanto en el tiempo de ejecución como en el consumo energético cuando se compara con el estado del arte en resolutores para dicha plataforma.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Dense Linear Systems]]></kwd>
<kwd lng="en"><![CDATA[Gauss-Huard]]></kwd>
<kwd lng="en"><![CDATA[NVIDIA Jetson K1]]></kwd>
<kwd lng="en"><![CDATA[Energy-aware computing]]></kwd>
<kwd lng="es"><![CDATA[Sistemas lineales densos]]></kwd>
<kwd lng="es"><![CDATA[Gauss-Huard]]></kwd>
<kwd lng="es"><![CDATA[Jetson K1 de NVIDIA]]></kwd>
<kwd lng="es"><![CDATA[Computación considerando asepctos energéticos]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="maketitle"> <h2 class="titleHead" style="font-size: 14pt;">Balancing Energy and Performance    <br>  in Dense Linear System Solvers    <br>  for Hybrid ARM+GPU platforms    <br>  </h2>      <div class="author"> <span class="cmbx-12">Juan P. Silva, Ernesto Dufrechou, Pablo Ezzatti</span>     <br>  <span class="cmr-12">Facultad de Ingenier</span><span class="cmr-12">&iacute;a, Universidad de la Rep</span><span class="cmr-12">&uacute;blica,</span>     <br>  <span class="cmr-12">11300, Montevideo, Uruguay,</span>     <br>  <span class="cmsy-10x-x-120">{</span><span class="cmti-12"><a href="mailto:jpsilva@fing.edu.uy"> jpsilva</a>,<a href="mailto:edufrechou@fing.edu.uy">edufrechou</a>,<a href="mailto:pezzatti@fing.edu.uy">pezzatti</a></span><span class="cmsy-10x-x-120">}</span><span class="cmti-12">@fing.edu.uy, </span><br class="and">  <span class="cmbx-12">Enrique</span><span class="cmbx-12">&nbsp;S.</span><span class="cmbx-12">&nbsp;Quintana-Ort</span><span class="cmbx-12">&iacute;</span>     <br>  <span class="cmr-12">Departamento de Ingenier</span><span class="cmr-12">&iacute;a y Ciencia de Computadores,</span>     <br>  <span class="cmr-12">Universidad Jaume I,</span>     ]]></body>
<body><![CDATA[<br>  <span class="cmr-12">12.071, Castell</span><span class="cmr-12">&oacute;n, Spain,</span>     <br>  <span class="cmti-12"><a href="mailto:quintana@icc.uji.es">quintana@icc.uji.es</a> </span><br class="and">  <span class="cmbx-12">Alfredo</span><span class="cmbx-12">&nbsp;Rem</span><span class="cmbx-12">&oacute;n and Peter</span><span class="cmbx-12">&nbsp;Benner</span>     <br>  <span class="cmr-12">Max Planck Institute for Dynamics of Complex Technical Systems,</span>     <br>  <span class="cmr-12">D-39106, Magdeburg, Germany,</span>     <br>  <span class="cmsy-10x-x-120">{</span><span class="cmti-12"><a href="mailto:remon@mpi-magdeburg.mpg.de">remon</a>,<a href="mailto:benner@mpi-magdeburg.mpg.de">benner</a></span><span class="cmsy-10x-x-120">}</span><span class="cmti-12">@mpi-magdeburg.mpg.de </span></div>      <br>      <div class="date"></div>  </div>      <div class="abstract">     <div class="center"><!--l. 100-->     <p class="noindent"></p>      ]]></body>
<body><![CDATA[<div class="minipage">     <div class="center"><!--l. 100-->     <p class="noindent"><!--l. 100--></p>      <p class="noindent"><span class="cmbx-10">Abstract</span></p>  </div>  <!--l. 102-->     <p class="noindent">The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform. <!--l. 119--></p>      <p class="noindent">Abstract in Spanish <!--l. 122--></p>      <p class="noindent">La comunidad de computaci&oacute;n de alto desempe&ntilde;o (HPC del ingl&eacute;s) tradicionalmente se ha focalizado &uacute;nicamente en la reducci&oacute;n de los tiempos de ejecuci&oacute;n, sin embargo, en los &uacute;ltimos a&ntilde;os, la optimizaci&oacute;n del consumo energ&eacute;tico se ha convertido en un tema importante. La reducci&oacute;n en el uso de energ&iacute;a sin una p&eacute;rdida de desempe&ntilde;o requiere la adopci&oacute;n de hardware eficiente desde el punto de vista energ&eacute;tico, acompa&ntilde;ado tambi&eacute;n del desarrollo de algoritmos que consideren este aspecto. La resoluci&oacute;n de sistemas lineales es una operaci&oacute;n clave en diversos problemas relacionados con la computaci&oacute;n cient&iacute;fica. Su relevancia ha motivado un volumen importante de trabajo y, en consecuencia, es posible encontrar resolutores que incluyen t&eacute;cnicas de HPC para una amplia gama de plataformas de hardware. En este trabajo, se avanza en el desarrollo de variantes para la resoluci&oacute;n de sistemas lineales que incluyen t&eacute;cnicas de computaci&oacute;n de alto desempe&ntilde;o y contemplan aspectos energ&eacute;ticos. En particular, se desarrollan dos resolutores para una plataforma h&iacute;brida CPU+GPU de bajo consumo, la Jetson TK1 de NVIDIA. Estos resolutores implementan el m&eacute;todo de Gauss-Huard, buscando hacer un uso eficiente de la plataforma de hardware objetivo, considerando las caracter&iacute;sticas de su jerarqu&iacute;a de memoria. La evaluaci&oacute;n experimental muestra que la novel propuesta reporta importantes reducciones tanto en el tiempo de ejecuci&oacute;n como en el consumo energ&eacute;tico cuando se compara con el estado del arte en resolutores para dicha plataforma. </p>  </div>  </div>  </div>  <!--l. 146-->     <p class="noindent"><span class="cmbx-10">Keywords: </span>Dense Linear Systems, Gauss-Huard, NVIDIA Jetson K1, Energy-aware computing<br class="newline">  Keywords in Spanish: Sistemas lineales densos, Gauss-Huard, Jetson K1 de NVIDIA, Computaci&oacute;n considerando asepctos energ&eacute;ticos<br class="newline">  Received: 2015-12-13 Revised: 2016-02-17 Accepted: 2016-02-17 <br class="newline">  DOI: http://dx.doi.org/10.19153/cleiej.19.1.2 </p>  <h3 class="sectionHead"><span class="titlemark">1 </span> Introduction</h3>  <!--l. 3-->     <p class="noindent">The solution of a system of linear equations of the form </p>  <table class="equation">    <tbody>      <tr>        <td>       <center class="math-display"><img src="/img/revistas/cleiej/v19n1/1a020x.png" alt="Ax = b," class="math-display"></center>        </td>        <td class="equation-label">(1)</td>      </tr>    </tbody> </table>  <!--l. 6-->     <p class="nopar">where <img src="/img/revistas/cleiej/v19n1/1a021x.png" alt=" n&times;n A &isin; &#8477; " class="math"> is a dense matrix, and <img src="/img/revistas/cleiej/v19n1/1a022x.png" alt=" n&times;m b,x &isin; &#8477; " class="math"> represent respectively, the right-hand side vector of independent terms (RHS) and the sought-after solution, is the main computational problem in the solution of many scientific and engineering applications&nbsp;<span class="cite">[<a href="#XDemmel:1997:ANL:264989">1</a><a id="br1">]</a></span>. The traditional method to solve a general linear system is based on the LU factorization (i.e., Gaussian elimination), which must be complemented with (partial) row pivoting to ensure numerical stability&nbsp;<span class="cite">[<a href="#XGVL3">2</a><a id="br2">]</a></span>. The operation involves the factorization of the matrix <img src="/img/revistas/cleiej/v19n1/1a023x.png" alt="A " class="math"> into the lower-triangular matrix <img src="/img/revistas/cleiej/v19n1/1a024x.png" alt=" n&times;n L &isin; &#8477; " class="math"> and the upper-triangular matrix <img src="/img/revistas/cleiej/v19n1/1a025x.png" alt=" n&times;n U &isin; &#8477; " class="math"> such that <img src="/img/revistas/cleiej/v19n1/1a026x.png" alt="A = LU " class="math">. Then, <img src="/img/revistas/cleiej/v19n1/1a027x.png" alt="x " class="math"> is obtained via the solution of two triangular linear systems: <img src="/img/revistas/cleiej/v19n1/1a028x.png" alt="Ly = b " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a029x.png" alt="Ux = y " class="math">. The factorization stage is the most time consuming operation of the process, and involves about <img src="/img/revistas/cleiej/v19n1/1a0210x.png" alt=" 3 2n &#8725;3 " class="math"> floating-point arithmetic operations (flops), in contrast with the <img src="/img/revistas/cleiej/v19n1/1a0211x.png" alt=" 2 n " class="math"> flops of each subsequent triangular solver. The factorization phase and, therefore, the most time-consuming part of the method, can be expressed in terms of BLAS-3 operations <span class="cite">[<a href="#XBLAS3">3</a><a id="br3">]</a></span>. Thus, an efficient implementation of this method on a parallel platform can potentially deliver remarkable performance. As a consequence, this is the method implemented in the LAPACK library (routine <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">s</span><span class="small-caps">v</span></span>), as well as in MATLAB<sup class="textsuperscript"><span class="cmr-9">&reg;</span></sup>&nbsp;(command <span class="cmcsc-10"><span class="small-caps">l</span><span class="small-caps">i</span><span class="small-caps">n</span><span class="small-caps">s</span><span class="small-caps">o</span><span class="small-caps">l</span><span class="small-caps">v</span><span class="small-caps">e</span></span>&nbsp;or backslash). <!--l. 24--></p>      ]]></body>
<body><![CDATA[<p class="indent"> The Gauss-Jordan elimination (GJE) is an efficient algorithm to compute the matrix inverse, and can be easily adapted to obtain the solution of linear systems of equations. The numerical stability can also be improved via partial pivoting. The main interest in this algorithm lies in its high degree of parallelism and its reduced number of memory operations. Consequently, this method usually delivers remarkable efficiency in modern hardware architectures that present a large number of computational units&nbsp;<span class="cite">[<a href="#XJOS11">4</a><a id="br4">,</a>&nbsp;<a href="#XBenner2013">5</a><a id="br5">]</a></span>. However, when applied to the solution of linear systems, GJE incurs a larger computational cost than the traditional approach based on the LU factorization. In particular, it requires <img src="/img/revistas/cleiej/v19n1/1a0212x.png" alt=" 3 n " class="math"> flops, compared with the <img src="/img/revistas/cleiej/v19n1/1a0213x.png" alt=" 3 2n &#8725;3 " class="math"> flops of the LU-based method. This limitation makes it useless as soon as the dimension of the system (<img src="/img/revistas/cleiej/v19n1/1a0214x.png" alt="n " class="math">) becomes moderate. In the late 70s, P.&nbsp;Huard presented the Gauss-Huard algorithm (GH)&nbsp;<span class="cite">[<a href="#XHua79">6</a><a id="br6">]</a></span>, which can be considered as an extension of the GJE algorithm for the solution of linear systems, and represents an alternative and reliable method when complemented with column pivoting&nbsp;<span class="cite">[<a href="#XDek97">7</a><a id="br7">]</a></span>. Interestingly, this method presents the same computational cost as the LU-based solver, i.e., <img src="/img/revistas/cleiej/v19n1/1a0215x.png" alt=" 3 2n &#8725;3 " class="math"> flops. <!--l. 40--></p>      <p class="indent"> In recent years, hardware architectures have experienced remarkable changes. Mainly motivated by the limitations on power consumption, the strategy of increasing the processor frequency has been replaced by an increment in the number of computational units. GPUs and the Intel Xeon Phi processors are successful exponents of this trend. These new hardware platforms offer from tens to thousands of computational units, and are more energy-efficient than traditional multi-core processors. In this domain, the solutions presented by NVIDIA that integrate in a single board a low power ARM processor and a GPU occupy a relevant position. In particular, the most important example of this type of platforms is the NVIDIA Jetson TK1, composed of a quad-core ARM Cortex A15 processor and a Kepler GPU with 192 CUDA cores. However, to fully exploit the capabilities of these new platforms, the methods and their implementations must be reformulated. <!--l. 58--></p>      <p class="indent"> This work aims to obtain a high-performance and energy-efficient linear system solver based on the Gauss-Huard method (with partial column pivoting) optimized for the NVIDIA Jetson TK1. We extend the work in <span class="cite">[<a href="#Xclei15">8</a><a id="br8">]</a></span> to further improve the new solvers and evaluate their performances in terms of runtime, but also energy. We include in the evaluation the state-of-the-art solvers in ARM and hybrid CPU-GPU platforms, specifically OpenBlas&nbsp;<span class="cite">[<a href="#Xopenblas">9</a><a id="br9">]</a></span> and MAGMA&nbsp;<span class="cite">[<a href="#XMAGMA2">10</a><a id="br10">]</a></span>. Our experimental results show that one of the new implementations outperforms, in both runtime and energy consumption, the solvers included in the previous libraries, for problems of moderate to large dimensions (<img src="/img/revistas/cleiej/v19n1/1a0216x.png" alt="n &gt; 3,000 " class="math">). <!--l. 68--></p>      <p class="indent"> The rest of the paper is structured as follows: In Section&nbsp;<a href="#x1-20002">2<!--tex4ht:ref: sec:sistlin --></a>, we describe three methods for the solution of linear systems based on the LU factorization, the GJE based strategy and the GH&nbsp;algorithm. Then, in Section&nbsp;<a href="#x1-60003">3<!--tex4ht:ref: sec:impl --></a>, we present two new GPU-based versions that implement the GH&nbsp;algorithm, this is followed by an experimental evaluation of their performance and energy efficiency in Section&nbsp;<a href="#x1-90004">4<!--tex4ht:ref: sec:evalexp --></a>. Finally, we discuss some conclusions and future work in Section&nbsp;<a href="#x1-130005">5<!--tex4ht:ref: sec:CR&FW --></a>. <!--l. 2--></p>      <p class="noindent"> </p>  <h3 class="sectionHead"><span class="titlemark">2 </span> Dense linear systems resolution</h3>  <!--l. 5-->     <p class="noindent">In this section, we revisit different approaches to compute the solution of a <span class="cmti-10">dense </span>linear system <img src="/img/revistas/cleiej/v19n1/1a0217x.png" alt="Ax = b " class="math"> with dense <img src="/img/revistas/cleiej/v19n1/1a0218x.png" alt="A " class="math">. In numerical linear algebra (NLA), there are two main families of methods to solve this kind of problems: direct and iterative methods; see <span class="cite">[<a href="#XGVL3">2</a><a id="br2">]</a></span>. The first family consists of by deterministic methods that compute the exact solution of the problem (neglecting unavoidable floating-point rounding errors). On the other hand, iterative methods start with an initial solution and, after successive approximations, converge to a solution according to certain criteria. This usually means that a threshold, <span class="cmti-10">tol</span>, is set such that when the difference between the computed solution and the exact solution is lower than <span class="cmti-10">tol</span>, the method stops. From the computational point of view, direct methods generally present a computational cost of <img src="/img/revistas/cleiej/v19n1/1a0219x.png" alt="O(n3) " class="math"> flops and are mostly based on BLAS-3 operations, allowing an efficient use of the memory hierarchy in modern hardware platforms. In contrast, iterative methods present a computational cost of <img src="/img/revistas/cleiej/v19n1/1a0220x.png" alt="O (n2) " class="math"> flops per iteration and they do not take advantage of BLAS-3 operations which limits their performance. However, an iterative method may be faster than a direct method if it presents a lower computational cost, namely, the number of iterations is considerably smaller than <img src="/img/revistas/cleiej/v19n1/1a0221x.png" alt="n " class="math"> (i.e., the convergence is fast or the problem does not require a high accuracy in the solution). On the other hand, direct methods allow to reuse most of the computations in the successive solution of linear systems with the same coefficient matrix. In this work, we focus on direct methods for the solution of dense linear systems. In more detail, in this section we describe three direct approaches based on the traditional LU, the GJE and the GH&nbsp;algorithms. All methods are in-place and, therefore, present a similar cost in terms of memory requirements. <!--l. 35--></p>      <p class="noindent"> </p>  <h4 class="subsectionHead"><span class="titlemark">2.1 </span> Solution of linear systems via the LU factorization</h4>  <!--l. 38-->     <p class="noindent">The LU-based algorithm for the solution of linear systems is analogous to the traditional strategy based on Gaussian elimination. Specifically, the method consists of three steps that initially (first step) compute the <img src="/img/revistas/cleiej/v19n1/1a0222x.png" alt="LU " class="math"> factorization of <img src="/img/revistas/cleiej/v19n1/1a0223x.png" alt="A " class="math">. This operation requires of a row pivoting strategy to ensure numerical stability&nbsp;<span class="cite">[<a href="#XGVL3">2</a><a id="br2">]</a></span>. The decomposition takes the form <img src="/img/revistas/cleiej/v19n1/1a0224x.png" alt="P A = LU " class="math">, where <img src="/img/revistas/cleiej/v19n1/1a0225x.png" alt="P &isin; &#8477;n&times;n " class="math"> is a permutation matrix, and the factors <img src="/img/revistas/cleiej/v19n1/1a0226x.png" alt="L,U &isin; &#8477;n &times;n " class="math"> are, respectively, unit lower triangular and upper triangular. The <img src="/img/revistas/cleiej/v19n1/1a0227x.png" alt="L " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a0228x.png" alt="U " class="math"> factors are stored in the memory space of <img src="/img/revistas/cleiej/v19n1/1a0229x.png" alt="A " class="math"> (i.e. in-place) to reduce the memory requirements. The main diagonal of <img src="/img/revistas/cleiej/v19n1/1a0230x.png" alt="L " class="math"> does not need to be stored since <img src="/img/revistas/cleiej/v19n1/1a0231x.png" alt="L " class="math"> is unit lower triangular. To further reduce memory use, <img src="/img/revistas/cleiej/v19n1/1a0232x.png" alt="P " class="math"> is implicitly stored as a vector. Then, the system is rewritten as <img src="/img/revistas/cleiej/v19n1/1a0233x.png" alt="LU x = (P b) = &circ;b " class="math"> and, therefore, <img src="/img/revistas/cleiej/v19n1/1a0234x.png" alt="x " class="math"> can be obtained by solving the triangular linear systems <img src="/img/revistas/cleiej/v19n1/1a0235x.png" alt="Ly = &circ;b " class="math"> followed by <img src="/img/revistas/cleiej/v19n1/1a0236x.png" alt="U x = y " class="math">. <!--l. 51--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float"><img src="/img/revistas/cleiej/v19n1/1a02f1.jpg" alt="PIC">     ]]></body>
<body><![CDATA[<br>      <div class="caption"><span class="id">Figure&nbsp;1: </span><span class="content">Scalar <span class="cmti-10">in-place </span>algorithm to compute the LU factorization of matrix <img src="/img/revistas/cleiej/v19n1/1a0237x.png" alt="A " class="math">. Upon completion, <img src="/img/revistas/cleiej/v19n1/1a0238x.png" alt="A " class="math"> is overwriten by the triangular factors <img src="/img/revistas/cleiej/v19n1/1a0239x.png" alt="L " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a0240x.png" alt="U " class="math">.</span></div>  <!--tex4ht:label?: x1-30011 --> </div>  <hr class="endfloat"><!--l. 60-->     <p class="indent"> </p>  <hr class="float">     <div class="float"><img src="/img/revistas/cleiej/v19n1/1a02f2.jpg" alt="PIC">     <br>      <div class="caption"><span class="id">Figure&nbsp;2: </span><span class="content">Blocked <span class="cmti-10">in-place </span>algorithm to compute the LU factorization of <img src="/img/revistas/cleiej/v19n1/1a0241x.png" alt="A " class="math">. Upon completion, <img src="/img/revistas/cleiej/v19n1/1a0242x.png" alt="A " class="math"> is overwriten by the triangular factors <img src="/img/revistas/cleiej/v19n1/1a0243x.png" alt="L " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a0244x.png" alt="U " class="math">. </span></div>  <!--tex4ht:label?: x1-30022 --> </div>  <hr class="endfloat"><!--l. 70-->     <p class="indent"> Most of the computational cost is due to the factorization of <img src="/img/revistas/cleiej/v19n1/1a0245x.png" alt="A " class="math">. This can be exploited when several systems involving the same coefficient matrix must be solved. In this case, the factors <img src="/img/revistas/cleiej/v19n1/1a0246x.png" alt="L " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a0247x.png" alt="U " class="math"> can be reused while the matrix decomposition needs to be computed only once. Consequently, the solution of the successive systems presents a computational cost of <img src="/img/revistas/cleiej/v19n1/1a0248x.png" alt="2n2 " class="math"> flops. Figure&nbsp;<a href="#x1-30011">1<!--tex4ht:ref: fig:lu_unb --></a> shows an in-place unblocked variant of the Gaussian elimination algorithm to compute the LU factorization using the FLAME notation&nbsp;<span class="cite">[<a href="#XGunnels:2001:FFL">11</a><a id="br11">,</a>&nbsp;<a href="#XRecipe">12</a><a id="br12">]</a></span>. In the figure, <img src="/img/revistas/cleiej/v19n1/1a0249x.png" alt="n(A) " class="math"> stands for the number of columns of <img src="/img/revistas/cleiej/v19n1/1a0250x.png" alt="A " class="math">. <!--l. 78--></p>      <p class="indent"> Figure&nbsp;<a href="#x1-30022">2<!--tex4ht:ref: fig:lu_blk --></a> shows the blocked version of the method, where <img src="/img/revistas/cleiej/v19n1/1a0251x.png" alt="nb " class="math"> is the algorithmic block size. For simplicity, pivoting is not included in both algorithms. In general, blocked variants tend to offer better performance in modern hardware architectures, since they improve the computational intensity&nbsp;<span class="cite">[<a href="#XGVL3">2</a><a id="br2">]</a></span> and reduce the memory operations. <!--l. 84--></p>      <p class="indent"> The LU-based method presents some drawbacks that limit its performance on massively parallel architectures. Concretely: </p>  <ul class="itemize1">    <li class="itemize">The method consists of three sequential stages (LU factorization, upper triangular solve and lower triangular solve), implicitly defining two synchronization points. </li>    <li class="itemize">In addition, it requires the solution of two triangular linear systems, an operation that presents a rich collection of data dependencies and limited concurrency. Furthermore, small triangular linear systems also appear during the factorization stage.</li>      </ul>  <!--l. 96-->     ]]></body>
<body><![CDATA[<p class="indent"> These problems present little relevance when the dimension of <img src="/img/revistas/cleiej/v19n1/1a0252x.png" alt="A " class="math"> grows, since the factorization concentrates most of the computational effort. In this situation, the possibility to use BLAS-3 kernels during the factorization turns the method extremely suitable for parallel platforms. <!--l. 102--></p>      <p class="indent"> <span class="cmcsc-10">LAPACK</span>&nbsp;<span class="cite">[<a href="#Xlapack">13</a><a id="br13">]</a></span> is a specification that provides many important NLA operations, and offers high performance and numerically reliable implementations for many types of platforms. Certain implementations may include a number of high performance computing (HPC) techniques and also particular customizations for different processors and architectures. The <span class="cmcsc-10">LAPACK</span>&nbsp;specification includes support for the different stages of the LU-based solver. In particular, the routine <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">t</span><span class="small-caps">r</span><span class="small-caps">f</span></span>&nbsp;computes the LU factorization (applying partial row pivoting) of a non-singular matrix, while the routine <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">t</span><span class="small-caps">r</span><span class="small-caps">s</span></span>&nbsp;allows the solution of the subsequent triangular systems from the factors computed by <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">t</span><span class="small-caps">r</span><span class="small-caps">f</span></span>. Additionally, <span class="cmcsc-10">LAPACK</span>&nbsp;includes the <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">s</span><span class="small-caps">v</span></span>&nbsp;driver routine for the solution of linear systems, which simply performs the appropriate calls to the routines <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">t</span><span class="small-caps">r</span><span class="small-caps">f</span></span>&nbsp;and <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">e</span><span class="small-caps">t</span><span class="small-caps">r</span><span class="small-caps">s</span></span>. </p>  <h4 class="subsectionHead"><span class="titlemark">2.2 </span> The Gauss-Jordan method</h4>  <!--l. 120-->     <p class="noindent">An alternative method for the solution of dense linear systems is the so-called Gauss-Jordan Elimination method (GJE). This algorithm calculates the inverse of a matrix, but can be easily adapted to obtain the solution of a linear system. Additionally, if row pivoting is employed, the algorithm exhibits stability properties similar to those of the Gaussian Elimination method <span class="cite">[<a href="#XHigham:2002:ASN">14</a><a id="br14">]</a></span>. <!--l. 127--></p>      <p class="indent"> The solution of linear systems with this method implies the application of GJE&nbsp;to the extended matrix <img src="/img/revistas/cleiej/v19n1/1a0253x.png" alt="A&circ;= [A,b] " class="math">. The method processes the extended matrix from left to right, updating, at each iteration, all the entries of <img src="/img/revistas/cleiej/v19n1/1a0254x.png" alt="&circ;A " class="math">. For the solution of linear systems, only the elements to the right of the active column need be updated, reducing the computational cost to <img src="/img/revistas/cleiej/v19n1/1a0255x.png" alt="n3 " class="math"> flops. Once the process is completed, the solution of the linear system (i.e., the <img src="/img/revistas/cleiej/v19n1/1a0256x.png" alt="x " class="math"> vector) is stored in the last column of the extended matrix <img src="/img/revistas/cleiej/v19n1/1a0257x.png" alt="A&circ; " class="math">. <!--l. 135--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float"><img src="/img/revistas/cleiej/v19n1/1a02f3.jpg" alt="PIC">     <br>      <div class="caption"><span class="id">Figure&nbsp;3: </span><span class="content">Blocked GJE&nbsp;algorithm for the solution of linear systems of the form <img src="/img/revistas/cleiej/v19n1/1a0258x.png" alt="Ax = b " class="math">. Initially, <img src="/img/revistas/cleiej/v19n1/1a0259x.png" alt="A&circ;= [A,b] " class="math">. Upon completion, the solution <img src="/img/revistas/cleiej/v19n1/1a0260x.png" alt="x " class="math"> overwrites the last column of <img src="/img/revistas/cleiej/v19n1/1a0261x.png" alt="&circ;A " class="math">. </span></div>  <!--tex4ht:label?: x1-40013 --> </div>  <hr class="endfloat"><!--l. 144-->     <p class="indent"> Figure&nbsp;<a href="#x1-40013">3<!--tex4ht:ref: fig:alg_gje_blk --></a> shows the blocked version of the GJE&nbsp;to solve the linear system of equations associated with <img src="/img/revistas/cleiej/v19n1/1a0262x.png" alt="A " class="math">. For simplicity we do not include pivoting in the description. A description of a scalar version of the GJE&nbsp;method, that is invoked in the blocked version, can be found in&nbsp;<span class="cite">[<a href="#XQuiQSG01">15</a><a id="br15">]</a></span>. It should be noted that this is also an <span class="cmti-10">in-place </span>method, which means that, when the process is completed, the matrix <img src="/img/revistas/cleiej/v19n1/1a0263x.png" alt="A " class="math"> is overwritten with the transformed matrix and <img src="/img/revistas/cleiej/v19n1/1a0264x.png" alt="x " class="math"> overwrites <img src="/img/revistas/cleiej/v19n1/1a0265x.png" alt="b " class="math"> (in the last column of <img src="/img/revistas/cleiej/v19n1/1a0266x.png" alt="&circ;A " class="math">). <!--l. 153--></p>      <p class="indent"> This method is rich in BLAS-3 operations. In particular, the updates <img src="/img/revistas/cleiej/v19n1/1a0267x.png" alt="A&circ; " class="math"> are performed via matrix-matrix products. This usually yields a high throughput when executed on massively parallel platforms. On the other hand, the use of the GJE&nbsp;method to solve a linear system requires <img src="/img/revistas/cleiej/v19n1/1a0268x.png" alt="n3 " class="math"> flops, in contrast to the <img src="/img/revistas/cleiej/v19n1/1a0269x.png" alt="2&#8725;3n3 " class="math"> flops for the LU-based method. In consequence, this method cannot compete with the LU-based algorithm for the solution of systems of large dimension. </p>  <h4 class="subsectionHead"><span class="titlemark">2.3 </span> The Gauss-Huard method</h4>  <!--l. 165-->     ]]></body>
<body><![CDATA[<p class="noindent">As previously stated, the GJE&nbsp;method presents some features that turn it especially appealing for its implementation on massively parallel platforms. Unfortunately the algorithm also presents a high computational cost for the solution of linear systems of equations compared with the LU-based method. This drawback was addressed by P.&nbsp;Huard&nbsp;<span class="cite">[<a href="#XHua79">6</a><a id="br6">]</a></span> in the late 1970s. In his work, he presents a smart variant of the GJE&nbsp;algorithm to solve linear systems that incurs a computational cost analogous to that of the LU-based method. This method, known as the Gauss-Huard (GH) algorithm, presents a convenient memory access pattern. But this feature is destroyed if row pivoting is introduced to ensure numerical stability. This problem was overcome by T. J. Dekker et al. in&nbsp;<span class="cite">[<a href="#XDekker1994221">16</a><a id="br16">,</a>&nbsp;<a href="#XDek97">7</a><a id="br7">]</a></span>. They proposed the inclusion of column pivoting and proved that the resulting method offers numerical characteristics similar to those of the LU factorization with row pivoting. <!--l. 179--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float"><img src="/img/revistas/cleiej/v19n1/1a02f4.jpg" alt="PIC">     <br>      <div class="caption"><span class="id">Figure&nbsp;4: </span><span class="content">Scalar (unblocked) Gauss-Huard algorithm for the solution of linear systems of the form <img src="/img/revistas/cleiej/v19n1/1a0270x.png" alt="Ax = b " class="math">. Initially, <img src="/img/revistas/cleiej/v19n1/1a0271x.png" alt="&circ;A = [A, b] " class="math">. Upon completion, the solution <img src="/img/revistas/cleiej/v19n1/1a0272x.png" alt="x " class="math"> overwrites the last column of <img src="/img/revistas/cleiej/v19n1/1a0273x.png" alt="A&circ; " class="math">. </span></div>  <!--tex4ht:label?: x1-50014 --> </div>  <hr class="endfloat"><!--l. 191-->     <p class="indent"> Figure&nbsp;<a href="#x1-50014">4<!--tex4ht:ref: fig:gausshuard_unb --></a> describes the GH&nbsp;for the solution of a linear system of equations. Pivoting is not included in the algorithm for simplicity, although all our variants integrate partial column pivoting. In practice, pivoting can be easily added as follows: before the diagonalization of <img src="/img/revistas/cleiej/v19n1/1a0274x.png" alt="[ ] &circ;&alpha;11, &circ;aT12 " class="math">, this vector is searched for its maximum entry in magnitude (excluding its last element, which corresponds to an entry of <img src="/img/revistas/cleiej/v19n1/1a0275x.png" alt="b " class="math">) and the column of <img src="/img/revistas/cleiej/v19n1/1a0276x.png" alt="&circ;A " class="math"> corresponding to this entry is then swapped with the column of <img src="/img/revistas/cleiej/v19n1/1a0277x.png" alt="&circ;A " class="math"> containing the diagonal entry <img src="/img/revistas/cleiej/v19n1/1a0278x.png" alt="&circ;&alpha;11 " class="math">. <!--l. 201--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float"><img src="/img/revistas/cleiej/v19n1/1a02f5.jpg" alt="PIC">     <br>      <div class="caption"><span class="id">Figure&nbsp;5: </span><span class="content">Blocked Gauss-Huard algorithm for the solution of linear systems of the form <img src="/img/revistas/cleiej/v19n1/1a0279x.png" alt="Ax = b " class="math">. Initially, <img src="/img/revistas/cleiej/v19n1/1a0280x.png" alt="A&circ;= [A,b] " class="math">. Upon completion, the solution <img src="/img/revistas/cleiej/v19n1/1a0281x.png" alt="x " class="math"> overwrites the last column of <img src="/img/revistas/cleiej/v19n1/1a0282x.png" alt="&circ;A " class="math">. </span></div>  <!--tex4ht:label?: x1-50025 --> </div>  <hr class="endfloat"><!--l. 210-->     ]]></body>
<body><![CDATA[<p class="indent"> A blocked variant of GH&nbsp;was introduced for distributed-memory (message-passing) systems in&nbsp;<span class="cite">[<a href="#XHoffmann1994321">17</a><a id="br17">]</a></span>. Unfortunately, the authors did not perform any experimental evaluation of the implementation, and simply stated that its performance could be expected to be close to that of an LU-based solver. <!--l. 215--></p>      <p class="indent"> Figure&nbsp;<a href="#x1-50025">5<!--tex4ht:ref: fig:gausshuard_blk --></a> describes a blocked version of the Gauss-Huard algorithm which processes <img src="/img/revistas/cleiej/v19n1/1a0283x.png" alt="nb " class="math"> columns of the matrix per iteration. </p>  <h3 class="sectionHead"><span class="titlemark">3 </span> Gauss-Huard implementations for hybrid CPU-GPU platforms</h3>  <!--l. 9-->     <p class="noindent">CPU-GPU platforms have currently reached a prominent position in HPC, as revealed by the top positions of the Top500 list <span class="cite">[<a href="#Xtop500">18</a><a id="br18">]</a></span>. They are also widely used in the domain of numerical methods and linear algebra. This is reflected by the efforts integrated in modern scientific numerical libraries to exploit this kind of platforms; see e.g., MAGMA&nbsp;<span class="cite">[<a href="#Xmagma">19</a><a id="br19">]</a></span>, PETSc&nbsp;<span class="cite">[<a href="#Xpetsc">20</a><a id="br20">]</a></span> or ViennaCL&nbsp;<span class="cite">[<a href="#Xviennacl">21</a><a id="br21">]</a></span>. All those libraries include routines to solve (dense and sparse) linear systems taking advantage of the capabilities provided by this kind of hybrid architectures. Current GPUs offer not only an ample parallelism, but also remarkable flops-per-Watt ratios and competitive economical cost. This has motivated the development of boards that embed one or more general-purpose low-power processors (like those developed for mobile devices by ARM) with low-power GPUs. The performance and moderate power consumption of such hybrid boards has motivated the assembly of powerful clusters with hundreds of these boards. This is the case of the Tibidabo supercomputer at the Barcelona Supercomputer Center&nbsp;<span class="cite">[<a href="#Xbsc">22</a><a id="br22">]</a></span>. <!--l. 23--></p>      <p class="indent"> The GH&nbsp;algorithm combines highly parallel operations suitable for GPUs and fine-grain parallelism computations (mainly due to pivoting) that suits general-purpose processors. Thus, this algorithm represents a good candidate for hybrid CPU-GPU platforms. <!--l. 28--></p>      <p class="indent"> In this context, we present two implementations of the GH&nbsp;algorithm that target hybrid CPU-GPU low-power platforms. The first variant performs all computations in the GPU. The second variant distributes the computations between the CPU and the GPU, but incurs in a overhead due to data transfers between the CPU and the GPU memories. <!--l. 38--></p>      <p class="noindent"> </p>  <h4 class="subsectionHead"><span class="titlemark">3.1 </span> GPU version (<span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a0284x.png" alt="gpu " class="math">) </h4>  <!--l. 40-->     <p class="noindent">Due to the moderate computational power offered by the general-purpose processor when compared with that offered by the GPU, this implementation performs all the computations in the latter. On the other hand, the main drawback of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a0285x.png" alt="gpu " class="math">&nbsp;is that not all the computational units in the platform are employed. This variant is an implementation of algorithm <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">a</span><span class="small-caps">u</span><span class="small-caps">s</span><span class="small-caps">s</span><span class="small-caps">h</span><span class="small-caps">u</span><span class="small-caps">a</span><span class="small-caps">r</span><span class="small-caps">d</span></span><span class="cmcsc-10">_<span class="small-caps">b</span><span class="small-caps">l</span><span class="small-caps">k</span></span>&nbsp;using the kernels offered by the CUBLAS library, i.e., the implementation of the BLAS provided by NVIDIA. <!--l. 48--></p>      <p class="indent"> In an initial phase, the whole <img src="/img/revistas/cleiej/v19n1/1a0286x.png" alt="&circ; A " class="math"> matrix is transferred from the CPU to the GPU. Then, the algorithm proceeds in the GPU, and upon completion, the last column of <img src="/img/revistas/cleiej/v19n1/1a0287x.png" alt="&circ; A " class="math">, containing <img src="/img/revistas/cleiej/v19n1/1a0288x.png" alt="x " class="math">, is transferred back to the CPU memory. <!--l. 51--></p>      <p class="indent"> As previously stated, most of the computational effort in algorithm <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">a</span><span class="small-caps">u</span><span class="small-caps">s</span><span class="small-caps">s</span><span class="small-caps">h</span><span class="small-caps">u</span><span class="small-caps">a</span><span class="small-caps">r</span><span class="small-caps">d</span></span><span class="cmcsc-10">_<span class="small-caps">b</span><span class="small-caps">l</span><span class="small-caps">k</span></span>&nbsp;lies in the matrix-matrix products. To fully exploit the capabilities of the GPU during the execution of these products, the matrices involved must be of a moderate to large size. The dimension of these matrices is determined by the system dimension, <img src="/img/revistas/cleiej/v19n1/1a0289x.png" alt="n " class="math">, and the algorithmic block size, <img src="/img/revistas/cleiej/v19n1/1a0290x.png" alt="nb " class="math">. Consequently, a small value of <img src="/img/revistas/cleiej/v19n1/1a0291x.png" alt="nb " class="math"> limits the performance of this variant. On the other hand, a large value of <img src="/img/revistas/cleiej/v19n1/1a0292x.png" alt="nb " class="math"> increases the relative computational effort of the <span class="cmti-10">block diagonalization</span>, which is a BLAS-2 operation that delivers limited performance. To partially overcome this problem, we included a multiple nested algorithmic block sizes strategy in <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a0293x.png" alt="gpu " class="math">. Specifically, the block diagonalization is performed via a blocked variant of the GH&nbsp;algorithm, i.e., <img src="/img/revistas/cleiej/v19n1/1a0294x.png" alt="Gauss -Huardxblk " class="math">. As a result, BLAS-3 kernels can be employed during most of this computation. To maximize the performance of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a0295x.png" alt="gpu " class="math">, both algorithmic block-sizes must be carefully chosen. <!--l. 64--></p>      <p class="noindent"> </p>  <h4 class="subsectionHead"><span class="titlemark">3.2 </span> Hybrid Version(<span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a0296x.png" alt="hyb " class="math">) </h4>  <!--l. 67-->     ]]></body>
<body><![CDATA[<p class="noindent">This variant aims to exploit both available computational resources in the platform, i.e., the CPU and GPU. To achieve this, it requires data transfers between the CPU and the GPU and, consequently, incurs a certain overhead due to communications. To alleviate this, as an initial phase, the matrix <img src="/img/revistas/cleiej/v19n1/1a0297x.png" alt="&circ;A " class="math"> is transferred from the CPU memory to the GPU memory. With this we avoid to transfer the panel <img src="/img/revistas/cleiej/v19n1/1a0298x.png" alt="[ &circ;A11,A &circ;12] " class="math"> from CPU to GPU at each iteration of the algorithm. The total amount of data transferred remains the same but, given the platform specifications, a single larger data transfer is more efficient than several smaller transfers. Furthermore, this data transfer can be partly overlapped with the diagonalization of the first panel. This way, we save time as well as energy. <!--l. 76--></p>      <p class="indent"> Most of the computations in the algorithm are cast in terms of matrix-matrix products. This operation exhibits a high level of concurrency and suits the GPU architecture. In contrast, the block diagonalization presents a moderate cost and its concurrency is constrained by the pivoting scheme. Thus, the natural way to distribute the operations is to perform the matrix-matrix products in the GPU and the block diagonalization in the CPU. Therefore, the bulk of the computations are performed in the most powerful processor, the GPU. This partitioning of the workload (diagonalization in the CPU and the remaining block eliminations in the GPU) requires that, at each step of the <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">a</span><span class="small-caps">u</span><span class="small-caps">s</span><span class="small-caps">s</span><span class="small-caps">h</span><span class="small-caps">u</span><span class="small-caps">a</span><span class="small-caps">r</span><span class="small-caps">d</span></span><span class="cmcsc-10">_<span class="small-caps">b</span><span class="small-caps">l</span><span class="small-caps">k</span></span>&nbsp;algorithm, the block <img src="/img/revistas/cleiej/v19n1/1a0299x.png" alt="[ &circ;A11,A&circ;12] " class="math"> is sent from the GPU (after completing the <span class="cmti-10">block-row elimination</span>) to the CPU and retrieved from there back into the GPU (after the computation of the block diagonalization). Consequently, the volume of data transferred at each iteration of the algorithm is proportional to <img src="/img/revistas/cleiej/v19n1/1a02100x.png" alt="n " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a02101x.png" alt="nb " class="math">. After the procedure is completed, the solution vector (stored in the last column of <img src="/img/revistas/cleiej/v19n1/1a02102x.png" alt="&circ;A " class="math">) must be sent to the CPU. <!--l. 91--></p>      <p class="indent"> Similarly to the fully GPU implementation, <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02103x.png" alt="gpu " class="math">, the efficiency of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02104x.png" alt="hyb " class="math">&nbsp;strongly depends on the value of <img src="/img/revistas/cleiej/v19n1/1a02105x.png" alt="nb " class="math">: small values of <img src="/img/revistas/cleiej/v19n1/1a02106x.png" alt="nb " class="math"> close to <img src="/img/revistas/cleiej/v19n1/1a02107x.png" alt="32, 64 " class="math">, limit the performance of the matrix-matrix products performed in the GPU, though this ensures that the execution of the algorithm is driven by computation instead of data transfers. However, a large value of <img src="/img/revistas/cleiej/v19n1/1a02108x.png" alt="nb " class="math"> may hurt the performance of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02109x.png" alt="hyb " class="math">&nbsp;because a significant part of the computations is mapped to the less powerful processor, the CPU. Again, we propose to perform the block diagonalization via a blocked variant of the algorithm and, therefore, we leverage a double algorithmic block size, namely <img src="/img/revistas/cleiej/v19n1/1a02110x.png" alt="nbd " class="math"> and <img src="/img/revistas/cleiej/v19n1/1a02111x.png" alt="nbg " class="math">. The former is uniquely employed for the computations in the ARM processor, i.e., during the block diagonalization. This way, BLAS-3 kernels can be employed during this stage, accelerating its execution. At a higher level, in <span class="cmcsc-10"><span class="small-caps">g</span><span class="small-caps">a</span><span class="small-caps">u</span><span class="small-caps">s</span><span class="small-caps">s</span><span class="small-caps">h</span><span class="small-caps">u</span><span class="small-caps">a</span><span class="small-caps">r</span><span class="small-caps">d</span></span><span class="cmcsc-10">_<span class="small-caps">b</span><span class="small-caps">l</span><span class="small-caps">k</span></span>&nbsp;we set <img src="/img/revistas/cleiej/v19n1/1a02112x.png" alt="nbg = nb " class="math">. A more efficient execution of the block diagonalization phase permits to chose larger values for <img src="/img/revistas/cleiej/v19n1/1a02113x.png" alt="nbg " class="math">, which also yield higher performance during the execution of the matrix-matrix products in the GPU. Additionally, larger values of <img src="/img/revistas/cleiej/v19n1/1a02114x.png" alt="nbg " class="math"> reduce the number of iterations and, thus, the number of data transfers (despite it might increase the amount of data transferred, transfers will be performed in a more efficient manner). <!--l. 110--></p>      <p class="indent"> Additionally, <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02115x.png" alt="hyb " class="math">&nbsp;makes a heavy use of tuned computational kernels in OpenBLAS and NVIDIA CUBLAS, respectively, for the ARM and the GPU processors. Moreover, a few key scalar and Level-1 BLAS operations are performed via <span class="cmti-10">ad-hoc </span>kernels, parallelized with OpenMP, which yield higher performance than their counterparts in OpenBLAS. <!--l. 2--></p>      <p class="noindent"> </p>  <h3 class="sectionHead"><span class="titlemark">4 </span> Experimental evaluation</h3>  <!--l. 5-->     <p class="noindent">In this section we show the experimental evaluation of the GH-based solvers introduced in the previous section. The evaluation focuses on two aspects, performance (measured in execution time and GFLOPS) and energy consumption. All experiments are performed employing single precision arithmetic. We compare the solvers on the solution of randomly generated systems of dimension <img src="/img/revistas/cleiej/v19n1/1a02116x.png" alt="n = 1,024,...,7,168 " class="math">. <!--l. 10--></p>      <p class="indent"> The novel codes has been tuned for the hardware platform employed, and compared with two LU-based solvers considered as the state-of-the-art for ARM processors and hybrid CPU-GPU platforms. In particular, the solver in OpenBLAS v.0.2.14 to exploit the ARM processor and its counterpart in MAGMA v.1.6.2 to leverage the CPU and GPU processors. Note that the routine in MAGMA internally uses kernels in OpenBLAS to perform computations in the ARM. <!--l. 21--></p>      <p class="noindent"> </p>  <h4 class="subsectionHead"><span class="titlemark">4.1 </span> Experimental platform</h4>  <!--l. 23-->     <p class="noindent">The experimental hardware platform, a JETSON TK1&nbsp;<span class="cite">[<a href="#XJETSON">23</a><a id="br23">]</a></span> board, is composed of a Tegra K1 SoC (Systems on a Chip) processor with 2GB of DDR3L RAM. In more detail, the Tegra K1 includes an NVIDIA GPU with 192 CUDA Kepler cores and an ARM quad-core Cortex-A15 processor at 2.32 GHz. <!--l. 28--></p>      <p class="indent"> The board runs an extended version of Linux Ubuntu 14.04 adapted for the ARM architecture. The codes for the ARM are compiled using <span class="cmti-10">gcc </span>v.4.8.2, while the codes developed for the NVIDIA GPU are compiled with <span class="cmti-10">nvcc</span> v.6.0.1. <!--l. 33--></p>      ]]></body>
<body><![CDATA[<p class="indent"> The platform configuration was adapted such that the maximum performance level was reported with a negligible loss of energy efficiency. This involved to: (1) set the frequency governor of the CPU to <span class="cmtt-10">performance </span>(instead of its default value, <span class="cmtt-10">ondemand</span>), (2) turn on the ARM-cores as required, (3) and set the GPU clock to the highest available frequency, i.e., <img src="/img/revistas/cleiej/v19n1/1a02117x.png" alt="852,000,000 " class="math"> <span class="cmtt-10">Hz </span>(by default set to <img src="/img/revistas/cleiej/v19n1/1a02118x.png" alt="12,000,000 " class="math"> <span class="cmtt-10">Hz</span>). <!--l. 41--></p>      <p class="noindent"> </p>  <h4 class="subsectionHead"><span class="titlemark">4.2 </span> Performance evaluation</h4>  <!--l. 43-->     <p class="noindent">The runtimes showed for the different GPU-variants include the overhead associated with data transfers between CPU and GPU memory spaces. <!--l. 53--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float">     <div class="center"><!--l. 54-->     <p class="noindent">    <br>  </p>      <div class="caption"><span class="id">Table&nbsp;1: </span><span class="content">Execution time (in seconds) and performance (in GFLOPS) attained with the solvers. </span></div>  <!--tex4ht:label?: x1-110011 --><!--l. 59-->     <p class="noindent"><img src="/img/revistas/cleiej/v19n1/1a02t1.jpg" alt="PIC"></p>  </div>  </div>  <hr class="endfloat"><!--l. 67-->     ]]></body>
<body><![CDATA[<p class="indent"> The quality of the numerical results obtained with all solvers are numerically equivalent. In other words, the residual error associated with all the computed solutions are of the same order. <!--l. 70--></p>      <p class="indent"> Table&nbsp;<a href="#x1-110011">1<!--tex4ht:ref: tab:times --></a> summarizes the runtime (in seconds) and performance (in GFLOPS or billions of flops per second) obtained by the four solvers, i.e., OpenBLAS, MAGMA, <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02119x.png" alt="gpu " class="math"> and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02120x.png" alt="hyb " class="math">. Additionally, for each variant (except <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02121x.png" alt="gpu " class="math">), we include the number of threads employed on the CPU. In particular, we present two options for each variant, corresponding to the use a single thread and use 4 threads (one per core in the Cortex A-15 processor). For <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02122x.png" alt="gpu " class="math">&nbsp;and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02123x.png" alt="hyb " class="math">, we evaluate several block sizes (both, internal and external), but for simplicity, only the best performance for each system dimension is shown. A graphical comparison of the performance attained by all variants is offered in Figure&nbsp;<a href="#x1-110026">6<!--tex4ht:ref: fig:graf --></a>. <!--l. 81--></p>      <p class="indent"> First of all, we focus on the results obtained with solvers in OpenBLAS and MAGMA. The solver in the MAGMA library reports a moderate performance on the solution of small to moderate-scale systems. This might be caused by the CPU-GPU communication overhead. The performance rapidly improves with the problem dimension. Additionally, the use of 4 threads (note that MAGMA offers a hybrid CPU-GPU method) delivers marginal gains, about a 10<img src="/img/revistas/cleiej/v19n1/1a02124x.png" alt="% " class="math"> of time reduction. OpenBLAS offers a remarkable performance despite it uses only the ARM processor (that presents a lower computational power than the GPU), but its performance is far from that of GPU-based solvers. ts parallel version is able to beat the MAGMA solver for all problem dimensions. <!--l. 91--></p>      <p class="indent"> Regarding the GH-based solvers, the hybrid variant (<span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02125x.png" alt="hyb " class="math">) offers execution times comparable with those obtained with the MAGMA library for the larger problems, but it is clearly superior on the solution of small systems, especially when a single thread is employed in the CPU. In contrast, the <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02126x.png" alt="gpu " class="math">&nbsp;variant, which performs all the computations in the GPU, achieves a remarkable performance, especially for larger problems. This is because the capabilities of the massively parallel architecture in the GPU can be exploited more efficiently in the solution of large systems. It should be note that <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02127x.png" alt="gpu " class="math">&nbsp;is the best option for problems of dimension <img src="/img/revistas/cleiej/v19n1/1a02128x.png" alt="n " class="math"> larger than <img src="/img/revistas/cleiej/v19n1/1a02129x.png" alt="3,072 " class="math">. <!--l. 99--></p>      <p class="indent"> In a conclusion, the hybrid implementations do not attain high performance because of the high overhead introduced by the data transfers. For the solution of small problems, the use of the ARM processor only is convenient since it does not require of data transfer. For larger problems, the higher computational power of the GPU promotes the GPU-only variant as the best option. <!--l. 108--></p>      <p class="indent"> </p>  <hr class="figure">     <div class="figure"> <!--l. 111-->     <p class="noindent"><img src="/img/revistas/cleiej/v19n1/1a02f6.jpg" alt="PIC">    <br>  </p>      <div class="caption"><span class="id">Figure&nbsp;6: </span><span class="content">Power-efficiency (measured in GFLOPS/Watt) attained by the OpenBLAS and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02130x.png" alt="gpu " class="math">&nbsp;solvers. </span></div>  <!--tex4ht:label?: x1-110026 --><!--l. 113-->     ]]></body>
<body><![CDATA[<p class="indent"> </p>  </div>  <hr class="endfigure"> <h4 class="subsectionHead"><span class="titlemark">4.3 </span> Power consumption evaluation</h4>  <!--l. 9-->     <p class="noindent">In this section we address the power and energy evaluation of the two more efficient solvers, i.e., the solver in the OpenBLAS library and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02131x.png" alt="gpu " class="math">. We exclude from this comparison the hybrid solvers (<span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02132x.png" alt="hyb " class="math">&nbsp;and the solver in the MAGMA library) because the power required by them is expected to be higher than that of the solvers that are fully executed in one of the processors in the platform. Additionally, they exhibited a higher runtime and thus, larger energy requirements are to be expected by the hybrid solvers. For the OpenBLAS solver we evaluate its behavior with 1 and 4 threads. <!--l. 18--></p>      <p class="indent"> In order to measure the power consumption, we connected a <span class="cmcsc-10">W<span class="small-caps">a</span><span class="small-caps">t</span><span class="small-caps">t</span><span class="small-caps">s</span>U<span class="small-caps">p</span>?P<span class="small-caps">r</span><span class="small-caps">o</span> </span>wattmeter (accuracy of <img src="/img/revistas/cleiej/v19n1/1a02133x.png" alt="&plusmn; 1.5 " class="math">% and a rate of 1 sample/sec.) to the power line from the electric socket to the power supply unit (PSU) of the Tegra TK1 device. Thus we measure the power consumption of the whole board. The power measurements are collected, stored, and treated in a different server, so that the measurement does not disturb the solvers execution. All tests were executed for a minimum of 1 minute, after a warm-up period of 2 minutes. <!--l. 26--></p>      <p class="indent"> Table&nbsp;<a href="#x1-120012">2<!--tex4ht:ref: tab:ene --></a> collects the average power (<img src="/img/revistas/cleiej/v19n1/1a02134x.png" alt="Pavg " class="math">), total energy (<img src="/img/revistas/cleiej/v19n1/1a02135x.png" alt="Etot " class="math">), and GFLOPS/Watt reported by the solvers. Figure&nbsp;<a href="#x1-120027">7<!--tex4ht:ref: fig:gfxw --></a> shows the GFLOPS/Watt ratio of the solvers. The results show that, when the system dimension <img src="/img/revistas/cleiej/v19n1/1a02136x.png" alt="n &le; 2,000 " class="math">, the most energy-efficient solver is that provided by the OpenBLAS library (using 4 threads). This is explained by the absence of expensive CPU-GPU data transfers. Although the GPU architecture is more energy-efficient than the ARM processors, it cannot be fully exploited for the solution of small problems. Thus, the time and energy spent in data transfers is superior to the gains reported by the GPU usage. This is exposed by the <img src="/img/revistas/cleiej/v19n1/1a02137x.png" alt="Pavg " class="math">&nbsp;values. <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02138x.png" alt="gpu " class="math">&nbsp;delivers a lower <img src="/img/revistas/cleiej/v19n1/1a02139x.png" alt="Pavg " class="math">&nbsp;but a longer execution time (see Table&nbsp;<a href="#x1-110004.2">4.2<!--tex4ht:ref: tab:performance --></a>), resulting in higher energy consumption. For larger systems, the <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02140x.png" alt="gpu " class="math">&nbsp;reported the best results in terms of energy. For this kind of problems, the full exploitation of the GPU capabilities compensates the time and energy spent in data transfers (note than only two data transfers are required in <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02141x.png" alt="gpu " class="math">, initially the whole <img src="/img/revistas/cleiej/v19n1/1a02142x.png" alt="A&circ; " class="math"> matrix is sent to the GPU and, upon completion, the <img src="/img/revistas/cleiej/v19n1/1a02143x.png" alt="x " class="math"> vector is retrieved by the CPU). A special case occurs when <img src="/img/revistas/cleiej/v19n1/1a02144x.png" alt="n = 3,000 " class="math">. In this case, the solver in OpenBLAS (with 4 threads) reports the shorter execution time, but due to the lower <img src="/img/revistas/cleiej/v19n1/1a02145x.png" alt="Pavg " class="math">&nbsp;of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02146x.png" alt="gpu " class="math">, the latter consumes less energy. If we focus on the <img src="/img/revistas/cleiej/v19n1/1a02147x.png" alt="Pavg " class="math">&nbsp;results, the OpenBLAS solver with a single thread presents the lowest power consumption. This is explained by the fact that the solver uses a single core of the ARM processor. On the other hand, when the same solver operates with 4 threads, it delivers a speed-up of about 3.5, while the power is nearly doubled. Consequently, the use of 4 threads in OpenBLAS is more energy-efficient. The average power consumed by <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02148x.png" alt="gpu " class="math">&nbsp;(remember this solver is completely executed in the GPU) is lower than that of the parallel OpenBLAS kernel (using 4 cores of the ARM processor), reporting up to 50% of power reduction depending on the system dimension. This implies that the GPU presents a higher peak performance and also a lower energy-consumption and thus, this architecture specially appealing for our purposes. <!--l. 56--></p>      <p class="indent"> </p>  <hr class="float">     <div class="float">     <div class="center"><!--l. 57-->     <p class="noindent">    <br>  </p>      <div class="caption"><span class="id">Table&nbsp;2: </span><span class="content">Average power consumption (in Watts), total energy (in Joules) and performance per Watt (in GFLOPS/Watt) obtained for the three solvers on the solution of linear systems of equations of dimension <img src="/img/revistas/cleiej/v19n1/1a02149x.png" alt="n " class="math">. </span></div>  <!--tex4ht:label?: x1-120012 --><!--l. 62-->     ]]></body>
<body><![CDATA[<p class="noindent"><img src="/img/revistas/cleiej/v19n1/1a02t2.jpg" alt="PIC"></p>  </div>  </div>  <hr class="endfloat"><!--l. 69-->     <p class="indent"> </p>  <hr class="figure">     <div class="figure"> <!--l. 71-->     <p class="noindent"><img src="/img/revistas/cleiej/v19n1/1a02f7.jpg" alt="PIC">    <br>  </p>      <div class="caption"><span class="id">Figure&nbsp;7: </span><span class="content">Energy-efficiency (measured in GFLOPS/Watt) attained with the OpenBLAS and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02150x.png" alt="gpu " class="math">&nbsp;solvers. </span></div>  <!--tex4ht:label?: x1-120027 --><!--l. 73-->     <p class="indent"> </p>  </div>  <hr class="endfigure"><!--l. 75-->     <p class="indent"> Since the platform contains several devices &mdash;e.g., network interface cards&mdash; we measured the average power while idle for 1 minute, <img src="/img/revistas/cleiej/v19n1/1a02151x.png" alt="PI " class="math">, showing a value of 2.6 Watts. This correspond to the power consumption of the board when no operation is executed. We use this value to compute the <span class="cmti-10">net energy</span>, corresponding to the energy consumption that is obtained after subtracting <img src="/img/revistas/cleiej/v19n1/1a02152x.png" alt="PI " class="math"> from the power samples. The <span class="cmti-10">net energy </span>approximates the actual energy spent in computations, and allows a fair comparison between the computational kernels. Figure&nbsp;<a href="#x1-120038">8<!--tex4ht:ref: fig:gfxew --></a> shows the GFLOPS/Watt computed with the average net power (left) and the <span class="cmti-10">net energy </span>of the solvers. Both figures show the superiority of <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02153x.png" alt="gpu " class="math">, because it performs more arithmetic operations per Watt and also requires less energy. <!--l. 89--></p>      <p class="indent"> </p>  <hr class="figure">     <div class="figure"> <!--l. 90-->     ]]></body>
<body><![CDATA[<p class="noindent"><img src="/img/revistas/cleiej/v19n1/1a02f8.jpg" alt="PIC">    <br>  </p>      <div class="caption"><span class="id">Figure&nbsp;8: </span><span class="content">Net energy-efficiency measured in GFLOPS/Watt (left) and total net energy in Joules (right), attained with the OpenBLAS and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02154x.png" alt="gpu " class="math">&nbsp;solvers. </span></div>  <!--tex4ht:label?: x1-120038 --><!--l. 95-->     <p class="indent"> </p>  </div>  <hr class="endfigure"> <h3 class="sectionHead"><span class="titlemark">5 </span> Concluding Remarks</h3>  <!--l. 4-->     <p class="noindent">During the last years, energy consumption has become a major issue in high performance computing, too. In this work we present two efficient solvers for linear sytems of equations that present remarkably low energy consumption. The new solvers exploit the capabilities of a low-power hardware platform, an NVIDIA JETSON K1 (formed by an ARM quad-core processor and a NVIDIA GPU with 192 CUDA cores) device to efficiently run a Gauss-Huard-based solver. The Gauss-Huard algorithm presents some features which turn it more suitable for the target hardware platform than the traditional LU-based solver. The <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02155x.png" alt="hyb " class="math">&nbsp;solver aims to exploit all the computational units in the platform concurrently. Despite it is able to perform computations concurrently in both processors, it also requires data transfers between their memory spaces, incurring into a considerable overhead. The <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02156x.png" alt="gpu " class="math">&nbsp;solver performs all the computations in the most powerful processor, the GPU. Thus, it incurs a minor communication overhead that is compensated with both higher performance and energy-efficiency of the GPU. The performance of both solvers is compared with the solvers in OpenBLAS and MAGMA libraries. The experimental results show the superior behavior of the <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02157x.png" alt="gpu " class="math">&nbsp;solver on the solution of linear systems of dimension <img src="/img/revistas/cleiej/v19n1/1a02158x.png" alt="n &ge; 4,000 " class="math">. For smaller systems, the OpenBLAS solver reports the best execution times because it runs entirely in the ARM processor and does not require any CPU-GPU communication. The power and energy evaluation show the convenience of the GPU processor over the ARM quad-core. The GPU consumes less power and delivers a higher GFLOPS rate. Consequently, the <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02159x.png" alt="gpu " class="math">&nbsp;variant attains the best energy-efficiency on the solution of moderate to large-scale systems. On the solution of small systems, again the absence of CPU-GPU communications in the OpenBLAS kernel turns it into the most energy-efficient variant. The communication overhead affects negatively the performance (and therefore the energy efficiency) of the hybrid solvers (i.e., the solver in the MAGMA library and <span class="cmcsc-10">GH</span><img src="/img/revistas/cleiej/v19n1/1a02160x.png" alt="hyb " class="math">) and thus, both deliver limited performance. <!--l. 34--></p>      <p class="indent"> There are different aspects that were not addressed in the current work but deserve more investigation. For example, it is important to study new techniques to improve the results of the hybrid solvers. For example techniques like <span class="cmti-10">look-ahead </span><span class="cite">[<a href="#XLookahead">24</a><a id="br24">]</a></span> have demonstrated important benefits in similar applications. It should also be evaluated how the hybrid CPU-GPU implementations can benefit from the improvements in the Unified-Memory-Access announced by NVIDIA for their future devices. Additionally, we plan to extend this study to other low power consumption hardware platforms, e.g., Samsung ODROID-XU4. Finally, although this work targets single precision arithmetic (due to the low performance of the architecture on double precision), we plan to investigate how GH can be complemented with an iterative refinement to attain a full double precision solution. <!--l. 163--></p>      <p class="noindent"> </p>  <h3 class="likesectionHead">Acknowledgment</h3>  <!--l. 164-->     <p class="noindent">Juan Silva, Ernesto Dufrechou and Pablo Ezzatti acknowledge the support from Programa de Desarrollo de las Ciencias B&aacute;sicas, and Agencia Nacional de Investigaci&oacute;n e Innovaci&oacute;n, Uruguay. Enrique&nbsp;S.&nbsp;Quintana-Ort&iacute; was supported by project TIN-2014-53495-R of the <span class="cmti-10">Ministerio de Econom</span><span class="cmti-10">&iacute;a y Competitividad </span>and FEDER. All researchers acknowledge the support from the EHFARS project funded by the German Ministry of Education and Research BMBF. <!--l. 2--></p>      <p class="noindent"> </p>  <h3 class="likesectionHead">References</h3>  <!--l. 2-->     <p class="noindent"> </p>      ]]></body>
<body><![CDATA[<div class="thebibliography">     <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br1">1</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;W. Demmel, <span class="cmti-10">Applied Numerical Linear Algebra</span>. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1997.     </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br2">2</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>G.&nbsp;Golub and C.&nbsp;V. Loan, <span class="cmti-10">Matrix Computations</span>, 3rd&nbsp;ed. Baltimore: The Johns Hopkins University Press, 1996.     </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br3">3</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;J. Dongarra, J.&nbsp;D. Croz, S.&nbsp;Hammarling, and I.&nbsp;Duff, &ldquo;A set of level 3 basic linear algebra subprograms,&rdquo; <span class="cmti-10">ACM Trans. Math. Soft.</span>, vol.&nbsp;16, no.&nbsp;1, pp. 1&ndash;17, March 1990. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br4">4</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;Ezzatti, E.&nbsp;Quintana-Ort&iacute;, and A.&nbsp;Rem&oacute;n, &ldquo;Using graphics processors to accelerate the computation of the matrix inverse,&rdquo; <span class="cmti-10">The Journal of Supercomputing</span>, vol.&nbsp;58, pp. 429&ndash;437, 2011. [Online]. Available: <a href="http://dx.doi.org/10.1007/s11227-011-0606-4" class="url">http://dx.doi.org/10.1007/s11227-011-0606-4</a> </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br5">5</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;Benner, P.&nbsp;Ezzatti, E.&nbsp;S. Quintana-Ort&iacute;, and A.&nbsp;Rem&oacute;n, &ldquo;Matrix inversion on CPU-GPU platforms with applications in control theory,&rdquo; <span class="cmti-10">Concurrency and Computation: Practice and Experience</span>, vol.&nbsp;25, no.&nbsp;8, pp. 1170&ndash;1182, 2013. [Online]. Available: <a href="http://dx.doi.org/10.1002/cpe.2933" class="url">http://dx.doi.org/10.1002/cpe.2933</a> </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br6">6</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;Huard, &ldquo;La m&eacute;thode simplex sans inverse explicite,&rdquo; <span class="cmti-10">EDB Bull, Direction </span><span class="cmti-10">&Eacute;tudes Rech. S</span><span class="cmti-10">&eacute;r. C</span> <span class="cmti-10">Math. Inform. 2</span>, pp. 79&ndash;98, 1979. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br7">7</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;J. Dekker, W.&nbsp;Hoffmann, and K.&nbsp;Potma, &ldquo;Stability of the Gauss-Huard algorithm with partial pivoting,&rdquo; <span class="cmti-10">Computing</span>, vol.&nbsp;58, pp. 225&ndash;244, 1997. </p>      ]]></body>
<body><![CDATA[<p class="bibitem"><span class="biblabel"> [<a href="#br8">8</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;P. Silva, E.&nbsp;Dufrechou, E.&nbsp;Quintana-Ort&iacute;, A.&nbsp;Rem&oacute;n, and P.&nbsp;Benner, &ldquo;Solving dense linear systems with hybrid CPU-GPU platforms,&rdquo; in <span class="cmti-10">Proceedings of the XLI Latin American Computing</span> <span class="cmti-10">Conference, CLEI 2015, Arequipa, Peru, 2015</span>. IEEE, to appear. </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br9">9</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>OpenBLAS website, Z. Xianyi; [accessed 2015 Dec], <a href="http://www.openblas.net/" class="url">http://www.openblas.net/</a>.     </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br10">10</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>B.&nbsp;J. Smith, &ldquo;R package MAGMA: Matrix algebra on GPU and multicore architectures, version 0.2.2,&rdquo; September 3, 2010, [On-line] <a href="http://cran.r-project.org/package=magma">http://cran.r-project.org/package=magma</a>. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br11">11</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;A. Gunnels, F.&nbsp;G. Gustavson, G.&nbsp;M. Henry, and R.&nbsp;A. van&nbsp;de Geijn, &ldquo;FLAME: Formal linear algebra methods environment,&rdquo; <span class="cmti-10">ACM Trans. Math. Soft.</span>, vol.&nbsp;27, no.&nbsp;4, pp. 422&ndash;455, 2001. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br12">12</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;Bientinesi, J.&nbsp;A. Gunnels, M.&nbsp;E. Myers, E.&nbsp;S. Quintana-Ort&iacute;, and R.&nbsp;A. van&nbsp;de Geijn, &ldquo;The science of deriving dense linear algebra algorithms,&rdquo; <span class="cmti-10">ACM Trans. Math. Soft.</span>, vol.&nbsp;31, no.&nbsp;1, pp. 1&ndash;26, 2005. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br13">13</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;Anderson, Z.&nbsp;Bai, J.&nbsp;Demmel, J.&nbsp;E. Dongarra, J.&nbsp;DuCroz, A.&nbsp;Greenbaum, S.&nbsp;Hammarling, A.&nbsp;E. McKenney, S.&nbsp;Ostrouchov, and D.&nbsp;Sorensen, <span class="cmti-10">LAPACK Users&rsquo; Guide</span>. Philadelphia: SIAM, 1992. </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br14">14</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>N.&nbsp;J. Higham, <span class="cmti-10">Accuracy and Stability of Numerical Algorithms</span>, 2nd&nbsp;ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2002.     </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br15">15</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;Quintana-Ort&iacute;, G.&nbsp;Quintana-Ort&iacute;, X.&nbsp;Sun, and R.&nbsp;van&nbsp;de&nbsp;Geijn, &ldquo;A note on parallel matrix inversion,&rdquo; <span class="cmti-10">SIAM J. Sci. Comput.</span>, vol.&nbsp;22, pp. 1762&ndash;1771, 2001. </p>      ]]></body>
<body><![CDATA[<p class="bibitem"><span class="biblabel"> [<a href="#br16">16</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;J. Dekker, W.&nbsp;Hoffmann, and K.&nbsp;Potma, &ldquo;Parallel algorithms for solving large linear systems,&rdquo; <span class="cmti-10">Journal of Computational and Applied Mathematics</span>, vol.&nbsp;50, no. 1&ndash;3, pp. 221&ndash;232, 1994. </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br17">17</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>W.&nbsp;Hoffmann, K.&nbsp;Potma, and G.&nbsp;Pronk, &ldquo;Solving dense linear systems by Gauss-Huard&rsquo;s method on a distributed memory system,&rdquo; <span class="cmti-10">Future Generation Computer Systems</span>, vol.&nbsp;10, no. 2&ndash;3, pp. 321&ndash;325, 1994. </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br18">18</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>TOP500.org; [accessed 2015 Dec], <a href="http://www.top500.org/" class="url">http://www.top500.org/</a>.     </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br19">19</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>MAGMA, Univ. of Tennessee; [accessed 2015 Dec], <a href="http://icl.cs.utk.edu/magma/" class="url">http://icl.cs.utk.edu/magma/</a>.     </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br20">20</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Balay, W.&nbsp;Gropp, L.&nbsp;C. McInnes, and B.&nbsp;Smith, &ldquo;PETSc 2.0 users manual,&rdquo; Argonne National Laboratory, Tech. Rep. ANL-95/11, October 1996. </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br21">21</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>TU Wien and FASTMathSciDAC Institute, <a href="http://viennacl.sourceforge.net/" class="url">http://viennacl.sourceforge.net/</a>.     </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br22">22</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>Barcelona Supercomputing Center, <a href="http://www.bsc.es" class="url">http://www.bsc.es</a>.     </p>      <!-- ref --><p class="bibitem"><span class="biblabel"> [<a href="#br23">23</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>NVIDIA, <a href="http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html" class="url">http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html</a>.     </p>      <p class="bibitem"><span class="biblabel"> [<a href="#br24">24</a>]<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;Strazdins, &ldquo;A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization,&rdquo; Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, Tech. Rep. TR-CS-98-07, 1998. </p>  </div>       ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Demmel]]></surname>
<given-names><![CDATA[J. W.]]></given-names>
</name>
</person-group>
<collab>Society for Industrial and Applied Mathematics</collab>
<source><![CDATA[Applied Numerical Linear Algebra]]></source>
<year>1997</year>
<publisher-loc><![CDATA[Philadelphia^ePA PA]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Golub]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
<name>
<surname><![CDATA[Loan]]></surname>
<given-names><![CDATA[C. V.]]></given-names>
</name>
</person-group>
<collab>Matrix Computations</collab>
<source><![CDATA[]]></source>
<year>1996</year>
<edition>3</edition>
<publisher-loc><![CDATA[Baltimore ]]></publisher-loc>
<publisher-name><![CDATA[The Johns Hopkins University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dongarra]]></surname>
<given-names><![CDATA[J. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Croz]]></surname>
<given-names><![CDATA[J. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Hammarling]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Duff]]></surname>
<given-names><![CDATA[I]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A set of level 3 basic linear algebra subprograms]]></article-title>
<source><![CDATA[ACM Trans. Math. Soft.]]></source>
<year>Marc</year>
<month>h </month>
<day>19</day>
<volume>16</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-17</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Using graphics processors to accelerate the computation of the matrix inverse]]></article-title>
<source><![CDATA[The Journal of Supercomputing]]></source>
<year>2011</year>
<volume>58</volume>
<page-range>429-437</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Matrix inversion on CPU-GPU platforms with applications in control theory]]></article-title>
<source><![CDATA[Concurrency and Computation: Practice and Experience]]></source>
<year>2013</year>
<volume>25</volume>
<numero>8</numero>
<issue>8</issue>
<page-range>1170-1182</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Huard]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang="fr"><![CDATA[La méthode simplex sans inverse explicite]]></article-title>
<source><![CDATA[EDB Bull, Direction Études Rech. Sér.: C Math. Inform]]></source>
<year>1979</year>
<page-range>79-98</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dekker]]></surname>
<given-names><![CDATA[T. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Hoffmann]]></surname>
<given-names><![CDATA[W]]></given-names>
</name>
<name>
<surname><![CDATA[Potma]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Stability of the Gauss-Huard algorithm with partial pivoting]]></article-title>
<source><![CDATA[Computing]]></source>
<year>1997</year>
<volume>58</volume>
<page-range>225-244</page-range></nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Silva]]></surname>
<given-names><![CDATA[J. P.]]></given-names>
</name>
<name>
<surname><![CDATA[Dufrechou]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Solving dense linear systems with hybrid CPU-GPU platforms]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proceedings of the XLI Latin American ComputingConference]]></conf-name>
<conf-date>2015</conf-date>
<conf-loc>Arequipa </conf-loc>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Xianyi]]></surname>
<given-names><![CDATA[Z.]]></given-names>
</name>
</person-group>
<source><![CDATA[OpenBLAS website]]></source>
<year>2015</year>
<month> D</month>
<day>ec</day>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Smith]]></surname>
<given-names><![CDATA[B. J.]]></given-names>
</name>
</person-group>
<source><![CDATA[R package MAGMA: Matrix algebra on GPU and multicore architectures]]></source>
<year>Sept</year>
<month>em</month>
<day>be</day>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gunnels]]></surname>
<given-names><![CDATA[J. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Gustavson]]></surname>
<given-names><![CDATA[F. G.]]></given-names>
</name>
<name>
<surname><![CDATA[Henry]]></surname>
<given-names><![CDATA[G. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Geijn]]></surname>
<given-names><![CDATA[R. A. van de]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[FLAME: Formal linear algebra methods environment]]></article-title>
<source><![CDATA[ACM Trans. Math. Soft.]]></source>
<year>2001</year>
<volume>27</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>422-455</page-range></nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bientinesi]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Gunnels]]></surname>
<given-names><![CDATA[J. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Myers]]></surname>
<given-names><![CDATA[M. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Geijn]]></surname>
<given-names><![CDATA[R. A. van de]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The science of deriving dense linear algebra algorithms]]></article-title>
<source><![CDATA[ACM Trans. Math. Soft.]]></source>
<year>2005</year>
<volume>31</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-26</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Anderson]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Bai]]></surname>
<given-names><![CDATA[Z]]></given-names>
</name>
<name>
<surname><![CDATA[Demmel]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Dongarra]]></surname>
<given-names><![CDATA[J. E.]]></given-names>
</name>
<name>
<surname><![CDATA[DuCroz]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Greenbaum]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Hammarling]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[McKenney]]></surname>
<given-names><![CDATA[A. E]]></given-names>
</name>
<name>
<surname><![CDATA[Ostrouchov]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Sorensen]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
</person-group>
<source><![CDATA[LAPACK Users’ Guide. Philadelphia: SIAM]]></source>
<year>1992</year>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Higham]]></surname>
<given-names><![CDATA[N. J]]></given-names>
</name>
</person-group>
<source><![CDATA[Accuracy and Stability of Numerical Algorithms]]></source>
<year>2002</year>
<edition>2</edition>
<publisher-loc><![CDATA[Philadelphia^ePA PA]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
<name>
<surname><![CDATA[Sun]]></surname>
<given-names><![CDATA[X]]></given-names>
</name>
<name>
<surname><![CDATA[van de Geijn]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A note on parallel matrix inversion]]></article-title>
<source><![CDATA[Sci. Comput]]></source>
<year>2001</year>
<volume>22</volume>
<page-range>1762-1771</page-range></nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dekker]]></surname>
<given-names><![CDATA[T. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Hoffmann]]></surname>
<given-names><![CDATA[W]]></given-names>
</name>
<name>
<surname><![CDATA[Potma]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Parallel algorithms for solving large linear systems]]></article-title>
<source><![CDATA[Journal of Computational and Applied Mathematics]]></source>
<year>1994</year>
<volume>50</volume>
<numero>1-3</numero>
<issue>1-3</issue>
<page-range>221-232</page-range></nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hoffmann]]></surname>
</name>
<name>
<surname><![CDATA[Potma]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Pronk]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Solving dense linear systems by Gauss-Huard’s method on a distributed memory system]]></article-title>
<source><![CDATA[Future Generation Computer Systems]]></source>
<year>1994</year>
<volume>10</volume>
<numero>2-3</numero>
<issue>2-3</issue>
<page-range>321-325</page-range></nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="">
<source><![CDATA[TOP500.org]]></source>
<year>2015</year>
<month> D</month>
<day>ec</day>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="">
<collab>Univ. of Tennessee</collab>
<source><![CDATA[MAGMA]]></source>
<year>2015</year>
<month> D</month>
<day>ec</day>
</nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Balay]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Gropp]]></surname>
<given-names><![CDATA[W]]></given-names>
</name>
<name>
<surname><![CDATA[McInnes]]></surname>
<given-names><![CDATA[L. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Smith]]></surname>
<given-names><![CDATA[B]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[PETSc 2.0 users manual]]></article-title>
<source><![CDATA[Argonne National Laboratory]]></source>
<year>Octo</year>
<month>be</month>
<day>r </day>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="">
<source><![CDATA[TU Wien and FASTMathSciDAC Institute]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="">
<source><![CDATA[Barcelona Supercomputing Center]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B23">
<label>23</label><nlm-citation citation-type="">
<source><![CDATA[NVIDIA]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B24">
<label>24</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Strazdins]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization]]></article-title>
<collab>Department of Computer Science</collab>
<source><![CDATA[Canberra 0200 ACT]]></source>
<year>1998</year>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
