<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002014000100004</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[Peter]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[Pablo]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[Enrique S]]></given-names>
</name>
<xref ref-type="aff" rid="A03"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[Alfredo]]></given-names>
</name>
<xref ref-type="aff" rid="A04"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Max Planck Institute for Dynamics of Complex Technical Systems  ]]></institution>
<addr-line><![CDATA[Magdeburg ]]></addr-line>
<country>Germany</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Max Planck Institute for Dynamics of Complex Technical Systems Universidad de la República Facultad de Ingeniería]]></institution>
<addr-line><![CDATA[Montevideo ]]></addr-line>
<country>Uruguay</country>
</aff>
<aff id="A03">
<institution><![CDATA[,Universidad Jaume I Departamento de Ingeniería y Ciencia de Computadores ]]></institution>
<addr-line><![CDATA[Castellón ]]></addr-line>
<country>Spain</country>
</aff>
<aff id="A04">
<institution><![CDATA[,Max Planck Institute for Dynamics of Complex Technical Systems  ]]></institution>
<addr-line><![CDATA[Magdeburg ]]></addr-line>
<country>Germany</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>04</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>04</month>
<year>2014</year>
</pub-date>
<volume>17</volume>
<numero>1</numero>
<fpage>4</fpage>
<lpage>4</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002014000100004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002014000100004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002014000100004&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[We analyze the performance-power-energy balance of a conventional Intel Xeon multicore processor and two low-power architectures -an Intel Atom processor and a system with a quad-core ARM Cortex A9+NVIDIA Quadro 1000M- using a high performance implementation of Gauss-Jordan elimination (GJE) for matrix inversion. The blocked version of this algorithm employed in the experimental evaluation mostly comprises matrix-matrix products, so that the results from the evaluation carry beyond the simple matrix inversion and are representative for a wide variety of dense linear algebra operations/codes.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[En este trabajo se estudia el desempeño, potencia y consumo energético necesario al utilizar un procesador convencional Intel Xeon multi-core y dos arquitecturas de bajo consumo -como son el procesador Intel Atom y un sistema con un procesador ARM quad-core Cortex A9 conectado a una GPU NVIDIA Quadro 1000M- para computar la inversión de matrices mediante una implementación optimizada del algoritmo de eliminación de Gauss-Jordan (GJE). El algoritmo a bloques utilizado para realizar la evaluación experimental se basa fuertemente en la operación producto matriz-matriz, por lo tanto los resultados obtenidos no solo son aplicables a la inversión de matrices, sino que, son representativos para una gama amplia de operaciones de álgebra lineal densa.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Dense Linear Algebra]]></kwd>
<kwd lng="en"><![CDATA[Gauss-Jordan]]></kwd>
<kwd lng="en"><![CDATA[Power]]></kwd>
<kwd lng="en"><![CDATA[Energy]]></kwd>
<kwd lng="es"><![CDATA[Álgebra lineal densa]]></kwd>
<kwd lng="es"><![CDATA[Gauss-Jordan]]></kwd>
<kwd lng="es"><![CDATA[Energía]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="maketitle">    <b><font face="Verdana" size="4">Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory</font></b>    <div class="author" >    <font face="Verdana" size="2"> <span  class="ecbx-1200">Peter&#x00A0;Benner</span> <br /> <span  class="ecrm-1200">Max Planck Institute for Dynamics of Complex Technical Systems,</span> <br />                  <span  class="ecrm-1200">Magdeburg, Germany, D-39106,</span> <br />           <span  class="ecti-1200"><a href="mailto:benner@mpi-magdeburg.mpg.de">benner@mpi-magdeburg.mpg.de</a> </span><br class="and" /><span  class="ecbx-1200">Pablo Ezzatti</span> <br />        <span  class="ecrm-1200">Facultad de Ingeniería, Universidad de la República,</span> <br />                    <span  class="ecrm-1200">Montevideo, Uruguay, 11300,</span> <br />         <span  class="ecti-1200"><a href="mailto:pezzatti@fing.edu.uy ">pezzatti@fing.edu.uy </a></span><br class="and" /><span  class="ecbx-1200">Enrique&#x00A0;S.&#x00A0;Quintana-Ortí</span> <br />      <span  class="ecrm-1200">Departamento de Ingeniería y Ciencia de Computadores,</span> <br />                        <span  class="ecrm-1200">Universidad Jaume I,</span> <br />                      <span  class="ecrm-1200">Castellón, Spain, 12.071,</span> <br />               <span  class="ecti-1200"><a href="mailto:quintana@icc.uji.es ">quintana@icc.uji.es </a></span><br class="and" /><span  class="ecbx-1200">Alfredo&#x00A0;Remón</span> <br /> <span  class="ecrm-1200">Max Planck Institute for Dynamics of Complex Technical Systems,</span> <br />                  <span  class="ecrm-1200">Magdeburg, Germany, D-39106,</span> <br />                   <span  class="ecti-1200"><a href="mailto:remon@mpi-magdeburg.mpg.de">remon@mpi-magdeburg.mpg.de</a> </span> </font></div><font face="Verdana" size="2"><br /> </font>     <div class="date" ></div>    </div> <!--l. 148-->    <p >    <div  class="abstract"  >     <div class="center"  > <!--l. 149-->    <p >     <div class="minipage"><!--l. 149-->    <p ><font face="Verdana" size="2"><span  class="ecbx-1000">Abstract </span><br  class="newline" /><!--l. 151--></font>    <p ><font face="Verdana" size="2">We analyze the performance-power-energy balance of a conventional Intel Xeon multicore processor  and  two  low-power  architectures  &#8211;an  Intel  Atom  processor  and  a  system with a quad-core ARM Cortex A9+NVIDIA Quadro 1000M&#8211; using a high performance implementation of Gauss-Jordan elimination (GJE) for matrix inversion. The blocked version  of  this  algorithm  employed  in  the  experimental  evaluation  mostly  comprises matrix-matrix products, so that the results from the evaluation carry beyond the simple matrix  inversion  and  are  representative  for  a  wide  variety  of  dense  linear  algebra operations/codes. <!--l. 159--></font>    ]]></body>
<body><![CDATA[<p ><font face="Verdana" size="2">Resumen <!--l. 161--></font>    <p ><font face="Verdana" size="2">En este trabajo se estudia el desempeño, potencia y consumo energético necesario al utilizar un procesador convencional Intel Xeon multi-core y dos arquitecturas de bajo consumo -como son el procesador Intel Atom y un sistema con un procesador ARM quad-core Cortex A9 conectado a una GPU NVIDIA Quadro 1000M- para computar la  inversión  de  matrices  mediante  una  implementación  optimizada  del  algoritmo  de eliminación de Gauss-Jordan (GJE). El algoritmo a bloques utilizado para realizar la evaluación experimental se basa fuertemente en la operación producto matriz-matriz, por lo tanto los resultados obtenidos no solo son aplicables a la inversión de matrices, sino que, son representativos para una gama amplia de operaciones de álgebra lineal densa. <!--l. 169--></font>    <p ><font face="Verdana" size="2"><span  class="ecbx-1000">Keywords: </span>Dense Linear Algebra, Gauss&#8211;Jordan, Power, Energy <!--l. 171--></font>    <p ><font face="Verdana" size="2">Spanish keywords: Álgebra lineal densa, Gauss-Jordan, Energía <!--l. 173--></font>    <p ><font face="Verdana" size="2">Received 2013-09-08, Revised 2014-03-10, Accepted 2014-03-10   </font> </div></div> </div>         <p><font face="Verdana" size="2"><span class="titlemark">1   </span> <a   id="x1-10001"></a>Introduction</font></p> <!--l. 180-->    <p ><font face="Verdana" size="2">General-purpose multicore architectures and graphics processor units (GPUs) dominate today&#8217;s landscape of high performance computing (HPC), offering unprecedented levels of raw performance when aggregated to build the systems of the Top500 list&#x00A0;<span class="cite">[<a  name="bXtop500"> </a><a  href="#Xtop500">1</a>]</span>. While the performance-power trade-off of HPC platforms has also enjoyed considerable advances in the past few years&#x00A0;<span class="cite">[<a  name="bXgreen500"> </a><a  href="#Xgreen500">2</a>]</span> &#8212;mostly due to the deployment of heterogeneous platforms equipped with hardware accelerators (e.g., NVIDIA and AMD graphics processors, Intel Xeon Phi) or the adoption of low-power multicore processors (IBM PowerPC A2, ARM chips, etc.)&#8212; much remains to be done from the perspective of energy efficiency. In particular, power consumption has been identified as a key challenge that will have to be confronted to render Exascale systems feasible by 2020&#x00A0;<span class="cite">[<a  name="bXexascalechallenge"> </a><a  href="#Xexascalechallenge">3</a>,&#x00A0;<a  name="bXDongarraEA11"> </a><a  href="#XDongarraEA11">4</a>,&#x00A0;<a  name="bXDuranton13"> </a><a  href="#XDuranton13">5</a>]</span>. Even if the current progress pace of the performance-power ratio can be maintained (a factor of about <img  src="/img/revistas/cleiej/v17n1/1a040x.png" alt="5&#x00D7; "  class="math" > in the last 5 years&#x00A0;<span class="cite">[<a  name="bXgreen500"> </a><a  href="#Xgreen500">2</a>]</span>), the ambitious goal of yielding a sustained ExaFLOPS (i.e., <img  src="/img/revistas/cleiej/v17n1/1a041x.png" alt="1018  "  class="math" > floating-point arithmetic operations, or flops, per second) for 20&#8211;40&#x00A0;MWatts by the end of this decade will be clearly exceeded. <!--l. 197--></font>    <p >   <font face="Verdana" size="2">In recent years, a number of HPC prototypes have proposed the use of low-power technology, initially designed for mobile appliances like smart phones and tablets, to deliver high MFLOPS/Watt rates&#x00A0;<span class="cite">[<a  name="bXcrestaweb"> </a><a  href="#Xcrestaweb">6</a>,&#x00A0;<a  name="bXmontblancweb"> </a><a  href="#Xmontblancweb">7</a>]</span>. Following this trend, in this paper we investigate the performance, power and energy consumption of two low-power architectures, concretely an Intel Atom and a hybrid system composed of a multicore ARM processor and an NVIDIA 96-core GPU, and a general-purpose multicore processor, using as a workhorse matrix inversion via Gauss-Jordan elimination (GJE)&#x00A0;<span class="cite">[<a  name="bXHigham:2002:ASN"> </a><a  href="#XHigham:2002:ASN">8</a>]</span>. While this operation is key for the solution of important matrix equations arising in control theory via the matrix sign function&#x00A0;<span class="cite">[<a  name="bXCPE:CPE2933"> </a><a  href="#XCPE:CPE2933">9</a>,&#x00A0;<a  name="bXRob80"> </a><a  href="#XRob80">10</a>]</span>, the relevance of this study carries beyond the inversion operation/method or these specific applications. In particular, a blocked implementation of matrix inversion via GJE casts the bulk of the computations in terms of the matrix-matrix product, so that its performance as well as power dissipation and energy consumption are representative for many other dense linear algebra operations such as, e.g., the solution of linear systems, linear-least squares problems, eigenvalue computations, etc. <!--l. 213--></font>    <p >   <font face="Verdana" size="2">The rest of the paper is structured as follows. In Section&#x00A0;<a  href="#x1-20002">2<!--tex4ht:ref: sec:inversion --></a> we briefly review matrix inversion via the GJE method and an applications of this particular operation. Specifically, we introduce the sign function, which plays an important role for the resolution of several scientific problems arising in control theory. Next, in Section&#x00A0;<a  href="#x1-40003">3<!--tex4ht:ref: sec:parallel --></a>, we describe the specific implementation of the GJE method on the two low-power architectures selected for our study: <span  class="ecti-1000">i) </span>an Intel Atom processor not much different, from the programming point of view, from a mainstream multicore processor like the Intel Xeon or the AMD Opteron; and <span  class="ecti-1000">ii) </span>a hybrid board with ARM+NVIDIA technology that can be viewed as a low-power version of the heterogeneous platforms equipped with hardware accelerators that populate the first positions of the Top500 list. Finally, Sections&#x00A0;<a  href="#x1-50004">4<!--tex4ht:ref: sec:experiments --></a> and&#x00A0;<a  href="#x1-110005">5<!--tex4ht:ref: sec:remarks --></a> contain, respectively, the experimental evaluation and a few concluding remarks resulting from this investigation. <!--l. 227--></font>    <p >        ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="titlemark">2   </span> <a   id="x1-20002"></a>Matrix Inversion via GJE and its Applications to Control Theory</font></p> <!--l. 231-->    <p ><font face="Verdana" size="2">The traditional approache to compute a matrix inverse is based on the LU factorization and consist of the following three steps: </font>      <ol  class="enumerate1" >      <li    class="enumerate" id="x1-2002x1"><font face="Verdana" size="2">Compute the LU factorization <img  src="/img/revistas/cleiej/v17n1/1a042x.png" alt="P A = LU  "  class="math" >, where <img  src="/img/revistas/cleiej/v17n1/1a043x.png" alt="P &#x2208; &#x211D;n&#x00D7;n  "  class="math" > is a permutation matrix, and <img  src="/img/revistas/cleiej/v17n1/1a044x.png" alt="L &#x2208; &#x211D;n&#x00D7;n  "  class="math" >      and <img  src="/img/revistas/cleiej/v17n1/1a045x.png" alt="U &#x2208; &#x211D;n&#x00D7;n  "  class="math" > are, respectively, unit lower and upper triangular factors&#x00A0;<span class="cite">[<a  name="bXGVL3"> </a><a  href="#XGVL3">11</a>]</span>. </font>      </li>      <li    class="enumerate" id="x1-2004x2"><font face="Verdana" size="2">Invert the triangular factor <img  src="/img/revistas/cleiej/v17n1/1a046x.png" alt="      -1 U &#x2192;  U  "  class="math" >. </font>      </li>      <li    class="enumerate" id="x1-2006x3"><font face="Verdana" size="2">Solve the system <img  src="/img/revistas/cleiej/v17n1/1a047x.png" alt="XL = U -1  "  class="math" > for <img  src="/img/revistas/cleiej/v17n1/1a048x.png" alt="X  "  class="math" >. </font>      </li>      <li    class="enumerate" id="x1-2008x4"><font face="Verdana" size="2">Undo the permutations <img  src="/img/revistas/cleiej/v17n1/1a049x.png" alt="A -1 := XP  "  class="math" >.</font></li>    </ol> <!--l. 244-->    <p >   <font face="Verdana" size="2">An alternative approach to compute a matrix inverse is the <span  class="eccc-1000">GJE</span>, an appealing method for matrix inversion in current architecture, becauase it presents a computational cost and numerical properties analogous to those of traditional approaches &#x00A0;<span class="cite">[<a  name="bXHigham:2002:ASN"> </a><a  href="#XHigham:2002:ASN">8</a>]</span> but superior performance on a variety of parallel architectures, including clusters&#x00A0;<span class="cite">[<a  name="bXQuiQSG01"> </a><a  href="#XQuiQSG01">12</a>]</span>, general-purpose multicore processors and GPUs&#x00A0;<span class="cite">[<a  name="bXCPE:CPE2933"> </a><a  href="#XCPE:CPE2933">9</a>]</span>.  <!--l. 251--></font>    <p >   <font face="Verdana" size="2">Figure&#x00A0;<a  href="#x1-20091">1<!--tex4ht:ref: fig:alg_gje_blk --></a> shows a blocked version of the GJE algorithm for matrix inversion using the FLAME notation. There <img  src="/img/revistas/cleiej/v17n1/1a0410x.png" alt="m (A)  "  class="math" > stands for the number of rows of matrix&#x00A0;<img  src="/img/revistas/cleiej/v17n1/1a0411x.png" alt="A  "  class="math" > while, for details on the notation, we refer the reader to&#x00A0;<span class="cite">[<a  name="bXRecipe"> </a><a  href="#XRecipe">13</a>,&#x00A0;<a  name="bXGunnels:2001:FFL"> </a><a  href="#XGunnels:2001:FFL">14</a>]</span>. A description of the unblocked version of GJE, called from inside the blocked routine, can be found in&#x00A0;<span class="cite">[<a  name="bXQuiQSG01"> </a><a  href="#XQuiQSG01">12</a>]</span>; for simplicity, we do not include the application of pivoting during the factorization, but details can be found there as well. Given a square (nonsingular) matrix of size <img  src="/img/revistas/cleiej/v17n1/1a0412x.png" alt="n = m (A)  "  class="math" >, the cost of matrix inversion using this algorithm is <img  src="/img/revistas/cleiej/v17n1/1a0413x.png" alt="2n3  "  class="math" > flops, performing the inversion in-place so that, upon completion, the entries of <img  src="/img/revistas/cleiej/v17n1/1a0414x.png" alt="A  "  class="math" > are overwritten with those of its inverse.  <!--l. 107--></font>    <p >   <font face="Verdana" size="2">   <a   id="x1-20091"></a></font><hr class="float">    <div class="float"  >      <div class="center"  > <!--l. 110-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04f1.jpg"  ></font></div></div> <font face="Verdana" size="2"> <br /> </font>     ]]></body>
<body><![CDATA[<div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;1: </span><span   class="content">Blocked algorithm for matrix inversion via GJE without pivoting.</span></font></div><!--tex4ht:label?: x1-20091 -->     </div><hr class="endfloat" /> <!--l. 263-->    <p >   <font face="Verdana" size="2">Our primary interest for the GJE matrix inversion method is twofold. First, most of the computations of the blocked algorithm are matrix-matrix products (see Figure&#x00A0;<a  href="#x1-20091">1<!--tex4ht:ref: fig:alg_gje_blk --></a>). Therefore, the conclusions from our power&#8211;energy&#8211;performance evaluation can be extended to many other dense linear algebra kernels such as the solution of linear systems via the LU and Cholesky factorizations, and least-squares computations using the QR factorization&#x00A0;<span class="cite">[<a  name="bXGVL3"> </a><a  href="#XGVL3">11</a>]</span>, among others. <!--l. 269--></font>    <p >   <font face="Verdana" size="2">Additionally, explicit matrix inversion is required during the computation of the sign function of a matrix <img  src="/img/revistas/cleiej/v17n1/1a0416x.png" alt="A  "  class="math" > using the Newton iteration method&#x00A0;<span class="cite">[<a  name="bXRob80"> </a><a  href="#XRob80">10</a>]</span>, which we describe brief in the next sub-section. </font>        <p><font face="Verdana" size="2"><span class="titlemark">2.1   </span> <a   id="x1-30002.1"></a>Matrix sign function</font></p> <!--l. 276-->    <p ><font face="Verdana" size="2">Consider a matrix <img  src="/img/revistas/cleiej/v17n1/1a0417x.png" alt="A &#x2208; &#x211D;n&#x00D7;n  "  class="math" > with no eigenvalues on the imaginary axis, and let </font>    <table  class="equation"><tr><td><font face="Verdana" size="2"><a   id="x1-3001r1"></a>        </font>    <center class="math-display" > <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a0418x.png" alt="        (         ) A = T-1   J-   0   T,            0  J+ " class="math-display" ></font></center></td><td class="equation-label">        <font face="Verdana" size="2">(1)</font></td></tr></table> <!--l. 289-->    <p > <font face="Verdana" size="2">be its Jordan decomposition, where the eigenvalues of <img  src="/img/revistas/cleiej/v17n1/1a0419x.png" alt="J- &#x2208; &#x211D;j&#x00D7;j  "  class="math" > and <img  src="/img/revistas/cleiej/v17n1/1a0420x.png" alt="J+ &#x2208; "  class="math" ><img  src="/img/revistas/cleiej/v17n1/1a0421x.png" alt="&#x211D;(n-j)&#x00D7; (n-j)  "  class="math" > have negative and positive real parts&#x00A0;<span class="cite">[<a  name="bXGVL3"> </a><a  href="#XGVL3">11</a>]</span> respectively. <!--l. 296--></font>    <p >   <font face="Verdana" size="2">The <span  class="ecti-1000">matrix sign function </span>of <img  src="/img/revistas/cleiej/v17n1/1a0422x.png" alt="A  "  class="math" > is then defined as </font>    <table  class="equation"><tr><td><font face="Verdana" size="2"><a   id="x1-3002r2"></a>        </font>    <center class="math-display" > <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a0423x.png" alt="             (             ) sign(A) = T -1 - Ij    0     T,                  0    In- j " class="math-display" ></font></center></td><td class="equation-label">        <font face="Verdana" size="2">(2)</font></td></tr></table> <!--l. 306-->    <p > <font face="Verdana" size="2">where <img  src="/img/revistas/cleiej/v17n1/1a0424x.png" alt="I  "  class="math" > denotes the identity matrix of the order indicated by the subscript. The matrix sign function is a useful numerical tool for the solution of control theory problems (model reduction, optimal control)&#x00A0;<span class="cite">[<a  name="bXPetCK91"> </a><a  href="#XPetCK91">15</a>]</span>, and the bottleneck computation in many lattice quantum chromodynamics computations <span class="cite">[<a  name="bXFro_et_al00"> </a><a  href="#XFro_et_al00">16</a>]</span> and dense linear algebra computations (block diagonalization, eigenspectrum separation)&#x00A0;<span class="cite">[<a  name="bXGVL3"> </a><a  href="#XGVL3">11</a>,&#x00A0;<a  name="bXBye87"> </a><a  href="#XBye87">17</a>]</span>. Large-scale problems as those arising, e.g., in control theory often involve matrices of dimension <img  src="/img/revistas/cleiej/v17n1/1a0425x.png" alt="n &#x2192; O(10,000- 100,000)  "  class="math" >&#x00A0;<span class="cite">[<a  name="bXimtek"> </a><a  href="#Ximtek">18</a>]</span>. <!--l. 319--></font>    <p >   <font face="Verdana" size="2">There are simple iterative schemes for the computation of the sign function. Among these, the Newton iteration, given by </font>    <table  class="equation"><tr><td><font face="Verdana" size="2"><a   id="x1-3003r3"></a>        </font>    <center class="math-display" > <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a0426x.png" alt="  A0  :=  A,1      - 1 Ak+1  :=   2(Ak + Ak ), k = 0,1,2,..., " class="math-display" ></font></center></td><td class="equation-label">        <font face="Verdana" size="2">(3)</font></td></tr></table> <!--l. 328-->    <p > <font face="Verdana" size="2"> <!--l. 330--></font>    ]]></body>
<body><![CDATA[<p ><font face="Verdana" size="2">is specially appealing for its simplicity, efficiency, parallel performance, and asymptotic quadratic convergence&#x00A0;<span class="cite">[<a  name="bXBye87"> </a><a  href="#XBye87">17</a>,&#x00A0;<a  name="bXBenQ99"> </a><a  href="#XBenQ99">19</a>,&#x00A0;<a  name="bXBenEQR09"> </a><a  href="#XBenEQR09">20</a>]</span>. However, even if <img  src="/img/revistas/cleiej/v17n1/1a0427x.png" alt="A  "  class="math" > is sparse, <img  src="/img/revistas/cleiej/v17n1/1a0428x.png" alt="{Ak}k=1,2,...  "  class="math" > are in general full dense matrices and, thus, the scheme in&#x00A0;(<a  href="#x1-3003r3">3<!--tex4ht:ref: eqn:newton --></a>) roughly requires <img  src="/img/revistas/cleiej/v17n1/1a0429x.png" alt="2n3  "  class="math" > floating-point arithmetic operations (flops) per iteration.  <!--l. 340--></font>    <p >        <p><font face="Verdana" size="2"><span class="titlemark">3   </span> <a   id="x1-40003"></a>High Performance Implementation of GJE on Multicore and Manycore Architectures</font></p> <!--l. 344-->    <p ><font face="Verdana" size="2">As previously stated, the GJE algorithm for matrix inversion casts the bulk of the computations in terms of matrix-matrix products; see Figure&#x00A0;<a  href="#x1-20091">1<!--tex4ht:ref: fig:alg_gje_blk --></a>. In particular, provided that the block size <img  src="/img/revistas/cleiej/v17n1/1a0430x.png" alt="b  "  class="math" > is chosen there as <img  src="/img/revistas/cleiej/v17n1/1a0431x.png" alt="64 &#x2264; b &#x226A; n  "  class="math" >, the computational cost of the factorization of the &#8220;current&#8221; panel <img  src="/img/revistas/cleiej/v17n1/1a0432x.png" alt=" &#x02C6;  [ T   T   T ]T A =  A01;A11;A21  "  class="math" >, performed inside the routine GJE_<span  class="eccc-1000"><span  class="small-caps">UNB</span></span>, is negligible compared with that of the update of the remaining matrix blocks following that operation. <!--l. 352--></font>    <p >   <font face="Verdana" size="2">Therefore, the key to attaining high performance with <span  class="eccc-1000">GJE</span>&#x00A0;algorithm primarily relies on using a highly tuned implementation of the matrix-matrix product and, under certain conditions on parallel architectures, the reduction of the serial bottleneck that the factorization of <img  src="/img/revistas/cleiej/v17n1/1a0433x.png" alt=" &#x02C6; A  "  class="math" > represents applying, e.g., a look-ahead strategy&#x00A0;<span class="cite">[<a  name="bXLookahead"> </a><a  href="#XLookahead">21</a>]</span>. <!--l. 357--></font>    <p >   <font face="Verdana" size="2">Fortunately, there exist nowadays highly efficient routines for the matrix-matrix multiplication, embedded into mathematical libraries such as Intel MKL, AMD ACML, IBM ESSL, or NVIDIA CUBLAS; but also as part of independent development efforts like GotoBLAS2&#x00A0;<span class="cite">[<a  name="bXblasgoto"> </a><a  href="#Xblasgoto">22</a>]</span> or OpenBLAS&#x00A0;<span class="cite">[<a  name="bXopenblas"> </a><a  href="#Xopenblas">23</a>]</span>. Therefore, for the implementation of GJE in the Atom processor (as well as for the Intel Xeon processor that will be used for reference in the experimental evaluation), we simply leverage the matrix-matrix product kernel <span  class="ectt-1000">sgemm </span>in a recent version of Intel MKL. <!--l. 364--></font>    <p >   <font face="Verdana" size="2">Let us consider next the hybrid SECO development kit&#x00A0;<span class="cite">[<a  name="bXseco"> </a><a  href="#Xseco">24</a>]</span>, which combines a quad-core NVIDIA Tegra3/ARM Cortex A9 processor and an NVIDIA Quadro 1000M GPU with 96 cores, both processors in single board (see Figure <a  href="#x1-40012">2<!--tex4ht:ref: fig:carma --></a>).<br  class="newline" /> <!--l. 368--></font>    <p >   <hr class="figure">    <div class="figure"  >  <font face="Verdana" size="2">  <a   id="x1-40012"></a>  </font>      <div class="center"  > <!--l. 369-->    ]]></body>
<body><![CDATA[<p >  <font face="Verdana" size="2">  <!--l. 370--></font>    <p ><font face="Verdana" size="2"><img  src="/img/revistas/cleiej/v17n1/1a04f2.jpg" alt="PIC"   ></font></div> <font face="Verdana" size="2"> <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;2: </span><span   class="content">SECO board, which includes a quad-core ARM processor and a Quadro GPU.</span></font></div><!--tex4ht:label?: x1-40012 -->  <!--l. 374-->    <p >   </div><hr class="endfigure"> <!--l. 377-->    <p >   <font face="Verdana" size="2">The properties of the GJE algorithm and the hybrid nature of the target platform ask for an implementation that harnesses the concurrency of the operation while paying special attention to diminish the negative impact of communications between the memory address spaces of the Cortex A9 processor and the Quadro GPU. In a previous work we introduced a CPU-GPU implementation of the GJE algorithm for matrix inversion&#x00A0;<span class="cite">[<a  name="bXCPE:CPE2933"> </a><a  href="#XCPE:CPE2933">9</a>]</span>, and demonstrated the benefits of mapping each operation to the most convenient device: multicore processor or manycore accelerator. In this work we apply a similar approach to obtain a tuned implementation of the GJE algorithm for the SECO platform. The highly parallel matrix-matrix products are computed in the GPU. On the other hand, the panel factorizations performed with the unblocked algorithm GJE_<span  class="eccc-1000"><span  class="small-caps">UNB</span></span>, which consist of fine grain operations, are computed in the multicore CPU. This algorithm is summarized in Figure <a  href="#x1-40023">3<!--tex4ht:ref: fig:alg_gje_blk_con --></a> (note that, for simplicity, pivoting is ommitted in the figure, though partial column pivoting is performed in all our implementations). The block size <img  src="/img/revistas/cleiej/v17n1/1a0434x.png" alt="b  "  class="math" > is tuned for the architecture and also for each problem dimension. <!--l. 395--></font>    <p >   <font face="Verdana" size="2">Additionally, we include a look-ahead technique that allows to overlap the factorization in step <img  src="/img/revistas/cleiej/v17n1/1a0435x.png" alt="(k+ 1)-th  "  class="math" > with part of the updates performed during iteration <img  src="/img/revistas/cleiej/v17n1/1a0436x.png" alt="k  "  class="math" >, and also keeps a low communication overhead. <!--l. 400--></font>    <p >   <font face="Verdana" size="2">Finally, the parallelism intrinsic to the linear algebra operations that appear in algorithms GJE_<span  class="eccc-1000"><span  class="small-caps">BLK</span> </span>and GJE_<span  class="eccc-1000"><span  class="small-caps">UNB</span> </span>are exploited using parallel implementations of the BLAS. In particular, we employ kernels from libraries CUBLAS and reference BLAS (parallelized with OpenMP directives), for the Quadro GPU and the ARM processor respectively.  <!--l. 121--></font>    <p >   <font face="Verdana" size="2">   <a   id="x1-40023"></a></font><hr class="float">    <div class="float"  >      <div class="center"  > <!--l. 124-->    ]]></body>
<body><![CDATA[<p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04f3.jpg"  ></font></div></div> <font face="Verdana" size="2"> <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;3: </span><span   class="content">Blocked algorithm for matrix inversion via GJE without pivoting in the SECO platform.</span></font></div><!--tex4ht:label?: x1-40023 -->     </div><hr class="endfloat" />        <p><font face="Verdana" size="2"><span class="titlemark">4   </span> <a   id="x1-50004"></a>Evaluation</font></p> <!--l. 413-->    <p ><font face="Verdana" size="2">This section is divided into three parts. First, we introduce the target platforms. This is followed by a description of the power measurement system. Finally, the results obtained are presented and analyzed. <!--l. 418--></font>    <p >        <p><font face="Verdana" size="2"><span class="titlemark">4.1   </span> <a   id="x1-60004.1"></a>Hardware platforms</font></p> <!--l. 420-->    <p ><font face="Verdana" size="2">We evaluate the matrix inversion routines on three target hardware platforms: a state-of-the-art server equipped with two multicore Intel Xeon (&#8220;Nehalem&#8221;) processors, an Intel Atom-based laptop, and a hybrid ARM+NVIDIA board from SECO. Details about the hardware and the compilers employed in each platform can be found in Table&#x00A0;<a  href="#x1-60011">1<!--tex4ht:ref: tab:hw --></a>. <!--l. 425--></font>    <p >   <font face="Verdana" size="2">The inversion routines for the Xeon and Atom processors heavily rely on the matrix-matrix product kernel in Intel MKL (versions 10.3 and 11.0, respectively). The hybrid implementation for the SECO platform makes intensive use of the kernels in CUBLAS (version 5.0) and the legacy implementation of BLAS (available at <a  href="http://www.netlib.org/)." class="url" ><span  class="ectt-1000">http://www.netlib.org/).</span></a> parallelized with OpenMP. (We note, however, that the amount of computation that is performed in the cores of the Cortex A9 processor is small, and mostly based on BLAS-1 and BLAS-2 operations, so that we do not expect significant differences if a tuned version of BLAS was used for this architecture.) <!--l. 436--></font>    <p >   <font face="Verdana" size="2">The codes were compiled with the <span  class="ectt-1000">-O3 </span>optimization flag and all the computations were performed using single precision arithmetic. <br  class="newline" />    </font>        ]]></body>
<body><![CDATA[<div class="table">  <!--l. 439-->    <p >   <font face="Verdana" size="2">   <a   id="x1-60011"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;1: </span><span   class="content">Architectures employed in the experimental evaluation and the average power dissipation while idle (<img  src="/img/revistas/cleiej/v17n1/1a0438x.png" alt="PI  "  class="math" >)</span></font></div><!--tex4ht:label?: x1-60011 -->     <div class="center"  > <!--l. 441-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t1.jpg" ></font></div></div>     </div><hr class="endfloat" />    </div>         <p><font face="Verdana" size="2"><span class="titlemark">4.2   </span> <a   id="x1-70004.2"></a>Power measurement</font></p> <!--l. 461-->    <p ><font face="Verdana" size="2">In order to measure power, we connected a <span  class="eccc-1000">W<span  class="small-caps">ATTS</span>U<span  class="small-caps">P</span>?P<span  class="small-caps">RO</span> </span>wattmeter (accuracy of <img  src="/img/revistas/cleiej/v17n1/1a0440x.png" alt="± 1.5  "  class="math" >% and a rate of 1 sample/sec.) to the power line from the electric socket to the power supply unit (PSU), collecting the results on a separate server. All tests were executed for a minimum of 1 minute, after a warm up period of 2 minutes. <!--l. 468--></font>    <p >   <font face="Verdana" size="2">Since some of the platforms where the processors are embedded contain other devices &#8212;e.g., disks, network interface cards, and on the Atom laptop even the LCD display&#8212; on each platform we calculated the average power while idle for 1 minute, <img  src="/img/revistas/cleiej/v17n1/1a0441x.png" alt="PI  "  class="math" >, and then used this value to calculate the <span  class="ecti-1000">net energy</span>, corresponding to the consumption after subtracting <img  src="/img/revistas/cleiej/v17n1/1a0442x.png" alt="PI  "  class="math" > from the power samples. We expect this measure allows a fair comparison between the architectures, as in this manner we only evaluate the energy that is necessary to do the actual work. <!--l. 476--></font>    ]]></body>
<body><![CDATA[<p >        <p><font face="Verdana" size="2"><span class="titlemark">4.3   </span> <a   id="x1-80004.3"></a>Experimental evaluation</font></p> <!--l. 479-->    <p ><font face="Verdana" size="2">The experimental evaluation is performed in two stages. First, we analyze the performance and power-energy consumption of the GJE for matrix inversion algorithm. Next, we study the impact of Gauss-Jordan inversion method on the resolution of matrix sign function. <!--l. 483--></font>    <p >        <p><font face="Verdana" size="2"><span class="titlemark">4.3.1   </span> <a   id="x1-90004.3.1"></a>Matrix inversion</font></p> <!--l. 485-->    <p ><font face="Verdana" size="2">Tables&#x00A0;<a  href="#x1-90012">2<!--tex4ht:ref: tab:mc1 --></a>,&#x00A0;<a  href="#x1-90023">3<!--tex4ht:ref: tab:mc2 --></a> and&#x00A0;<a  href="#x1-90034">4<!--tex4ht:ref: tab:mc3 --></a> collect the results obtained from the execution of the different implementations of the GJE matrix inversion algorithm on the three target platforms, for problems of dimension <img  src="/img/revistas/cleiej/v17n1/1a0443x.png" alt="n  "  class="math" > varying from 256 to 8,192. The same information is refined and collected graphically, in terms of GFLOPS and GFLOPS/Watt, in Figures&#x00A0;<a  href="#x1-90044">4<!--tex4ht:ref: fig:mc1 --></a>,&#x00A0;<a  href="#x1-90055">5<!--tex4ht:ref: fig:mc2 --></a> and&#x00A0;<a  href="#x1-90066">6<!--tex4ht:ref: fig:mc3 --></a>. <!--l. 489--></font>    <p >   <font face="Verdana" size="2">The results characterize the different performance-power-energy balance of the platforms: The Intel Xeon is considerably faster than the Intel Atom, in factors that range from more than 255<img  src="/img/revistas/cleiej/v17n1/1a0444x.png" alt="&#x00D7; "  class="math" > for the smaller problem dimensions, to about 50.8<img  src="/img/revistas/cleiej/v17n1/1a0445x.png" alt="&#x00D7; "  class="math" > for the larger ones; but the power dissipated by the Atom architecture is, depending on the problem size, 9.8 to 12.4<img  src="/img/revistas/cleiej/v17n1/1a0446x.png" alt="&#x00D7; "  class="math" > lower than that of the Intel Xeon architecture. The outcome of the combination of these two factors is that, from the perspective of total energy, the Intel Atom spends between 4.25 and 22.0<img  src="/img/revistas/cleiej/v17n1/1a0447x.png" alt="&#x00D7; "  class="math" > more energy than the Intel Xeon to compute the inverse; but the excess is only between 1.77 and 8.46<img  src="/img/revistas/cleiej/v17n1/1a0448x.png" alt="&#x00D7; "  class="math" > if we consider net energy. On the other hand, the SECO board presents quite an interesting balance. While being clearly slower than the Intel Xeon (especially for the smaller problems), this platform also shows a remarkable advantage from the point of view of energy efficiency. Thus, when the problem size is larger than 2,048, the ratios for the total and net energy of these two platforms are, respectively, up to 2.04 and 1.94<img  src="/img/revistas/cleiej/v17n1/1a0449x.png" alt="&#x00D7; "  class="math" > in favor of the SECO system. </font>         <div class="table">  <!--l. 511-->    <p >   <font face="Verdana" size="2">   <a   id="x1-90012"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     ]]></body>
<body><![CDATA[<div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;2:  </span><span   class="content">Time  (in  sec.);  GFLOPS;  average  and  maximum  power  consumption  (Pavg   and Pmax, respectively, in Watts); and total and net energy (Etot  and Enet, respectively, in Joules) in Xeon</span></font></div><!--tex4ht:label?: x1-90012 -->     <div class="center"  > <!--l. 514-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t2.jpg" ></font></div></div>     </div><hr class="endfloat" />    </div>        <div class="table">  <!--l. 536-->    <p >   <font face="Verdana" size="2">   <a   id="x1-90023"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;3:  </span><span   class="content">Time  (in  sec.);  GFLOPS;  average  and  maximum  power  consumption  (Pavg  and  Pmax, respectively, in Watts); and total and net energy (Etot and Enet, respectively, in Joules) in Atom</span></font></div><!--tex4ht:label?: x1-90023 -->     <div class="center"  > <!--l. 539-->    <p >     ]]></body>
<body><![CDATA[<div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t3.jpg"  ></font></div></div>     </div><hr class="endfloat" />    </div>        <div class="table">  <!--l. 560-->    <p >   <font face="Verdana" size="2">   <a   id="x1-90034"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;4:  </span><span   class="content">Time  (in  sec.);  GFLOPS;  average  and  maximum  power  consumption  (Pavg   and Pmax, respectively, in Watts); and total and net energy (Etot  and Enet, respectively, in Joules)  in SECO</span></font></div><!--tex4ht:label?: x1-90034 -->     <div class="center"  > <!--l. 563-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t4.jpg"  ></font></div></div>     </div><hr class="endfloat" />    </div>  <!--l. 587-->    <p >   <hr class="figure">    <div class="figure"  >  <font face="Verdana" size="2">  <a   id="x1-90044"></a>  </font>      ]]></body>
<body><![CDATA[<div class="center"  > <!--l. 588-->    <p >  <font face="Verdana" size="2">  <!--l. 589--></font>    <p ><font face="Verdana" size="2"><img  src="/img/revistas/cleiej/v17n1/1a04f4.jpg" alt="PIC"   ></font></div> <font face="Verdana" size="2"> <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;4: </span><span   class="content">Performance in the target platforms</span></font></div><!--tex4ht:label?: x1-90044 -->  <!--l. 593-->    <p >   </div><hr class="endfigure"> <!--l. 595-->    <p >   <hr class="figure">    <div class="figure"  >  <font face="Verdana" size="2">  <a   id="x1-90055"></a>  </font>      <div class="center"  > <!--l. 596-->    <p >  <font face="Verdana" size="2">  <!--l. 597--></font>    <p ><font face="Verdana" size="2"><img  src="/img/revistas/cleiej/v17n1/1a04f5.jpg" alt="PIC"   ></font></div> <font face="Verdana" size="2"> <br /> </font>     ]]></body>
<body><![CDATA[<div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;5: </span><span   class="content">Total performance-per-watt in the target platforms</span></font></div><!--tex4ht:label?: x1-90055 -->  <!--l. 601-->    <p >   </div><hr class="endfigure">  <!--l. 605-->    <p >   <hr class="figure">    <div class="figure"  >  <font face="Verdana" size="2">  <a   id="x1-90066"></a>  </font>      <div class="center"  > <!--l. 606-->    <p >  <font face="Verdana" size="2">  <!--l. 607--></font>    <p ><font face="Verdana" size="2"><img  src="/img/revistas/cleiej/v17n1/1a04f6.jpg" alt="PIC"   ><br /> </font> </div> <font face="Verdana" size="2"> <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Figure&#x00A0;6: </span><span   class="content">Net performance-per-watt in the target platforms </span></font></div><!--tex4ht:label?: x1-90066 -->  <!--l. 611-->    <p >   </div><hr class="endfigure">        <p><font face="Verdana" size="2"><span class="titlemark">4.3.2   </span> <a   id="x1-100004.3.2"></a>Matrix sign function</font></p> <!--l. 615-->    ]]></body>
<body><![CDATA[<p ><font face="Verdana" size="2">We next evaluate the performance and power-energy consumption to solve the matrix sign function (see Section <a  href="#x1-30002.1">2.1<!--tex4ht:ref: sec:MSF --></a>) using the three target platforms, i.e. Xeon, Atom and SECO. Concretely, Table <a  href="#x1-100015">5<!--tex4ht:ref: tab:fs1 --></a> presents the runtime to obtain the matrix sign function (in seconds) for four different problem dimensions, 256, 2,048, 5,120 and 8,192. Additionally, we discriminate the execution time related to the matrix inversions including the associated percentage in the overall process time. The time required for different problem dimensions can be easily infered from the results showed in the previous subsection and the data in Table <a  href="#x1-100015">5<!--tex4ht:ref: tab:fs1 --></a>. As the input matrices were randomly generated, and do not correspond to a real problem, the number of steps of the algorithm to reach the solution was fixed to 20 iterations. </font>        <div class="table">  <!--l. 626-->    <p >   <font face="Verdana" size="2">   <a   id="x1-100015"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;5: </span><span   class="content">Time (in sec.) and percentage of matrix inversion runtime to calculate the matrix sign function</span></font></div><!--tex4ht:label?: x1-100015 -->     <div class="center"  > <!--l. 628-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t5.jpg" ></font></div></div>     </div><hr class="endfloat" />    </div>  <!--l. 656-->    <p >   <font face="Verdana" size="2">The time evaluation summarized in Table <a  href="#x1-100015">5<!--tex4ht:ref: tab:fs1 --></a> shows the close relation between the computational complexity of the matrix sign function algorithm and the matrix inversion kernel. Thus, the time required by the computation of the matrix inverses represents at least the 95% of the computational time of the overall process. This allows to easily compute a rough approximation of the energy consumption from the data obtained during the matrix inversion kernels evaluation. In particular, Table <a  href="#x1-100026">6<!--tex4ht:ref: tab:fs2 --></a> summarizes the total runtime (in seconds) and the energy consumption (in Joules) for the computation of the matrix sign function on the three platforms and four different dimension cases (256, 2,048, 5,120 and 8,192). <br  class="newline" />    </font>        <div class="table">  <!--l. 669-->    ]]></body>
<body><![CDATA[<p >   <font face="Verdana" size="2">   <a   id="x1-100026"></a></font><hr class="float">    <div class="float"  >  <font face="Verdana" size="2">  <br /> </font>     <div class="caption"  ><font face="Verdana" size="2"><span class="id">Table&#x00A0;6: </span><span   class="content">Time (in sec.) and total energy (Etot, in Joules) to calculate the matrix sign function</span></font></div><!--tex4ht:label?: x1-100026 -->     <div class="center"  > <!--l. 671-->    <p >     <div class="pic-tabular"> <font face="Verdana" size="2"> <img  src="/img/revistas/cleiej/v17n1/1a04t6.jpg" ></font></div></div>     </div><hr class="endfloat" />    </div> <!--l. 696-->    <p >   <font face="Verdana" size="2">The highest energy consumption is required by Atom. Despite its low average power consumption, the large computational time leads to the worst results in terms of energy for this platform. Thus, the energy consumed by the Xeon is 4<img  src="/img/revistas/cleiej/v17n1/1a0467x.png" alt="&#x00D7; "  class="math" > lower for the largest problem tackled. On the other hand the lowest energy consumption is obtained for SECO, which requires 2<img  src="/img/revistas/cleiej/v17n1/1a0468x.png" alt="&#x00D7; "  class="math" > and 8<img  src="/img/revistas/cleiej/v17n1/1a0469x.png" alt="&#x00D7; "  class="math" > less energy than Xeon and Atom respectively. This is explained by the favorable performance-power ratio of the SECO platform. <br  class="newline" />    </font>        <p><font face="Verdana" size="2"><span class="titlemark">5   </span> <a   id="x1-110005"></a>Concluding Remarks and Future Directions</font></p> <!--l. 708-->    <p ><font face="Verdana" size="2">We have investigated the trade-off between performance and power-energy of three architectures using as a benchmark the matrix sign function, a useful mathematical tool in some numerical methods. In particular, the evaluation includes two low-power architectures and a conventional general-purpose multicore processor. <!--l. 712--></font>    <p >   <font face="Verdana" size="2">Our experimental evaluation is divided in two stages. First, the main computational kernel in the sign function algorithm, the general matrix inversion, is evaluated. Then, the complete sign function algorithm is assessed. <!--l. 716--></font>    ]]></body>
<body><![CDATA[<p >   <font face="Verdana" size="2">The use of blocked routines for GJE matrix inversion shows that, for dense linear algebra operations that are rich in matrix-matrix products, the <span  class="ecti-1000">race-to-idle </span>strategy (i.e, execute the task as fast as possible, even if there is a high power dissipation associated with that) is crucial to attain both high throughput and performance-per-watt rates on general-purpose processor architectures, favoring power-hungry complex designs like the Intel Xeon processor over the Intel Atom counterpart. However, the experimentation also shows that a hybrid architecture that combines a low-power multicore processor and a limited GPU can offer competitive performance compared with that of the Intel Xeon platform, while being clearly superior from the perspective of energy efficiency. <!--l. 725--></font>    <p >   <font face="Verdana" size="2">The results obtained during the evaluation of the matrix sign function reinforce the conclussions extracted from the previous analisys. <!--l. 729--></font>    <p >   <font face="Verdana" size="2">Future research lines resulting from this experience will include: </font>      <ul class="itemize1">      <li class="itemize"><font face="Verdana" size="2">Evaluate in detail the power-energy consumption of each stage of the <span  class="eccc-1000">GJE</span>&#x00A0;method for general matrix      inversion. </font>      </li>      <li class="itemize"><font face="Verdana" size="2">Analyze  the  impact  of  other  optimization  techniques  on  memory-bounded  dense  linear  algebra      operations, and the adoption of dynamic frequency-voltage scaling (DVFS) and dynamic concurrency      throttling (DCT) for certain stages of the algorithm. </font>      </li>      <li class="itemize"><font face="Verdana" size="2">Extend our study to other high performance platforms, e.g. Kepler GPUs connected to high end      multi-core CPU and extend the evaluation to other low-power processors with a large number of cores,      such as ARM Cortex A15 processors. </font>      </li>    </ul> <!--l. 747-->    <p >        <p><font face="Verdana" size="2"><a   id="x1-120005"></a>Acknowledgments</font></p> <!--l. 749-->    <p ><font face="Verdana" size="2">The researcher from UJI was supported by the CICYT project TIN2011-23283 of the Ministerio de Economía y Competitividad, FEDER, and the EU Project FP7 318793 &#8220;EXA2GREEN&#8221;. P. Ezzatti acknowledges support from Agencia Nacional de Investigación e Innovación (ANII) and Programa de Desarrollo de las Ciencias Básicas (PEDECIBA), Uruguay. The authors gratefully acknowledge Juan Pablo Silva and Germán León for their technical support with SECO hardware platform.  <!--l. 2--></font>    <p >        <p><font face="Verdana" size="2"><a   id="x1-130005"></a>References</font></p> <!--l. 2-->    <p >         ]]></body>
<body><![CDATA[<div class="thebibliography">         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXtop500">1</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xtop500"></a>&#8220;The top500 list,&#8221; 2013, available at <a  href="http://www.top500.org" class="url" >http://www.top500.org</a>. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXgreen500">2</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xgreen500"></a>&#8220;The Green500 list,&#8221; 2013, available at <a  href="http://www.green500.org" class="url" >http://www.green500.org</a>. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXexascalechallenge">3</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xexascalechallenge"></a>S.&#x00A0;Ashby&#x00A0;<span  class="ecti-1000">et       al</span>,        &#8220;The        opportunities        and        challenges        of        Exascale     computing,&#8221;        Summary       Report       of       the       Advanced       Scientific       Computing     Advisory     Committee     (ASCAC)     Subcommittee,     November     2010.     [Online].     Available:     <a  href="http://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale\_subcommittee\_report.pdf" class="url" >http://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale\_subcommittee\_report.pdf</a>     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXDongarraEA11">4</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XDongarraEA11"></a>J.&#x00A0;Dongarra and <span  class="ecti-1000">et al</span>, &#8220;The international ExaScale software project roadmap,&#8221;  <span  class="ecti-1000">Int. J. of High</span>     <span  class="ecti-1000">Performance Computing &amp; Applications</span>, vol.&#x00A0;25, no.&#x00A0;1, pp. 3&#8211;60, 2011.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXDuranton13">5</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XDuranton13"></a>M.&#x00A0;Duranton and <span  class="ecti-1000">et al</span>, &#8220;The HiPEAC vision for advanced computing in horizon 2020,&#8221; 2013.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXcrestaweb">6</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xcrestaweb"></a>&#8220;CRESTA:   collaborative   research   into   Exascale   systemware,   tools   and   applications,&#8221;     <a  href="http://cresta-project.eu" class="url" >http://cresta-project.eu</a>.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXmontblancweb">7</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xmontblancweb"></a>&#8220;The Mont Blanc project,&#8221; <a  href="http://montblanc-project.eu" class="url" >http://montblanc-project.eu</a>.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXHigham:2002:ASN">8</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XHigham:2002:ASN"></a>N.&#x00A0;Higham, <span  class="ecti-1000">Accuracy and Stability of Numerical Algorithms</span>, 2nd&#x00A0;ed.    Philadelphia, PA, USA:     Society for Industrial and Applied Mathematics, 2002. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">   [<a   href="#bXCPE:CPE2933">9</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XCPE:CPE2933"></a>P.&#x00A0;Benner,  P.&#x00A0;Ezzatti,  E.&#x00A0;S.  Quintana-Ortí,  and  A.&#x00A0;Remón,  &#8220;Matrix  inversion  on  CPU-GPU     platforms with applications in control theory,&#8221; <span  class="ecti-1000">Concurrency and Computation: Practice &amp; Experience</span>,     vol.&#x00A0;25, no.&#x00A0;8, pp. 1170&#8211;1182, 2013. [Online]. Available: <a  href="http://dx.doi.org/10.1002/cpe.2933" class="url" >http://dx.doi.org/10.1002/cpe.2933</a>     </font>     </p>         ]]></body>
<body><![CDATA[<p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXRob80">10</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XRob80"></a>J.&#x00A0;Roberts, &#8220;Linear model reduction and solution of the algebraic Riccati equation by use of the sign     function,&#8221; <span  class="ecti-1000">Internat. J. Control</span>, vol.&#x00A0;32, pp. 677&#8211;687, 1980, (Reprint of Technical Report No. TR-13,     CUED/B-Control, Cambridge University, Engineering Department, 1971). </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXGVL3">11</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XGVL3"></a>G.&#x00A0;Golub and C.&#x00A0;V. Loan, <span  class="ecti-1000">Matrix Computations</span>, 3rd&#x00A0;ed.  Baltimore: The Johns Hopkins University     Press, 1996. </font>      </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXQuiQSG01">12</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XQuiQSG01"></a>E.&#x00A0;Quintana-Ortí, G.&#x00A0;Quintana-Ortí, X.&#x00A0;Sun, and R.&#x00A0;van&#x00A0;de&#x00A0;Geijn, &#8220;A note on parallel matrix     inversion,&#8221; <span  class="ecti-1000">SIAM J. Sci. Comput.</span>, vol.&#x00A0;22, pp. 1762&#8211;1771, 2001.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXRecipe">13</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XRecipe"></a>P.&#x00A0;Bientinesi, J.&#x00A0;A. Gunnels, M.&#x00A0;E. Myers, E.&#x00A0;S. Quintana-Ortí, and R.&#x00A0;A. van&#x00A0;de Geijn, &#8220;The     science of deriving dense linear algebra algorithms,&#8221; <span  class="ecti-1000">ACM Trans. Math. Soft.</span>, vol.&#x00A0;31, no.&#x00A0;1, pp. 1&#8211;26,     March 2005. [Online]. Available: <a  href="http://doi.acm.org/10.1145/1055531.1055532" class="url" >http://doi.acm.org/10.1145/1055531.1055532</a>     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXGunnels:2001:FFL">14</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XGunnels:2001:FFL"></a>J.&#x00A0;A. Gunnels, F.&#x00A0;G. Gustavson, G.&#x00A0;M. Henry, and R.&#x00A0;A. van&#x00A0;de Geijn, &#8220;FLAME: Formal linear     algebra  methods  environment,&#8221;  <span  class="ecti-1000">ACM  Trans.  Math.  Soft.</span>,  vol.&#x00A0;27,  no.&#x00A0;4,  pp.  422&#8211;455,  Dec.  2001.     [Online]. Available: <a  href="http://doi.acm.org/10.1145/504210.504213" class="url" >http://doi.acm.org/10.1145/504210.504213</a>     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXPetCK91">15</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XPetCK91"></a>P.&#x00A0;Petkov, N.&#x00A0;Christov, and M.&#x00A0;Konstantinov, <span  class="ecti-1000">Computational Methods for Linear Control Systems</span>.     Hertfordshire, UK: Prentice-Hall, 1991. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXFro_et_al00">16</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XFro_et_al00"></a>A.&#x00A0;Frommer,  T.&#x00A0;Lippert,  B.&#x00A0;Medeke,  and  K.&#x00A0;Schilling,  Eds.,  <span  class="ecti-1000">Numerical  Challenges  in  Lattice</span>     <span  class="ecti-1000">Quantum  Chromodynamics</span>,   ser.   Lecture   Notes   in   Computational   Science   and   Engineering.     Berlin/Heidelberg: Springer-Verlag, 2000, vol.&#x00A0;15. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXBye87">17</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XBye87"></a>R.&#x00A0;Byers, &#8220;Solving the algebraic Riccati equation with the matrix sign function,&#8221;  <span  class="ecti-1000">Linear Algebra</span>     <span  class="ecti-1000">Appl.</span>, vol.&#x00A0;85, pp. 267&#8211;279, 1987. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a  href="#bXimtek">18</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Ximtek"></a><span  class="ecti-1000">Oberwolfach          model          reduction          benchmark          collection</span>,            IMTEK,     <span  class="ectt-1000">http://www.imtek.de/simulation/benchmark/</span>. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXBenQ99">19</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XBenQ99"></a>P.&#x00A0;Benner and E.&#x00A0;Quintana-Ortí, &#8220;Solving stable generalized Lyapunov equations with the matrix     sign function,&#8221; <span  class="ecti-1000">Numer. Algorithms</span>, vol.&#x00A0;20, no.&#x00A0;1, pp. 75&#8211;100, 1999.     </font>     </p>         ]]></body>
<body><![CDATA[<p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXBenEQR09">20</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XBenEQR09"></a>P.&#x00A0;Benner, P.&#x00A0;Ezzatti, E.&#x00A0;S. Quintana-Ortí, and A.&#x00A0;Remón, &#8220;Using hybrid CPU-GPU platforms     to accelerate the computation of the matrix sign function,&#8221;  in <span  class="ecti-1000">Euro-Par 2009, Parallel Processing -</span>     <span  class="ecti-1000">Workshops</span>, ser. Lecture Notes in Computer Science, H.-X. Lin, M.&#x00A0;Alexander, M.&#x00A0;Forsell, A.&#x00A0;Knupfer,     R.&#x00A0;Prodan, L.&#x00A0;Sousa, and A.&#x00A0;Streit, Eds.   Springer-Verlag, 2009, no. 6043, pp. 132&#8211;139.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXLookahead">21</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="XLookahead"></a>P.&#x00A0;Strazdins, &#8220;A comparison of lookahead and algorithmic blocking techniques for parallel matrix     factorization,&#8221; Department of Computer Science, The Australian National University, Canberra 0200     ACT, Australia, Tech. Rep. TR-CS-98-07, 1998. </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXblasgoto">22</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xblasgoto"></a>Texas Advanced Computing Center, <a  href="http://www.tacc.utexas.edu/tacc-software/gotoblas2" class="url" >http://www.tacc.utexas.edu/tacc-software/gotoblas2</a>.     </font>      </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXopenblas">23</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xopenblas"></a>&#8220;Open   BLAS,&#8221;   Lab.   of   Parallel   Software   and   Computational   Science,   ISCAS,   2013,     <a  href="http://xianyi.github.io/OpenBLAS/" class="url" >http://xianyi.github.io/OpenBLAS/</a>.     </font>     </p>         <p ><font face="Verdana" size="2"><span class="biblabel">  [<a   href="#bXseco">24</a>]<span class="bibsp">&#x00A0;&#x00A0;&#x00A0;</span></span><a   id="Xseco"></a>&#8220;The         CUDA         development         kit         from         SECO,&#8221;         NVIDIA,         2013,     <a  href="http://www.nvidia.com/object/seco-dev-kit.html" class="url" >http://www.nvidia.com/object/seco-dev-kit.html</a>.     </font> </p>     </div>           ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="">
<source><![CDATA[The top500 list]]></source>
<year>2013</year>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="">
<source><![CDATA[The Green500 list]]></source>
<year>2013</year>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ashby]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The opportunities and challenges of Exascale computing]]></article-title>
<source><![CDATA[Summary Report of the Advanced Scientific Computing Advisory Committee: ASCAC]]></source>
<year>2010</year>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dongarra]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The international ExaScale software project roadmap]]></article-title>
<source><![CDATA[Int. J. of High Performance Computing and Applications]]></source>
<year>2011</year>
<volume>25</volume><volume>1</volume>
<page-range>3-60</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Duranton]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<source><![CDATA[The HiPEAC vision for advanced computing in horizon 2020]]></source>
<year>2013</year>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="">
<source><![CDATA[CRESTA: collaborative research into Exascale systemware, tools and applications]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="">
<source><![CDATA[The Mont Blanc project]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Higham]]></surname>
<given-names><![CDATA[N]]></given-names>
</name>
</person-group>
<source><![CDATA[Accuracy and Stability of Numerical Algorithms]]></source>
<year>2002</year>
<edition>2</edition>
<publisher-loc><![CDATA[Philadelphia ]]></publisher-loc>
<publisher-name><![CDATA[Society for Industrial and Applied Mathematics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Matrix inversion on CPU-GPU platforms with applications in control theory]]></article-title>
<source><![CDATA[Concurrency and Computation: Practice and Experience]]></source>
<year>2013</year>
<volume>25</volume>
<numero>8</numero>
<issue>8</issue>
<page-range>1170-1182</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Roberts]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Linear model reduction and solution of the algebraic Riccati equation by use of the sign function]]></article-title>
<source><![CDATA[Internat. J. Control]]></source>
<year>1980</year>
<volume>32</volume>
<page-range>677-687</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Golub]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
<name>
<surname><![CDATA[Loan]]></surname>
<given-names><![CDATA[C. V.]]></given-names>
</name>
</person-group>
<source><![CDATA[Matrix Computations]]></source>
<year>1996</year>
<edition>3</edition>
<publisher-loc><![CDATA[Baltimore ]]></publisher-loc>
<publisher-name><![CDATA[The Johns Hopkins University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
<name>
<surname><![CDATA[Sun]]></surname>
<given-names><![CDATA[X]]></given-names>
</name>
<name>
<surname><![CDATA[van de Geijn]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A note on parallel matrix inversion]]></article-title>
<source><![CDATA[SIAM J. Sci. Comput]]></source>
<year>2001</year>
<volume>22</volume>
<page-range>1762-1771</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bientinesi]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Gunnels]]></surname>
<given-names><![CDATA[J. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Myers]]></surname>
<given-names><![CDATA[M. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Geijn]]></surname>
<given-names><![CDATA[R. A. van de]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The science of deriving dense linear algebra algorithms]]></article-title>
<source><![CDATA[ACM Trans. Math. Soft]]></source>
<year>2005</year>
<volume>31</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-26</page-range></nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gunnels]]></surname>
<given-names><![CDATA[J. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Gustavson]]></surname>
<given-names><![CDATA[F. G.]]></given-names>
</name>
<name>
<surname><![CDATA[Henry]]></surname>
<given-names><![CDATA[G. M.]]></given-names>
</name>
<name>
<surname><![CDATA[van deGeijn]]></surname>
<given-names><![CDATA[R. A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[FLAME: Formal linear algebra methods environment]]></article-title>
<source><![CDATA[ACM Trans. Math. Soft]]></source>
<year>2001</year>
<volume>27</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>422-455</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Petkov]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Christov]]></surname>
<given-names><![CDATA[N]]></given-names>
</name>
<name>
<surname><![CDATA[Konstantinov]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<source><![CDATA[Computational Methods for Linear Control Systems.]]></source>
<year>1991</year>
<publisher-loc><![CDATA[Hertfordshire^eUK UK]]></publisher-loc>
<publisher-name><![CDATA[Prentice-Hall]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Frommer]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Lippert]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
<name>
<surname><![CDATA[Medeke]]></surname>
<given-names><![CDATA[B]]></given-names>
</name>
<name>
<surname><![CDATA[Schilling]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Numerical Challenges in Lattice Quantum Chromodynamics]]></article-title>
<source><![CDATA[Lecture Notes in Computational Science and Engineering]]></source>
<year>2000</year>
<volume>15</volume>
<publisher-loc><![CDATA[BerlinHeidelberg ]]></publisher-loc>
<publisher-name><![CDATA[Springer-Verlag]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Byers]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Solving the algebraic Riccati equation with the matrix sign function]]></article-title>
<source><![CDATA[Linear Algebra Appl]]></source>
<year>1987</year>
<volume>85</volume>
<page-range>267-279</page-range></nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="">
<source><![CDATA[Oberwolfach model reduction benchmark collection, IMTEK]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Solving stable generalized Lyapunov equations with the matrix sign function]]></article-title>
<source><![CDATA[Numer. Algorithms]]></source>
<year>1999</year>
<volume>20</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>75-100</page-range></nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Benner]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Ezzatti]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Quintana-Ortí]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Remón]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[X. Lin]]></surname>
<given-names><![CDATA[H]]></given-names>
</name>
<name>
<surname><![CDATA[Alexander]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Forsell]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Knupfer]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Prodan]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Sousa]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
<name>
<surname><![CDATA[Streit]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<source><![CDATA[Euro-Par 2009: Parallel Processing]]></source>
<year>2009</year>
<page-range>132-139</page-range><publisher-name><![CDATA[Springer-Verlag]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Strazdins]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization]]></article-title>
<source><![CDATA[Department of Computer Science, The Australian National University]]></source>
<year>1998</year>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="">
<source><![CDATA[Texas Advanced Computing Center]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B23">
<label>23</label><nlm-citation citation-type="">
<source><![CDATA[Open BLAS: Lab. of Parallel Software and Computational Science]]></source>
<year>2013</year>
</nlm-citation>
</ref>
<ref id="B24">
<label>24</label><nlm-citation citation-type="">
<source><![CDATA[The CUDA development kit from SECO: NVIDIA]]></source>
<year>2013</year>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
