<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002012000300004</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Optimizing Latency in Beowulf Clusters]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Garabato]]></surname>
<given-names><![CDATA[Rafael]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[More]]></surname>
<given-names><![CDATA[Andrés]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Rosales]]></surname>
<given-names><![CDATA[Victor]]></given-names>
</name>
<xref ref-type="aff" rid="A03"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Argentina Software Design Center  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<aff id="A02">
<institution><![CDATA[,Argentina Software Design Center  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<aff id="A21">
<institution><![CDATA[,Instituto Universitario Aeronáutico  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<aff id="A03">
<institution><![CDATA[,Argentina Software Design Center  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<volume>15</volume>
<numero>3</numero>
<fpage>3</fpage>
<lpage>3</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002012000300004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002012000300004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002012000300004&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[This paper discusses how to decrease and stabilize network latency in a Beowulf system. Having low latency is particularly important to reduce execution time of High Performance Computing applications. Optimization opportunities are identified and analyzed over the different system components that are integrated in compute nodes, including device drivers, operating system services and kernel parameters. This work contributes with a systematic approach to optimize communication latency, provided with a detailed checklist and procedure. Performance impacts are shown through the figures of benchmarks and mpiBLAST as a real-world application. We found that after applying different techniques the default Gigabit Ethernet latency can be reduced from about 50 <img border=0 width=32 height=32 id="_x0000_i1025" src="..\..\..\..\..\img\revistas\cleiej\v15n3\3a040x.png" alt="&#956; " class=math>s into nearly 20 <img border=0 width=32 height=32 id="_x0000_i1026" src="..\..\..\..\..\img\revistas\cleiej\v15n3\3a041x.png" alt="&#956; " class=math>s.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Este artículo examina la manera de reducir y estabilizar la latencia de red en un sistema Beowulf. Tener una baja latencia es particularmente importante para reducir el tiempo de ejecución de aplicaciones de alto rendimiento. Diferentes oportunidades de optimización son identificadas y analizadas dentro de cada componente que se integra en un sistema, incluyendo los controladores de dispositivos, servicios del sistema operativo e incluso los parámetros del núcleo del mismo. Este trabajo aporta un enfoque sistemático para optimizar la latencia de la comunicación, a través de un procedimiento y una lista detallada de pasos a seguir. Los impactos en el sistema se muestran a través de valores de referencia en pruebas sintéticas de rendimiento y de mpiBLAST como una aplicación del mundo real. Se encontró que después de aplicar diferentes técnicas la latencia por defecto de Gigabit Ethernet puede reducirse de 50 a casi 20 nanosegundos.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Beowulf]]></kwd>
<kwd lng="en"><![CDATA[Cluster]]></kwd>
<kwd lng="en"><![CDATA[Ethernet]]></kwd>
<kwd lng="en"><![CDATA[Latency]]></kwd>
<kwd lng="es"><![CDATA[Beowulf]]></kwd>
<kwd lng="es"><![CDATA[Cluster]]></kwd>
<kwd lng="es"><![CDATA[Ethernet]]></kwd>
<kwd lng="es"><![CDATA[Latencia]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="maketitle">    <b><font face="Verdana" size="4">Optimizing Latency in Beowulf Clusters</font></b>    <div class="author">    <font face="Verdana" size="2"> <span class="cmbx-12">Rafael Garabato</span>     <br>           <span class="cmr-12">Argentina Software Design Center (ASDC - Intel C&oacute;rdoba)</span>     <br>  <span class="cmti-12"><a href="mailto:rafael.f.garabato@intel.com">rafael.f.garabato@intel.com </a></span><br class="and">  <span class="cmbx-12">Andr&eacute;s More</span>     <br>           <span class="cmr-12">Argentina Software Design Center (ASDC - Intel C&oacute;rdoba)</span>     <br>             <span class="cmti-12"><a href="mailto:andres.more@intel.com"> andres.more@intel.com</a></span>     <br>                   <span class="cmr-12">Instituto Universitario Aeron&aacute;utico (IUA)</span>     <br>                       <span class="cmti-12"><a href="mailto:amore@iua.edu.ar">amore@iua.edu.ar</a> </span><br class="and">  <span class="cmbx-12">Victor Rosales</span>     <br>           <span class="cmr-12">Argentina Software Design Center (ASDC - Intel C&oacute;rdoba)</span>     <br>          <span class="cmti-12"><a href="mailto:victor.h.rosales@intel.com"> victor.h.rosales@intel.com</a> </span>   </font></div>  <font face="Verdana" size="2">      ]]></body>
<body><![CDATA[<br>   </font>       <div class="date"></div>      </div>           <div class="abstract">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      <div class="minipage">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="cmbx-10">Abstract</span></font></p>  </div>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">This paper discusses how to decrease and stabilize network latency in a Beowulf system. Having low latency is particularly important to reduce execution time of High Performance Computing applications. Optimization opportunities are identified and analyzed over the different system components that are integrated in compute nodes, including device drivers, operating system services and kernel parameters.&nbsp;</font></p>      <p><font face="Verdana" size="2">This work contributes with a systematic approach to optimize communication latency, provided with a detailed checklist and procedure. Performance impacts are shown through the figures of benchmarks and mpiBLAST as a real-world application. We found that after applying different techniques the default Gigabit Ethernet latency can be reduced from about 50 <img src="/img/revistas/cleiej/v15n3/3a040x.png" alt="&mu;  " class="math">s into nearly 20 <img src="/img/revistas/cleiej/v15n3/3a041x.png" alt="&mu;  " class="math">s.&nbsp;</font></p>      <p><font face="Verdana" size="2">Spanish abstract&nbsp;</font></p>      <p><font face="Verdana" size="2">Este artculo examina la manera de reducir y estabilizar la latencia de red en un sistema Beowulf. Tener una baja latencia es particularmente importante para reducir el tiempo de ejecucin de aplicaciones de alto rendimiento. Diferentes oportunidades de optimizacin son identificadas y analizadas dentro de cada componente que se integra en un sistema, incluyendo los controladores de dispositivos, servicios del sistema operativo e incluso los parmetros del ncleo del mismo. Este trabajo aporta un enfoque sistemtico para optimizar la latencia de la comunicacin, a travs de un procedimiento y una lista detallada de pasos a seguir. Los impactos en el sistema se muestran a travs de valores de referencia en pruebas sintticas de rendimiento y de mpiBLAST como una aplicacin del mundo real. Se encontr que despus de aplicar diferentes tcnicas la latencia por defecto de Gigabit Ethernet puede reducirse de 50 a casi 20 nanosegundos. </font> </p>  </div>  </div>   </div>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2"><span class="cmbx-10">Keywords:  </span>Beowulf, Cluster, Ethernet, Latency&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Spanish Keywords: Beowulf, Cluster, Ethernet, Latencia&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Received: 2012-06-10 Revised 2012-10-01 Accepted 2012-10-04 </font>     </p>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="titlemark">1   </span> <a id="x1-10001"></a>Introduction</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1.1   </span> <a id="x1-20001.1"></a>Beowulf Clusters</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Instead of purchasing an expensive and high-end symmetric multiprocessing (SMP) system, most scientists today choose to interconnect multiple regular-size commodity systems as a means to scale computing performance and gain the ability to resolve bigger problems without requiring heavy investments <span class="cite">(<a name="1."></a><a href="#1..">1</a>)</span> <span class="cite">(<a name="2."></a><a href="#2..">2</a>)</span> <span class="cite">(<a name="3."></a><a href="#3..">3</a>)</span>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The key driving factor is cost, hence out-of-the-box hardware components are used together with open source software to build those systems. In the specific case of academia, open source software provides the possibility to make software stack modifications, therefore enabling innovation and broadening their adoption.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Clusters are nearly ubiquitous at the Top500 ranking listing most powerful computer systems worldwide, clustered systems represent more than 80% of the list (Figure <a href="#x1-2001r1">1</a>).&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2"><a id="x1-2001r1"> <img src="/img/revistas/cleiej/v15n3/3a04f1.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;1: </span><span class="content">Top500 System Share by Architecture (as of June 2012)</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">As the cheapest network fabrics are the ones being distributed on-board by system manufacturers, Ethernet is the preferred communication network in Beowulf clusters. At the moment Gigabit Ethernet is included integrated on most hardware. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1.2   </span> <a id="x1-30001.2"></a>Latency</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">Latency itself can be measured at different levels, in particular communication latency is a performance metric representing the time it takes for information to flow from one compute node into another. It then becomes not only important to understand how to measure the latency of the cluster but also to understand how this latency affects the performance of High Performance applications <span class="cite">(<a name="4."></a></span><a href="#4..">4</a><span class="cite">)</span>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">In the case of latency-sensitive applications, messaging needs to be highly optimized and even be executed over special-purpose hardware. For instance latency directly affects the synchronization speed of concurrent jobs in distributed applications, impacting their total execution time.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1.3   </span> <a id="x1-40001.3"></a>Related Work</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">There are extensive work on how to reduce communication latency <span class="cite">(<a name="5."></a><a href="#5..">5</a>)</span> <span class="cite">(<a name="6."></a><a href="#6..">6</a>)</span>. However, this work contributes not with a single component but with a system wide point of view.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The top supercomputers in the world report latencies that commodity systems cannot achieve (Table <a href="#x1-4001r1">1</a>). They utilize specially built network hardware, where the cost factor is increased to get lower latency. </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     ]]></body>
<body><![CDATA[<div class="float">        <div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;1: </span><span class="content">Communication Latency at the HPCC ranking</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-4001r1"><img src="/img/revistas/cleiej/v15n3/3a04t1.png"></a> </font>     </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">High performance network technology (like InfiniBand <span class="cite">(<a name="7."></a><a href="#7..">7</a>)</span>) is used in cases were state-of-the-art Ethernet cannot meet the required latency (see reference values in Table <a href="#x1-4002r2">2</a>). Some proprietary network fabrics are built together with supercomputers when they are designed from scratch. </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     <div class="float">        <div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;2: </span><span class="content">System Level Ethernet Latency</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-4002r2"><img src="/img/revistas/cleiej/v15n3/3a04t2.png"></a> </font>     </div>  <hr class="endfloat">    </div>           <p><font face="Verdana" size="2"><span class="titlemark">1.4   </span> <a id="x1-50001.4"></a>Problem Statement</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">The time it takes to transmit on a network can be calculated as the required time a message information is assembled and dissembled plus the time needed to transmit message payload. Equation <a href="#x1-5001r1">1</a> shows the relation between these startup plus throughput components for the transmission of <img src="/img/revistas/cleiej/v15n3/3a042x.png" alt="n  " class="math"> bytes. </font>    </p>  <table class="equation">    <tbody>      <tr>        <td>                  <center class="math-display">       <font face="Verdana" size="2">       <img src="/img/revistas/cleiej/v15n3/3a043x.png" alt="t(n) = &alpha; + &beta; &times; n" class="math-display"><a id="x1-5001r1"></a></font></center>        </td>        <td class="equation-label"><font face="Verdana" size="2">(1)</font></td>      </tr>       </tbody> </table>   <font face="Verdana" size="2">       <br>  </font>      <p> </p>      <p>   <font face="Verdana" size="2">In the hypothetical case where <span class="cmti-10">zero bytes </span>are transmitted, we can get the minimum possible latency on the system (Equation <a href="#x1-5002r2">2</a>). The value of <img src="/img/revistas/cleiej/v15n3/3a044x.png" alt="&alpha;  " class="math"> is also known as the theoretical or zero-bytes latency. </font>    </p>  <table class="equation">    <tbody>      <tr>        <td>                  <center class="math-display">       <font face="Verdana" size="2">       <img src="/img/revistas/cleiej/v15n3/3a045x.png" alt="t(0) = &alpha;" class="math-display"><a id="x1-5002r2"></a></font></center>        </td>        <td class="equation-label"><font face="Verdana" size="2">(2)</font></td>      </tr>       </tbody> </table>   <font face="Verdana" size="2">       <br>  </font>      <p> </p>      <p>   <font face="Verdana" size="2">It is worth noticing that <img src="/img/revistas/cleiej/v15n3/3a046x.png" alt="&alpha;  " class="math"> is not the only player in the equation, <img src="/img/revistas/cleiej/v15n3/3a047x.png" alt="1&#8725;&beta;  " class="math"> is called network bandwidth, the maximum transfer rate that can be achieved. <img src="/img/revistas/cleiej/v15n3/3a048x.png" alt="&beta;  " class="math"> is the component that affects the overall time as a function of the package size.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2   </span> <a id="x1-60002"></a>Benchmarking Latency</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">There are different benchmarks used to measure communication latency.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.1   </span> <a id="x1-70002.1"></a>Intel MPI Benchmarks</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The Intel MPI Benchmarks (IMB) are a set of timing utilities targeting most important Message Passing Interface (MPI)<a name="8."></a> <span class="cite">(<a href="#8..">8</a>)</span> functions. The suite covers the different versions of the MPI standard, and the most used utility is Ping Pong.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">IMB Ping Pong performs a single message transfer exercise between two active MPI processes (Figure <a href="#x1-7001r2">2</a>). The action can be run multiple times using varying message lengths, timings are averaged to avoid measurement errors.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">    <font face="Verdana" size="2">        <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><a id="x1-7001r2"><img src="/img/revistas/cleiej/v15n3/3a04f2.png"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;2: </span><span class="content">IMB Ping Pong Communication</span></font></div>  <font face="Verdana" size="2">      <br> &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">Using only MPI basic routines, a package is sent (<span class="cmtt-10">MPI_SEND</span>) from a host system and received (<span class="cmtt-10">MPI_RECV</span>) on a remote one (Figure <a href="#x1-7002r3">3</a>) and the time is reported as half the time in <img src="/img/revistas/cleiej/v15n3/3a049x.png" alt="&mu;  " class="math">s for an <span class="cmti-10">X </span>long bytes (<span class="cmtt-10">MPI_BYTE</span>) package to complete a round trip.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">    <font face="Verdana" size="2">        <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><a id="x1-7002r3"><img src="/img/revistas/cleiej/v15n3/3a04f3.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;3: </span><span class="content">IMB Ping Pong Benchmark</span></font></div>  <font face="Verdana" size="2">      <br> &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">As described by the time formula at Equation <a href="#x1-5001r1">1</a>, different measures of transmission time are obtained depending on the package size. To get the minimum latency an empty package is used. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.2   </span> <a id="x1-80002.2"></a>Other Benchmarks</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">There are other relevant HPC benchmarks that are usually used to exercise clusters: HPL and HPCC. These exercise the system from an application level, integrating all components performance for a common goal.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">It is worth mentioning that there are other methods that work at a lower level of abstraction, for instance using Netperf <a name="9."></a><span class="cite">(<a href="#9..">9</a>)</span> or by following RFC 2544 <a name="10."></a><span class="cite">(<a href="#10..">10</a>)</span> techniques. However these last two measure latency at network protocol and device level respectively.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.2.1   </span> <a id="x1-90002.2.1"></a>High Performance Linpack</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">High Performance Linpack is a portable benchmark for distributed-memory systems doing pure matrix multiplication <span class="cite">(<a name="11."></a><a href="#11..">11</a>)</span>. It provides a testing and timing tool to quantify cluster performance. It requires MPI and BLAS supporting libraries.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.2.2   </span> <a id="x1-100002.2.2"></a>High Performance Computing Challenge Benchmarks</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The HPC Challenge benchmark suite <span class="cite">(<a name="12."></a><a href="#12..">12</a>)</span> packages 7 benchmarks:&nbsp;</font></p>      <p>      </p>  <dl class="description">    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">HPL: </span>measures floating point by computing a system of linear equations.   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">DGEMM:  </span>measures  the  floating  point  rate  of  execution  of  double  precision  real  matrix-matrix      multiplication. </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">STREAM: </span>measures sustainable memory bandwidth.   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">PTRANS: </span>computes a distributed parallel matrix transpose   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">RandomAccess: </span>measures random updates of shared distributed memory   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">FFT: </span>double precision complex one-dimensional discrete Fourier transform.   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmti-10">b_eff: </span>measures both communication latency and bandwidth</font></dd>  </dl>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">HPL, DGEMM, STREAM, FFT run in parallel in all nodes, so they can be used to check if cluster nodes are performing similarly. PTRANS, RandomAccess and b_eff exercise the system cluster wide. It is expected that latency optimizations impact their results differently.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3   </span> <a id="x1-110003"></a>Methods</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Given a simplified system view of a cluster, there are multiple compute nodes that together run the application. An application uses software such as libraries that interface with the operating system to reach hardware resources through device drivers. This work analyzes the following components:&nbsp;</font></p>      <p>      </p>  <dl class="description">    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">Ethernet Drivers: </span>interrupt moderation capabilities   </font>       </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">System Services: </span>interrupt balancing and packet-based firewall   </font>      </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">Kernel Settings: </span>low latency extensions on network protocols</font></dd>  </dl>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">Further work to optimize performance is always possible; only the most relevant optimizations were considered according to gathered experience over more than 5 years on the engineering of volume HPC solutions.&nbsp;</font></p>      <p>    </p>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="titlemark">3.1   </span> <a id="x1-120003.1"></a>Drivers</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">As any other piece of software, device drivers implement algorithms which, depending on different factors, may introduce latency. Drivers may even expose hardware functionalities or configurations that could change the device latency to better support the Beowulf usage scenario.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3.1.1   </span> <a id="x1-130003.1.1"></a>Interrupt Moderation</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Interrupt moderation is a technique to reduce CPU interrupts by caching them and servicing multiple ones at once <span class="cite">(<a name="13."></a><a href="#13..">13</a>)</span>. Although it make sense for general purpose systems, this introduces extra latency, so Ethernet drivers should not moderate interruptions when running in HPC clusters.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">To turn off Interrupt Moderation on Intel network drivers add the following line on each node of the cluster and reload the network driver kernel module. Refer to documentation <span class="cite">(<a name="14."></a><a href="#14..">14</a>)</span> for more details. </font>     </p>      <div class="verbatim" id="verbatim-1"> <font face="Verdana" size="2">#&nbsp;echo&nbsp;"options&nbsp;e1000e&nbsp;InterruptThrottleRate=0"&nbsp;&gt;&nbsp;/etc/modprobe.conf &nbsp;    <br>  #&nbsp;modprobe&nbsp;-r&nbsp;e1000e&nbsp;&amp;&amp;&nbsp;modprobe&nbsp;e1000e </font> </div>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p> </p>      <p>   <font face="Verdana" size="2">For maintenance reasons some Linux distributions do not include the configuration capability detailed above. In those cases, the following command can be used to get the same results. </font>     </p>      <div class="verbatim" id="verbatim-2"> <font face="Verdana" size="2">#&nbsp;ethtool&nbsp;eth0&nbsp;rx-usecs </font> </div>   <font face="Verdana" size="2">       <br>  </font>      <p> </p>      <p>   <font face="Verdana" size="2">There is no portable approach to query kernel modules configurations in all Linux kernel versions, so configuration files should be used as a reference.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3.2   </span> <a id="x1-140003.2"></a>Services</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3.2.1   </span> <a id="x1-150003.2.1"></a>Interrupt Balancing</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Some system services may directly affect network latency. For instance <span class="cmti-10">irqbalance </span>job is to distribute interrupt requests (IRQs) among processors (and even between each processor cores) on a <span class="cmti-10">Symmetric Multi-Processing </span>(SMP) system. Migrating IRQs to be served from one CPU to another is a time consuming task that although balance the load it may affect overall latency.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The main objective of having such a service is to balance between power-savings and optimal performance. The task it performs is to dynamically distribute workload evenly across CPUs and their computing cores. The job is done by properly configuring the IO-ACPI chipset that maps interruptions to cores.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">An ideal setup will assign all interrupts to the cores of a same CPU, also assigning storage and network interrupts to cores near the same cache domain. However this implies processing and routing the interrupts before running them, which has the consequence of adding a short delay on their processing.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Turning off the <span class="cmti-10">irqbalance </span>service will help then to decrease network latency. In a Red Hat compatible system this can be done as follows: </font>     </p>      <div class="verbatim" id="verbatim-3"> <font face="Verdana" size="2">#&nbsp;service&nbsp;irqbalance&nbsp;stop &nbsp;    <br>  #&nbsp;chkconfig&nbsp;irqbalance&nbsp;off &nbsp;    <br>  $&nbsp;service&nbsp;irqbalance&nbsp;status </font> </div>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p> </p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3.2.2   </span> <a id="x1-160003.2.2"></a>Firewall</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">As compute nodes are generally isolated on a private network reachable only through the head node, the firewall may not even be required. The system firewall needs to review each package received before continuing with the execution. This overhead increases the latency as incoming and outgoing packet fields are inspected during communication.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Linux-based systems have a firewall in its kernel that can be controlled throughout a user-space application called <span class="cmti-10">iptables</span>. This application runs in the system as a service, therefore the system&rsquo;s service mechanisms has to be used to stop it. </font>     </p>      <div class="verbatim" id="verbatim-4"> <font face="Verdana" size="2">#&nbsp;service&nbsp;iptables&nbsp;stop &nbsp;    <br>  #&nbsp;chkconfig&nbsp;iptables&nbsp;stop &nbsp;    <br>  $&nbsp;lsmod&nbsp;|&nbsp;grep&nbsp;iptables </font> </div>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p> </p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3.3   </span> <a id="x1-170003.3"></a>Kernel Parameters</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The Linux Transport Control Protocol (TCP) stack makes decisions by default that favors higher throughput as opposed to low latency. The Linux TCP stack implementation has different packet lists to handle incoming data, the PreQueue can be disabled so network packets will go directly into the Receive queue. In Red Hat compatible systems this can be done with the command: </font>     </p>      <div class="verbatim" id="verbatim-5"> <font face="Verdana" size="2">#&nbsp;echo&nbsp;1&nbsp;&gt;&nbsp;/proc/sys/net/ipv4/tcp_low_latency &nbsp;    <br>  $&nbsp;sysctl&nbsp;-a&nbsp;|&nbsp;grep&nbsp;tcp_low_latency </font> </div>   <font face="Verdana" size="2">       <br>  </font>      <p> </p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">There are others parameters that can be analyzed <span class="cite">(<a name="15."></a><a href="#15..">15</a>)</span>, but the impact they cause are too application specific to be included on a general optimization study.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4   </span> <a id="x1-180004"></a>Optimization Impact</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4.1   </span> <a id="x1-190004.1"></a>IMB Ping Pong</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Using IMB Ping Pong as workload, the following results (Figure <a href="#x1-19001r4">4</a>) reflect how the different optimizations impact communication latency. The actual figures on average and deviation are shown below at Table <a href="#x1-19002r3">3</a>.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">      ]]></body>
<body><![CDATA[<br>  </font>      <p> <font face="Verdana" size="2"> <a id="x1-19001r4"> <img src="/img/revistas/cleiej/v15n3/3a04f4.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;4: </span><span class="content">Comparison of Optimizations</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure">         <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     <div class="float">        ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;3: </span><span class="content">IMB Ping Pong Optimization Results</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-19002r3"><img src="/img/revistas/cleiej/v15n3/3a04t3.png"></a> </font>     </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">The principal cause of overhead in communication latency is then IRQ moderation. Another important contributor is the packet firewall service. We found that the low latency extension for TCP was actually slightly increasing the IMB Ping Pong reported latency. In the case of the IRQ balance service, the impact is only minimal.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Optimizations impact vary, and not surprisingly they are not accumulative when combining them all. At a glance, it is possible to optimize the average latency in nearly 54%, nearly halving result deviations. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4.2   </span> <a id="x1-200004.2"></a>High Performance Linpack</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">A cluster-wide HPL running over MPI reported results as shown in Table <span class="cmbx-10">??</span>. The problem size was customized to <span class="cmtt-10">Ns:37326 NBs:168 Ps:15 Qs:16 </span>for a quick but still representative execution with a controlled deviation. </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     ]]></body>
<body><![CDATA[<div class="float">        <div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;4: </span><span class="content">HPL Results</span></font></div>  <font face="Verdana" size="2">  <a id="x1-20001r4">&nbsp;<img src="/img/revistas/cleiej/v15n3/3a04t4.png"></a> </font>     </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">As we can see on the results, the actual synchronization cycle done by the algorithm heavily relies on having low latency. The linear system is partitioned in smaller problem blocks which are distributed over a grid of processes which may be on different compute nodes. The distribution of matrix pieces is done using a binary tree among compute nodes with several rolling phases between them. The required time was then reduced 56%, and the gathered performance was increased almost 2.5 times. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4.3   </span> <a id="x1-210004.3"></a>HPCC</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Figure <a href="#x1-21001r5">5</a> and table <a href="#x1-21002r5">5</a> show HPCC results obtained with a default and optimized Beowulf cluster. As we can see on the results, the overall execution time is directly affected with a 29% reduction. The performance figures differ across packaged benchmarks as they measure system characteristics that are affected by latency in diverse ways.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><a id="x1-21001r5"> <img src="/img/revistas/cleiej/v15n3/3a04f5.jpg"> </a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;5: </span><span class="content">HPCC Performance Results (higher is better)</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure">         <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     <div class="float">        <div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;5: </span><span class="content">HPCC Timing Results</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-21002r5"><img src="/img/revistas/cleiej/v15n3/3a04t5.png"></a> </font>     </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">Local benchmarks like STREAM, DGEMM and HPL are not greatly affected, as they obviously do not need communication between compute nodes. However, the actual latency, bandwidth and PTRANS benchmark are impacted as expected due they communication dependency. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4.4   </span> <a id="x1-220004.4"></a>mpiBLAST</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">In order to double check if any of the optimization have hidden side effects and the real impact on the execution of a full-fledge HPC application, a real-world code was exercised. mpiBLAST <span class="cite">(<a name="16."></a><a href="#16..">16</a>)</span> is an open source tool that implements DNA-related algorithms to find regions of similarity between biological sequences.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Table <a href="#x1-22001r6">6</a> shows the actual averaged figures after multiple runs. Results got with a default and optimized system on a fixed workload for mpiBLAST. The required time to process the problem was reduced by 11% with the previous 42% improvement as measured by IMB Ping Pong. </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p></p>  <hr class="float">     <div class="float">        ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;6: </span><span class="content">mpiBLAST Results</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-22001r6"><img src="/img/revistas/cleiej/v15n3/3a04t6.png"></a> </font>     </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">This shows that the results of a synthetic benchmark like IMB Ping Pong can not be used directly to extrapolate figures, they are virtually the limit to what can be achieved by an actual application. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4.5   </span> <a id="x1-230004.5"></a>Testbed</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The experiments done as part of this work were done over 32 nodes with the following bill of materials (Table <a href="#x1-23001r7">7</a>). </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p></p>  <hr class="float">     <div class="float">        ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;7: </span><span class="content">Compute Node Hardware and Software</span></font></div>  <font face="Verdana" size="2">&nbsp;<a id="x1-23001r7"><img src="/img/revistas/cleiej/v15n3/3a04t7.png"></a> </font>     </div>  <hr class="endfloat">    </div>           <p><font face="Verdana" size="2"><span class="titlemark">5   </span> <a id="x1-240005"></a>Optimization Procedure</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Figure <a href="#x1-24001r6">6</a> summarizes the complete optimization procedure. It is basically a sequence of steps involving checking and reconfiguring Ethernet drivers and system services if required. Enabling TCP extensions for low latency is not included due their negative consequences.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2"><a id="x1-24001r6"> <img src="/img/revistas/cleiej/v15n3/3a04f6.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;6: </span><span class="content">Latency Optimization Procedure</span></font></div>  <font face="Verdana" size="2">&nbsp;    ]]></body>
<body><![CDATA[<br>  </font>      <p>   </p>  </div>  <hr class="endfigure">         <p><font face="Verdana" size="2"><span class="titlemark">5.1   </span> <a id="x1-250005.1"></a>Detailed Steps</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The steps below include the purpose and an example of the actual command to execute as required on Red Hat compatible systems. The pdsh (<a href="http://sourceforge.net/projects/pdsh" class="url"><span class="cmtt-10">http://sourceforge.net/projects/pdsh</span></a>) parallel shell is used to reach compute nodes at once.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Questions (1) helps to dimension the required work to optimize driver configuration to properly support network devices. Questions (2) helps to understand what&rsquo;s needed to properly configure system services.&nbsp;</font></p>      <p>      </p>  <ol class="enumerate1">        <li class="enumerate" id="x1-25002x1"><font face="Verdana" size="2">Interrupt Moderation on Ethernet Driver&nbsp;      </font>              <p>          </p>           <ol class="enumerate2">            <li class="enumerate" id="x1-25004x1"><font face="Verdana" size="2">Is the installed driver version the latest and greatest?          </font>                                 <div class="verbatim" id="verbatim-6">          <font face="Verdana" size="2">$&nbsp;/sbin/modinfo&nbsp;-F&nbsp;version&nbsp;e1000e          &nbsp;    <br>  1.2.20-NAPI </font>         </div>            <font face="Verdana" size="2">                ]]></body>
<body><![CDATA[<br>                    </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25006x2"><font face="Verdana" size="2">Is the same version installed across all compute nodes?       </font>                                 <div class="verbatim" id="verbatim-7">          <font face="Verdana" size="2">$&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;/sbin/modinfo&nbsp;-F&nbsp;version&nbsp;e1000e&rsquo;&nbsp;|&nbsp;uniq          &nbsp;    <br>  1.2.20-NAPI </font>         </div>            <font face="Verdana" size="2">                <br>                   </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25008x3"><font face="Verdana" size="2">Are interrupt moderation settings in HPC mode?       </font>                                 <div class="verbatim" id="verbatim-8">          <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;grep&nbsp;"e1000e"&nbsp;/etc/modprobe.conf&rsquo;&nbsp;|&nbsp;uniq          &nbsp;    <br>  options&nbsp;e1000e&nbsp;InterruptThrottleRate=0 </font>         </div>            <font face="Verdana" size="2">                <br>                   </font>                       <p>          </p>        </li>               ]]></body>
<body><![CDATA[</ol>        </li>        <li class="enumerate" id="x1-25010x2"><font face="Verdana" size="2">System Services&nbsp;      </font>              <p>          </p>           <ol class="enumerate2">            <li class="enumerate" id="x1-25012x1"><font face="Verdana" size="2">Is the firewall disabled?          </font>                                 <div class="verbatim" id="verbatim-9">          <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;service&nbsp;iptables&nbsp;status&rsquo;&nbsp;|&nbsp;uniq          &nbsp;    <br>  Firewall&nbsp;is&nbsp;stopped. </font>         </div>            <font face="Verdana" size="2">                <br>                    </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25014x2"><font face="Verdana" size="2">Is the firewall disabled at startup?       </font>                                 <div class="verbatim" id="verbatim-10">          <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;chkconfig&nbsp;iptables&nbsp;--list&rsquo;          &nbsp;    <br>  irqbalance&nbsp;0:off&nbsp;1:off&nbsp;2:off&nbsp;3:off&nbsp;4:off&nbsp;5:off&nbsp;6:off          </font>         </div>            <font face="Verdana" size="2">                <br>                   </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25016x3"><font face="Verdana" size="2">Was the system rebooted after stopping firewall services?       </font>                                 ]]></body>
<body><![CDATA[<div class="verbatim" id="verbatim-11">          <font face="Verdana" size="2">$&nbsp;uptime          &nbsp;    <br>  &nbsp;15:42:29&nbsp;up&nbsp;18:49,&nbsp;4&nbsp;users,&nbsp;load&nbsp;average:&nbsp;0.09,&nbsp;0.08,&nbsp;0.09          </font>         </div>            <font face="Verdana" size="2">                <br>                   </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25018x4"><font face="Verdana" size="2">Is the IRQ balancing service disabled?       </font>                                 <div class="verbatim" id="verbatim-12">          <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;service&nbsp;irqbalance&nbsp;status&rsquo;&nbsp;|&nbsp;uniq          &nbsp;    <br>  irqbalance&nbsp;is&nbsp;stopped </font>         </div>            <font face="Verdana" size="2">                <br>                   </font>                       <p>                    </p>        </li>        <li class="enumerate" id="x1-25020x5"><font face="Verdana" size="2">Is IRQ balancing daemon disabled at startup?       </font>                                 <div class="verbatim" id="verbatim-13">          <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-N&nbsp;-a&nbsp;&rsquo;chkconfig&nbsp;irqbalance&nbsp;--list&rsquo;&nbsp;|&nbsp;uniq          &nbsp;    <br>  irqbalance&nbsp;0:off&nbsp;1:off&nbsp;2:off&nbsp;3:off&nbsp;4:off&nbsp;5:off&nbsp;6:off          </font>         </div>            <font face="Verdana" size="2">                ]]></body>
<body><![CDATA[<br>                   </font>                       <p>          </p>        </li>               </ol>        </li>      </ol>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">Once gathered all the information required to known if optimizations can be applied, the following list can be used to apply configuration changes. Between each change a complete cycle of measurement should be done. This include contrasting old and new latency average plus their deviation using at least IMB Ping Pong.&nbsp;</font></p>      <p>      </p>  <dl class="description">    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">Disable IRQ Moderation</span>   </font>                     <div class="verbatim" id="verbatim-14">      <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-a&nbsp;&rsquo;echo&nbsp;"options&nbsp;e1000e&nbsp;InterruptThrottleRate=0"&nbsp;&gt;&gt;&nbsp;\      &nbsp;    <br>  /etc/modprobe.conf&rsquo;      &nbsp;    <br>  #&nbsp;modprobe&nbsp;-r&nbsp;e1000e;&nbsp;modprobe&nbsp;e1000e </font>     </div>        <font face="Verdana" size="2">            ]]></body>
<body><![CDATA[<br>           </font>               <p>      </p>    </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">Disable IRQ Balancer</span>   </font>                     <div class="verbatim" id="verbatim-15">      <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-a&nbsp;&rsquo;service&nbsp;irqbalance&nbsp;stop&rsquo;      &nbsp;    <br>  #&nbsp;pdsh&nbsp;-a&nbsp;&rsquo;chkconfig&nbsp;irqbalance&nbsp;off&rsquo; </font>     </div>        <font face="Verdana" size="2">            <br>           </font>               <p>      </p>    </dd>    <dd>&nbsp;</dd>    <dd class="description"><font face="Verdana" size="2"><span class="cmbx-10">Disable Firewall</span>   </font>                     <div class="verbatim" id="verbatim-16">      <font face="Verdana" size="2">#&nbsp;pdsh&nbsp;-a&nbsp;&rsquo;service&nbsp;iptables&nbsp;stop&rsquo;      &nbsp;    <br>  #&nbsp;pdsh&nbsp;-a&nbsp;&rsquo;chkconfig&nbsp;iptables&nbsp;off&rsquo; </font>     </div>        <font face="Verdana" size="2">            <br>           </font>               <p>      </p>    </dd>  </dl>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">6   </span> <a id="x1-260006"></a>Conclusion</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">This work shows that by only changing default configurations the latency of a Beowulf system can be easily optimized, directly affecting the execution time of High Performance Computing applications. As a quick reference, an out-of-the-box system using Gigabit Ethernet has around 50 <img src="/img/revistas/cleiej/v15n3/3a0410x.png" alt="&mu;  " class="math">s of communication latency. Using different techniques, it is possible to get as low as nearly 20 <img src="/img/revistas/cleiej/v15n3/3a0411x.png" alt="&mu;  " class="math">s.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">After introducing some background theory and supporting tools, this work analyzed and exercised different methods to measure latency (IMB, HPL and HPCC benchmarks). This work also contrasted those methods and provided insights on how they should be executed and their results analyzed.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">We identified which specific items have higher impact over latency metrics (interrupt moderation and system services), using de-facto benchmarks and a real-world application such as mpiBLAST.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">6.1   </span> <a id="x1-270006.1"></a>Future Work</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">Running a wider range of real-world computational problems will help to understand the impact in different workloads. A characterization of the impact according to the application domain, profiling information or computational kernel might be useful to offer as a reference.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">There are virtually endless opportunities to continue with the research on latency optimization opportunities; among them components like BIOS, firmware, networking switches and routers. An interesting opportunity are the RX/TX parameters of Ethernet drivers that control the quantity of packet descriptors used during communication.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Another option is to implement an MPI trace analysis tool to estimate the impact of having an optimized low latency environment. At the moment there are several tools to depict communication traces (Jumpshot <a href="http://www.mcs.anl.gov/research/projects/perfvis/software/viewers" class="url"><span class="cmtt-10">http://www.mcs.anl.gov/research/projects/perfvis/software/viewers</span></a>, Intel&rsquo;s ITAC <a href="http://software.intel.com/en-us/articles/intel-trace-analyzer" class="url"><span class="cmtt-10">http://software.intel.com/en-us/articles/intel-trace-analyzer</span></a>), but they do not provide a simulation of what would happen while running over a different network environment. Having this approximation can be useful to decide if it is worth to purchase specialized hardware or not.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">At last, it would be interesting also to understand the impact of this work into research or development processes using clusters, not only in industry but also in academia.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><a id="x1-280006.1"></a>Acknowledgments</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The authors would like to thanks the Argentina Cluster Engineering team at the Argentina Software Design Center (ASDC Intel) for their contributions.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><a id="x1-290006.1"></a>References</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>     </p>      <div class="thebibliography">          <p><font face="Verdana" size="2"><span class="biblabel"><a name="1.."></a>   (<a href="#1.">1</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;Sterling,  D.&nbsp;J.  Becker,  D.&nbsp;Savarese,  J.&nbsp;E.  Dorband,  U.&nbsp;A.  Ranawake,  and  C.&nbsp;V.  Packer,     &ldquo;Beowulf: A parallel workstation for scientific computation,&rdquo; in <span class="cmti-10">In Proceedings of the 24th International</span>     <span class="cmti-10">Conference on Parallel Processing</span>.   CRC Press, 1995, pp. 11&ndash;14. </font>      </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="2.."></a>   (<a href="#2.">2</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;Salmon,  C.&nbsp;Stein,  and  T.&nbsp;Sterling,  &ldquo;Scaling  of  beowulf-class  distributed  systems,&rdquo;  in  <span class="cmti-10">In</span>     <span class="cmti-10">Proceedings of SC&rsquo;98</span>, 1998. </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="3.."></a>   (<a href="#3.">3</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;L. William&nbsp;Gropp and T.&nbsp;Sterling, <span class="cmti-10">Beowulf Cluster Computing with Linux, Second Edition</span>.  The     MIT Press, 2003.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="4.."></a>   (<a href="#4.">4</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>QLogic, &ldquo;Introduction to ethernet latency,&rdquo; Tech. Rep., 2011. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="5.."></a>   (<a href="#5.">5</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Larsen, P.&nbsp;Sarangam, and R.&nbsp;Huggahalli, &ldquo;Architectural breakdown of end-to-end latency in a     tcp/ip network,&rdquo; in <span class="cmti-10">Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007.</span>     <span class="cmti-10">19th International Symposium on</span>, oct. 2007, pp. 195 &ndash;202. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">   (<a name="6.."></a><a href="#6.">6</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>A.&nbsp;Foong, T.&nbsp;Huff, H.&nbsp;Hum, J.&nbsp;Patwardhan, and G.&nbsp;Regnier, &ldquo;Tcp performance re-visited,&rdquo; in     <span class="cmti-10">Performance Analysis of Systems and Software, 2003. ISPASS. 2003 IEEE International Symposium</span>     <span class="cmti-10">on</span>, march 2003, pp. 70 &ndash; 79. </font>     </p>            ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="biblabel"><a name="7.."></a>   (<a href="#7.">7</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>I.&nbsp;T. Association, &ldquo;Infiniband architecture specification release 1.2.1,&rdquo; Tech. Rep., 2008. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">   (<a name="8.."></a>8)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;P.&nbsp;I. Forum, &ldquo;Mpi: A message-passing interface standard,&rdquo; Tech. Rep., 2009. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">   (<a name="9.."></a><a href="#9.">9</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Jones, &ldquo;Netperf,&rdquo; 2007. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="10.."></a><a href="#10.">10</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Bradner and J.&nbsp;McQuaid, &ldquo;Ieee rfc2544: Benchmarking methodology for network interconnect     devices,&rdquo; United States, 1999. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="11.."></a><a href="#11.">11</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;D. A.&nbsp;Petitet, R. C.&nbsp;Whaley and A.&nbsp;Cleary, &ldquo;A portable implementation of the high-performance     linpack benchmark for distributed-memory computers,&rdquo; Tech. Rep., 2008. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="12.."></a><a href="#12.">12)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;J. Dongarra, I.&nbsp;High, and P.&nbsp;C. Systems, &ldquo;Overview of the hpc challenge benchmark suite,&rdquo;     2006. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="13.."></a><a href="#13.">13</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>I.&nbsp;Corporation, &ldquo; Interrupt Moderation Using Intel Gigabit Ethernet Controllers Application Note     ,&rdquo; Tech. Rep., 2007. </font>      </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="14.."></a><a href="#14.">14</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>I.&nbsp;M.&nbsp;L.  in&nbsp;Linux  for  Intel(R)  82575/82576&nbsp;or  82598/82599  Ethernet&nbsp;Controllers,  &ldquo;Interrupt     moderation using intel gigabit ethernet controllers application note,&rdquo; Tech. Rep., 2009. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="15.."></a><a href="#15.">15</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>A.&nbsp;I.  to&nbsp;Processor  Cores  using  an  Intel(R)  82575/82576&nbsp;or  82598/82599  Ethernet&nbsp;Controller,     &ldquo;Interrupt moderation using intel gigabit ethernet controllers application note,&rdquo; Tech. Rep., 2009. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel">  (<a name="16.."></a><a href="#16.">16</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>H.&nbsp;Lin, P.&nbsp;Balaji, R.&nbsp;Poole, C.&nbsp;Sosa, X.&nbsp;Ma, and W.-c. Feng, &ldquo;Massively parallel genomic sequence search on the blue gene/p architecture,&rdquo; in <span class="cmti-10">Proceedings of the 2008 ACM/IEEE conference</span>     <span class="cmti-10">on Supercomputing</span>, ser. SC &rsquo;08.   Piscataway, NJ, USA: IEEE Press, 2008, pp. 33:1&ndash;33:11. (Online).     Available: <a href="http://dl.acm.org/citation.cfm?id=1413370.1413404" class="url">http://dl.acm.org/citation.cfm?id=1413370.1413404</a> </font> </p>       </div>            ]]></body>
<body><![CDATA[ ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sterling]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Becker]]></surname>
<given-names><![CDATA[D. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Savarese]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Dorband]]></surname>
<given-names><![CDATA[J. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Ranawake]]></surname>
<given-names><![CDATA[U. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Packer]]></surname>
<given-names><![CDATA[C. V.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Beowulf: A parallel workstation for scientific computation]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 24th International Conference on Parallel Processing]]></conf-name>
<conf-date>1995</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Salmon]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Stein]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Sterling]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Scaling of beowulf-class distributed systems]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proceedings of SC&#8217;98]]></conf-name>
<conf-date>1998</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gropp]]></surname>
<given-names><![CDATA[E. L. William]]></given-names>
</name>
<name>
<surname><![CDATA[Sterling]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Beowulf Cluster Computing with Linux, Second Edition]]></source>
<year>2003</year>
<publisher-name><![CDATA[The MIT Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="">
<collab>QLogic</collab>
<source><![CDATA[Introduction to ethernet latency]]></source>
<year>2011</year>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Larsen]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Sarangam]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Huggahalli]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Architectural breakdown of end-to-end latency in a tcp/ip network]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Computer Architecture and High Performance Computing]]></conf-name>
<conf-date>2007</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Foong]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Huff]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Hum]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Patwardhan]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Regnier]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Tcp performance re-visited]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Performance Analysis of Systems and Software]]></conf-name>
<conf-date>2003</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="">
<collab>I. T. Association</collab>
<source><![CDATA[Infiniband architecture specification release 1.2.1]]></source>
<year>2008</year>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="">
<collab>M. P. I. Forum</collab>
<source><![CDATA[Mpi: A message-passing interface standard]]></source>
<year>2009</year>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Jones]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Netperf]]></source>
<year>2007</year>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bradner]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[McQuaid]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Ieee rfc2544: Benchmarking methodology for network interconnect devices]]></source>
<year>1999</year>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Petitet]]></surname>
<given-names><![CDATA[J. D. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Whaley]]></surname>
<given-names><![CDATA[R. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Cleary]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[A portable implementation of the high-performance linpack benchmark for distributed-memory computers]]></source>
<year>2008</year>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dongarra]]></surname>
<given-names><![CDATA[J. J.]]></given-names>
</name>
<name>
<surname><![CDATA[High]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<collab>P. C. Systems</collab>
<source><![CDATA[Overview of the hpc challenge benchmark suite]]></source>
<year>2006</year>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="">
<collab>I. Corporation</collab>
<source><![CDATA[Interrupt Moderation Using Intel Gigabit Ethernet Controllers Application Note]]></source>
<year>2007</year>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="">
<collab>I. Corporation</collab>
<source><![CDATA[Interrupt moderation using intel gigabit ethernet controllers application note]]></source>
<year>2009</year>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="">
<collab>I. Corporation</collab>
<source><![CDATA[Interrupt moderation using intel gigabit ethernet controllers application note]]></source>
<year>2009</year>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lin]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Balaji]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Poole]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Sosa]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Ma]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
<name>
<surname><![CDATA[Feng]]></surname>
<given-names><![CDATA[W.-c.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Massively parallel genomic sequence search on the blue gene/p architecture]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 2008 ACM/IEEE conference on Supercomputing]]></conf-name>
<conf-date>2008</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
