<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002012000300009</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Parallel implementations of the MinMin heterogeneous computing scheduler in GPU]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Canabé]]></surname>
<given-names><![CDATA[Mauro]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Nesmachnow]]></surname>
<given-names><![CDATA[Sergio]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad de la República  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Uruguay</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Universidad de la República  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Uruguay</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<volume>15</volume>
<numero>3</numero>
<fpage>8</fpage>
<lpage>8</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002012000300009&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002012000300009&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002012000300009&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[This work presents parallel implementations of the MinMin scheduling heuristic for heterogeneous computing using Graphic Processing Units, in order to improve its computational efficiency. The experimental evaluation of the four proposed MinMin variants demonstrates that a significant reduction on the computing times can be attained, allowing to tackle large scheduling scenarios in reasonable execution times]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Este trabajo presenta implementaciones paralelas de la heurística de planificiación MinMin para entornos de computación heterogénea usando unidades de procesamiento gráfico, con el fin de mejorar su eficiencia computacional. La evaluación experimental de las cuatro variantes propuestas para la heuristica MinMin demuestra que se puede alcanzar una reducción significativa en los tiempos de cálculo, lo que permite hacer frente a grandes escenarios de planificación en los tiempos de ejecución razonables.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[GPU computing]]></kwd>
<kwd lng="en"><![CDATA[heterogeneous computing]]></kwd>
<kwd lng="en"><![CDATA[scheduling]]></kwd>
<kwd lng="es"><![CDATA[computación en GPU]]></kwd>
<kwd lng="es"><![CDATA[computación heterogénea]]></kwd>
<kwd lng="es"><![CDATA[planificación]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="maketitle">    <b><font face="Verdana" size="4">Parallel implementations of the MinMin heterogeneous computing scheduler in GPU</font></b>    <div class="author">    <font face="Verdana" size="2"> <span class="cmbx-12">Mauro Canab&eacute;</span>     <br>         <span class="cmr-12">Centro de C&aacute;lculo, Facultad de Ingenier&iacute;a</span>     <br>            <span class="cmr-12">Universidad de la Rep&uacute;blica, Uruguay</span>     <br>  <a href="mailto:mcanabe@fing.edu.uy" class="url"><span class="cmitt-10x-x-120">mcanabe@fing.edu.uy</span></a> <br class="and">  <span class="cmbx-12">Sergio Nesmachnow</span>     <br>         <span class="cmr-12">Centro de C&aacute;lculo, Facultad de Ingenier&iacute;a</span>     <br>            <span class="cmr-12">Universidad de la Rep&uacute;blica, Uruguay</span>     <br>             <a href="mailto:sergion@fing.edu.uy" class="url"><span class="cmitt-10x-x-120">sergion@fing.edu.uy</span></a>   </font> </div>  <font face="Verdana" size="2">      <br>   </font>       <div class="date"></div>      </div>        ]]></body>
<body><![CDATA[<div class="abstract">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      <div class="minipage">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      <p><font face="Verdana" size="2"><span class="cmbx-10">Abstract</span></font></p>  </div>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">This work presents parallel implementations of the MinMin scheduling heuristic for heterogeneous computing using Graphic Processing Units, in order to improve its computational efficiency. The experimental evaluation of the four proposed MinMin variants demonstrates that a significant reduction on the computing times can be attained, allowing to tackle large scheduling scenarios in reasonable execution times.&nbsp;</font></p>      <p><font face="Verdana" size="2"><span class="cmbx-10">Spanish abstract:</span>&nbsp;</font></p>      <p><font face="Verdana" size="2">Este trabajo presenta implementaciones paralelas de la heur&iacute;stica de planificiaci&oacute;n MinMin para entornos de computaci&oacute;n heterog&eacute;nea usando unidades de procesamiento gr&aacute;fico, con el fin de mejorar su eficiencia computacional. La evaluaci&oacute;n experimental de las cuatro variantes propuestas para la heuristica MinMin demuestra que se puede alcanzar una reducci&oacute;n significativa en los tiempos de c&aacute;lculo, lo que permite hacer frente a grandes escenarios de planificaci&oacute;n en los tiempos de ejecuci&oacute;n razonables. </font> </p>  </div>  </div>   </div>   <font face="Verdana" size="2">   <span class="cmbx-10">Keywords: </span>GPU computing, heterogeneous computing, scheduling.&nbsp; </font>     <p>   <font face="Verdana" size="2">   <span class="cmbx-10">Spanish keywords: </span>computaci&oacute;n en GPU, computaci&oacute;n heterog&eacute;nea, planificaci&oacute;n.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Received: 2012-06-10 Revised 2012-10-01 Accepted 2012-10-04 </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1   </span> <a id="x1-10001"></a>Introduction</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">In the last fifteen years, distributed computing environments have been increasingly used to solve complex problems. Nowadays, a common platform for distributed computing usually comprises a heterogeneous collection of computers. This class of infrastructures includes <span class="cmti-10">grid computing </span>and <span class="cmti-10">cloud computing </span>environments, where a large set of heterogeneous computers with diverse characteristics are combined to provide pervasive on demand and cost-effective processing power, software, and access to data, for solving many kinds of problems&nbsp;<span class="cite">(<a href="#c1">1</a>,&nbsp;<a href="#c2">2</a>)</span><a name="c1."></a><a name="c2."></a>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">A key problem when using such heterogeneous computing (HC) environments consists in finding a scheduling strategy for a set of tasks to be executed. The goal is to assign the computing resources by satisfying some efficiency criteria, usually related to the total execution time or resource utilization&nbsp;<span class="cite">(<a href="#c3">3</a>,&nbsp;<a href="#c4">4</a>)<a name="c3."></a><a name="c4."></a></span>. The <span class="cmti-10">heterogeneous computing</span> <span class="cmti-10">scheduling problem </span>(HCSP) became specially important due to the popularization of heterogeneous distributed computing systems&nbsp;<span class="cite">(<a href="#c5">5</a>,&nbsp;<a href="#c6">6</a>)</span>.<a name="c5."></a><a name="c6."></a>&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Traditional scheduling problems are NP-hard&nbsp;<span class="cite">(<a href="#c7">7</a>)</span><a name="c7."></a>, thus classic exact methods are only useful for solving problem instances of very reduced size. Heuristics methods are able to get efficient schedules in reasonable times, but they still require long execution times when solving large instances of the scheduling problem. These execution times (i.e., in the order of an hour) can be extremely high for performing on-line scheduling in realistic HC infrastructures.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">High performance computing techniques can be applied to reduce the execution times required to perform the scheduling. The massively parallel hardware in Graphic Processor Units (GPU) has been successfully applied to speed up the computations required to solve problems in many application areas&nbsp;<span class="cite">(<a href="#c8">8</a>)</span><a name="c8."></a>, showing an excellent trade-off between cost and computing power <span class="cite">(<a href="#c9">9</a>)<a name="c9."></a></span>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The main contribution of this work is the development of four parallel implementations on GPU for a the classic and effective scheduling heuristic MinMin&nbsp;<span class="cite">(<a href="#c10">10</a>)</span><a name="c10."></a>. The experimental evaluation of the proposed parallel methods demonstrates that a significant reduction on the computing times can be attained when using the parallel GPU hardware. This performance improvement allows solving large scheduling scenarios in reasonable execution times.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The manuscript is structured as follows. Next section introduces the HCSP mathematical formulation, and the heuristics studied in this work. A brief introduction to GPU computing is presented in Section&nbsp;<a href="#x1-50003">3</a>. Section&nbsp;<a href="#x1-70004">4</a> describes the four proposed implementations of the MinMin heuristic on GPU. The experimental evaluation of the proposed methods is reported in Section&nbsp;<a href="#x1-80005">5</a>, where the efficiency results are also analyzed. Finally, Section&nbsp;<a href="#x1-120006">6</a> summarizes the conclusions of the research and formulates the main lines for future work.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2   </span> <a id="x1-20002"></a>Heterogeneous computing scheduling</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">This section presents the HCSP and its mathematical formulation. It also provides a description of the class of list scheduling heuristics, and describes the MinMin method parallelized in this work.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.1   </span> <a id="x1-30002.1"></a>HCSP formulation</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">An HC system is composed of many computers, also called <span class="cmti-10">processors </span>or <span class="cmti-10">machines</span>, and a set of tasks to be executed on the system. A task is the atomic unit of workload, so it cannot be divided into smaller chunks, nor interrupted after it is assigned to a machine. The execution times of any individual task vary from one machine to another, so there will be competition among tasks for using those machines able to execute them in the shortest time.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Scheduling problems mainly concern about time, trying to minimize the time spent to execute all tasks. The most usual metric to minimize in this model is the <span class="cmti-10">makespan</span>, defined as the time spent from the moment when the first task begins execution to the moment when the last task is completed&nbsp;<span class="cite">(<a href="#c4">4</a>)</span>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The following formulation presents the mathematical model for the HCSP aimed at minimizing the makespan: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2">given an HC system composed of a set of machines <img src="/img/revistas/cleiej/v15n3/3a090x.png" alt="P = {m1,...,mM } " class="math"> (dimension <img src="/img/revistas/cleiej/v15n3/3a091x.png" alt="M  " class="math">), and a      collection of tasks <img src="/img/revistas/cleiej/v15n3/3a092x.png" alt="T = {t1,...,tN} " class="math"> (dimension <img src="/img/revistas/cleiej/v15n3/3a093x.png" alt="N  " class="math">) to be executed on the system,      </font>      </li>        <li class="itemize"><font face="Verdana" size="2">let there be an <span class="cmti-10">execution time function</span> <img src="/img/revistas/cleiej/v15n3/3a094x.png" alt="ET : T &times; P &rarr; R+  " class="math">, where <img src="/img/revistas/cleiej/v15n3/3a095x.png" alt="ET (t,m )     i   j  " class="math"> is the time required      to execute the task <img src="/img/revistas/cleiej/v15n3/3a096x.png" alt="t i  " class="math"> in the machine <img src="/img/revistas/cleiej/v15n3/3a097x.png" alt="m  j  " class="math">, </font>      </li>        <li class="itemize"><font face="Verdana" size="2">the goal of the HCSP is to find an assignment of tasks to machines (a function <img src="/img/revistas/cleiej/v15n3/3a098x.png" alt="f : TN &rarr; PM  " class="math">) which      minimizes the <span class="cmti-10">makespan</span>, defined in Equation <a href="#x1-3001r1">1</a>.      </font>                <table class="equation">        <tbody>          <tr>            <td><font face="Verdana" size="2"><a id="x1-3001r1"></a>                            </font>                            <center class="math-display">       <font face="Verdana" size="2">       <img src="/img/revistas/cleiej/v15n3/3a099x.png" alt="       &sum;  max       ET (ti,mj ) mj&isin;P ft(ti&isin;)T=:m        i  j      " class="math-display"></font></center>            </td>            <td class="equation-label"><font face="Verdana" size="2">(1)</font></td>          </tr>               </tbody>          </table>        <font face="Verdana" size="2">            <br>            </font>               <p>      </p>    </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">In the previous HCSP formulation all tasks can be independently executed, disregarding the execution order. This kind of applications frequently appears in many lines of scientific research, specially in Single-Program Multiple-Data applications used for multimedia processing, data mining, parallel domain decomposition of numerical models for physical phenomena, etc. The independent tasks model also arises when different users submit their (obviously independent) tasks to execute in grid computing and volunteer-based computing infrastructures -such as TeraGrid, WLCG, Berkeley&rsquo;s BOINC, Xgrid, etc.&nbsp;<span class="cite">(<a href="#c11">11</a>)</span><a name="c11."></a>-, where non-dependent applications using domain decomposition are very often submitted for execution. Thus, the relevance of the HCSP version faced in this work is justified due to its significance in realistic distributed HC and grid environments.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2.2   </span> <a id="x1-40002.2"></a>List scheduling heuristics</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">The class of <span class="cmti-10">list scheduling </span>heuristics comprises many deterministic scheduling methods that work by assigning priorities to tasks based on a particular criterion. After that, the list of tasks is sorted in decreasing priority and each task is assigned to a processor, regarding the task priority and the processor availability. Algorithm <a href="#x1-4001r1">1</a> presents the generic schema of a list scheduling method. </font>    </p>      <div class="algorithm">  <font face="Verdana" size="2">      <br>  </font>      <p>   <font face="Verdana" size="2">   <a id="x1-4001r1"></a></font></p>  <hr class="float">     <div class="float">        <div class="caption"><font face="Verdana" size="2"><span class="id">Algorithm 1: </span><span class="content">Schema of a list scheduling algorithm.</span></font></div>  <font face="Verdana" size="2">      <br>   </font>       <div class="algorithmic"> <font face="Verdana" size="2"> <a id="x1-4002r1"></a>  <span class="ALCitem"><span class="cmr-8">1:</span></span><span style="width: 5pt;">&nbsp;</span> <span class="cmbx-10">while</span>&nbsp;tasks left to assign&nbsp;<span class="cmbx-10">do</span><span class="while-body"> <a id="x1-4003r2"></a>      <br>  <span class="ALCitem"><span class="cmr-8">2:</span></span><span style="width: 15pt;">&nbsp;</span>   determine the most suitable task according to the chosen criterion <a id="x1-4004r3"></a>      ]]></body>
<body><![CDATA[<br>  <span class="ALCitem"><span class="cmr-8">3:</span></span><span style="width: 15pt;">&nbsp;</span>   <span class="cmbx-10">for</span>&nbsp;each task to assign, each machine&nbsp;<span class="cmbx-10">do</span><span class="for-body"> <a id="x1-4005r4"></a>      <br>  <span class="ALCitem"><span class="cmr-8">4:</span></span><span style="width: 25pt;">&nbsp;</span>     evaluate criterion (task, machine)      </span><a id="x1-4006r5"></a>      <br>  <span class="ALCitem"><span class="cmr-8">5:</span></span><span style="width: 15pt;">&nbsp;</span>   <span class="cmbx-10">end</span>&nbsp;<span class="cmbx-10">for</span><a id="x1-4007r6"></a>      <br>  <span class="ALCitem"><span class="cmr-8">6:</span></span><span style="width: 15pt;">&nbsp;</span>   assign the selected task to the selected machine    </span><a id="x1-4008r7"></a>      <br>  <span class="ALCitem"><span class="cmr-8">7:</span></span><span style="width: 5pt;">&nbsp;</span> <span class="cmbx-10">end</span>&nbsp;<span class="cmbx-10">while</span> </font> </div>       </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">Since the pioneering work by Ibarra and Kim&nbsp;<span class="cite">(<a href="#c12">12</a>)</span><a name="c12."></a>, where the first algorithms following the generic schema in Algorithm <a href="#x1-4001r1">1</a> were introduced, many list scheduling techniques have been proposed to provide easy methods for tasks-to-machines scheduling. This class of methods has also often been employed in hybrid algorithms, with the objective of improving the search of metaheuristic approaches for the HCSP and related scheduling problems.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The simplest list scheduling heuristics use a single criterion to perform the tasks-to-machines assignment. Among others, this category includes: <span class="cmti-10">Minimum Execution Time </span>(MET), which considers the tasks sorted in an arbitrary order, and assigns them to the machine with lower ET for that task, regardless of the machine availability; <span class="cmti-10">Opportunistic Load Balancing </span>(OLB), which considers the tasks sorted in an arbitrary order, and assigns them to the next machine that is expected to be available, regardless of the ET for each task on that machine; and <span class="cmti-10">Minimum</span> <span class="cmti-10">Completion Time </span>(MCT), which tries to combine the benefits of OLB and MET by considering the set of tasks sorted in an arbitrary order and assigning each task to the machine with the minimum CT for that task.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Trying to overcome the inefficacy of these simple heuristics, other methods with higher complexity have been proposed, by taking into account more complex and holistic criteria to perform the task mapping, and then reduce the makespan values. This work focuses on one of the most effective heuristics in this class: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">MinMin</span>, which greedily picks the task that can be completed the soonest. The method starts with a      set <img src="/img/revistas/cleiej/v15n3/3a0910x.png" alt="U  " class="math"> of all <span class="cmti-10">unmapped </span>tasks, calculates the MCT for each task in <img src="/img/revistas/cleiej/v15n3/3a0911x.png" alt="U  " class="math"> for each machine, and assigns      the task with the minimum overall MCT to the best machine. The mapped task is removed from <img src="/img/revistas/cleiej/v15n3/3a0912x.png" alt="U  " class="math">,      and the process is repeated until all tasks are mapped. MinMin improves upon the MCT heuristic,      since it does not consider a single task at a time but all the unmapped tasks sorted by MCT and by      updating the machine availability for every assignment. This procedure leads to balanced schedules      and also allows finding smaller makespan values than other heuristics, since more tasks are expected      to be assigned to the machines that can complete them the earliest.</font></li>      </ul>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">The computational complexity of MinMin heuristic is <img src="/img/revistas/cleiej/v15n3/3a0913x.png" alt="O(N 3)  " class="math">, where <img src="/img/revistas/cleiej/v15n3/3a0914x.png" alt="N  " class="math"> is the number of tasks to schedule. When solving large instances of the HCSP, large execution times are required to perform the task-to-machine assignment (i.e. several minutes for a problem instance with 10.000 tasks). In this context, parallel computing techniques can be applied to reduce the execution times required to find the schedules.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">GPU computing has been used to parallelize many algorithms in diverse research areas. However, to the best of our knowledge, there have been no previous proposals of applying GPU parallelism to list scheduling heuristics. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">3   </span> <a id="x1-50003"></a>GPU computing</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">GPUs were originally designed to exclusively perform the graphic processing in computers, allowing the Central Process Unit (CPU) to concentrate in the remaining computations. Nowadays, GPUs have a considerably large computing power, provided by hundreds of processing units with reasonable fast clock frequencies. In the last ten years, GPUs have been used as a powerful parallel hardware architecture to achieve efficiency in the execution of applications.&nbsp;</font></p>      <p><font face="Verdana" size="2"><span class="paragraphHead"><span class="cmbx-10">GPU programming and CUDA.</span></span>    Ten years ago, when GPUs were first used to perform general-purpose computation, they were programmed using low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and DirectX&nbsp;<span class="cite">(<a href="#c13">13</a>)</span><a name="c13."></a>. Later, the programs for GPU were developed in assembly language for each card model, and they had very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In 2007, NVIDIA introduced CUDA&nbsp;<span class="cite">(<a href="#c14">14)</a></span><a name="c14."></a>, a software architecture for managing the GPU as a parallel computing device without requiring to map the data and the computation into a graphic API.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">CUDA is based in an extension of the C language, and it is available for graphic cards GeForce 8 Series and superior. Three software layers are used in CUDA to communicate with the GPU (see Fig. <a href="#x1-6001r1">1</a>): a low-level hardware driver that performs the CPU-GPU data communications, a high-level API, and a set of libraries such as CUBLAS for linear algebra and CUFFT for Fourier transforms. </font> </p>  <hr class="figure">     <div class="figure">    <font face="Verdana" size="2">        <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><a id="x1-6001r1"><img src="/img/revistas/cleiej/v15n3/3a09f1.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;1: </span><span class="content">CUDA architecture.</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">For the CUDA programmer, the GPU is a computing device which is able to execute a large number of threads in parallel. A specific procedure to be executed many times over different data can be isolated in a GPU-function using many execution threads. The function is compiled using a specific set of instructions and the resulting program (named <span class="cmti-10">kernel</span>) is loaded in the GPU. The GPU has its own DRAM, and the data are copied from the DRAM of the GPU to the RAM of the host (and viceversa) using optimized calls to the CUDA API.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The CUDA architecture is built around a scalable array of multiprocessors, each one of them having eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create, manage, and execute parallel threads, with small overhead. The threads are grouped in <span class="cmti-10">blocks </span>(with up to 512 threads), which are executed in a single multiprocessor, and the blocks are grouped into <span class="cmti-10">grids</span>. When a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is numbered and distributed to an available multiprocessor. When a multiprocessor receives a block to execute, it splits the threads in <span class="cmti-10">warps</span>, a set of 32 consecutive threads. Each warp executes a single instruction at a time, so the best efficiency is achieved when the 32 threads in the warp executes the same instruction. Each time that a block finishes its execution, a new block is assigned to the available multiprocessor.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The threads access the data using three memory spaces: a <span class="cmti-10">shared memory </span>used by the threads in the block; the <span class="cmti-10">local memory </span>of the thread; and the <span class="cmti-10">global memory </span>of the GPU. Minimizing the access to the slower memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to achieve efficiency. On the other hand, the shared memory is placed within the GPU chip, thus it provides a faster way to store the data. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4   </span> <a id="x1-70004"></a>MinMin implementations on GPU</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">The GPU architecture is better suited to the Single Instruction Multiple Data execution model for parallel programs. Thus, GPUs provide an ideal platform for executing parallel programs based on algorithms that use the domain decomposition strategy, especially when the algorithms execute the same set of instructions for each element of the domain.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The generic schema for a list scheduling heuristic presented in Algorithm&nbsp;<a href="#x1-4001r1">1</a> applies the following strategy: for each unassigneed task the criterial are evaluated on all machines and the task that best meets the criteria is selected and assigned to the machine which generates the minimum MCT. Clearly, this schema is an ideal case for applying a task-based or machine-based domain decomposition to generate parallel versions of the heuristics.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The four MinMin implementations on GPU designed in this work are based on the same generic parallel strategy. For each unassigned task, the evaluation of the criteria for all machines is performed in parallel on the GPU, building a vector that stores the identifier of the task, the best value obtained for the criteria, and the correspondent machine to get that value. The indicators in the vector are then processed in the reduction phase to obtain the best value that meets the criteria, and then the best pair (task, machine) is assigned. It is worth noting that the processing of the indicators to obtain the optimum value in each step is also performed using the GPU. A graphical summary of the generic parallel strategy applied in the parallel MinMin algorithms proposed in this article is presented in Fig.&nbsp;<a href="#x1-7001r2">2</a>.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">   <font face="Verdana" size="2">       <br>  </font>      <p> <font face="Verdana" size="2"> <a id="x1-7001r2"><img src="/img/revistas/cleiej/v15n3/3a09f2.jpg"> </a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;2: </span><span class="content">Generic parallel strategy for MinMin on GPU.</span></font></div>  <font face="Verdana" size="2">      ]]></body>
<body><![CDATA[<br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>  </p>      <p>   <font face="Verdana" size="2">Four variants of the proposed MinMin implementation in GPU were designed: </font>      </p>  <ol class="enumerate1">        <li class="enumerate" id="x1-7003x1"><font face="Verdana" size="2"><span class="cmti-10">Parallel MinMin using one GPU  </span>(MinMin-1GPU), which executes on a single GPU, applying the      aforementioned generic procedure; </font>      </li>        <li class="enumerate" id="x1-7005x2"><font face="Verdana" size="2"><span class="cmti-10">Parallel  MinMin  in  four  GPUs  with  domain  decomposition  using  pthreads  </span>(MinMin-4GPU-PT),      which applies a master-slave multithreading programming approach implemented with POSIX threads      (PThreads) that executes the same algorithm on four GPUs independently. The employed domain      partition strategy splits the domain (i.e. the set of tasks) into <img src="/img/revistas/cleiej/v15n3/3a0917x.png" alt="N  " class="math"> equally sized parts (being <img src="/img/revistas/cleiej/v15n3/3a0918x.png" alt="N  " class="math"> the      number of GPUs used, four in our case), so that each task belongs to only one subset. Thus, each GPU      performs the MinMin algorithm on a subset of the tasks input data on all machines, and a master      process consolidate the results after each GPU finishes its task; </font>      </li>        <li class="enumerate" id="x1-7007x3"><font face="Verdana" size="2"><span class="cmti-10">Parallel MinMin in four GPUs with domain decomposition using OpenMP  </span>(MinMin-4GPU-OMP),      which  applies  the  same  master-slave  strategy  than  the  previous  variant,  but  the  multithreading      programming is implemented using OpenMP. The only difference between this implementation and      the previous variant lies in how the threads are handled, in this case they are automatically managed      and synchronized using OpenMP directives included in the implementation. The code for loading input      data, dumping the resulting data, performing the domain partition, and implementing the GPU kernel      are identical to the one used in MinMin-4GPU-PT; </font>      </li>        <li class="enumerate" id="x1-7009x4"><font face="Verdana" size="2"><span class="cmti-10">Parallel synchronous MinMin in four GPUs and CPU  </span>(MinMin-4GPU-sync), which also applies a      domain decomposition but it follows an hybrid approach. In each iteration, each GPU performs a single      step of the MinMin algorithm, then a master process running in CPU assesses the result computed      by each GPU and select the one that best meets the proposed criteria (i.e. MCT minimization), and      finally the information of the selected assignment is updated in each GPU. This variant applies a      multitheading approach implemented using pthreads to manage and synchronize the threads.</font></li>      </ol>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">Figure&nbsp;<a href="#x1-7010r3">3</a> describes the parallel strategy used in the proposed implementations MinMin-4GPU-PT and MinMin-4GPU-OMP, where the CPU threads are defined and handled by using pthreads and OpenMP, respectively. Figure&nbsp;<a href="#x1-7011r4">4</a> describes the parallel strategy used in the synchronous implementation MinMin-4GPU-sync.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">A specific data representation was used to accelerate the execution of the sequential implementation of the MinMin heuristic, in order to perform a fair comparison with the execution times of the GPU implementations. The sequential implementation use a data matrix (SoA) where each row represents a task and each column represents a machine. Thus, when performing the processing for tasks (rows), the entries are loaded to the cache of the processing core, allowing a faster way to access the data.&nbsp;</font></p>      <p>  </p>      ]]></body>
<body><![CDATA[<p>   </p>  <hr class="figure">     <div class="figure">   <font face="Verdana" size="2">       <br>  </font>      <p> <font face="Verdana" size="2"> <a id="x1-7010r3"><img src="/img/revistas/cleiej/v15n3/3a09f3.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;3: </span><span class="content">Parallel strategy used in MinMin-4GPU-PT and MinMin-4GPU-OMP.</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   </p>  <hr class="figure">     ]]></body>
<body><![CDATA[<div class="figure">   <font face="Verdana" size="2">       <br>  </font>      <p> <font face="Verdana" size="2"> <a id="x1-7011r4"><img src="/img/revistas/cleiej/v15n3/3a09f4.jpg"></a>     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;4: </span><span class="content">Parallel strategy used in MinMin-4GPU-sync.</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">For parallel algorithms executing on GPU, loading the data matrix in the same way reduces the computational efficiency. Adjacent threads would access to the data stored in contiguous rows, but these are not stored contiguously, thus they cannot be stored in shared memory. When the data matrix is loaded so that each column represent a task and each row represent a machine, two adjacent threads in GPU access to the data stored in contiguous columns. These data are stored in contiguous memory locations, so they can be loaded in the shared memory, allowing to perform a faster data access for each thread, and therefore improving the execution of the parallel algorithm on GPU.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Preliminary experiments were also performed using a domain decomposition strategy that divides the data <span class="cmti-10">by</span> <span class="cmti-10">machines </span>rather than by tasks, but this option was finally discarded due to scalability issues as the problem size increases.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>     </p>      <p><font face="Verdana" size="2"><span class="titlemark">5   </span> <a id="x1-80005"></a>Experimental analysis</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">This section presents the experimental evaluation of the proposed MinMin implementations on GPU.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.1   </span> <a id="x1-90005.1"></a>HCSP scenarios</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">No standardized benchmarks or test suites for the HCSP have been proposed in the related literature&nbsp;<span class="cite">(<a href="#c15">15</a>)</span><a name="c15."></a>. Researchers have often used the suite of twelve instances proposed by Braun et al.&nbsp;<span class="cite">(<a href="#c16">16</a>)</span><a name="c16."></a>, following the expected time to compute (ETC) performance estimation model by Ali et al.&nbsp;<span class="cite">(<a href="#c17">17</a>)</span>.<a name="c17."></a>&nbsp;</font></p>      <p>   <font face="Verdana" size="2">ETC takes into account three key properties: machine heterogeneity, task heterogeneity, and consistency. <span class="cmti-10">Machine heterogeneity </span>evaluates the variation of execution times for a given task across the HC resources, while <span class="cmti-10">task</span> <span class="cmti-10">heterogeneity </span>represents the variation of the tasks execution times for a given machine. Regarding the consistency property, in a <span class="cmti-10">consistent </span>scenario, whenever a given machine <img src="/img/revistas/cleiej/v15n3/3a0921x.png" alt="mj  " class="math"> executes any task <img src="/img/revistas/cleiej/v15n3/3a0922x.png" alt="ti  " class="math"> faster than other machine <img src="/img/revistas/cleiej/v15n3/3a0923x.png" alt="mk  " class="math">, then machine <img src="/img/revistas/cleiej/v15n3/3a0924x.png" alt="mj  " class="math"> executes all tasks faster than machine <img src="/img/revistas/cleiej/v15n3/3a0925x.png" alt="mk  " class="math">. In an <span class="cmti-10">inconsistent </span>scenario a given machine <img src="/img/revistas/cleiej/v15n3/3a0926x.png" alt="mj  " class="math"> may be faster than machine <img src="/img/revistas/cleiej/v15n3/3a0927x.png" alt="mk  " class="math"> when executing some tasks and slower for others. Finally, a <span class="cmti-10">semi-consistent </span>scenario models those inconsistent systems that include a consistent subsystem.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">For the purpose of studying the efficiency of the GPU implementations as the problem instances grow, the experimental analysis consider a test suite of large-dimension HCSP instances, randomly generated to test the scalability of the proposed methods. This test suite was designed following the methodology by Ali et al.&nbsp;<span class="cite">(<a href="#c17">17</a>)</span>. The set includes the 96 medium-sized HCSP instances with dimension (tasks<img src="/img/revistas/cleiej/v15n3/3a0928x.png" alt="&times; " class="math">machines) 1024<img src="/img/revistas/cleiej/v15n3/3a0929x.png" alt="&times; " class="math">32, 2048<img src="/img/revistas/cleiej/v15n3/3a0930x.png" alt="&times; " class="math">64, 4096<img src="/img/revistas/cleiej/v15n3/3a0931x.png" alt="&times; " class="math">128 and 8192<img src="/img/revistas/cleiej/v15n3/3a0932x.png" alt="&times; " class="math">256 previously solved using an evolutionary algorithm&nbsp;<span class="cite">(<a href="#c18">18</a>)</span><a name="c18."></a>, and new large dimension HCSP instances with dimensions 16384<img src="/img/revistas/cleiej/v15n3/3a0933x.png" alt="&times; " class="math">512, 32768<img src="/img/revistas/cleiej/v15n3/3a0934x.png" alt="&times; " class="math">1024, 65536<img src="/img/revistas/cleiej/v15n3/3a0935x.png" alt="&times; " class="math">2048, and 131072<img src="/img/revistas/cleiej/v15n3/3a0936x.png" alt="&times; " class="math">4096, specifically created to evaluate the GPU implementations presented in this work.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">These dimensions are significanlty larger than those of the popular benchmark by Braun et al.&nbsp;<span class="cite">(<a href="#c16">16</a>)</span> and they better model present distributed HC and grid systems. The problem instances and the generator program are publicly available to download at <a href="http://www.fing.edu.uy/inco/grupos/cecal/hpc/HCSP" class="url"><span class="cmtt-10">http://www.fing.edu.uy/inco/grupos/cecal/hpc/HCSP</span></a>.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.2   </span> <a id="x1-100005.2"></a>Development and execution platform</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The parallel MinMin heuristics were implemented in C, using the standard <span class="cmtt-10">stdlib </span>library. The experimental analysis was performed on a Dell PowerEdge (QuadCore Xeon E5530 at 2.4 GHz, 48 GB RAM, 8 MB cache), with CentOS Linux 5.4 and four NVidia Tesla C1060 GPU (240 cores at 1.33 GHz, 4GB RAM) from the Cluster FING infrastructure, Facultad de Ingenier&iacute;a, Universidad de la Rep&uacute;blica, Uruguay (cluster website <a href="http://www.fing.edu.uy/cluster" class="url"><span class="cmtt-10">http://www.fing.edu.uy/cluster</span></a>).&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.3   </span> <a id="x1-110005.3"></a>Experimental results</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">This section reports the results obtained when applying the parallel GPU implementations of the MinMin list scheduling heuristic for the HSCP instances tackled in this article.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">In the experimental evaluation, we study two specific aspects of the proposed parallel MinMin implementations on GPU: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmti-10">Solution quality</span>: The proposed parallel implementations modify the algorithmic behavior of the MinMin      heuristic, so the makespan results obtained with the GPU implementations are not the same than those      obtained with the sequential versions for the studied HCSP instances. We evaluate the relative gap with      respect to the traditional (sequential) MinMin for each method, as defined by Eq.&nbsp;<a href="#x1-11001r2">2</a>, where <img src="/img/revistas/cleiej/v15n3/3a0937x.png" alt="makespanPAR  " class="math">      and <img src="/img/revistas/cleiej/v15n3/3a0938x.png" alt="makespanSEQ  " class="math"> are the makespan values computed using the parallel and the sequential MinMin      implementation, respectively. </font>                <table class="equation">        <tbody>          <tr>            <td><font face="Verdana" size="2"><a id="x1-11001r2"></a>                             </font>                             <center class="math-display">      <font face="Verdana" size="2">      <img src="/img/revistas/cleiej/v15n3/3a0939x.png" alt="       makespan     - makespan GAP  = ---------PAR------------SEQ                makespanSEQ      " class="math-display"></font></center>            </td>            <td class="equation-label"><font face="Verdana" size="2">(2)</font></td>          </tr>               </tbody>          </table>        <font face="Verdana" size="2">            ]]></body>
<body><![CDATA[<br>            </font>               <p>            </p>    </li>    <li class="itemize"><font face="Verdana" size="2"><span class="cmti-10">Execution times and speedup</span>: We analyze the wall-clock execution times and the speedup for each parallel      MinMin implementation with respect to the sequential one. The speedup metric evaluates how      much faster a parallel algorithm is than its corresponding sequential version. It is computed      as the ratio of the execution times of the sequential algorithm (<img src="/img/revistas/cleiej/v15n3/3a0940x.png" alt="TS  " class="math">) and the parallel version      executed on <img src="/img/revistas/cleiej/v15n3/3a0941x.png" alt="m  " class="math"> computing elements (<img src="/img/revistas/cleiej/v15n3/3a0942x.png" alt="Tm  " class="math">) (Equation <a href="#x1-11002r3">3</a>). The ideal case for a parallel algorithm is      to achieve linear speedup (<img src="/img/revistas/cleiej/v15n3/3a0943x.png" alt="Sm = m  " class="math">), but the most common situation is to achieve sublinear      speedup (<img src="/img/revistas/cleiej/v15n3/3a0944x.png" alt="Sm &lt; m  " class="math">), mainly due to the times required to communicate and synchronize the parallel      processes. However, when using GPU infrastructures very large speedup values have been often      reported. </font>                <table class="equation">        <tbody>          <tr>            <td><font face="Verdana" size="2"><a id="x1-11002r3"></a>                            </font>                            <center class="math-display">      <font face="Verdana" size="2">      <img src="/img/revistas/cleiej/v15n3/3a0945x.png" alt="      TS- Sm =  Tm      " class="math-display"></font></center>            </td>            <td class="equation-label"><font face="Verdana" size="2">(3)</font></td>          </tr>               </tbody>          </table>        <font face="Verdana" size="2">            <br>           </font>               <p>      </p>    </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Table&nbsp;<a href="#x1-11003r1">1</a> reports the average execution times (in seconds), the average GAP values and the average speedup for each of the four parallel MinMin implementations on GPU studied, and a comparison with the sequential implementation in CPU. The results in Table&nbsp;<a href="#x1-11003r1">1</a> correspond to the average values for all the HCSP instances solved for each problem dimension studied, and the comparison is performed considering the optimized sequential algorithms using the specialized data representation described in Section&nbsp;<a href="#x1-70004">4</a>. </font>    </p>      <div class="table">  <font face="Verdana" size="2">      <br>  </font>      <p>   </p>  <hr class="float">     ]]></body>
<body><![CDATA[<div class="float"> <font face="Verdana" size="2"> <a id="x1-11003r1"><img src="/img/revistas/cleiej/v15n3/3a09t1.jpg"></a>      <br>    </font>        <div class="caption"><font face="Verdana" size="2"><span class="id">Table&nbsp;1: </span><span class="content">Experimental results for the GPU implementations.</span></font></div>  <font face="Verdana" size="2">      <br>       </font>       </div>  <hr class="endfloat">    </div>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">The results in Table&nbsp;<a href="#x1-11003r1">1</a> show that significant improvements on the execution times of MinMin are obtained when using the GPU implementations for problem instances with more than 8.000 tasks. When solving the low-dimension problem instances, the GPU implementations were unable to outperform the execution times of the sequential MinMin, mainly due to the overhead introduced by the threads creation and management, and the CPU-GPU memory movements. However, when solving larger problem instances that model realistic large grid scenarios, significant improvements in the execution times are achieved, specially for the problem instances with dimension 65536<img src="/img/revistas/cleiej/v15n3/3a0947x.png" alt="&times; " class="math">2048 and 131072<img src="/img/revistas/cleiej/v15n3/3a0948x.png" alt="&times; " class="math">4096.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">      <br>  </font>      <p><font face="Verdana" size="2"><a id="x1-11004r5"> <img src="/img/revistas/cleiej/v15n3/3a09f5.jpg"></a>       ]]></body>
<body><![CDATA[<br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;5: </span><span class="content">Speedup for the MinMin GPU implementations.</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">Regarding the computational efficiency, Fig.&nbsp;<a href="#x1-11004r5">5</a> summarizes the speedup values for the GPU implementations for each problem dimension faced.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The evolution of the speedup values in Fig.&nbsp;<a href="#x1-11004r5">5</a> indicates that the four GPU implementations obtained small accelerations for the HCSP instances with dimension less than 8192<img src="/img/revistas/cleiej/v15n3/3a0950x.png" alt="&times; " class="math">256. However, as the dimension of the problem instances grow (16384<img src="/img/revistas/cleiej/v15n3/3a0951x.png" alt="&times; " class="math">512, 32768<img src="/img/revistas/cleiej/v15n3/3a0952x.png" alt="&times; " class="math">1024, 65536<img src="/img/revistas/cleiej/v15n3/3a0953x.png" alt="&times; " class="math">2048, and 131072<img src="/img/revistas/cleiej/v15n3/3a0954x.png" alt="&times; " class="math">4096), reasonable speedup values are obtained for the parallel implementations. The best speedup values were computed for the two largest problem dimensions, with a maximum of <span class="cmbx-10">72.05 </span>for the parallel asynchronous MinMin implementation on four GPUs using OpenMP threads.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The four studied MinMin variants in GPU provide different trade-off values between the quality of solutions and execution time required. The asynchronous implementations applying domain decomposition using four GPUs (MinMin-4GPU-PT and MinMin-4GPU-OMP) have the largest speedup values, but the results quality are from 16% to 20% worst than the sequential MinMin implementation. Despite the aforementioned reductions in the solution quality, these methods are able to compute the solutions in reduced execution times (i.e. about 10 minutes in the larges scenario studied, when scheduling 131072 tasks on 4096 machines), thus they can be useful to rapidly solve large scheduling scenarios. On the other hand, the parallel synchronous version of MinMin using four GPUs computed exactly the same solution than the sequential MinMin, but it improves the execution time in a factor of up to <img src="/img/revistas/cleiej/v15n3/3a0955x.png" alt="22.25&times; " class="math"> for the largest instances tackled in this work.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The previously commented results indicate that the proposed parallel implementation of the MinMin list scheduling heuristic in GPU are accurate and efficient methods for scheduling in large HC and grid infrastructures. All parallel variants provides promising reductions in the execution times when solving large instances of the scheduling problem. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">6   </span> <a id="x1-120006"></a>Conclusions and future work</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">This article studied the development of parallel implementations in GPU for a weel-known effective list scheduling heuristic algorithm, namely MinMin, for scheduling in heterogeneous computing environments.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The four proposed algorithms were developed using CUDA, following a simple domain decomposition approach that allows scaling up to solve very large dimension problem instances. The experimental evaluation solved HCSP instances with up to 131072 tasks and 4096 machines, a dimension far more larger than the previously tackled in the related literature.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The experimental results demonstrated that the parallel implementations of MinMin on GPU provide significant accelerations over the time required by the sequential implementations when solving large instances of the HCSP. On the one hand, the speedup values raised up to a maximum of <span class="cmbx-10">72.05 </span>for the parallel asynchronous MinMin implementation on four GPUs using OpenMP threads. On the other hand, the parallel synchronous version of MinMin using four GPUs computed exactly the same solution than the sequential MinMin, but improving the execution time in a factor of up to <span class="cmbx-10">22.25</span><img src="/img/revistas/cleiej/v15n3/3a0956x.png" alt="&times; " class="math"> for the largest instances tackled in this work.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The previously commented results demonstrate that the parallel MinMin implementations in GPU introduced in this article are accurate and efficient schedulers for HC systems, which allow tackling large scheduling scenarios in reasonable execution times.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The main line for future work is related with improving the proposed GPU implementations, mainly by studying the management of the memory accessed by the threads. In this way, the computational efficiency of the heuristics on GPU can be further improved, allowing to develop even more efficient parallel implementations. Another line for future works is used this implementations for complement the efficient heuristic local search methods implemented on GPU. We are working on these topics right now.&nbsp;</font></p>      <p>     </p>      <p><font face="Verdana" size="2"><a id="x1-130006"></a>References</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p>     </p>      ]]></body>
<body><![CDATA[<div class="thebibliography">          <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c1"></a>   (<a href="#c1.">1</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>I.&nbsp;Foster and C.&nbsp;Kesselman, <span class="cmti-10">The Grid: Blueprint for a Future Computing Infrastructure</span>.   Morgan     Kaufmann Publishers, 1998.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c2"></a>   (<a href="#c2.">2</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;Velte, A.&nbsp;Velte, and R.&nbsp;Elsenpeter, <span class="cmti-10">Cloud Computing, A Practical Approach</span>.   New York, NY,     USA: McGraw-Hill, Inc., 2010.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c3"></a>   (<a href="#c3.">3</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>H.&nbsp;El-Rewini,  T.&nbsp;Lewis,  and  H.&nbsp;Ali,  <span class="cmti-10">Task  scheduling  in  parallel  and  distributed  systems</span>.     Prentice-Hall, Inc., 1994.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c4"></a>   (<a href="#c4.">4</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;Leung,   L.&nbsp;Kelly,   and   J.&nbsp;Anderson,   <span class="cmti-10">Handbook  of  Scheduling:  Algorithms,  Models,  and</span>     <span class="cmti-10">Performance Analysis</span>.   CRC Press, Inc., 2004.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c5"></a>   (<a href="#c5.">5</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Eshaghian, <span class="cmti-10">Heterogeneous Computing</span>.   Artech House, 1996.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c6"></a>   (<a href="#c6.">6)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Freund, V.&nbsp;Sunderam, A.&nbsp;Gottlieb, K.&nbsp;Hwang, and S.&nbsp;Sahni, &ldquo;Special issue on heterogeneous     processing,&rdquo; <span class="cmti-10">J. Parallel Distrib. Comput.</span>, vol.&nbsp;21, no.&nbsp;3, 1994. </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c7"></a>   (<a href="#c7.">7</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Garey and D.&nbsp;Johnson, <span class="cmti-10">Computers and intractability</span>.   Freeman, 1979.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c8"></a>(<a href="#c8.">8</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>D.&nbsp;Kirk and W.&nbsp;Hwu, <span class="cmti-10">Programming Massively Parallel Processors: A Hands-on Approach</span>.  Morgan     Kaufmann, 2010.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c9"></a>   (<a href="#c9.">9</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;Owens, M.&nbsp;Houston, D.&nbsp;Luebke, S.&nbsp;Green, J.&nbsp;Stone, and J.&nbsp;Phillips, &ldquo;GPU computing,&rdquo; <span class="cmti-10">Proceedings of the IEEE</span>, vol.&nbsp;96, no.&nbsp;5, pp. 879&ndash;899, May 2008. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c10"></a>  (<a href="#c10.">10</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>Y.&nbsp;Kwok  and  I.&nbsp;Ahmad,  &ldquo;Static  scheduling  algorithms  for  allocating  directed  task  graphs  to     multiprocessors,&rdquo; <span class="cmti-10">ACM Comput. Surv.</span>, vol.&nbsp;31, no.&nbsp;4, pp. 406&ndash;471, 1999. </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c11"></a>  (<a href="#c11.">11</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>F.&nbsp;Berman, G.&nbsp;Fox, and A.&nbsp;Hey, <span class="cmti-10">Grid Computing: Making the Global Infrastructure a Reality</span>.  New     York, NY, USA: John Wiley &amp; Sons, Inc., 2003.     </font>     </p>            ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="biblabel"><a name="c12"></a>  (<a href="#c12.">12</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>O.&nbsp;Ibarra  and  C.&nbsp;Kim,  &ldquo;Heuristic  algorithms  for  scheduling  independent  tasks  on  nonidentical     processors,&rdquo; <span class="cmti-10">Journal of the ACM</span>, vol.&nbsp;24, no.&nbsp;2, pp. 280&ndash;289, 1977. </font>      </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c13"></a>  (<a href="#c13.">13</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Fernando, Ed., <span class="cmti-10">GPU gems</span>.   Boston: Addision-Wesley, 2004.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c14"></a>  (<a href="#c14.">14)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>nVidia, &ldquo;CUDA website,&rdquo; Available online <a href="http://www.nvidia.com/object/cuda_home.html" class="url">http://www.nvidia.com/object/cuda_home.html</a>, 2010,     accessed on July 2011. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c15"></a>  (<a href="#c15.">15</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Theys, T.&nbsp;Braun, H.&nbsp;Siegel, A.&nbsp;Maciejewski, and Y.&nbsp;Kwok, &ldquo;Mapping tasks onto distributed     heterogeneous computing systems using a genetic algorithm approach,&rdquo; in <span class="cmti-10">Solutions to parallel and</span>     <span class="cmti-10">distributed computing problems</span>.   New York, USA: Wiley, 2001, pp. 135&ndash;178. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><span style="font-weight: bold;"><a name="c16"></a></span>  (<a href="#c16.">16</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;Braun, H.&nbsp;Siegel, N.&nbsp;Beck, L.&nbsp;B&ouml;l&ouml;ni, M.&nbsp;Maheswaran, A.&nbsp;Reuther, J.&nbsp;Robertson, M.&nbsp;Theys, B.&nbsp;Yao, D.&nbsp;Hensgen, and R.&nbsp;Freund, &ldquo;A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems,&rdquo; <span class="cmti-10">J. Parallel Distrib. Comput.</span>,     vol.&nbsp;61, no.&nbsp;6, pp. 810&ndash;837, 2001. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c17"></a>  (<a href="#c17.">17</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Ali,  H.&nbsp;Siegel,  M.&nbsp;Maheswaran,  S.&nbsp;Ali,  and  D.&nbsp;Hensgen,  &ldquo;Task  execution  time  modeling     for  heterogeneous  computing  systems,&rdquo;  in  <span class="cmti-10">Proc.  of  the  9th  Heterogeneous  Computing  Workshop</span>,     Washington, USA, 2000, p. 185. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c18"></a>  (<a href="#c18.">18)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Nesmachnow,  &ldquo;A  cellular  multiobjective  evolutionary  algorithm  for  efficient  heterogeneous     computing scheduling,&rdquo; in <span class="cmti-10">EVOLVE 2011, A bridge between Probability, Set Oriented Numerics and</span>     <span class="cmti-10">Evolutionary Computation</span>, 2011. </font> </p>       </div>             ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Foster]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Kesselman]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<source><![CDATA[The Grid: Blueprint for a Future Computing Infrastructure]]></source>
<year>1998</year>
<publisher-name><![CDATA[Morgan Kaufmann Publishers]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Velte]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Velte]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Elsenpeter]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Cloud Computing: A Practical Approach]]></source>
<year>2010</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[McGraw-Hill, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[El-Rewini]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Lewis]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Ali]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<source><![CDATA[Task scheduling in parallel and distributed systems]]></source>
<year>1994</year>
<publisher-name><![CDATA[Prentice-Hall, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Leung]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Kelly]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Anderson]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Handbook of Scheduling: Algorithms, Models, and Performance Analysis]]></source>
<year>2004</year>
<publisher-name><![CDATA[CRC Press, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Eshaghian]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Heterogeneous Computing]]></source>
<year>1996</year>
<publisher-name><![CDATA[Artech House]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Freund]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Sunderam]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
<name>
<surname><![CDATA[Gottlieb]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Hwang]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Sahni]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Special issue on heterogeneous processing]]></article-title>
<source><![CDATA[J. Parallel Distrib. Comput.]]></source>
<year>1994</year>
<volume>21</volume>
<numero>3</numero>
<issue>3</issue>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Garey]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Johnson]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Computers and intractability]]></source>
<year>1979</year>
<publisher-name><![CDATA[Freeman]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kirk]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Hwu]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Programming Massively Parallel Processors: A Hands-on Approach]]></source>
<year>2010</year>
<publisher-name><![CDATA[Morgan Kaufmann]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Owens]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Houston]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Luebke]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Green]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Stone]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Phillips]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[GPU computing]]></article-title>
<source><![CDATA[Proceedings of the IEEE]]></source>
<year>May </year>
<month>20</month>
<day>08</day>
<volume>96</volume>
<numero>5</numero>
<issue>5</issue>
<page-range>879-899</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kwok]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Ahmad]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Static scheduling algorithms for allocating directed task graphs to multiprocessors]]></article-title>
<source><![CDATA[ACM Comput. Surv.]]></source>
<year>1999</year>
<volume>31</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>406-471</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Berman]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Fox]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Hey]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Grid Computing: Making the Global Infrastructure a Reality]]></source>
<year>2003</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[John Wiley & Sons, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ibarra]]></surname>
<given-names><![CDATA[O.]]></given-names>
</name>
<name>
<surname><![CDATA[Kim]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Heuristic algorithms for scheduling independent tasks on nonidentical processors]]></article-title>
<source><![CDATA[Journal of the ACM]]></source>
<year>1977</year>
<volume>24</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>280-289</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fernando]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[GPU gems]]></source>
<year>2004</year>
<publisher-loc><![CDATA[Boston ]]></publisher-loc>
<publisher-name><![CDATA[Addision-Wesley]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="">
<collab>nVidia</collab>
<source><![CDATA[CUDA website]]></source>
<year>2010</year>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Theys]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Braun]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Siegel]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Maciejewski]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Kwok]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Mapping tasks onto distributed heterogeneous computing systems using a genetic algorithm approach]]></article-title>
<source><![CDATA[Solutions to parallel and distributed computing problems]]></source>
<year>2001</year>
<publisher-loc><![CDATA[New York ]]></publisher-loc>
<publisher-name><![CDATA[Wiley]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Braun]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Siegel]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Beck]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Bölöni]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Maheswaran]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Reuther]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Robertson]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Theys]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Yao]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Hensgen]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Freund]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems]]></article-title>
<source><![CDATA[J. Parallel Distrib. Comput.]]></source>
<year>2001</year>
<volume>61</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>810-837</page-range></nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ali]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Siegel]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Maheswaran]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Ali]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Hensge]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Task execution time modeling for heterogeneous computing systems]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 9th Heterogeneous Computing Workshop]]></conf-name>
<conf-date>2000</conf-date>
<conf-loc>Washington </conf-loc>
</nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Nesmachnow]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A cellular multiobjective evolutionary algorithm for efficient heterogeneous computing scheduling]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ EVOLVE 2011, A bridge between Probability, Set Oriented Numerics and Evolutionary Computation]]></conf-name>
<conf-date>2011</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
