<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002012000200004</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Co-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[Eric J.]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Leiss]]></surname>
<given-names><![CDATA[Ernst L]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Texas Instruments  ]]></institution>
<addr-line><![CDATA[Stafford Texas]]></addr-line>
<country>USA</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Department of Computer Science  ]]></institution>
<addr-line><![CDATA[ Texas]]></addr-line>
<country>USA</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>08</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>08</month>
<year>2012</year>
</pub-date>
<volume>15</volume>
<numero>2</numero>
<fpage>2</fpage>
<lpage>2</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002012000200004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002012000200004&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002012000200004&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract Code size is a primary concern in the embedded computing community. Minimizing physical memory requirements reduces total system cost and improves performance and power efficiency. VLIW processors rely on the compiler to statically encode the ILP in the program before its execution, and because of this, code size is larger relative to other processors. In this paper we describe the co-design of compiler optimizations and processor architecture features that have progressively reduced code size across three generations of a VLIW processor]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Spanish abstract: El tamaño del código es la principal preocupación en la comunidad de computación empotrados. Reducir al mínimo los requisitos de memoria física reduce el coste total del sistema y mejora la rendimiento y eficiencia energética. Los procesadores VLIW confían en que el compilador estáticamente codifica la ILP en el programa antes de su ejecución, y debido a esto, el tamaño del código es más grande en relación a otros procesadores. En este trabajo se describe el co-diseño de las optimizaciones del compilador y la arquitectura del procesador, características que han de reducir progresivamente el tamaño del código a través de tres generaciones de un procesador VLIW.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Instruction level parallelism]]></kwd>
<kwd lng="en"><![CDATA[code compression]]></kwd>
<kwd lng="en"><![CDATA[VLIW]]></kwd>
<kwd lng="en"><![CDATA[ILP]]></kwd>
<kwd lng="es"><![CDATA[paralelismo a nivel de instrucciones]]></kwd>
<kwd lng="es"><![CDATA[compresión de código]]></kwd>
<kwd lng="es"><![CDATA[VLIW]]></kwd>
<kwd lng="es"><![CDATA[ILP]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="maketitle">    <b><font face="Verdana" size="4">Co-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor</font></b>    <div class="author">    <font face="Verdana" size="2"> <span class="cmbx-12">Eric J. Stotzer</span>     <br>          <span class="cmr-12">Texas Instruments,</span>     <br>         <span class="cmr-12">Stafford, Texas, USA</span>     <br>   <span class="cmti-12"><a href="mailto:estotzer@ti.com">estotzer@ti.com</a> </span><br class="and">  <span class="cmbx-12">Ernst L. Leiss</span>     <br>   <span class="cmr-12">Department of Computer Science,</span>     <br>  <span class="cmr-12">University of Houston, Texas, USA</span>     <br>           <span class="cmti-12"><a href="mailto:coscel@cs.uh.edu">coscel@cs.uh.edu</a> </span>   </font></div>  <font face="Verdana" size="2">      <br>   </font>       <div class="date"></div>      </div>           ]]></body>
<body><![CDATA[<div class="abstract">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      <div class="minipage">     <div class="center"> <font face="Verdana" size="2">     <br>  </font>      <p> </p>      <p><font face="Verdana" size="2"><span class="cmbx-10">Abstract</span></font></p>  </div>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">Code size is a primary concern in the embedded computing community. Minimizing physical memory requirements reduces total system cost and improves performance and power efficiency. VLIW processors rely on the compiler to statically encode the ILP in the program before its execution, and because of this, code size is larger relative to other processors. In this paper we describe the co-design of compiler optimizations and processor architecture features that have progressively reduced code size across three generations of a VLIW processor.&nbsp;</font></p>      <p><font face="Verdana" size="2">Spanish abstract:&nbsp;</font></p>      <p><font face="Verdana" size="2">El tama&ntilde;o del c&oacute;digo es la principal preocupaci&oacute;n en la comunidad de computaci&oacute;n empotrados. Reducir al m&iacute;nimo los requisitos de memoria f&iacute;sica reduce el coste total del sistema y mejora la rendimiento y eficiencia energ&eacute;tica. Los procesadores VLIW conf&iacute;an en que el compilador est&aacute;ticamente codifica la ILP en el programa antes de su ejecuci&oacute;n, y debido a esto, el tama&ntilde;o del c&oacute;digo es m&aacute;s grande en relaci&oacute;n a otros procesadores. En este trabajo se describe el co-dise&ntilde;o de las optimizaciones del compilador y la arquitectura del procesador, caracter&iacute;sticas que han de reducir progresivamente el tama&ntilde;o del c&oacute;digo a trav&eacute;s de tres generaciones de un procesador VLIW.</font></p>  </div>  </div>   </div>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2"><span class="cmbx-10">Keywords: </span>Instruction level parallelism, code compression, VLIW, ILP&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="cmbx-10">Spanish keywords: </span> paralelismo a nivel de instrucciones, compresi&oacute;n de c&oacute;digo, VLIW, ILP&nbsp;</font></p>      <p> <font face="Verdana" size="2">Received 2011-12-15, Revised 2012-05-16, Accepted 2012-05-16 </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1   </span> <a id="x1-10001"></a>Introduction</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">VLIW processors are well-suited for high performance embedded applications, which are characterized by mathematically oriented loop kernels and abundant ILP. In contrast to superscalar processors, which have dedicated  hardware to dynamically find ILP at run-time, VLIW architectures rely completely on the compiler to find ILP before program execution. The compiler can, in many cases, exploit ILP better than hardware, and the saved silicon space can be used to reduce cost, save power, or add more functional units <span class="cite">(<a href="#c1">1</a>)</span><a name="c1."></a>. Therefore, it is critical that a VLIW processor be a good compiler target.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Because ILP must be explicitly expressed in the program code, VLIW compiler optimizations often replicate instructions, increasing code size. While code size is a secondary concern in the computing community overall, it can be significant in the embedded community. Minimizing the amount of physical memory reduces total system cost. Reducing code size improves system performance by allowing space for more code in on-chip memory and program caches. Code size reduction improves power efficiency, because it reduces the energy required to fetch instructions from memory <span class="cite">(<a href="#c2">2</a>,&nbsp;<a href="#c1">1</a>,&nbsp;<a href="#c3">3</a>)</span><a name="c2."></a><a name="c3."></a>&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1.1   </span> <a id="x1-20001.1"></a>The C6X Processor Family</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Figure&nbsp;<a href="#x1-20011">1</a> is a block diagram of the C6X processor. The first generation C6X (C6X-1) processors are the TMS320C62 (C62) and TMS320C67 (C67). The C6X-1 is a fully pipelined VLIW processor, which allows eight new instructions to be issued per cycle. All instructions can be optionally guarded by a static predicate. If an instruction&rsquo;s predicate operand evaluates to false, then the results of the instruction are anulled. The C62 provides a foundation of integer instructions. It has 32 static general-purpose registers, partitioned into two register files. A small subset of the registers may be used as predicates. Load instructions have four delay slots, multiplies have one delay slot, and branches have five delay slots. Other instructions have no delay slots. The C67 adds floating point instructions.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f1.png">     <br>   </font>   </p>      ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;1: </span><span class="content">TMS320C6000 architecture block diagram</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">The second generation C6X (C6X-2) processor is the TMS320C64 (C64), which builds on the C62 by removing scheduling restrictions on existing instructions and providing additional instructions for SIMD packed-data processing. The C6X-2 processors increase the size of register file by providing an additional 32 static general-purpose registers.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The third and latest generation C6X (C6X-3) processors are the TMS320C64 (C64+) and the TMS320C674 (C674). The C6X-3 doubled the number of multipliers and added new architecture features to improve code size and software-pipelined loop performance. The C674 includes the C67 floating point instructions.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The C6X processors are supported by an optimizing compiler <span class="cite">(<a href="#c4">4</a>)<a name="c4."></a></span>. The structure and operations of a compiler are well documented <span class="cite">(<a href="#c5">5</a>,&nbsp;<a href="#c6">6</a>,&nbsp;<a href="#c7">7</a>,&nbsp;<a href="#c8">8</a>)</span><a name="c5."></a><a name="c6."></a><a name="c7."></a><a name="c8."></a>. The compiler implements important optimization phases such as function inlining, loop nest optimization, data dependence analysis, software pipelining, and many more. The compiler is absolutely critical for exploiting ILP. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">1.2   </span> <a id="x1-30001.2"></a>Encoding Wide Instructions</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Each instruction on the C6X-1 processors is 32-bit. Instructions are fetched eight at a time from program memory in bundles called <span class="cmti-10">fetch packets</span>. Fetch packets are aligned on 256-bit (8-word) boundaries. The C6X-1 processors can execute from one to eight instructions in parallel. Parallel instructions are bundled together into an <span class="cmti-10">execute packet</span>. As fetch packets are read from program memory, the instruction dispatch logic extracts execute packets from the fetch packets. All of the instructions in an execute packet execute in parallel. Each instruction in an execute packet must use a different functional unit.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">The execute packet boundary is determined by a bit in each instruction called the <span class="cmti-10">parallel-bit </span>(or <span class="cmti-10">p-bit</span>). The p-bit (bit 0) controls whether the next instruction executes in parallel. The p-bits are scanned from lower to higher addresses. If the p-bit of instruction <img src="/img/revistas/cleiej/v15n2/2a040x.png" alt="i  " class="math"> is <img src="/img/revistas/cleiej/v15n2/2a041x.png" alt="1  " class="math">, then instruction <img src="/img/revistas/cleiej/v15n2/2a042x.png" alt="i+ 1  " class="math"> is part of the same execute packet as instruction <img src="/img/revistas/cleiej/v15n2/2a043x.png" alt="i  " class="math">. If the p-bit of instruction <img src="/img/revistas/cleiej/v15n2/2a044x.png" alt="i  " class="math"> is <img src="/img/revistas/cleiej/v15n2/2a045x.png" alt="0  " class="math">, then instruction <img src="/img/revistas/cleiej/v15n2/2a046x.png" alt="i+ 1  " class="math"> is part of the next execute packet.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Figure&nbsp;<a href="#x1-30012">2</a> shows three p-bit patterns for fetch packets, which result in the following execution sequences for the eight instructions: fully serial, fully parallel, and partially serial. The least significant bits (LSBs) of the program memory address shows how the fetch packets are laid out in memory. Each instruction with a p-bit set to <img src="/img/revistas/cleiej/v15n2/2a047x.png" alt="0  " class="math"> marks the end of an execute packet. The fully serial fetch packet has eight execute packets each made up of one instruction. The fully parallel fetch packet has one execute packet made up of eight instructions, which will execute in parallel. The partially serial fetch packet has four execute packets.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f2.png" alt="PIC">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;2: </span><span class="content">Instruction fetch packet layout showing p-bits</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">On the C6X-1 processors, execute packets cannot span a fetch packet boundary. Therefore, the last p-bit in a fetch packet is always set to <img src="/img/revistas/cleiej/v15n2/2a048x.png" alt="0  " class="math">, and each fetch packet starts a new execute packet. Execute packet are <span class="cmti-10">padded </span>with explicit parallel NOP instructions to prevent subsequent execute packets from spanning a fetch packet boundary. On the C6X processor, NOP instructions may execute on any of the eight functional units. Figure&nbsp;<a href="#x1-30023">3</a> shows how parallel NOP instructions are used to align spanning execute packets. Instructions in <span class="obeylines-h"><span class="verb"><span class="cmtt-10">{&nbsp;}</span></span></span> are part of the same execute packet and, therefore, will execute in parallel.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f3.jpg" alt="PIC">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;3: </span><span class="content">Example of NOP padding to prevent a spanning execute packet</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">Except for a few special case instructions such as the NOP, each instruction has a predicate encoded in the first four bits. Figure <a href="#x1-30034">4</a> is a generalization of the C6X 32-bit three operand instruction encoding format. The predicate register is encoded in the condition (creg) field, and the z-bit encodes the true or not-true sense of the predicate. The dst, src2, and src1 fields encode operands. The x-bit encodes whether src2 is read from the opposite cluster&rsquo;s register file. The op field encodes the operation and functional unit, and the s-bit specifies the cluster that the instruction executes on.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f4.png" alt="PIC">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;4: </span><span class="content">Typical 32-bit instruction encoding format</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">NOP instructions occur frequently in VLIW programs and as a result increase code size. Often NOP instructions are executed for multiple sequential cycles. The C6X processors include a multi-cycle NOP for encoding a sequence of NOP instructions. Figure&nbsp;<a href="#x1-40015">5</a> shows how four sequential NOP instructions are encoded as one multi-cycle NOP 4. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">2   </span> <a id="x1-40002"></a>NOP Compression</font></p>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p><font face="Verdana" size="2">While the fetch-execute packet encoding scheme and multi-cycle NOP instruction improved the code size of the C6X-1 processors relative to previous VLIWs, embedded applications required further code size reductions. To this end, it was proposed that the C6X-2 processors allow execute packets to span fetch packet boundaries with some minimal restrictions, thus reducing code size by eliminating the need for padding NOP instructions. Further, we proposed new instructions, similar to the multi-cycle NOP, that remove <span class="cmti-10">pipeline </span>NOP instructions <span class="cite">(<a href="#c9">9</a>,&nbsp;<a href="#c10">10</a>,&nbsp;<a href="#c11">11</a>)</span><a name="c9."></a><a name="c10."></a><a name="c11."></a>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Because VLIW processors have exposed pipelines, NOP instructions are inserted to compensate for instruction latencies. A latency is the number of cycles it takes for the effect of an instruction to complete. Instruction scheduling is used to <span class="cmti-10">fill </span>the latency or <span class="cmti-10">delay slots </span>with other useful instructions. Assuming that other instructions are unavailable for execution during the instruction latency, explicit pipeline NOPs are inserted after the instruction issues to maintain correct program execution. On the C6X processor, the load (LD) and branch (B) instructions have five and six cycle latencies, respectively. In Figure&nbsp;<a href="#x1-40015">5</a>, the delay slots of load and branch instructions are filled with NOP instructions.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f5.jpg" alt="PIC">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;5:  </span><span class="content">Example  of  pipeline  NOP  instructions  in  the  delay  slots  of  the  load  (LD)  and  branch  (B) instructions</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">The breakdown of pipeline and padding NOP instructions occurring in a set of embedded applications is shown in Figure&nbsp;<a href="#x1-40026">6</a>. These data show that 8.8% and 6.1% of all instructions are pipeline and padding NOPs, respectively. The control-oriented applications had more pipeline NOP instructions, and the loop-oriented applications had more padding NOP instructions. Loop-oriented code with high degrees of ILP contains more padding NOP instructions, because execute packets tend to be larger in loop code, thus increasing the likelihood of spanning execute packets. The opposite occurs in control-oriented code with lower degrees of ILP, because execute packets are smaller and, therefore, pack more efficiently into fetch packets. Because it has less ILP and is characterized by short test-and-branch sequences, pipeline NOPs occur more often in control-oriented code.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f6.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;6: </span><span class="content">Percentage of pipeline NOPs, and padding NOPs in a set of embedded applications compiled for the C6X-1 processors</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">We proposed a new instruction format that implements a variable delay operation <span class="cite">(<a href="#c10">10</a>,&nbsp;<a href="#c12">12</a>)</span><a name="c12."></a>, that in effect encodes subsequent NOP instructions as an operand. For example, because of their long latency, branch instructions are often followed by a multi-cycle NOP instruction. The <span class="cmti-10">Branch with NOP </span>(BNOP) instruction encodes the subsequent NOP instructions as an operand (see Figure&nbsp;<a href="#x1-40037">7</a>). The effect is that the NOP is issued in parallel with the instruction requiring the latency. The NOP operand ranges from zero to the maximum latency of the instruction.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f7.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;7: </span><span class="content">Example using the branch with parallel NOP instruction</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">We found that this new instruction format reduce average code size by 6%. Control-oriented code, which contains more branches, saw a larger improvement. Loop-oriented code benefited more from eliminating the restrictions on spanning execute packets. </font>    </p>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="titlemark">3   </span> <a id="x1-50003"></a>Software-pipelined Loop Collapsing</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Software pipelining is a powerful loop-based transformation that exploits the ILP across the iterations of a loop. Modulo scheduling, an algorithm for implementing software pipelining, takes the N instructions in a loop and forms an M-stage pipeline as if a vector functional unit were being specifically designed to execute the loop body.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Modulo scheduling is motivated by the development of pipelined hardware functional units. The initiation interval (II) is the rate at which new loop iterations are started <span class="cite">(<a href="#c13">13</a>,&nbsp;<a href="#c14">14</a>)<a name="c13."></a><a name="c14."></a></span>. The total schedule length (TL) is the number of cycles to complete one loop iteration. The schedule for a single iteration is divided into a sequence of stages, each with a length of II. The number of stages is SC = <img src="/img/revistas/cleiej/v15n2/2a049x.png" alt="&lceil;T L&#8725;II&rceil; " class="math">. In the steady state of the execution of the software-pipelined loop, each of the stages will execute in parallel. The instruction schedule for a software-pipelined loop has three components: a <span class="cmti-10">prolog</span>, a <span class="cmti-10">kernel</span>, and an <span class="cmti-10">epilog</span>, as shown in Figure&nbsp;<a href="#x1-50018">8</a>. The kernel is the instruction schedule that will execute the steady state. In the kernel, an instruction scheduled at cycle <img src="/img/revistas/cleiej/v15n2/2a0410x.png" alt="k  " class="math"> will execute in parallel with all instructions scheduled at cycle <img src="/img/revistas/cleiej/v15n2/2a0411x.png" alt="k mod II  " class="math">. This is known as the <span class="cmti-10">modulo constraint </span>and is the source of the term <span class="cmti-10">modulo</span> <span class="cmti-10">scheduling</span>.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f8.png">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;8: </span><span class="content">Execution of a modulo scheduled loop</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     ]]></body>
<body><![CDATA[<p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">While software pipelining positively affects performance, it negatively impacts code size. As an example, assume a simple loop that consists of three generalized instructions (<span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span>, <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span>, and <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins3</span></span></span>), a decrement, and a conditional branch back to the beginning. Suppose <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span> depends on the result of <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span>, and <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins3</span></span></span> on the result of <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span>. In the absence of software pipelining, a possible schedule for this code on a VLIW processor is shown in Figure&nbsp;<a href="#x1-50029">9</a>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The <span class="obeylines-h"><span class="verb"><span class="cmtt-10">||</span></span></span> operator denotes instructions that execute in parallel. The operand <span class="obeylines-h"><span class="verb"><span class="cmtt-10">[n]</span></span></span> is a predicate that guards the branch. When the value of the register <span class="obeylines-h"><span class="verb"><span class="cmtt-10">n</span></span></span> is non-zero, the branch is taken. When <span class="obeylines-h"><span class="verb"><span class="cmtt-10">n</span></span></span> is zero, the instruction is nullified. For simplicity, all instructions are assumed to be single-cycle with no delay slots.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f9.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;9: </span><span class="content">Example of a generalized loop scheduled without software pipelining</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      ]]></body>
<body><![CDATA[<p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">In the above schedule, very little parallelism has been exploited because <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span>, <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span>, and <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins3</span></span></span> must execute in order within the given loop iteration. Thus, each loop iteration takes three cycles to complete. Software pipelining improves performance by overlapping multiple consecutive iterations of the loop. For example, a pipelined version of the loop is shown in Figure&nbsp;<a href="#x1-500310">10</a>. Although the first iteration requires <img src="/img/revistas/cleiej/v15n2/2a0412x.png" alt="T L = 3  " class="math"> cycles to complete, all successive iterations complete at a rate of <img src="/img/revistas/cleiej/v15n2/2a0413x.png" alt="II = 1  " class="math"> iteration per cycle. Thus, software pipelining has transformed the three-cycle loop into a one-cycle loop.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f10.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;10: </span><span class="content">Generalized software-pipelined loop</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">The two most common causes of code growth associated with software pipelining are the basic replication of loop iterations and compensation code. </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2">Instruction replication: As can be seen in Figure&nbsp;<a href="#x1-500310">10</a>, the final code size is roughly <img src="/img/revistas/cleiej/v15n2/2a0414x.png" alt="SC = 3  " class="math"> times the      original code size. The kernel is roughly the size of the original loop. The prolog and epilog combined      are roughly <img src="/img/revistas/cleiej/v15n2/2a0415x.png" alt="((SC - 1)*T L)  " class="math">.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Compensation code: For a loop to be eligible for software pipelining, the loop must execute at least      SC iterations. Recall that SC is the number of iterations that are concurrently executing during the      kernel. For example, in Figure&nbsp;<a href="#x1-500310">10</a> the loop will always execute three or more iterations. Therefore,      the pipelined version is only safe when the original trip count is at least three. When the compiler is      unable to determine that the trip count is large enough, it must either suppress software pipelining      or generate compensation code: two versions of the loop (pipelined and non-pipelined) and a run-time      check to choose between them. However, this clearly increases code size.      </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">Software-pipelined loop collapsing is a compile-time technique that <span class="cmti-10">folds </span>the prolog and epilog stages into the kernel. In the software-pipelined loop in Figure&nbsp;<a href="#x1-500310">10</a>, observe that the only difference between the kernel and the first stage of the epilog (ignoring loop control instructions) is that <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span> is executed in the kernel but not in the epilog. Suppose it were safe to speculatively execute (meaning that it would not cause incorrect program results) <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span> one extra time. Then the kernel could execute one extra time and skip the first stage of the epilog as shown in Figure&nbsp;<a href="#x1-500411">11</a>.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f11.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;11: </span><span class="content">Example of software-pipelined loop with one epilog stage collapsed</span></font></div>  <font face="Verdana" size="2">      ]]></body>
<body><![CDATA[<br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">The first epilog stage is now collapsed back into the kernel, replacing an epilog stage with an extra iteration of the kernel. Consequently, before the loop the trip counter is incremented by one, so that the pipelined loop executes the same number of iterations (produces the same number of results) as the original loop.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Once an epilog stage has been collapsed, the minimum number of iterations that will be completely executed (shortest path through loop) is reduced by one, from three to two. Although a third iteration is started, only <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span> from this iteration is executed. There is no harm in executing this instruction an extra time. Thus, this loop can be safely executed whenever the original trip count is at least two (as opposed to three).&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The same process is now applied to epilog stage 2 as shown in Figure&nbsp;<a href="#x1-500512">12</a>. In this case however, ignoring loop control instructions, there are two instructions that do not execute during epilog stage 2: <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span> and <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span>. Assume the compiler determined that <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins1</span></span></span> could be safely speculatively executed a second time, but <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span> could not. Therefore, to collapse stage 2, a predicate is placed on <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span> to guard against over-execution. Observe that, before loop execution begins, the new predicate register is initialized to one less than the trip counter, so that <span class="obeylines-h"><span class="verb"><span class="cmtt-10">ins2</span></span></span> is not executed during the last iteration of the kernel.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f12.jpg">     <br>   </font>   </p>      ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;12: </span><span class="content">Example of a software-pipelined loop with all epilog stages collapsed</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">In this example, the cost of eliminating the second epilog stage (the addition of two instructions) outweighs the benefit (eliminating one instruction). In practice, however, epilog stages are usually much larger. The pipelined version of the loop, with the fully collapsed epilog, is now safe for all trip counts greater than zero. The shortest path through the code now computes only one full iteration of the loop. Therefore, if the compiler did not have any information about the trip counter, it would have been worth collapsing the last epilog stage to eliminate the need for compensation code. Prologs are collapsed in the same way, except that it must be safe to over-execute an instruction before the loop rather than afterwards.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Using 72 loop kernels, we showed that collapsing decreased loop code size by over 30% <span class="cite">(<a href="#c15">15</a>)</span><a name="c15."></a>. Greater benefit was derived from epilog collapsing than from prolog collapsing. In most cases, epilogs can be completely collapsed using at most one predicate register. Prologs frequently cannot be completely collapsed and often require a predicate register per collapsed stage.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">In general, collapsing <img src="/img/revistas/cleiej/v15n2/2a0416x.png" alt="SC  - 1  " class="math"> stages across a combination of epilog or prolog collapsing obviates the need for compensation code. Thus, collapsing becomes a very important optimization when loop trip counts are not available at compile-time. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">4   </span> <a id="x1-60004"></a>Modulo Loop Buffer</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">A hardware loop buffer is a program cache specialized to hold a loop body. The motivation for building hardware loop buffers is to reduce power consumption and in some cases improve performance <span class="cite">(<a href="#c16">16</a>)</span><a name="c16."></a>. Typically, a loop buffer is small, compared to a full blown program cache, and placed close to a processor&rsquo;s execution units. Therefore, instructions executed from the loop buffer require less power, since the processor is not required to enable the memory system and fetch logic. In addition, the instructions may be stored in a decoded format bypassing the processor&rsquo;s decode logic. A zero overhead loop buffer has an additional function to eliminate the need for an explicit branch instruction in the program source code. The loop buffer performs the branch automatically. The loop body is demarcated by special instructions.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">We proposed a loop buffer specialized to improve the performance of software-pipelined loops specifically in the following areas <span class="cite">(<a href="#c17">17</a>)</span><a name="c17."></a> </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2">Code size: Replicated instructions in the software-pipelined loop prolog and epilog instruction schedule      significantly increase code size. </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Compensation code: If the bounds of the loop trip counter are unknown, the compiler must generate      additional compensation code, which increases code size and often decreases performance.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Fetch/decode power: Fetching from program memory and decoding the replicated instructions in the      software-pipelined loop prolog and epilog requires additional power. </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Instruction  speculation:  The  compiler  speculates  instructions  as  part  of  the  process  of  collapsing      software-pipelined loops. Because the results of these instructions are never used and many are memory      operations, they waste power and have side-effects in the memory system that can degrade performance.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Interrupt latency: The latency increases when interrupts are disabled around software-pipelined loops.      </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">A modulo loop buffer (MLB)&nbsp;<span class="cite">(<a href="#c18">18)</a><a name="c18."></a></span> is a hardware loop buffer that is specialized to exploit the regular pattern that occurs in software-pipelined loops. On the C6X-3 processor, the <span class="cmti-10">SPLOOP </span>instruction marks the beginning of a loop body that is to execute from the MLB. The operands of the <span class="cmti-10">SPLOOP </span>instruction encode the II. The <span class="cmti-10">SPKERNEL </span>instruction marks the end of the loop body instructions. In Figure&nbsp;<a href="#x1-600113">13</a> A, B, C, D, and E are the instructions of a single loop iteration with an II of 1. <span class="cmti-10">SPLOOP </span>and <span class="cmti-10">SPKERNEL </span>demarcate the loop body.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f13.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;13: </span><span class="content">Generalization of the modulo loop buffer code layout</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     ]]></body>
<body><![CDATA[<p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">The MLB records and executes future loop iterations and eliminates replicated instructions from the software-pipelined loop prolog and epilog. Figure&nbsp;<a href="#x1-600214">14</a> shows how the loop body is executed. <span class="obeylines-h"><span class="verb"><span class="cmtt-10">A:n</span></span></span> through <span class="obeylines-h"><span class="verb"><span class="cmtt-10">E:n</span></span></span> are the stages of the nth iteration in the software-pipelined loop. The <span class="cmti-10">Instr Fetch </span>column shows the loop body instructions fetched from program memory and stored in the MLB. After <span class="cmti-10">SPKERNEL </span>is encountered, instruction fetch from program memory is disabled; instructions are fetched only from the MLB until the loop completes. The <span class="cmti-10">Loop Buffer</span> <span class="cmti-10">Fetch </span>column shows the fetch of instructions from the MLB, and the <span class="cmti-10">Execute </span>column shows the instructions that are actually executed.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f14.jpg">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;14: </span><span class="content">Execution of a software-pipelined loop using the modulo loop buffer</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">The loop body is a single iteration, modulo scheduled software-pipelined loop. TL (total schedule length) is the length of the loop body in cycles starting with the cycle after the SPLOOP instruction and ending with the cycle at the SPKERNEL instruction. It consists of <img src="/img/revistas/cleiej/v15n2/2a0417x.png" alt="SC = &lceil;TL &#8725;II&rceil; " class="math"> stages of II cycles each. The execution of the prolog, kernel, and epilog are generated from copies of this single iteration time shifted by multiples of II cycles and overlapped for simultaneous execution (see Figure&nbsp;<a href="#x1-600214">14</a>).&nbsp;</font></p>      <p>   <font face="Verdana" size="2">As the instructions in the loop body are fetched and executed, they are stored in the loop buffer along with their insertion order. By the time the entire loop body has been inserted into the loop buffer, the loop kernel is present and can execute entirely from there. When the software-pipelined loop enters the epilog the loop buffer disables the execution of instructions in the order that they were inserted.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Clearly, the MLB reduces code size and improves power efficiency by eliminating the overlapped copies of the instructions in the loop body. Unlike software-pipelined loop collapsing, the MLB reduces code size without requiring instruction speculation. This improves power efficiency by eliminating the fetch, decode, and execution of unused speculated instructions. The MLB on the C6X-3 processors implements other capabilities such as the ability to overlap pre- and post-loop instructions with the execution of the prolog and epilog, support for nested loops, and an early-exit feature that eliminates compensation code. In addition, the MLB improves interrupt latency.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">There are two limits in the implementation of the loop buffer: the total size of the loop buffer and the maximum loop body length. The size is the number of execute packets in the kernel; therefore, this limits the maximum II for a software-pipelined loop in the MLB. The maximum loop body length in cycles (maximum TL) sets the sizes of internal bookkeeping information built during the prolog and used during the epilog to pipe down the loop.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Figure&nbsp;<a href="#x1-600315">15</a> illustrates the potential benefits of the MLB using several performance parameters on a small set of DSP and multi-media applications including gsmefr, g.723.1, g.729, JPEG, 95 loop kernels, and Reed-Solomon. These results are an upper bound since they assume that all software-pipelined loops can fit in the loop buffer.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f15.png">     <br>   </font>   </p>      ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;15: </span><span class="content">Average improvement factor of several parameters when using a modulo loop buffer</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">The <span class="cmti-10">program size </span>parameter measures relative code-size reduction in the entire program, and the <span class="cmti-10">loop size </span>parameter measures relative code-size reduction in the software-pipelined loops. If every software-pipelined loop in the benchmark applications used the MLB, the average total program and software-pipelined loop code size would be reduced by 17% and 51%, respectively. Because loops are typically executed more frequently, minimizing loop size improves the utilization of on-chip memories and program caches.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The <span class="cmti-10">instructions fetched/decoded </span>parameter measures how many instructions are fetched and decoded from program memory. Instructions executed out of the loop buffer are not included; they are not fetched and executed from program memory. This is the source of most of the power reduction attributable to the MLB facility. This parameter shows a potential to reduce fetch and decode activity by 83%.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The <span class="cmti-10">instructions executed </span>parameter measures the number of instructions executed. Because the MLB eliminates the speculative over-execution of collapsed software-pipelined loop prolog and epilog instructions, the MLB decreases the number of instructions executed by 8%.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The speculative over-execution of memory access instructions can pollute the data memory cache. Because cache activity occurs for data items that are never used, data cache performance and power efficiency are negatively impacted. The <span class="cmti-10">ld/st instructions executed </span>parameter measures the number of load and store instructions executed only, which are reduced by 14%. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5   </span> <a id="x1-70005"></a>Variable Length Instructions</font></p>   <font face="Verdana" size="2">       <br>  </font>      ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2">This section describes the variable length instruction set extensions we designed in the C6X-3 processors <span class="cite">(<a href="#c3">3</a>,<a href="#c19">&nbsp;19</a>)</span><a name="c19."></a>. The instruction set extensions reduce code size significantly, are binary compatible with older object code, and do not require the processor to switch <span class="cmti-10">modes</span>. The variable length instructions include 16-bit instructions that are compact versions of existing 32-bit instructions. All existing control, data path, and functional unit logic beyond the decode stage remains unchanged with respect to the 16-bit instructions. The 16-bit and 32-bit instructions can be mixed. Consistent with the VLIW processor philosophy, the utilization of the variable length instructions is directed by the compiler.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The 16-bit instructions implement frequently occurring instructions such as addition, subtraction, multiplication, shift, load, and store. By necessity, the 16-bit instructions have reduced functionality. For example, immediate fields are smaller, there is a reduced set of available registers, the instructions may operate only on one functional unit per cluster, and some standard arithmetic and logic instructions may have only two operands instead of three (one source register is the same as the destination register). Due to the design requirements of a high performance VLIW processor, 32-bit instructions must be kept on a 32-bit boundary. Therefore, the 16-bit instructions occur in pairs in order to honor the 32-bit instruction alignment.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">On a set of performance critical application benchmarks, the variable length instructions were shown to reduce code size by an average of 11.5% when the compiler was configured to maximize performance and 23.3% when the compiler was configured to minimize code size (at the expense of performance) <span class="cite">(<a href="#c3">3</a>)</span>.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.1   </span> <a id="x1-80005.1"></a>The Fetch Packet Header</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">A new type of fetch packet encodes a mixture of 16-bit and 32-bit instructions. Thus, there are two kinds of fetch packets: a standard fetch packet that contains only 32-bit instructions and a header-based fetch packet that contains a mixture of 32- and 16-bit instructions. Figure&nbsp;<a href="#x1-800116">16</a> shows a standard fetch packet and an example of a header-based fetch packet. Fetch packet headers are detected by looking at the first four bits of the last word in a fetch packet. The header-based fetch packet encodes how to interpret the bits in the rest of the fetch packet. On C6X-3 processors, execute packets may span standard and header-based fetch packets.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f16.png">     ]]></body>
<body><![CDATA[<br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;16: </span><span class="content">Fetch packet formats</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">Figure <a href="#x1-800217">17</a> shows the layout of the fetch packet header. The predicate field used to signify a fetch packet header occupies four bits (bits 28-31). There are seven <span class="cmti-10">layout bits </span>(bits 21-27) that designate whether the corresponding word in the fetch packet is a 32-bit instruction or a pair of 16-bit instructions. Bits 0-13 are p-bits for 16-bit instructions. For a 32-bit instruction, the corresponding two p-bits in the header are not used (set to 0). The remaining seven <span class="cmti-10">expansion bits </span>(bits 14-20) are used to specify different variations of the 16-bit instruction set.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f17.png">     <br>   </font>   </p>      ]]></body>
<body><![CDATA[<div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;17: </span><span class="content">Compact instruction header format</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">The expansion bits and p-bits are effectively extra opcode bits that are attached to each instruction in the fetch packet. Certain branch instructions appearing in header-based fetch packets can reach half-word program addresses. The compressor software (discussed below) ensures that branch instructions can reach intended destination addresses and encodes the expansion bits to maximize the number of instructions in a fetch packet.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The protected load instruction bit (bit 20) indicates whether all load instructions in the fetch packet are protected. This eliminates the NOP that often occurs after a load instruction in control-oriented code. The register set bit (bit 19) indicates which set of eight registers is used for three operand 16-bit instructions. The data size field (bits 16-18) encodes the access size (byte, half-word, word, double-word) of all 16-bit load and store instructions. The branch bit (bit 15) controls whether branch instructions or certain S-unit arithmetic and shift instructions are available. Finally, the saturation bit (bit 14) indicates whether basic arithmetic operations saturate on overflow and underflow. If the result of an arithmetic operation overflows, then the result <span class="cmti-10">saturates </span>to a maximum value, and if an operation underflows, it saturates to a minimum value.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The instruction set extensions also include a new 32-bit CALLP instruction. Unlike branch instructions, where the five delay slots must be filled with other instructions or NOPs, the CALLP instruction is <span class="cmti-10">protected</span>, meaning other instructions cannot start in the delay slots of the CALLP. The use of this CALLP can reduce code size up to 6% on some applications&nbsp;<span class="cite">(<a href="#c3">3</a>)</span>, with only a small degradation in performance. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.2   </span> <a id="x1-90005.2"></a>The Compressor</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">An instruction&rsquo;s size is determined at assembly-time. (This is possible because each 16-bit instruction has a 32-bit counterpart.) The <span class="cmti-10">compressor </span>runs after the assembly phase and is responsible for converting as many 32-bit instructions as possible to equivalent 16-bit instructions. As shown in Figure&nbsp;<a href="#x1-900118">18</a>, the compressor takes a specially instrumented object file (where all instructions are 32-bit), and produces an object file where some instructions have been converted to 16-bit instructions.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f18.png">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;18: </span><span class="content">Back-end compiler and assembler flow depicting the compression of instructions</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">Compression is an iterative process consisting of one or more <span class="cmti-10">compression iterations</span>&nbsp;<span class="cite">(<a href="#c3">3</a>)</span>. In each compression iteration, the compressor starts at the beginning of the section&rsquo;s instruction list and generates new fetch packets until all instructions are consumed. Each new fetch packet may contain eight 32-bit instructions (a regular fetch packet), or contain a mixture of 16- and 32-bit instructions (a header-based fetch packet).&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The compressor must select an <span class="cmti-10">overlay</span>, which is an expansion bit combination used for a fetch packet that contains 16-bit instructions. There are several expansion bits in the fetch packet header that indicate how the 16-bit instructions in the fetch packet are to be interpreted. For each new fetch packet, the compressor selects a window of instructions and records for each overlay which instructions may be converted to 16-bit. It then selects the overlay that packs the most instructions in the new fetch packet.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">During a compression iteration, there is often a potential 16-bit instruction with no other 16-bit instruction immediately before or after. In this case, the compressor may swap instructions <span class="cmti-10">within </span>an execute packet to create a pair. Because the C6X compiler often produces execute packets with multiple instructions, swapping instructions within an execute packet increases the conversion rate of potential 16-bit instructions. The compressor does not swap or move instructions outside of execute packets, nor change registers of instructions in order to improve compression. The compressor will always converge on a solution, typically after five or fewer compression iterations&nbsp;<span class="cite">(<a href="#c3">3</a>)</span>. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">5.3   </span> <a id="x1-100005.3"></a>Instruction Tailoring</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The compressor has the responsibility for packing instructions into fetch packets. The compiler does not make the final decision whether an instruction will become a 16-bit instruction. It does, however, specialize instructions so that they are likely to become 16-bit instructions. We call such instruction specialization <span class="cmti-10">tailoring</span>. Because instructions tailored to be 16-bit are restricted to use a subset of the register file and functional units, they can degrade performance. Therefore, the compiler implements a set of command-line options that allow users to control the aggressiveness of the tailoring optimizations. The compiler implements instruction tailoring via the following techniques: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Instruction selection</span>: The compiler tailors instructions to have a high probability of becoming      16-bit  instructions  <span class="cite">(<a href="#c3">3</a>)</span>.  The  compiler  will  replace  a  single  32-bit  instruction  with  two  instructions      that will likely compress to two 16-bit instructions. Since 16-bit instructions must be paired in the      compressor, replacing a 32-bit instruction with two potential 16-bit instructions reduces the impact of      32-bit alignment restrictions, which improves the compression of the surrounding instructions. When      compiling for minimum code size, the compiler attempts to generate instructions that have 16-bit      formats only. The compiler assumes that any potential 16-bit instruction will ultimately become 16-bit      in the compressor. </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Tiered register allocation</span>: The compiler implements a register allocation scheme that maximizes      the  usage  of  the  16-bit  instructions&rsquo;  register  file  subset  <span class="cite">(<a href="#c20.">20</a>)</span><a name="c20."></a>.  Using  tiered  register  allocation,  the      compiler  limits  the  available  registers  for  operands  in  potential  16-bit  instructions  to  the  16-bit      instructions&rsquo; register file subset. If the register allocation attempt succeeds, the operands in potential      16-bit instructions are allocated registers from the 16-bit instructions&rsquo; register file subset. If the register      allocation attempt fails, the compiler incrementally releases registers for allocation from the rest of      the register file for the operands of potential 16-bit instructions. Should register allocation attempts      continue failing, the whole register set is made available for all instruction operands thereby falling      back on the compiler&rsquo;s traditional register allocation mechanism.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Function call customization</span>: To better utilize the 16-bit instructions, the compiler identifies call      sites with potential 16-bit instruction operands that have live ranges across the call. The call is then      rewritten as an indirect call to a run-time support routine, which takes the address of the original      call site function as an operand. The run-time support routine saves the 16-bit instructions&rsquo; register      file subset on the stack. Control is then transferred to the actual function that was being called at      that call site. The called function returns to the run-time support routine, which restores the 16-bit      instructions&rsquo; register file subset and then returns to the original call site. This technique effectively      simulates changing the calling convention to include the 16-bit instructions&rsquo; register file subset in the       set of registers saved by a called function&nbsp;<span class="cite">(<a href="#c21">21</a>)</span><a name="c21."></a>. Calling convention customization is used only when      compiling to aggressively minimize code size. </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">6   </span> <a id="x1-110006"></a>Results</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The section summarizes, for each generation of C6X processors, the progressive code size reduction and performance impact of software-pipelined loop collapsing, NOP compression, variable length instructions, and the modulo loop buffer.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">The 84 benchmarks used for this analysis are organized into the groups enumerated below. The EEMBC telecom, automotive, and networking groups are taken directly from the EEMBC-v1 embedded benchmark suite <span class="cite">(<a href="#c22">22</a>)</span><a name="c22."></a>. </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">EEMBC telecom</span>: signal processing loop kernels typically found in telecommunication applications.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">EEMBC automotive</span>: control functions found in automotive engine applications.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">EEMBC  networking</span>:  packet  processing  algorithms  taken  from  network  and  communication      infrastructure applications. </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">DSP  codecs</span>:  voice  compression  decoder/encoders  (codecs)  used  in  wireless  communication      applications including: evrc, g723.1, g729, gsmAMR, gsmefr, gsmfr, gsmhr, Reed-Solomon, modem,      trau, and wbamr. </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Multimedia codecs</span>: image, video, music and data compression codecs including: jpeg, mpeg4, mp3,      ac3, aes, and des. </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Control code</span>: tcpip, zlib, and hard-disk drive.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Other applications</span>: miscellaneous benchmarks such as drhystones, dijkstra, susan, and others.      </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">The benchmarks were compiled with the TI C6X compiler version 6.0.8. The C6X compiler has a speed-or-size option that determines how the compiler makes tradeoffs between optimizing for code size or performance. The three speed-or-size options used in this analysis are: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Size</span>: aggressively minimize code size at the expense of performance.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Size and speed</span>: minimize code size with nominal impact on performance.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Speed</span>: aggressively maximize performance at the expense of code size (default).      </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">The compiler provides options to select the processor generation and to disable optimization passes that target specific processor features. For the following results, the baseline configuration is the C6X-1 generation processor compiled with software-pipelined loop collapsing <span class="cmti-10">disabled </span>and the speed-or-size option set to speed. All results are normalized to this baseline configuration. Benchmark code size reduction and speedup (performance improvement) are measured on the following four configurations: </font>       </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">C6X-1(SLC)</span>: C6X-1 generation processors with software-pipelined loop collapsing enabled.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">C6X-2(NC)</span>: C6X-2 generation processors with NOP compression.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">C6X-3(VLI)</span>: C6X-3 generation processors with variable length instructions enabled.      </font>      </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">C6X-3(VLI+MLB)</span>: C6X-3 generation processor with both variable length instructions and the      modulo loop buffer enabled. </font>      </li>      </ul>   <font face="Verdana" size="2">       <br>  </font>      <p>   <font face="Verdana" size="2">The differences in the benchmark results for the C6X-1(SLC), C6X-2(NC), and C6X-3(VLI+MLB) configurations correspond to the improvements that are seen when upgrading to new processor generations. The C6X-3(VLI) configuration is provided to differentiate the impact of variable length instructions and the modulo loop buffer.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>   <font face="Verdana" size="2">For the most part, the configurations accumulate (i.e., C6X-3(VLI+MLB) is really C6X-3(NC+SLC+VLI+MLB). One exception is the overlap between SLC and MLB since both are minimizing the code size of software-pipelined loops. Software-pipelined loops that do not use the MLB still benefit from loop collapsing.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">For each configuration, Figures&nbsp;<a href="#x1-1100119">19</a>, <a href="#x1-1100220">20</a>, <a href="#x1-1100321">21</a>, and <a href="#x1-1100422">22</a> present the normalized average <span class="cmti-10">code size reduction </span>and <span class="cmti-10">speed</span> <span class="cmti-10">improvement </span>(the y-axis). For code size reduction, smaller is better, and for speed improvement, larger is better. Each configuration is compiled for all three speed-or-size options (the x-axis).&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Figure&nbsp;<a href="#x1-1100119">19</a> summarizes the results for all benchmarks relative to C6X-1 compiled for speed with no loop collapsing. The results are averages across all benchmarks. Recall that because the compiler is disabling optimizations that increase code size, performance will degrade at the <span class="cmti-10">speed&amp;size </span>and <span class="cmti-10">size </span>options. The goal is to minimize the degradation as much as possible.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f19.png"> <img src="/img/revistas/cleiej/v15n2/2a04f20.png">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;19: </span><span class="content">Code-size reduction and performance improvement on all benchmarks</span></font></div>  <font face="Verdana" size="2">      <br>  &nbsp; </font>     <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">The following analysis of the results in Figure&nbsp;<a href="#x1-1100119">19</a> is grouped by compiler speed-or-size option. </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Speed</span>:      </font>                    <ul class="itemize2">            <li class="itemize"><font face="Verdana" size="2">C6X-1(SLC):  The  3.7%  speedup  improvement  is  from  the  elimination  of  compensation  code          around software-pipelined loops. Software pipelined loop collapsing improves code size by 9%.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-2(NC): The 10.4% speedup is predominately from increasing the size of the register file. NOP          compression has more than doubled the code-size reduction to 24.9%.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI): Some of the compiler transformations that exploit the variable length instructions          degrade the performance slightly from 10.0% to 10.4%. The variable length instructions drop the          average code-size reduction from 24.9% to 33.3%. </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI+MLB): Some of the restrictions on using the MLB have degraded performance slightly          further to 9.4%. The code-size reduction has reached an impressive 40%.          </font>          </li>               </ul>        </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Speed and Size</span>:      </font>                    <ul class="itemize2">            <li class="itemize"><font face="Verdana" size="2">C6X-1(SLC): A speedup degradation of 2.9% and a code-size reduction of 17.6%.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-2(NC): The 1.4% speedup improvement is from the larger register file. NOP compression          has almost doubled the code-size reduction to 32%. </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI): Compiler transformations to exploit variable length instructions have degraded the          speedup by 2.6%, but the code-size reduction has improved to 43%.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI+MLB): The 0.8% speedup is because the modulo loop buffer allows the compiler to          aggressively software pipeline loops even when compiling for size. The code-size reduction has          increased to an impressive 47.3%. </font>          </li>               </ul>        </li>        <li class="itemize"><font face="Verdana" size="2"><span class="cmbx-10">Size</span>:      </font>                    <ul class="itemize2">            <li class="itemize"><font face="Verdana" size="2">C6X-1(SLC): A large code-size reduction of 35.4% with performance degrading by 38.1%.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-2(NC): A large code-size reduction of 47.1% with a slight improvement in the performance          degradation to -37.0 </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI):  A  very  large  code-size  reduction  of  58.2%  with  performance  degrading  by  a          huge 45.2%. Variable length instructions can reduce code size significantly, but the compiler          transformations that enable them are costly in terms of performance.          </font>          </li>            <li class="itemize"><font face="Verdana" size="2">C6X-3(VLI+MLB): An impressive code-size reduction of 56.1% with only a 22.5% performance          degradation. The modulo loop buffer enables the compiler to software pipeline loops even when          compiling for code size only. </font>          </li>               </ul>        </li>      </ul>    <font face="Verdana" size="2">        <br>  </font>      <p>   <font face="Verdana" size="2">Figure&nbsp;<a href="#x1-1100220">20</a> shows the results for the EEMBC telecom, automotive, and networking benchmarks. These benchmarks are smaller and representative of code in specific application spaces. The automotive benchmarks are dominated by control code, the telecom code is primarily loop-oriented, and the networking algorithms are a mixture of control- and loop-oriented code. Note that when compiling the C6X-3(VLI+MLB) configuration for speed, there is a 57% code-size reduction in the telecom benchmarks.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The results for the DSP codecs, multimedia codecs, control code, and other applications are shown in Figures&nbsp;<a href="#x1-1100321">21</a> and <a href="#x1-1100422">22</a>. Many of these benchmarks are complete applications. The DSP and multimedia codecs are loop-oriented applications, the control code is obviously control-oriented, and the other applications are a mixture both. Note the 44.3% code-size reduction in the DSP codecs when compiling the C6X-3(VLI-MLB) configuration for speed.&nbsp;</font></p>      <p>   </p>  <hr class="figure">     ]]></body>
<body><![CDATA[<div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f21.png"> <img src="/img/revistas/cleiej/v15n2/2a04f22.png"> <img src="/img/revistas/cleiej/v15n2/2a04f23.png"> <img src="/img/revistas/cleiej/v15n2/2a04f24.png"> <img src="/img/revistas/cleiej/v15n2/2a04f25.png"> <img src="/img/revistas/cleiej/v15n2/2a04f26.png">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;20: </span><span class="content">Code-size reduction and performance improvement on the EEMBC benchmarks</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f27.png"> <img src="/img/revistas/cleiej/v15n2/2a04f28.png"> <img src="/img/revistas/cleiej/v15n2/2a04f29.png"> <img src="/img/revistas/cleiej/v15n2/2a04f30.png">     ]]></body>
<body><![CDATA[<br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;21:  </span><span class="content">Code-size  reduction  and  performance  improvement  on  DSP  and  multimedia  application benchmarks</span></font></div>  <font face="Verdana" size="2">&nbsp;    <br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   </p>  <hr class="figure">     <div class="figure">  <font face="Verdana" size="2">&nbsp; </font>     <p><font face="Verdana" size="2"><img src="/img/revistas/cleiej/v15n2/2a04f31.png"> <img src="/img/revistas/cleiej/v15n2/2a04f32.png"> <img src="/img/revistas/cleiej/v15n2/2a04f33.png"> <img src="/img/revistas/cleiej/v15n2/2a04f34.png">     <br>   </font>   </p>      <div class="caption"><font face="Verdana" size="2"><span class="id">Figure&nbsp;22: </span><span class="content">Code-size reduction and performance improvement on control code and other miscellaneous application benchmarks</span></font></div>  <font face="Verdana" size="2">&nbsp;    ]]></body>
<body><![CDATA[<br>  </font>      <p>   </p>  </div>  <hr class="endfigure"> <font face="Verdana" size="2">     <br>  </font>      <p>   <font face="Verdana" size="2">There is a distinct difference in the results between control- and loop-oriented benchmarks. The loop-oriented benchmarks are demonstrating more code size reduction. Clearly, software-pipelined loop collapsing and the modulo loop buffer are going to have no effect on the size of control-oriented code. However, variable length instructions and NOP compression improve both loop- and control-oriented code.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The variable length instructions create a more significant size and performance tradeoff range. When a programmer&rsquo;s primary desire is to control code size, this additional range can be useful in balancing performance and code size in a memory-constrained application, which is common in embedded systems. </font>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">7   </span> <a id="x1-120007"></a>Summary</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">Code size is a primary concern in the embedded computing community. Minimizing physical memory requirements reduces total system cost, improves system performance by allowing more code to fit in on-chip memory and program caches, and improves power efficiency.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">We have presented the co-design of the following four compiler optimizations and architecture features, which reduce code size across three generations of the C6X processor: </font>      </p>  <ul class="itemize1">        <li class="itemize"><font face="Verdana" size="2">Software-pipelined loop collapsing: a compiler technique that reduces the code size of software-pipelined      loops. </font>      </li>        <li class="itemize"><font face="Verdana" size="2">NOP compression: a hardware technique that compresses the encoding of padding and pipeline NOP      instructions. </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Variable length instructions: complementary compiler and hardware techniques that encode commonly      occurring 32-bit instructions as 16-bit instructions. </font>      </li>        <li class="itemize"><font face="Verdana" size="2">Modulo  loop  buffer:  a  hardware  technique  to  reduce  the  code  size  and  power  efficiency  of      software-pipelined loops.</font></li>      </ul>   <font face="Verdana" size="2">       ]]></body>
<body><![CDATA[<br>  </font>      <p>   <font face="Verdana" size="2">We presented the code-size reduction and performance impact of using these techniques to compile a set of 84 benchmarks. With the compiler&rsquo;s speed-or-size option set to maximize speed, the results showed an impressive cumulative average code size reduction of 40%.&nbsp;</font></p>      <p>    </p>      <p><font face="Verdana" size="2"><span class="titlemark">8   </span> <a id="x1-130008"></a>Related Work</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p><font face="Verdana" size="2">The FPS-164 attached array processor&nbsp;<span class="cite"><a href="#c23">(23</a>,&nbsp;<a href="#c24">24</a>)<a name="c23."></a><a name="c24."></a></span> was a horizontally microcoded computer used for scientific applications, such as signal processing. In the early 1980&rsquo;s Fisher and his coworkers in the ELI (Enormously Long Instructions) project at Yale University developed the concepts of VLIW architectures&nbsp;<span class="cite">(<a href="#c25">25</a>)</span><a name="c25."></a>. The ELI project developed into Multiflow Corporation and the Trace family of computers&nbsp;<span class="cite">(<a href="#c26">26</a>)</span><a name="c26."></a>. The significant impact of the ELI project was the development of a hardware and compiler strategy at the same time. The architecture relied heavily on a trace scheduling compiler. The Trace compiler did not use software pipelining, but instead used extensive loop unrolling. The Trace Family of computers was available in three sizes where each size replicated a cluster. There were one-, two-, and four-cluster machines.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The Cydra 5 computer developed at Cydrome Inc.&nbsp;<span class="cite">(<a href="#c27">27</a>,&nbsp;<a href="#c28">28</a>)</span><a name="c27."></a><a name="c28."></a> evolved from the polycyclic architecture described in&nbsp;<span class="cite">(<a href="#c13">13</a>)</span>. The Cydra 5 architecture was a VLIW system that was designed for optimizing the execution of inner loops using software pipelining. As with the Trace computer, the Cydra 5 relies on the compiler to statically schedule&nbsp;<span class="cite">(<a href="#c29">29</a>)</span><a name="c29."></a> all operations. The Warp architecture is a systolic array consisting of 10 VLIW cell processors&nbsp;<span class="cite">(<a href="#c30">30</a>,&nbsp;<a href="#c31">31</a>)<a name="c30."></a><a name="c31."></a></span>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The Intel Itanium IA-64 processor&nbsp;<span class="cite">(<a href="#c32">32</a>)<a name="c32."></a></span> is a VLIW design, although Intel refers to it as an explicitly parallel instruction computing (EPIC) processor. Today (in 2010), the VLIW philosophy is popular in embedded processors. Besides the C6X, other examples of embedded VLIW processors include the Analog Devices SHARC DSP&nbsp;<span class="cite">(<a href="#c33">33</a>)<a name="c33."></a></span>, the Trimedia ST200&nbsp;<span class="cite">(<a href="#c34">34</a>)<a name="c34."></a></span>, the Infineon Carmel&nbsp;<span class="cite">(<a href="#c35">35)</a></span><a name="c35."></a> and, Tensilica&rsquo;s Xtensa  LX2&nbsp;<span class="cite">(<a href="#c36">36)</a></span><a name="c36."></a>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">Another approach to reduce code size is to store a compressed image of the VLIW program code in external memory and to use run-time software or hardware to decompress the code as it is executed or loaded into a program cache&nbsp;<span class="cite">(<a href="#c37">37</a>,&nbsp;<a href="#c38">38</a>)</span><a name="c37."></a><a name="c38."></a>. Other approaches have used variable length instruction encoding techniques to reduce the size of execute packets&nbsp;<span class="cite">(<a href="#XAditya00">2</a>)</span>. Finally, some embedded processors have modes that implement smaller opcode encodings for a subset of frequently occurring instructions. Examples of mode-based architectures are the ARM architecture&rsquo;s Thumb mode&nbsp;<span class="cite">(<a href="#c39">39</a>,&nbsp;<a href="#c40">40</a>)</span><a name="c39."></a><a name="c40."></a> and the MIPS32 architecture&rsquo;s MIPS16 mode&nbsp;<span class="cite">(<a href="#c41">41</a>)</span><a name="c41."></a>.&nbsp;</font></p>      <p>   <font face="Verdana" size="2">The basic concepts of a modulo loop buffer are described by Merten and Hwu&nbsp;<span class="cite">(<a href="#c18">18</a>)</span>. Other approaches to reduce the code size of software-pipelined loops employ special-purpose hardware. Instructions from different iterations are controlled by distinct rotating predicates&nbsp;<span class="cite">(<a href="#c42">42</a>)<a name="c42."></a></span>. Loop-control instructions are used in combination with the rotating predicate register file to conditionally nullify a subset of the instructions during the pipe-fill and pipe-drain phases, eliminating the need for explicit prologs and epilogs. Only the kernel code is explicitly represented. The advantage of kernel-only code is that there is no code growth. The disadvantage is that the prolog and epilog code can neither be customized nor overlapped with surrounding instructions. The effects of software-pipelined loop collapsing are similar to kernel-only code, but software-pipelined loop collapsing does not require hardware support beyond the availability of static predicate registers.&nbsp;</font></p>      ]]></body>
<body><![CDATA[<p>    </p>      <p><font face="Verdana" size="2"><a id="x1-140008"></a>References</font></p>   <font face="Verdana" size="2">       <br>  </font>      <p>     </p>      <div class="thebibliography">          <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c1"></a>   (<a href="#c1.">1</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;A.  Fisher,  P.&nbsp;Faraboschi,  and  C.&nbsp;Young,  <span class="cmti-10">Embedded  Computing  :  A  VLIW  Approach  to</span>     <span class="cmti-10">Architecture, Compilers and Tools</span>.   Morgan Kaufmann, December 2004.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c2"></a>   (<a href="#c2.">2</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;Aditya, S.&nbsp;A. Mahlke, and B.&nbsp;R. Rau, &ldquo;Code size minimization and retargetable assembly for     custom EPIC and VLIW instruction formats,&rdquo; <span class="cmti-10">ACM Transactions on Design Automation of Electronic</span>     <span class="cmti-10">Systems</span>, vol.&nbsp;5, no.&nbsp;4, pp. 752&ndash;773, 2000. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c3"></a>   (<a href="#c3.">3)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>T.&nbsp;T. Hahn, E.&nbsp;J. Stotzer, D.&nbsp;Sule, and M.&nbsp;Asal, &ldquo;Compilation strategies for reducing code size     on a VLIW processor with variable length instructions,&rdquo; in <span class="cmti-10">HiPEAC</span>, 2008, pp. 147&ndash;160. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c4"></a>   (<a href="#c4.">4</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span><span class="cmti-10">TMS320C6000 Optimizing Compiler User&rsquo;s Guide</span>,  spru187o&nbsp;ed.,  Texas  Instruments,  Inc.,  May     2008. </font>     </p>            ]]></body>
<body><![CDATA[<!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c5"></a>   (<a href="#c5.">5</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>A.&nbsp;V. Aho, R.&nbsp;Sethi, and J.&nbsp;D. Ullman, <span class="cmti-10">Compilers: Principles, Techniques, and Tools</span>.    Boston,     MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1986.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c6"></a>   (<a href="#c6.">6</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Allen and K.&nbsp;Kennedy, <span class="cmti-10">Optimizing Compilers for Modern Architectures</span>.    San Francisco, CA,     USA: Morgan Kaufman Publishers Inc., 2002.     </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c7"></a>   (<a href="#c7.">7</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>K.&nbsp;Cooper and L.&nbsp;Torczon, <span class="cmti-10">Engineering a Compiler</span>.   San Francisco, CA, USA: Morgan Kaufman     Publishers Inc., 2003.     </font>      </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c8"></a>   (<a href="#c8.">8</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>S.&nbsp;S. Muchnick, <span class="cmti-10">Advanced Compiler Design and Implementation</span>.  San Francisco, CA, USA: Morgan     Kaufmann Publishers Inc., 1997.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c9"></a>   (<a href="#c9.">9</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;J. Stotzer, E.&nbsp;D. Granston, and A.&nbsp;S. Ward, &ldquo;Methods and apparatus for reducing the size     of code with an exposed pipeline by encoding NOP operations as instruction operands,&rdquo; U.S. Patent     6,799,266, September 2004. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c10"></a>  (<a href="#c10.">10</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>A.&nbsp;L. Davis, R.&nbsp;H. Scales, N.&nbsp;Seshan, E.&nbsp;J. Stotzer, and R.&nbsp;E. Tatge, &ldquo;Microprocessor with an     instruction immediately next to a branch instruction for adding a constant to a program counter,&rdquo; U.S.     Patent 6,889,320, May 2005. </font>     </p>            ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="biblabel"><a name="c11"></a>  (<a href="#c11.">11</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;J. Stotzer and E.&nbsp;L. Leiss, &ldquo;Instruction encoding schemes that reduce code size on a VLIW     processor,&rdquo; in <span class="cmti-10">CLEI &rsquo;10: Proceedings of the Conferencia Latinoamericana de Inform&iuml;&iquest;</span><img src="/img/revistas/cleiej/v15n2/2a0418x.png" alt="1 2   " class="math"><span class="cmti-10">tica</span>, October     2010. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c12"></a>  (<a href="#c12.">12</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;D.  Granston,  J.&nbsp;Zbiciak,  and  E.&nbsp;J.  Stotzer,  &ldquo;Method  for  software  pipelining  of  irregular     conditional control loops,&rdquo; U.S. Patent 6,892,380, May 2005. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c13"></a>  (<a href="#c13.">13</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>B.&nbsp;R. Rau and C.&nbsp;D. Glaeser, &ldquo;Some scheduling techniques and an easily schedulable horizontal     architecture for high performance scientific computing,&rdquo; in <span class="cmti-10">MICRO 14: Proceedings of the 14th Annual</span>     <span class="cmti-10">Workshop on Microprogramming</span>.   Piscataway, NJ, USA: IEEE Press, 1981, pp. 183&ndash;198. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c14"></a>  (<a href="#c14.">14</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;J. Stotzer and E.&nbsp;L. Leiss, &ldquo;Modulo scheduling without overlapped lifetimes,&rdquo; in <span class="cmti-10">LCTES &rsquo;09:</span>     <span class="cmti-10">Proceedings of the 2009 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for</span>     <span class="cmti-10">Embedded Systems</span>.   New York, NY, USA: ACM, 2009, pp. 1&ndash;10. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c15"></a>  (<a href="#c15.">15</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;Granston, R.&nbsp;Scales, E.&nbsp;Stotzer, A.&nbsp;Ward, and J.&nbsp;Zbiciak, &ldquo;Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture,&rdquo; in <span class="cmti-10">MSP-3: Proceedings of the</span>     <span class="cmti-10">3rd IEEE/ACM Workshop on Media and Streaming Processors</span>, 2001, pp. 29&ndash;38. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c16"></a>  (<a href="#c16.">16</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Jayapala, F.&nbsp;Barat, T.&nbsp;Vander&nbsp;Aa, F.&nbsp;Catthoor, H.&nbsp;Corporaal, and G.&nbsp;Deconinck, &ldquo;Clustered loop buffer organization for low energy VLIW embedded processors,&rdquo; <span class="cmti-10">IEEE Transactions on Computing</span>,     vol.&nbsp;54, no.&nbsp;6, pp. 672&ndash;683, 2005. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c17"></a>  (<a href="#c17.">17</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>E.&nbsp;J. Stotzer and E.&nbsp;L. Leiss, &ldquo;Compiler and hardware support for reducing the code size of software     pipelined loops,&rdquo; in <span class="cmti-10">CLEI &rsquo;11: Proceedings of the Conferencia Latinoamericana de Inform&iuml;&iquest;</span><img src="/img/revistas/cleiej/v15n2/2a0419x.png" alt="1 2   " class="math"><span class="cmti-10">tica</span>,     October 2011. </font>      </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c18"></a>  (<a href="#c18.">18</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;C. Merten and W.-m.&nbsp;W. Hwu, &ldquo;Modulo schedule buffers,&rdquo; in <span class="cmti-10">MICRO 34: Proceedings of the</span>     <span class="cmti-10">34th Annual ACM/IEEE International Symposium on Microarchitecture</span>.  Washington, DC, USA: IEEE     Computer Society, 2001, pp. 138&ndash;149. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c19"></a>  (<a href="#c19.">19</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Asal, E.&nbsp;Stotzer, and T.&nbsp;Hahn, &ldquo;VLIW optional fetch packet header extends instruction set     space,&rdquo; U.S. Patent 7,673,119, March 2010. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c20"></a>  (<a href="#c20.">20</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>D.&nbsp;Sule, E.&nbsp;Stotzer, and T.&nbsp;Hahn, &ldquo;Tiered register allocation,&rdquo; U.S. Patent App. 20070022413,     January 2007. </font>     </p>            ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span class="biblabel"><a name="c21"></a>  (<a href="#c21.">21</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>D.&nbsp;Sule  and  E.&nbsp;Stotzer,  &ldquo;Technique  for  the  calling  of  a  sub-routine  by  a  function  using  an     intermediate sub-routine,&rdquo; U.S. Patent App. 20070016899, January 2007. </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c22"></a>  (<a href="#c22">22</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>EEMBC,   The   Embedded   Microprocessor   Benchmark   Consortium.   (Online).   Available:     <a href="http://www.eembc.org">http://www.eembc.org</a> </font>     <p><font face="Verdana" size="2"><span class="biblabel"><a name="c23"></a>  (<a href="#c23.">23</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>A.&nbsp;Charlesworth,  &ldquo;An  approach  to  scientific  array  processing:  The  architectural  design  of  the     AP-120B/FPS-164 family,&rdquo; <span class="cmti-10">IEEE Computer</span>, vol.&nbsp;14, no.&nbsp;3, pp. 18&ndash;27, 1981. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c24"></a>  (<a href="#c24.">24</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;F.  Touzeau,  &ldquo;A  Fortran  compiler  for  the  FPS-164  scientific  computer,&rdquo;  in  <span class="cmti-10">SIGPLAN  &rsquo;84:</span>     <span class="cmti-10">Proceedings of the 1984 SIGPLAN Symposium on Compiler Construction</span>.  New York, NY, USA: ACM,     1984, pp. 48&ndash;57. </font>     </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c25"></a>  (<a href="#c25.">25</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;R. Ellis, <span class="cmti-10">Bulldog: A Compiler for VLIW Architectures</span>.  Cambridge, MA, USA: MIT Press, 1986.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c26"></a>  (<a href="#c26.">26</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>P.&nbsp;G. Lowney, S.&nbsp;M. Freudenberger, T.&nbsp;J. Karzes, W.&nbsp;D. Lichtenstein, R.&nbsp;P. Nix, J.&nbsp;S. O&rsquo;Donnell, and J.&nbsp;Ruttenberg, &ldquo;The multiflow trace scheduling compiler,&rdquo; <span class="cmti-10">Journal of Supercomputing</span>, vol.&nbsp;7, no.     1-2, pp. 51&ndash;142, 1993. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c27"></a>  (<a href="#c27.">27</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>G.&nbsp;R. Beck, D.&nbsp;W.&nbsp;L. Yen, and T.&nbsp;L. Anderson, &ldquo;The Cydra 5 minisupercomputer: Architecture     and implementation,&rdquo; <span class="cmti-10">Journal of Supercomputing</span>, vol.&nbsp;7, no. 1-2, pp. 143&ndash;180, 1993. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c28"></a>  (<a href="#c28.">28</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;C.  Dehnert,  P.&nbsp;Y.-T.  Hsu,  and  J.&nbsp;P.  Bratt,  &ldquo;Overlapped  loop  support  in  the  Cydra  5,&rdquo;     in  <span class="cmti-10">ASPLOS-III:  Proceedings  of  the  Third  International  Conference  on  Architectural  Support  for</span>     <span class="cmti-10">Programming Languages and Operating Systems</span>.   New York, NY, USA: ACM, 1989, pp. 26&ndash;38. </font>      </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c29"></a>  (<a href="#c29">29</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;C. Dehnert and R.&nbsp;A. Towle, &ldquo;Compiling for the Cydra 5,&rdquo; <span class="cmti-10">Journal of Supercomputing</span>, vol.&nbsp;7,     no. 1-2, pp. 181&ndash;227, 1993. </font>     </p>            ]]></body>
<body><![CDATA[<!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c30"></a>  (<a href="#c30.">30</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;S.  Lam,  <span class="cmti-10">A  Systolic  Array  Optimizing  Compiler</span>.     Norwell,  MA,  USA:  Kluwer  Academic     Publishers, 1989.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c31"></a>  (<a href="#c31.">31</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Cohn,  T.&nbsp;Gross,  and  M.&nbsp;Lam,  &ldquo;Architecture  and  compiler  tradeoffs  for  a  long  instruction     wordprocessor,&rdquo; in <span class="cmti-10">ASPLOS-III: Proceedings of the Third International Conference on Architectural</span>     <span class="cmti-10">Support for Programming Languages and Operating Systems</span>.   New York, NY, USA: ACM, 1989, pp.     2&ndash;14. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c32"></a>  (<a href="#c32.">32</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>J.&nbsp;Huck, D.&nbsp;Morris, J.&nbsp;Ross, A.&nbsp;Knies, H.&nbsp;Mulder, and R.&nbsp;Zahir, &ldquo;Introducing the IA-64 architecture,&rdquo; <span class="cmti-10">IEEE Micro</span>, vol.&nbsp;20, no.&nbsp;5, pp. 12&ndash;23, 2000. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c33"></a>  (<a href="#c33.">33</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>&ldquo;The SHARC Processor,&rdquo; Analog Devices Inc. (Online). Available: <a href="http://www.analog.com">http://www.analog.com</a> </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c34"></a>  (<a href="#c34.">34)</a><span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>&ldquo;The ST200 Processor,&rdquo; STMicroelectronics Inc. (Online). Available: <a href="http://www.st.com">http://www.st.com</a> </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c35"></a>  (<a href="#c35.">35</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>Carmel,   &ldquo;The   Carmel   DSP   processor,&rdquo;   Infineon   Technologies   AG.   (Online).   Available:     <a href="http://www.infineon.com">http://www.infineon.com</a> </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c36"></a>  (<a href="#c36.">36</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>Xtensa,     &ldquo;Xtensa     customizable     processors,&rdquo;     Tensilica     Inc.     (Online).     Available:     <a href="http://www.tensilica.com">http://www.tensilica.com</a> </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c37"></a>  (<a href="#c37.">37</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>C.&nbsp;H. Lin, Y.&nbsp;Xie, and W.&nbsp;Wolf, &ldquo;LZW-based code compression for VLIW embedded systems,&rdquo;     in  <span class="cmti-10">DATE  &rsquo;04:  Proceedings  of  the  Conference  on  Design,  Automation  and  Test  in  Europe</span>,  vol.&nbsp;3.     Washington, DC, USA: IEEE Computer Society, 2004, pp. 76&ndash;81. </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c38"></a>  (<a href="#c38.">38</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>M.&nbsp;Ros and P.&nbsp;Sutton, &ldquo;Compiler optimization and ordering effects on VLIW code compression,&rdquo; in     <span class="cmti-10">CASES &rsquo;03: Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis</span>     <span class="cmti-10">for Embedded Systems</span>.   New York, NY, USA: ACM Press, 2003, pp. 95&ndash;103. </font>     </p>            ]]></body>
<body><![CDATA[<!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c39"></a>  (<a href="#c39.">39</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span><span class="cmti-10">ARM7TDMI (Rev. 4) Technical Reference Manual</span>, ARM Limited, 2001.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c40"></a>  (<a href="#c40.">40</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>R.&nbsp;Phelan, &ldquo;Improving ARM code density and performance,&rdquo; ARM Limited, Tech. Rep., 2003. </font>      </p>            <!-- ref --><p><font face="Verdana" size="2"><span class="biblabel"><a name="c41"></a>  (<a href="#c41.">41</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span><span class="cmti-10">MIPS32 Architecture for Programmers, Vol. IV-a: The MIPS16 Application Specific Extension to</span>     <span class="cmti-10">the MIPS32 Architecture</span>, MIPS Technologies, 2001.     </font>     </p>            <p><font face="Verdana" size="2"><span class="biblabel"><a name="c42"></a>  (<a href="#c42.">42</a>)<span class="bibsp">&nbsp;&nbsp;&nbsp;</span></span>B.&nbsp;R. Rau, M.&nbsp;S. Schlansker, and P.&nbsp;P. Tirumalai, &ldquo;Code generation schema for modulo scheduled     loops,&rdquo; in <span class="cmti-10">MICRO 25: Proceedings of the 25th Annual International Symposium on Microarchitecture</span>.     Los Alamitos, CA, USA: IEEE Computer Society Press, 1992, pp. 158&ndash;169. </font> </p>       </div>             ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fisher]]></surname>
<given-names><![CDATA[J. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Faraboschi]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
<name>
<surname><![CDATA[Young]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
</person-group>
<source><![CDATA[Embedded Computing : A VLIW Approach to Architecture, Compilers and Tools]]></source>
<year>Dece</year>
<month>mb</month>
<day>er</day>
<publisher-name><![CDATA[Morgan Kaufmann]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Aditya]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Mahlke]]></surname>
<given-names><![CDATA[S. A]]></given-names>
</name>
<name>
<surname><![CDATA[Rau]]></surname>
<given-names><![CDATA[B. R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats,&#8221;]]></article-title>
<source><![CDATA[ACM Transactions on Design Automation of Electronic Systems]]></source>
<year>2000</year>
<volume>5</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>752-773</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hahn]]></surname>
<given-names><![CDATA[T. T.]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J]]></given-names>
</name>
<name>
<surname><![CDATA[Sule]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Asal]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Compilation strategies for reducing code size on a VLIW processor with variable length instructions,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ HiPEAC]]></conf-name>
<conf-date>2008</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="book">
<source><![CDATA[TMS320C6000 Optimizing Compiler User&#8217;s Guide]]></source>
<year>2008</year>
<publisher-name><![CDATA[Texas Instruments, Inc]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Aho]]></surname>
<given-names><![CDATA[A. V]]></given-names>
</name>
<name>
<surname><![CDATA[Sethi]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Ullman]]></surname>
<given-names><![CDATA[J. D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Compilers: Principles, Techniques, and Tools. Boston]]></source>
<year>1986</year>
<publisher-name><![CDATA[USA: Addison-Wesley Longman Publishing Co.,]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Allen]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Kennedy]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<source><![CDATA[Optimizing Compilers for Modern Architectures]]></source>
<year>2002</year>
<publisher-loc><![CDATA[San Francisco ]]></publisher-loc>
<publisher-name><![CDATA[Morgan Kaufman Publishers Inc]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cooper]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Torczon]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
</person-group>
<source><![CDATA[Engineering a Compiler]]></source>
<year>2003</year>
<publisher-loc><![CDATA[San Francisco ]]></publisher-loc>
<publisher-name><![CDATA[Morgan Kaufman Publishers Inc]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Muchnick]]></surname>
<given-names><![CDATA[S. S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Advanced Compiler Design and Implementation.]]></source>
<year>1997</year>
<publisher-loc><![CDATA[San Francisco ]]></publisher-loc>
<publisher-name><![CDATA[Morgan Kaufmann Publishers Inc]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Granston]]></surname>
<given-names><![CDATA[E. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Ward]]></surname>
<given-names><![CDATA[A. S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Methods and apparatus for reducing the size of code with an exposed pipeline by encoding NOP operations as instruction operands]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Davis]]></surname>
<given-names><![CDATA[A. L.]]></given-names>
</name>
<name>
<surname><![CDATA[Scales]]></surname>
<given-names><![CDATA[R. H.]]></given-names>
</name>
<name>
<surname><![CDATA[Seshan]]></surname>
<given-names><![CDATA[N]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Tatge]]></surname>
<given-names><![CDATA[R. E]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;Microprocessor with an instruction immediately next to a branch instruction for adding a constant to a program counter,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J.]]></given-names>
</name>
<name>
<surname><![CDATA[Leiss]]></surname>
<given-names><![CDATA[E. L]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Instruction encoding schemes that reduce code size on a VLIW processor,]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ CLEI &#8217;10: Proceedings of the Conferencia Latinoamericana de Informï¿<img border=0 width=32 height=32 id="_x0000_i1121" src="..\..\..\..\..\img\revistas\cleiej\v15n2\2a0418x.png" alt="1&#13;&#10;2 " class=math>tica]]></conf-name>
<conf-date>October 2010</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Granston]]></surname>
<given-names><![CDATA[E. D]]></given-names>
</name>
<name>
<surname><![CDATA[Zbiciak]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;Method for software pipelining of irregular conditional control loops,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rau]]></surname>
<given-names><![CDATA[B. R.]]></given-names>
</name>
<name>
<surname><![CDATA[Glaeser]]></surname>
<given-names><![CDATA[C. D.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ MICRO 14: Proceedings of the 14th Annual Workshop on Microprogramming]]></conf-name>
<conf-date>1981</conf-date>
<conf-loc>Piscataway NJ</conf-loc>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J]]></given-names>
</name>
<name>
<surname><![CDATA[Leiss]]></surname>
<given-names><![CDATA[E. L]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Modulo scheduling without overlapped lifetimes,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ LCTES &#8217;09: Proceedings of the 2009 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems.]]></conf-name>
<conf-date>2009</conf-date>
<conf-loc>New York NY</conf-loc>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Granston]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Scales]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Ward]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Zbiciak]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ MSP-3: Proceedings of the 3rd IEEE/ACM Workshop on Media and Streaming Processors]]></conf-name>
<conf-date>2001</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Jayapala]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Barat]]></surname>
<given-names><![CDATA[F]]></given-names>
</name>
<name>
<surname><![CDATA[Vander Aa]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
<name>
<surname><![CDATA[Catthoor]]></surname>
<given-names><![CDATA[F]]></given-names>
</name>
<name>
<surname><![CDATA[Corporaal]]></surname>
<given-names><![CDATA[H]]></given-names>
</name>
<name>
<surname><![CDATA[Deconinck]]></surname>
<given-names><![CDATA[G]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Clustered loop buffer organization for low energy VLIW embedded processors,&#8221;]]></article-title>
<source><![CDATA[IEEE Transactions on Computing]]></source>
<year>2005</year>
<volume>54</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>672-683</page-range></nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E. J]]></given-names>
</name>
<name>
<surname><![CDATA[Leiss]]></surname>
<given-names><![CDATA[E. L]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Compiler and hardware support for reducing the code size of software pipelined loops,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ CLEI &#8217;11: Proceedings of the Conferencia Latinoamericana de Informï¿<img border=0 width=32 height=32 id="_x0000_i1122" src="..\..\..\..\..\img\revistas\cleiej\v15n2\2a0419x.png" alt="1&#13;&#10;2 " class=math>tica,]]></conf-name>
<conf-date>October 2011</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Merten]]></surname>
<given-names><![CDATA[M. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Hwu]]></surname>
<given-names><![CDATA[W.-m. W]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Modulo schedule buffers,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ MICRO 34: Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture]]></conf-name>
<conf-date>2001</conf-date>
<conf-loc>Washington DC</conf-loc>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Asal]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Hahn]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;VLIW optional fetch packet header extends instruction set space,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sule]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
<name>
<surname><![CDATA[Hahn]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;Tiered register allocation,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sule]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Stotzer]]></surname>
<given-names><![CDATA[E]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;Technique for the calling of a sub-routine by a function using an intermediate sub-routine,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="">
<source><![CDATA[EEMBC, The Embedded Microprocessor Benchmark Consortium.]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B23">
<label>23</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Charlesworth]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family,&#8221;]]></article-title>
<source><![CDATA[IEEE Computer]]></source>
<year>1981</year>
<volume>14</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>18-27</page-range></nlm-citation>
</ref>
<ref id="B24">
<label>24</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Touzeau]]></surname>
<given-names><![CDATA[R. F]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;A Fortran compiler for the FPS-164 scientific computer,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ SIGPLAN &#8217;84: Proceedings of the 1984 SIGPLAN Symposium on Compiler Construction]]></conf-name>
<conf-date>1984</conf-date>
<conf-loc>New York, NY</conf-loc>
</nlm-citation>
</ref>
<ref id="B25">
<label>25</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ellis]]></surname>
<given-names><![CDATA[J. R]]></given-names>
</name>
</person-group>
<source><![CDATA[Bulldog: A Compiler for VLIW Architectures]]></source>
<year>1986</year>
<publisher-loc><![CDATA[Cambridge^eMA MA]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B26">
<label>26</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lowney]]></surname>
<given-names><![CDATA[P. G]]></given-names>
</name>
<name>
<surname><![CDATA[Freudenberger]]></surname>
<given-names><![CDATA[S. M]]></given-names>
</name>
<name>
<surname><![CDATA[Karzes]]></surname>
<given-names><![CDATA[T. J]]></given-names>
</name>
<name>
<surname><![CDATA[Lichtenstein]]></surname>
<given-names><![CDATA[W. D]]></given-names>
</name>
<name>
<surname><![CDATA[Nix]]></surname>
<given-names><![CDATA[R. P]]></given-names>
</name>
<name>
<surname><![CDATA[O&#8217;Donnell]]></surname>
<given-names><![CDATA[J. S]]></given-names>
</name>
<name>
<surname><![CDATA[Ruttenberg]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220; The multiflow trace scheduling compiler,&#8221;]]></article-title>
<source><![CDATA[Journal of Supercomputing]]></source>
<year>1993</year>
<volume>7</volume>
<numero>1-2</numero>
<issue>1-2</issue>
<page-range>51-142</page-range></nlm-citation>
</ref>
<ref id="B27">
<label>27</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Beck]]></surname>
<given-names><![CDATA[G. R]]></given-names>
</name>
<name>
<surname><![CDATA[Yen]]></surname>
<given-names><![CDATA[D. W. L]]></given-names>
</name>
<name>
<surname><![CDATA[Anderson]]></surname>
<given-names><![CDATA[T. L]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;The Cydra 5 minisupercomputer: Architecture and implementation,&#8221;]]></article-title>
<source><![CDATA[Journal of Supercomputing]]></source>
<year>1993</year>
<volume>7</volume>
<numero>1-2</numero>
<issue>1-2</issue>
<page-range>143-180</page-range></nlm-citation>
</ref>
<ref id="B28">
<label>28</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dehnert]]></surname>
<given-names><![CDATA[J. C]]></given-names>
</name>
<name>
<surname><![CDATA[Hsu]]></surname>
<given-names><![CDATA[P. Y.-T]]></given-names>
</name>
<name>
<surname><![CDATA[Bratt]]></surname>
<given-names><![CDATA[J. P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Overlapped loop support in the Cydra 5,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ ASPLOS-III: Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems.]]></conf-name>
<conf-date>1989</conf-date>
<conf-loc>New York NY</conf-loc>
</nlm-citation>
</ref>
<ref id="B29">
<label>29</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dehnert]]></surname>
<given-names><![CDATA[J. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Towle]]></surname>
<given-names><![CDATA[R. A]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Compiling for the Cydra 5,&#8221;]]></article-title>
<source><![CDATA[Journal of Supercomputing]]></source>
<year>1993</year>
<volume>7</volume>
<numero>1-2</numero>
<issue>1-2</issue>
<page-range>181-227</page-range></nlm-citation>
</ref>
<ref id="B30">
<label>30</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lam]]></surname>
<given-names><![CDATA[M. S]]></given-names>
</name>
</person-group>
<source><![CDATA[A Systolic Array Optimizing Compiler.]]></source>
<year>1989</year>
<publisher-loc><![CDATA[Norwell^eMA MA]]></publisher-loc>
<publisher-name><![CDATA[Kluwer Academic Publishers]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B31">
<label>31</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cohn]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Gross]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
<name>
<surname><![CDATA[Lam]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Architecture and compiler tradeoffs for a long instruction wordprocessor,]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ n ASPLOS-III: Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems]]></conf-name>
<conf-date>1989</conf-date>
<conf-loc>New York NY</conf-loc>
</nlm-citation>
</ref>
<ref id="B32">
<label>32</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Huck]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Morris]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Ross]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Knies]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Mulder]]></surname>
<given-names><![CDATA[H]]></given-names>
</name>
<name>
<surname><![CDATA[Zahir]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Introducing the IA-64 architecture,]]></article-title>
<source><![CDATA[IEEE Micro]]></source>
<year>2000</year>
<volume>20</volume>
<numero>5</numero>
<issue>5</issue>
<page-range>12-23</page-range></nlm-citation>
</ref>
<ref id="B33">
<label>33</label><nlm-citation citation-type="">
<collab>Analog Devices Inc</collab>
<source><![CDATA[The SHARC Processor,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B34">
<label>34</label><nlm-citation citation-type="">
<collab>STMicroelectronics Inc</collab>
<source><![CDATA[&#8220;The ST200 Processor,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B35">
<label>35</label><nlm-citation citation-type="">
<collab>Infineon Technologies AG</collab>
<source><![CDATA[Carmel, &#8220;The Carmel DSP processor]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B36">
<label>36</label><nlm-citation citation-type="">
<collab>Tensilica Inc</collab>
<source><![CDATA[Xtensa, &#8220;Xtensa customizable processors,&#8221;]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B37">
<label>37</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lin]]></surname>
<given-names><![CDATA[C. H]]></given-names>
</name>
<name>
<surname><![CDATA[Xie]]></surname>
<given-names><![CDATA[Y]]></given-names>
</name>
<name>
<surname><![CDATA[Wolf]]></surname>
<given-names><![CDATA[W]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;LZW-based code compression for VLIW embedded systems,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ DATE &#8217;04: Proceedings of the Conference on Design, Automation and Test in Europe]]></conf-name>
<conf-date>2004</conf-date>
<conf-loc>Washington DC</conf-loc>
</nlm-citation>
</ref>
<ref id="B38">
<label>38</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ros]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Sutton]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[&#8220;Compiler optimization and ordering effects on VLIW code compression,&#8221;]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ CASES &#8217;03: Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems]]></conf-name>
<conf-date>2003</conf-date>
<conf-loc>New York NY</conf-loc>
</nlm-citation>
</ref>
<ref id="B39">
<label>39</label><nlm-citation citation-type="">
<collab>ARM Limited</collab>
<source><![CDATA[ARM7TDMI (Rev. 4): Technical Reference Manual]]></source>
<year>2001</year>
</nlm-citation>
</ref>
<ref id="B40">
<label>40</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Phelan]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<source><![CDATA[&#8220;Improving ARM code density and performance,&#8221;]]></source>
<year>2003</year>
<publisher-name><![CDATA[ARM Limited]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B41">
<label>41</label><nlm-citation citation-type="">
<collab>MIPS Technologies</collab>
<source><![CDATA[MIPS32 Architecture for Programmers, Vol. IV-a:: The MIPS16 Application Specific Extension to the MIPS32 Architecture]]></source>
<year>2001</year>
</nlm-citation>
</ref>
<ref id="B42">
<label>42</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rau]]></surname>
<given-names><![CDATA[B. R]]></given-names>
</name>
<name>
<surname><![CDATA[Schlansker]]></surname>
<given-names><![CDATA[M. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Tirumalai]]></surname>
<given-names><![CDATA[P. P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Code generation schema for modulo scheduled loops]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ MICRO 25: Proceedings of the 25th Annual International Symposium on Microarchitecture]]></conf-name>
<conf-date>1992</conf-date>
<conf-loc>Los Alamitos CA</conf-loc>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
