<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0717-5000</journal-id>
<journal-title><![CDATA[CLEI Electronic Journal]]></journal-title>
<abbrev-journal-title><![CDATA[CLEIej]]></abbrev-journal-title>
<issn>0717-5000</issn>
<publisher>
<publisher-name><![CDATA[Centro Latinoamericano de Estudios en Informática]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0717-50002012000300007</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Facial Recognition Using Neural Networks over GPGPU]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Balarini]]></surname>
<given-names><![CDATA[Juan Pablo]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad de la República  ]]></institution>
<addr-line><![CDATA[Montevideo ]]></addr-line>
<country>Uruguay</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Universidad de la República  ]]></institution>
<addr-line><![CDATA[Montevideo ]]></addr-line>
<country>Uruguay</country>
</aff>
<aff id="A03">
<institution><![CDATA[,Universidad de la República  ]]></institution>
<addr-line><![CDATA[Montevideo ]]></addr-line>
<country>Uruguay</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2012</year>
</pub-date>
<volume>15</volume>
<numero>3</numero>
<fpage>6</fpage>
<lpage>6</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_arttext&amp;pid=S0717-50002012000300007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_abstract&amp;pid=S0717-50002012000300007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.edu.uy/scielo.php?script=sci_pdf&amp;pid=S0717-50002012000300007&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[This article introduces a parallel neural network approach implemented over Graphic Processing Units (GPU) to solve a facial recognition problem, which consists in deciding where the face of a person in a certain image is pointing. The proposed method uses the parallel capabilities of GPU in order to train and evaluate a neural network used to solve the abovementioned problem. The experimental evaluation demonstrates that a significant reduction on computing times can be obtained allowing solving large instances in reasonable time. Speedup greater than 8 is achieved when contrasted with a sequential implementation and classification rate superior to 85 % is also obtained.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Este artículo introduce una red neuronal implementada sobre una unidad de procesamiento gráfico (GPU), para resolver un problema de reconocimiento facial que consiste en decidir hacia donde apunta la cara de cierta persona en una imágen. El método propuesto utiliza la naturaleza paralela de la GPU para entrenar y evaluar una red neuronal utilizada para resolver el problema antes mencionado. Los resultados experimentales demuestran que se obtiene una reducción significativa en los tiempos de cómputo, permitiendo resolver instancias grandes de imágenes en tiempos rasonables. Speedup mayores a 8 son obtenidas al contrastar la implementación propuesta con una secuencial y tasas de clasificacíon mayores a 85% son obtenidas.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Face recognition]]></kwd>
<kwd lng="en"><![CDATA[Neural Networks]]></kwd>
<kwd lng="en"><![CDATA[Parallel Computing]]></kwd>
<kwd lng="en"><![CDATA[GPGPU]]></kwd>
<kwd lng="es"><![CDATA[Reconocimiento Facial]]></kwd>
<kwd lng="es"><![CDATA[Redes Neuronales]]></kwd>
<kwd lng="es"><![CDATA[Computación Paralela]]></kwd>
<kwd lng="es"><![CDATA[GPGPU]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <div class="Section1">      <p><b><span lang="EN-US"><font face="Verdana" size="4">Facial Recognition Using Neural Networks over GPGPU</font></span></b><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><b style=""><span lang="ES-UY"> <font face="Verdana" size="2">Juan Pablo Balarini</font></span><o:p></o:p></b></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">Universidad de la Rep&uacute;blica, Facultad de Ingenier&iacute;a,</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="EN-US"> <font face="Verdana" size="2">Montevideo, Uruguay, 11300</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="EN-US"> <font face="Verdana" size="2"><a href="mailto:jbala87@gmail.com">jbala87@gmail.com</a></font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="EN-US"> <font face="Verdana" size="2">&nbsp;</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><b style=""><span lang="EN-US"> <font face="Verdana" size="2">Sergio Nesmachnow</font></span><o:p></o:p></b></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">Universidad de la Rep&uacute;blica, Facultad de Ingenier&iacute;a,</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">Montevideo, Uruguay, 11300</font></span><o:p></o:p></p>       ]]></body>
<body><![CDATA[<p style="margin-bottom: 0.0001pt;"><font face="Verdana"><span style="" lang="ES-UY"><a href="mailto:sergion@fing.edu.uy"> <span style="font-size: 10pt;">sergion@fing.edu.uy</span></a></span></font><span style="font-size: 12pt;" lang="ES-UY"><o:p></o:p></span></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">&nbsp;</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">and</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">&nbsp;</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><b style=""><span lang="ES-UY"> <font face="Verdana" size="2">Mart&iacute;n Rodr&iacute;guez</font></span><o:p></o:p></b></p>       <p style="margin-bottom: 0.0001pt;"><span lang="ES-UY"> <font face="Verdana" size="2">Universidad de la Rep&uacute;blica, Facultad de Ingenier&iacute;a,</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><span lang="EN-US"> <font face="Verdana" size="2">Montevideo, Uruguay, 11300</font></span><o:p></o:p></p>       <p style="margin-bottom: 0.0001pt;"><font face="Verdana"><span lang="EN-US"><a href="mailto:martinr87@gmail.com"> <span style="font-size: 10pt;">martinr87@gmail.com</span></a></span></font><span style="font-size: 12pt;" lang="EN-US"><o:p></o:p></span></p>       <p style="text-indent: 0cm;"><b style=""><span lang="EN-US"> <font face="Verdana" size="2">Abstract.</font><o:p></o:p></span></b></p>       <p style="text-indent: 0cm;"><span lang="EN-US"><font face="Verdana" size="2">This article introduces a parallel neural network approach implemented over Graphic Processing Units (GPU) to solve a facial recognition problem, which consists in deciding where the face of a person in a certain image is pointing. The proposed method uses the parallel capabilities of GPU in order to train and evaluate a neural network used to solve the abovementioned problem.      ]]></body>
<body><![CDATA[<br>  The experimental evaluation demonstrates that a significant reduction on computing times can be obtained allowing solving large instances in reasonable time. Speedup greater than 8 is achieved when contrasted with a sequential implementation and classification rate superior to 85 % is also obtained.</font><span style=""><o:p></o:p></span></span></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><span style="" lang="ES-UY"> <b>Spanish abstract</b>     <br>      <br>  Este art&iacute;culo introduce una red neuronal implementada sobre una unidad de procesamiento gr&aacute;fico (GPU), para resolver un problema de reconocimiento facial que consiste en decidir hacia donde apunta la cara de cierta persona en una im&aacute;gen. El m&eacute;todo propuesto utiliza la naturaleza paralela de la GPU para entrenar y evaluar una red neuronal utilizada para resolver el problema antes mencionado.     <br>   Los resultados experimentales demuestran que se obtiene una reducci&oacute;n significativa en los tiempos de c&oacute;mputo, permitiendo resolver instancias grandes de im&aacute;genes en tiempos rasonables. Speedup mayores a 8 son obtenidas al contrastar la implementaci&oacute;n propuesta con una secuencial y tasas de clasificac&iacute;on mayores a 85% son obtenidas.</span></font><b style=""><span style="" lang="ES-UY"><o:p></o:p></span></b></p>        <p><font face="Verdana" size="2"><b style=""><span lang="EN-US">Keywords:</span></b><span lang="EN-US"> Face recognition, Neural Networks, Parallel Computing, GPGPU. </span> </font></p>       <p><font face="Verdana" size="2"><span lang="EN-US"><b>Spanish keywords:</b> Reconocimiento Facial, Redes Neuronales, Computaci&oacute;n Paralela, GPGPU.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Received: 2012-06-10 Revised 2012-10-01 Accepted 2012-10-04</span></font></p>       <p><span style="" lang="EN-US"><font face="Verdana" size="2">1</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><font face="Verdana" size="2">Introduction</font><o:p></o:p></span></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Face recognition can be described as the ability to recognize people given some set of facial characteristics. Nowadays, it has become a popular area of research in computer vision and image analysis, mainly because we can find such recognition systems in objects of everyday life such as cellphones, security systems, laptops, PCs, etc. <a name="r21."></a>(<a href="#r21">21</a>,<a name="r22."></a><a href="#r22">22</a>). Another key element is that the high computing power now available makes these image recognition systems possible.</span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">Using an image of a human face, an algorithm is proposed to evaluate and decide where that face is pointing. Each image is classified into one of four classes according to the direction where is facing (those classes are: left, right, up and straight).</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">For certain types of problems, artificial neural networks (ANN) have proven to be one of the most effective learning methods <a name="r2."></a><a href="#r2">(2)</a>, built of complex webs of interconnected neurons, where each unit takes a number of real-valued inputs and produces a single real-valued output. Also the Backpropagation algorithm is the most commonly used ANN learning technique, which is appropriate for problems where the target function to be learned is defined over instances that can be described by a vector of predefined features (such as pixel values), also the target function output may be discrete-valued, real-valued, or a vector of several real or discrete-valued attributes. Additionally the training examples may contain errors, and fast evaluation of the learned function may be required. All this makes ANN a good option in image recognition problems. A survey of practical applications of ANNs can be found on<a name="r14."></a> <a href="#r14">(14)</a>. </font> <b style=""><span style="color: red;"><o:p></o:p></span></b></span></p>       <p><font face="Verdana" size="2"><span lang="EN-US">One of the main inconvenient of ANNs is the time needed to perform the training phase, which generally is quite high for complex problems. As the number of hidden layers and neurons grows, the required time for the learning process of the ANN and for the evaluation of a new instance grows exponentially. On the other hand, the rate of successful classification of new instances increases as well. Note that generally, the more training examples the network is provided, the more effective it will be (and the longer it will take too). Therefore it is of special interest to perform training with a large number of neurons in the hidden layer and with a significant number of training examples, but with a relatively low training time.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">It is interesting to note that every neuron in each layer can make their calculations independently of others in the same layer. This means that for any given layer, parallel computations can take place, and some parallel architecture could be used to take advantage of this.</span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">Promising <a name="r20."></a><a href="#r20">(20)</a> work is being made in the area of general purpose GPU computing, principally in problems with parallel nature. GPU implementations allow obtaining significant reduction in the execution times of complex problems when compared with traditional sequential implementations on CPU <a name="r9."></a><a href="#r9">(9)</a>. Despite the fact GPUs were originally designed for the sole purpose of rendering computer graphics they have evolved into a general purpose computing platform with enough power and flexibility to make many computationally-intensive application perform better than on a CPU <a name="r12."></a><a name="r13."></a>(<a href="#r12">12</a>,<a href="#r13">13</a>). This can be explained by the significant disparity between CPUs and GPUs which rises every year. In this work, we propose an algorithm that takes advantage of this parallel architecture to obtain an algorithm that outperforms another that uses a </font> </span><span style="color: black;" lang="EN-US"> <font size="2">sequential</font></span><span style="font-size: 10pt;" lang="EN-US"> </span><span lang="EN-US"> <font size="2">implementation.</font></span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">This work focuses in the field of machine learning, high performance parallel computing, and using graphics processing units for general purpose computing, as it develops an algorithm that significantly improves the ANN training and classification time, as contrasted with a </font> </span><span style="color: black;" lang="EN-US"><font size="2">sequential</font></span><span style="font-size: 10pt;" lang="EN-US"> </span><span lang="EN-US"> <font size="2">algorithm.</font></span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The main contributions of this article are that a parallel face recognition algorithm is obtained that develops good classification rates in reasonable execution times, also this algorithm can be easily modified to recognize other features of a human face without changing those execution times. Furthermore, it demonstrates how the GPGPU platform is a very good option when it is desired to improve the execution time of a certain problem.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The rest of the paper is organized as follows. Section 2 introduces GPU computing and the CUDA programming model, section 3 presents a conceptual framework then, section 4 presents related work. Section 5 introduces the proposed solution and provides implementation details. In section 6 an experimental analysis is made and finally, section 7 presents the work conclusions and future work.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">2</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">GPU Computing</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">GPUs were originally designed to exclusively perform the graphic processing in computers, allowing the Central Process Unit (CPU) to concentrate in the remaining computations. Nowadays, GPUs have a considerably large computing power, provided by hundreds of processing units with reasonable fast clock frequencies. In the last ten years, GPUs have been used as a powerful parallel hardware architecture to achieve efficiency in the execution of applications.</span></font></p>       ]]></body>
<body><![CDATA[<p><span lang="EN-US"><font face="Verdana" size="2">2.1</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">GPU Programming and CUDA </span> </font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Ten years ago, when GPUs were first used to perform general-purpose computation, they were programmed using low-level mechanism such as the interruption services of the BIOS, or by using graphic APIs such as OpenGL and DirectX. Later, the programs for GPU were developed in assembly language for each card model, and they had very limited portability. So, high-level languages were developed to fully exploit the capabilities of the GPUs. In 2007, NVIDIA introduced CUDA<a name="r11."></a> <a href="#r11">(11)</a>, a software architecture for managing the GPU as a parallel computing device without requiring mapping the data and the computation into a graphic API.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">CUDA is based in an extension of the C language, and it is available for graphic cards GeForce 8 Series and superior. Three software layers are used in CUDA to communicate with the GPU (see <a href="#f1">Fig. 1</a>.): a low-level hardware driver that performs the CPU-GPU data communications, a high-level API, and a set of libraries such as CUBLAS for linear algebra and CUFFT for Fourier transforms.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">&nbsp;</font><o:p></o:p></span></p>        <p><font face="Verdana" size="2"><b style=""> <a name="f1"> <img src="/img/revistas/cleiej/v15n3/3a07f1.jpg"> </a>     <br>  <span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">1</b></span><b style="">.</b> CUDA<i style=""> Architecture</i></span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">For the CUDA programmer, the GPU is a computing device able to execute a large number of threads in parallel. A procedure to be executed many times over different data can be isolated in a GPU-function using many execution threads. The function is compiled using a specific set of instructions and the resulting program (<i style="">kernel</i>) is loaded in the GPU. The GPU has its own DRAM, and the data are copied from it to the RAM of the host (and vice versa) using optimized calls to the CUDA API.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The CUDA architecture is built around a scalable array of multiprocessors, each one with eight scalar processors, one multithreading unit, and a shared memory chip. The multiprocessors are able to create, manage, and execute parallel threads, with reduced overhead. The threads are grouped in <i style="">blocks</i> (with up to 512 threads), which are executed in a single multiprocessor, and the blocks are grouped in <i style="">grids</i>. When a CUDA program calls a grid to be executed in the GPU, each one of the blocks in the grid is numbered and distributed to an available multiprocessor. The multiprocessor receives a block and splits the threads in <i style="">warps</i>, a set of 32 consecutive threads. Each warp executes a single instruction at a time, so the best efficiency is achieved when the 32 threads in the warp executes the same instruction. Each time that a block finishes its execution, a new block is assigned to the available multiprocessor.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The threads access the data using three memory spaces: a <i style="">shared memory</i> used by the threads in the block; the <i style="">local memory</i> of the thread; and the <i style="">global memory</i> of the GPU. Minimizing the access to the slower memory spaces (the local memory of the thread and the global memory of the GPU) is a very important feature to achieve efficiency. On the other side, the shared memory is placed within the GPU chip, thus it provides a faster way to store the data.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">3</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Face Recognition Using Artificial Neural Networks in GPU</span></font></p>       ]]></body>
<body><![CDATA[<p><span lang="EN-US"><font face="Verdana" size="2">3.1</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Face Pointing Direction</span></font></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><span lang="EN-US">The face pointing direction problem consists in recognizing where a human face is pointing (up, left, right and center) in a certain image. This problem has many practical applications such as detecting where a driver is looking while driving (raising an alarm if the driver fell asleep), a computer mouse for impaired people that moves according to head movements (i.e. face direction), a digital camera software that only takes a picture if all individuals are looking at the camera, etc. Traditional methods to solve this problem include ANNs <a name="r17."></a>(<a href="#r17">17</a>, <a href="#r2">2</a>), evolutionary algorithms <a name="r15."></a><a name="r16."></a>(<a href="#r15">15</a>, <a href="#r16">16</a>), problem specific heuristics, etc., but in general, sequential implementations are used.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">3.2</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Artificial Neural Networks</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">ANNs provide a general practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. Several algorithms (such as <i style="">backpropagation</i>) can be used to tune network parameters to best fit a training set of input-output pairs. ANNs are robust to errors in the training data and has been successfully applied to problems such as image recognition, speech recognition, and learning robot control strategies <a href="#r2">(2)</a>.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US"><a href="#f2">Fig. 2</a> presents the general schema of an ANN. There is a set of <i style="">neurons</i> connected with each other. Each neuron receives several input data, perform a linear combination (result <i style="">a</i>) and then produces the result of the neuron, which is the evaluation of some function <i style="">f</i>(<i style="">x</i>) for the value <i style="">x = a</i>.</span></p>     <a name="f2"> </a> </font>     <p><font face="Verdana" size="2"><a name="f2"><img src="/img/revistas/cleiej/v15n3/3a07f2.jpg"> </a> </font> </p>       <p><font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">2</b></span><b style="">.</b> Schema of an ANN.</span></font></p>       <p style="line-height: normal;"><font face="Verdana" size="2"><span lang="EN-US">The neurons are grouped in several layers:</span></font></p>       <p style="margin: 3pt 0cm 0.0001pt 11.35pt; line-height: normal;"> <font face="Verdana" size="2"><span lang="EN-US">&middot;</span></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal" lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><font face="Verdana" size="2"><span lang="EN-US">Input layer: receives the problem input</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">&middot;</span></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal" lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><font face="Verdana" size="2"><span lang="EN-US">Hidden layer/s: receives data from other neurons (typically from input layer or from another hidden layer), and forwards the processed data to the next layer. In an ANN, there may be multiple hidden layers with multiple neurons each.</span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">&middot;</span></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal" lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><font face="Verdana" size="2"><span lang="EN-US">Output layer: this layer may contain multiple neurons and it determines the output of the processing for a certain problem instance.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US"><a href="#f3">Fig. 3</a> shows a schema for a neuron. First, a linear combination of the neuron input data (<img src="./3a07_archivos/image007.gif" v:shapes="_x0000_i1028" border="0" height="15" width="22"><span style="">&nbsp;</span>weights<img src="./3a07_archivos/image010.gif" v:shapes="_x0000_i1030" border="0" height="18" width="118">, and an independent coefficient&nbsp;<img src="./3a07_archivos/image013.gif" v:shapes="_x0000_i1032" border="0" height="18" width="36"><span style="">&nbsp;</span>is made. Then the output is evaluated at some well-known <i style="">activation function</i>, to produce the neuron output.</span></font></p>       <p style="margin-top: 6pt;"><font face="Verdana" size="2"><a name="f3"><img src="/img/revistas/cleiej/v15n3/3a07f3.jpg"> </a> </font> </p>       <p><font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">3</b></span><b style="">.</b> Schema of a single neuron in an ANN.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">3.3</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Face Recognition Using GPGPU</span></font></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><span lang="EN-US">In this article, an ANN is used to solve the face recognition problem, trained with the backpropagation algorithm. Backpropagation learns the weights for a multilayer network with a fixed set of units and interconnections, by applying the gradient descent method to minimize the squared error between the network output values and the target values for these outputs. The learning problem faced by backpropagation implies searching in a large space defined by all possible weight values for all neurons. The backpropagation method applied in this work (stochastic gradient descent version for feedforward networks) is described in <a href="#z1">Algorithm 1</a>.</span></font></p>    <font face="Verdana" size="2">        <br>   </font>       <p><font face="Verdana" size="2">&nbsp;<span lang="EN-US"><o:p><a name="z1"><img src="/img/revistas/cleiej/v15n3/3a07z1.jpg"> </a> </o:p></span></font></p>       <p style="margin-top: 3pt;"><font face="Verdana"><b style=""><span style="" lang="EN-US"> <font size="2">Algorithm 1.</font></span></b></font><span style="" lang="EN-US"><font size="2" face="Verdana"> Stochastic gradient descent version of the backpropagation algorithm for feedforward networks.</font><o:p></o:p></span></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><span lang="EN-US"><a href="#z1">Algorithm 1</a> begins by constructing a network with the desired number of hidden and output units and initializing all network weights to small random numbers. Given this fixed network structure, the main loop of the algorithm iterates over the training examples. For each training example, it applies the network to the example, computes the gradient with respect to the error on this example, and then updates all weights in the network. This gradient step is iterated (using the same training examples multiple times) until the network performs acceptably well <a href="#r2">(2)</a>. For evaluating a single instance (not to train the network), only the propagation of the input data through the network is made. The presented ANN uses neurons of sigmoid type with activation function:</span></font></p>       ]]></body>
<body><![CDATA[<p style="margin: 3pt 0cm; text-align: center;" align="center"> <font face="Verdana" size="2"><span style="" lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span lang="EN-US"><img src="./3a07_archivos/image096.gif" v:shapes="_x0000_i1093" border="0" height="33" width="97"><span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>(1)</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">To solve the face recognition problem in GPU, a specific version of the backpropagation algorithm was implemented. The generic schema in <a href="#z1">Algorithm 1</a> was adapted to execute on a GPU, mainly taking into account the communication restrictions between the GPU processing units. Before calling a function that runs on GPU a function-dependent domain decomposition is applied. In general, certain GPU threads are assigned to execute over certain neurons on the ANN. The domain decompositions are always performed to maximize the parallel execution, (i.e. each GPU thread can work independent from each other), and to avoid serializations in the memory access. A detailed description of domain decomposition is presented in section 5.2.</span></font></p>       <p style="margin-bottom: 6pt;"><font face="Verdana" size="2"><span lang="EN-US"><a href="#f4">Fig. 4</a> presents a schema of the ANN training, showing that the train() function is a concatenation of functions that execute in parallel.</span></font></p>       <p style="margin-top: 0cm;"><font face="Verdana" size="2"><span style=""><a name="f4"><img src="/img/revistas/cleiej/v15n3/3a07f4.jpg"></a> </span></font></p>       <p><font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">4</b></span><b style="">.</b> ANN training: parallel approach.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">4</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Related Work</span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">Lopes and Ribeiro <a href="#r9">(9)</a> presented an analysis of an ANN implementation executing on GPU, showing how the training and classification times can be reduced significantly (ranging from 5</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US"> to 150</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">, depending on the complexity of the problem). They also conclude that the GPU scales better than the CPU when handling large datasets and complex problems. In this work the authors recognizes two sources of parallelism: the outputs of the neurons can be computed in parallel, and all the samples (patterns) on a dataset can be processed independently. The parallel ANN takes advantage of parallelism in the three training phases (forward, robust learning and backpropagation). Several problems were tackled in the experimental analysis, including solving <i style="">f</i>(<i style="">x</i>) = sin(<i style="">x</i>)/<i style="">x</i>, and several classification and detection problems such as: the two-spirals problem, the sonar problem, the covertype problem, the poker hand problem, the ventricular arrhythmias problem and face recognition of the Yale face database <a name="r18."></a><a href="#r18">(18)</a>, containing 165 gray scale faces images of 64</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">64 pixels of 15 individuals. </font> </span></font></p>       <p><font face="Verdana"><span style="" lang="FR"><font size="2">Jang et al. <a name="r6."></a> </font></span><font size="2"><span lang="EN-US"><a href="#r6">(6)</a> introduced a parallel ANN implementation using CUDA applied to text recognition on images. Processing times up to 5 times faster than a CPU implementation were obtained. In this work, parallelism is achieved through computing in parallel all the linear combinations made when some neuron calculates their output, and also when the sigmoid function is computed on each neuron. In this case, text detection was performed over three image sizes (320</span></font></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">240, 571</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">785 and 1152</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">15466), always using 30 neurons on the hidden layer.</font></span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">Solving a similar problem, Izotov et al. <a name="r7."></a><a href="#r7">(7)</a>, used an ANN-based algorithm to recognize handwritten digits, using a CUDA-based implementation. The training time improvements were about 6 times less than an algorithm executed on CPU, and on instance classification there were reductions of about 9 times in execution time. This work represented some ANN features as matrixes and took advantage of the CUBLAS library (a linear algebra library over CUDA driver level) for calculating matrix multiplications (thus achieving parallelism) using 8-bit gray scale 28</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">28 pixel image instances of handwritten digits from the public domain MNIST database <a name="r19."></a><a href="#r19">(19)</a>.</font></span></font></p>       <p><font face="Verdana"><span style="" lang="FR"><font size="2">Nasse et al. <a name="r8."></a> </font></span><font size="2"><span lang="EN-US"><a href="#r8">(8)</a> solved the problem of locating the direction of a human face in space, using ANN on a GPU. Parallelization of the solutions was achieved through dividing the image into several rectangles, and computing each one in parallel. This implementation obtains classification values &#8203;&#8203;up to 11 times faster than an implementation which runs on CPU, and was trained using 6000 non-faces and 6000 faces with three different sizes (378</span></font></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">278, 640</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">480 and 800</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">600 pixels).</font></span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">Shufelt and Mitchell <a name="r4."></a><a href="#r4">(4)</a> solved a similar problem to the one proposed in this work (deciding whether an image is of a certain person or not) by using a sequential method. Their proposal was the starting point for the parallel solution presented here.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The analysis of the related work allows concluding that ANN implementations that execute over a GPU can obtain very good results when contrasted with a CPU-only implementation. Taking this fact into account, our purpose here is to develop an ANN to recognize certain features of a human face in a short period of time. If training time can be reduced considerably by using parallel GPU infrastructures, the solution will overcome one of the main disadvantages of traditional ANN implementations. </span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">5</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Implementation Details</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The proposed parallel implementation applies the ideas in the sequential algorithm by Shufelt and Mitchell <a href="#r4">(4)</a> for recognizing if a given picture is of a certain person. Thus, the method has to be slightly modified to obtain the expected solution for the face recognition problem. Moreover, in the parallel implementation, the possibility of working with a second layer of hidden neurons was included.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">5.1</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Software modules</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The proposed solution uses five modules, which implement all functions needed by the algorithm. First, facetrain.cu contains the main method and is the module that performs the calls to the other functions. Then, backprop.cu implements the ANN and all the auxiliary functions needed to work with it. The proper interaction between the ANN and the images is solved in imagenet.cu. The pgmimage.cu library is used to work with images in pgm format. Finally, constants.h contains the entire configuration for the correct execution of the algorithm.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Training is a key element in the algorithm. The diagram in <a href="#f5">Fig. 5</a> explains the required steps needed to train the ANN for the entire trainset.</span></font></p>       <p style="margin-top: 6pt;"><font face="Verdana" size="2"><span style=""><a name="f5"><img src="/img/revistas/cleiej/v15n3/3a07f5.jpg"></a> </span></font></p>       <p><font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">5</b></span><b style="">.</b> ANN training: functional schema.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The GPU architecture is best exploited when performing training and classification. For example (see <a href="#f6">Fig. 6</a>), when forwarding from input layer to hidden layer, the parallel algorithm creates as many blocks as neurons are in the hidden layer (each block will work with one neuron on the hidden layer), and each thread in a block will compute the linear combination of the weight that goes to that hidden neuron by the data that comes from the input layer (the algorithm works with the same number of threads per block as input neurons are). Similar levels of parallelism are achieved in the rest of the functions, obviously changing the role of each block and each thread in a block.</span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span style=""><a name="f6"><img src="/img/revistas/cleiej/v15n3/3a07f6.jpg"></a></span></font></p>       <p style="margin: 3pt 0cm 6pt;"><font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. </span></b><span lang="EN-US"><span style=""><b style="">6</b></span><b style="">.</b> Parallelism example: forwarding from input layer to hidden layer.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">5.2</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Tasks Executed on GPU</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Key parameters such as the grid and block size used for the function invocations performed on GPU are of special interest for the overall performance of the algorithm.<span style="">&nbsp; </span>The following functions are called in both the training and evaluation tasks. The function <i style="">load_input_with_image()</i><b style=""> </b>is called with as many block as rows the image has, and as many threads per block as columns the image has. This function loads each image into the neurons of the input layer for later use (each block loads a row of the image). After that, <i style="">forward_input_hidden()</i> computes the outputs of the neurons in the hidden layer, using the data from the input layer. It is called with as many blocks as the number of hidden neurons in the network and as many threads per block as the GPU allows (1024 in our case). Each block computes the output of one neuron on the hidden layer, which is the linear combination performed by several threads (see Section 2). The function <i style="">forward_hidden()</i> works like the previous function, obtaining the output of the hidden layer. Next, <i style="">load_output()</i> loads the expected output of the ANN for a certain image. It is invoked with one block with one thread because of its simplicity.</span></font></p>       <p style="text-indent: 7.1pt;"><font face="Verdana" size="2"><span lang="EN-US">The function<b style=""> </b><i style="">evaluate_performance()</i> computes the error of an image and checks if the output of the ANN matches the expected one. It is called with one block with one thread, due to the small processing performed. The functions <i style="">bpnn_output_error()</i> and <i style="">bpnn_hidden_error()</i> are used to compute the error in the output and hidden layer, respectively. The first one is called with one block and with as many threads per block as output neurons the ANN has. The second one with as many blocks as hidden neurons are and with as many threads per block as output neurons are. The function <i style="">bpnn_adjust_weights_hidden()</i> adjust the weights from the hidden layer to the output layer. It is called with as many blocks as output neurons are, and with as many threads per block as number of hidden neurons are plus one. For each block, it adjusts the weights that go from hidden neurons (as many as threads) to output neurons. Finally, <i style="">bpnn_adjust_weights_input()</i> adjusts weights that go from input layer to output layer. It works like the previous function, with the difference that the number of threads must be larger, due to the number of neurons the input layer has.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">5.3</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Other GPU considerations</span></font></p>       <p style="text-indent: 0cm;"><font face="Verdana"><span lang="EN-US"> <font size="2">Throughout the provided implementation, all constants have their type defined. This implementation decision was made because a precision loss was detected when performing conversions (e.g. double to float), affecting the numerical efficacy of the proposed algorithm. Since all GPU memory must be contiguous, static structures are used, because when transferring data from CPU to GPU the <i style="">cudaMemcpy</i> function copies only contiguous memory directions. Moreover, the CPU stack size had to be enlarged to 512 MB in order to allow storing 128</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">120 images or larger.</font></span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">In addition, certain implementation decisions were taken to improve the final performance of the proposed algorithm. First, it was decided to use shared memory in certain GPU kernels, to hide the latency of global memory access. Another element to take into account is that most threads running on GPU performed several calculations, in order to avoid the limit for the number of threads per block in the execution platform (1024 in CUDA compute capabilities 2.0).</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">A weakness of the implementation is that heavily relies on the used hardware (especially with the compute capabilities of the graphics card). This will impact on the number of threads per block that can be created and in the use of certain CUDA functions (i.e. use of the <i style="">atomicAdd</i> function with data of float type). </span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">6</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Experimental Analysis</span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">This section reports the results obtained when applying the parallel GPU implementations of the face pointing direction problem for a set of problem instances. A comparative analysis with a sequential implementation is performed, and the obtained speedups (the quotient between execution time in the sequential implementation and execution time in the parallel implementation) are also reported.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">&nbsp;</font><o:p></o:p></span></p>       <p><span lang="EN-US"><font face="Verdana" size="2">6.1</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Development and Execution Platform</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The GPU algorithm was developed on an AMD Athlon II X3 445 processor at 3.10 GHz, with 6 GB DDR3 RAM memory at 1333 MHz, a MSI 880GM-E41 motherboard, and a GeForce GTS 450 GPU with 1 GB of RAM. </span> </font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The experimental analysis of the proposed algorithm executions was carried out on two platforms. Validation experiments using small-sized images were performed on the development platform, using a 500 GB SATA-2 disk with RAID 0 running Ubuntu Linux 11.10 64 bits Edition.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The limited computing power of the development platform did not allow taking full advantage of the parallel features proposed by the algorithm. Thus, a more comprehensive set of experiment including large images were performed in a more powerful platform, the <i style="">execution platform</i>, consisting in a Core i7-2600 processor at 3.40 GHz processor with 16 GB DDR3 RAM memory, with Fedora 15 64 bits Edition operating system, and a GeForce GTX 480 GPU with 1536 MB of RAM. </span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">6.2</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Problem Instances</span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">Both sets of problem instances used for training and classification were obtained from the work by Shufelt and Mitchell <a href="#r4">(4)</a>. These are images of different people in different poses. For each person and pose, the image comes in three sizes: 32</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">30 pixels, 64</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">60 and 128</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">120 pixels. There are about 620 images, which are divided in three sets, one for the network training and the other two to measure its effectiveness (trainlist and testlist, respectively). Also, a scaling of the images was performed in order to carry out executions with larger images to contrast these executions with the previous ones. The scaling was performed in two sizes: 256</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">240 and 512</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">480.</font></span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">This variety of instance types (different image sizes) allows to take advantage of the GPU platform to the maximum because for instances of small size (i.e. 32</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">30 pixels, 64</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">60 pixels) both platforms perform similarly, while for images of larger sizes a clear difference between the two platforms begins to notice. This is mainly because the GPU platform has much more processing units than the CPU (yet at a lower speed), so if many calculations at once are required, there will be more available computing resources in GPU than in CPU.</font></span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">6.3</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Results and Discussion</span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">To validate the algorithm it was considered that the rate of correctly classified images for new instances should be greater than 80% and the speedup gain over the sequential algorithm should be at least 2.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">All values shown below correspond to algorithm executions with a training set consisting of 277 images (training set) and two test sets of 139 and 208 images respectively (train1 set and train2 set). In first instance, training over the ANN is made with every image in the training list, then performance is evaluated using images from test set 1 and 2 (this concludes a cycle). This is performed 100 times (100 epochs) to complete an execution cycle. The presented values correspond to the average of 50 execution cycles with an ANN with 100 neurons in the hidden layer.</span></font></p>       <p><b style=""><span lang="EN-US"><font face="Verdana" size="2">&nbsp;</font><o:p></o:p></span></b></p>       <p><b style=""><span lang="EN-US"><font face="Verdana" size="2">Solution Quality</font><o:p></o:p></span></b></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">Since the proposed parallel implementation does not modify the algorithmic behavior of the </font> </span><span style="color: black;" lang="EN-US"> <font size="2">sequential</font></span><span style="font-size: 10pt;" lang="EN-US"> </span><span lang="EN-US"> <font size="2">implementation, the obtained results with the GPU implementation are nearly the same than those obtained with the sequential version for all the studied instances. <a href="#t2">Table 2</a> and <a href="#t4">Table 4</a> show correctly classified instances (in percentage) for both sequential and parallel algorithm and the learning rate and momentum constants that were used. Classification rates close to 80% are achieved for images of 512</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">480 pixels in both development and execution platform. For the development platform best results are obtained with images of 32</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">30 pixels where classification rate close to 93% is obtained as for the execution platform classification rate close to 92% is achieved for both 32</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">30 and 64</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">60 and 128</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><span lang="EN-US"><font size="2" face="Verdana">120 pixel images.</font><b style=""><o:p></o:p></b></span></p>       <p style="text-indent: 0cm;"><b style=""><span lang="EN-US"> <font face="Verdana" size="2">Execution Times</font><o:p></o:p></span></b></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><span lang="EN-US">This subsection reports and discussed the execution time and performance results for the GPU implementation of the proposed algorithm. All execution times reported are the averages and its correspondent standard deviation values, computed in 50 independent execution of the parallel algorithm for each scenario.</span></font></p>       <p style="margin-top: 0cm;"><i style=""><span lang="EN-US"> <font face="Verdana" size="2">&nbsp;</font><o:p></o:p></span></i></p>       <p style="margin-top: 0cm;"><font face="Verdana" size="2"><i style=""><span lang="EN-US">Validation experiments on the development platform</span></i><span lang="EN-US">. <a href="#t1">Table 1</a> reports the execution times (in seconds) for the sequential and parallel implementations of the algorithm and the values of the speedup metric, in the development platform. <a href="#t2">Table 2</a> shows the classification rates obtained for both the sequential and parallel implementation, and the corresponding constants used for learning rate and momentum.</span></font></p>       <p><font face="Verdana"><b style=""><span style="" lang="EN-US"><font size="2">Table </font> </span></b></font><span style="" lang="EN-US"> <font size="2" face="Verdana"><span style=""><b style="">1</b></span><b style="">.</b> Execution times (in seconds) in the development platform</font><o:p></o:p></span></p>       ]]></body>
<body><![CDATA[<div align="center">  <font face="Verdana" size="2">  <a name="t1"><img src="/img/revistas/cleiej/v15n3/3a07t1.jpg"></a> </font> </div>       <p><font face="Verdana"><b style=""><span style="" lang="EN-US"><font size="2">Table </font> </span></b></font><span style="" lang="EN-US"> <font size="2" face="Verdana"><span style=""><b style="">2</b></span><b style="">.</b> Correctly classified instances in the development platform</font><o:p></o:p></span></p>       <div align="center">  <font face="Verdana" size="2">  <a name="t2"><img src="/img/revistas/cleiej/v15n3/3a07t2.jpg"></a> </font> </div>       <p style="text-indent: 0cm;"><i style=""><span lang="EN-US"> <font face="Verdana" size="2">&nbsp;</font><o:p></o:p></span></i></p>       <p style="text-indent: 0cm;"><font face="Verdana" size="2"><i style=""><span lang="EN-US">Experimental analysis on the execution platform</span></i><span lang="EN-US">. <a href="#t3">Table 3</a> reports the execution times for both sequential and parallel implementation, as well as the speedup obtained in experiments performed in the execution platform. <a href="#t4">Table 4</a> shows the classification rates obtained for each implementation and the corresponding constants used for learning rate and momentum.</span></font></p>       <p><font face="Verdana"><b style=""><span style="" lang="EN-US"><font size="2">Table </font> </span></b></font><span style="" lang="EN-US"> <font size="2" face="Verdana"><span style=""><b style="">3</b></span><b style="">.</b> Execution times (in seconds) in the execution platform</font><o:p></o:p></span></p>       <div align="center"> <font face="Verdana" size="2"> <a name="t3"><img src="/img/revistas/cleiej/v15n3/3a07t3.jpg"></a> </font>  </div>       <p><font face="Verdana"><b style=""><span style="" lang="EN-US"><font size="2">Table </font> </span></b></font><span style="" lang="EN-US"> <font size="2" face="Verdana"><span style=""><b style="">4</b></span><b style="">.</b> Correctly classified instances in the execution platform</font><o:p></o:p></span></p>       <div align="center">  <font face="Verdana" size="2">  <a name="t4"><img src="/img/revistas/cleiej/v15n3/3a07t4.jpg"></a> </font> </div>       <p style="margin-top: 6pt; text-indent: 0cm;"><font face="Verdana" size="2"><i style=""><span lang="EN-US">Speedup comparison</span></i><span lang="EN-US">. <a href="#f7">Fig. 7</a> summarizes the acceleration when using a GPU implementation, contrasted with using a sequential implementation in CPU, for the different image sizes and platforms used in the experimental analysis. The <i style="">speedup</i> evaluates the quotient between the execution time of the sequential implementation and the execution time in the parallel implementation in GPU. </span></font></p>       ]]></body>
<body><![CDATA[<p style="text-align: center; text-indent: 0cm; line-height: normal;" align="center"> <font face="Verdana" size="2"><a name="f7"><img src="/img/revistas/cleiej/v15n3/3a07f7.jpg"></a> </font> </p>       <p style="margin: 0cm 0cm 6pt; line-height: normal;"> <font face="Verdana" size="2"><b style=""><span lang="EN-US">Fig. 7.</span></b><span lang="EN-US"> Speedup comparison.</span></font></p>       <p style="line-height: normal;"><font face="Verdana"><span lang="EN-US"> <font size="2">The speedup values in <a href="#f7">Fig. 7</a> indicate that the best acceleration is obtained for images with dimension 256</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">240 pixels, where the algorithm reaches the compute capabilities of the graphic card. The results in Tables <a href="#t1">1</a> and <a href="#t3">3</a> indicate that significant improvements on the execution times are obtained when using the parallel version of the algorithm with images of size 64</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">60 or larger. When solving images of size 32</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span><font face="Verdana"><span lang="EN-US">30, the GPU implementation was unable to outperform the execution times of the CPU-only implementation, mainly due to the overhead introduced by thread creation and management and the use of the GPU memory. However, when solving larger problem instances, significant improvements in execution times are achieved, especially for images of size 256</span></font><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">240 pixels, where a speedup of 8.47 is obtained.</font></span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2">The previous results indicate that the parallel implementation of the face recognition algorithm executing on GPU provides significant reductions on the execution times over a traditional sequential implementation in CPU, especially when large images are processed. </font> <b style=""><span style="color: red;"><o:p></o:p></span></b></span></p>       <p><span lang="EN-US"><font face="Verdana" size="2">7</font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Conclusions and Future Work</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">ANNs have proven to be suitable for solving many real world problems. However, the large execution times required in the training phase sometimes exclude ANNs from being an option when using large datasets or when solving complex problems. </span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">Nowadays, parallel computing on GPUs allows achieving important performance improvements over CPU implementations. In this article, a parallel GPU algorithm was proposed for solving the face recognition problem using ANNs.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The parallel GPU algorithm was designed and implemented to take advantage of the specific features of GPU infrastructures, in order to provide an accurate and efficient solution to both the training process using the well-known backpropagation algorithm, and the face recognition problem itself.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The overall parallel strategy used is based on many threads running on GPU, each one working with several neurons, and trying to maintain the threads as independent as possible. Every kernel function was designed to take advantage of the execution platform, optimized to obtain the best performance (e.g. some kernels are assigned to perform more than one neuron calculations to avoid the overhead of thread creation). Also, shared memory was exploited in order to avoid global memory access latency.</span></font></p>       <p><font face="Verdana"><span lang="EN-US"><font size="2">The experimental analysis demonstrates that the parallel algorithm in GPU allowed obtaining significant improvements in the execution times when compared with a traditional sequential implementation. Speedup values up to <b style="">8.47</b> were obtained when solving problem instances with images of 256</font></span></font><font size="2"><span style="font-family: Verdana" lang="EN-US">&acute;</span></font><font face="Verdana"><span lang="EN-US"><font size="2">240 pixels, and 7.23 for images of 512&times;480 pixels. These results confirm that to take advantage of the GPU computing power, the algorithm should be used to process images of considerable sizes. </font> </span></font></p>       ]]></body>
<body><![CDATA[<p><font face="Verdana" size="2"><span lang="EN-US">The main contributions of this article include a parallel face recognition algorithm in GPU that is able to obtain accurate classification rates in reasonable execution times. The algorithm can be easily modified to recognize other features of a human face, without significant changes in the expected execution times. </span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The research reported in this article demonstrates that the GPGPU platform is a very good option to speed up the resolution of complex problems. Furthermore, the results indicate how the growing technological evolution of graphic cards helps to tackle more complex classification problems using ANNs, which can be solved accurately and in reduced execution times.</span></font></p>       <p><font face="Verdana" size="2"><span lang="EN-US">The main lines for future work include further improving the computational efficiency of the presented algorithm and tackling other classification/image processing problems using ANNs implemented on GPU. Regarding the first line of work, improved execution time results can be obtaining by adjusting the parameters of each kernel invocation, to avoid problems such as thread divergence or better use of the GPU resources (i.e. shared memory) for larger images. Also, some algorithm constants (such as <i style="">momentum</i> and <i style="">learning rate</i>) could be auto-tuned by the algorithm to obtain the best classification rates as possible. Regarding the second line, it will be of special interest to implement GPU algorithms to recognize generic features of people images, such as skin color, if it has sunglasses or not, etc., configurable at run time. </span></font></p>       <p style="margin-bottom: 6pt;"><font face="Verdana" size="2"><span class="heading3"><span lang="EN-US">References</span></span></font><b style=""><span lang="EN-US"><o:p></o:p></span></b></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a name="r1">(1)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">Kyong, K. and Jung, K., <i style="">GPU Implementation of Neural Network. Pattern Recognition</i>, vol. 37, no. 6, pp. 1311-1314. Pergamon, 2004.    </span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r2." name="r2">(2)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">T. Mitchell, <i style="">Machine Learning</i>. McGraw Hill, 1997.    </span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a name="r3">(3)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">C. M. Bishop, <i style="">Pattern Recognition and Machine Learning</i>. Springer, 2006.    </span></font></p>       ]]></body>
<body><![CDATA[<!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r4." name="r4">(4)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana"><span lang="EN-US"><font size="2">T. Mitchell and J. Shufelt, Neural Networks for Face Recognition, </font> <a href="http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/%20ftp/faces.html"> <span style="font-size: 10pt;">http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ ftp/faces.html</span></a><font size="2">. Accessed June 2012.    </font></span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a name="r5">(5)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana"><span lang="EN-US"><font size="2">Neural Network on GPU, </font> <a href="http://www.codeproject.com/Articles/24361/A-Neural-Network-on-GPU"> <span style="font-size: 10pt;">http://www.codeproject.com/Articles/24361/A-Neural-Network-on-GPU</span></a><font size="2"> Accessed June 2012</font></span></font><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r6." name="r6">(6)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">H. Jang, A. Park and K. Jung, &ldquo;Neural Network Implementation Using CUDA and OpenMP&rdquo;, <i style="">Proc. of Computing: Techniques and Applications</i>, pp.155-161. IEEE, 2008.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r7." name="r7">(7)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">P. Izotov, N. Kazanskiy, D. Golovashkin and S. Sukhanov, &ldquo;CUDA-enabled implementation of a neural network algorithm for handwritten digit recognition&rdquo;, <i style="">Optical Memory &amp; Neural Networks</i>, vol. 20, no. 2, pp.98-106. Allerton Press, Inc, 2011.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r8." name="r8">(8)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">F. Nasse, C. Thurau and G. Fink, &ldquo;Face Detection Using GPU-Based Convolutional Neural Networks&rdquo;, <i style="">Proc. of Computer Analysis of Images and Patterns</i>, pp. 83-90. Springer Berlin, 2009.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r9." name="r9">(9)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp; </span></span><font face="Verdana" size="2"><span lang="EN-US">N. Lopes and B. Ribeiro, &ldquo;An Evaluation of Multiple Feed-Forward Networks on GPUs&rdquo;, <i style="">International Journal of Neural Systems</i>, vol. 21, no. 1, pp. 31-47. World Scientific Publishing Company, 2011.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a name="r10">(10)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">Y. LeCun, L. Bottou, G. Orr and K. Muller, &ldquo;Efficient Backprop in Neural Networks-Tricks of the Trade&rdquo;, <i style="">Springer Lecture Notes in Computer Sciences</i>, vol. 1524, pp. 5-50. Springer, 1998.</span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r11." name="r11">(11)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana"><span lang="EN-US"><font size="2">NVIDIA. <i style="">CUDA C Programming Guide Version 4.1</i>, </font> <a href="http://developer.download.nvidia.com/%20compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf"> <span style="font-size: 10pt;">http://developer.download.nvidia.com/ compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf</span></a></span><span style="font-size: 10pt;" lang="EN-US">.</span><span class="MsoHyperlink"><span lang="EN-US"><font size="2"> </font> </span></span><span lang="EN-US"><font size="2">Accesed June 2012</font></span></font><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r12." name="r12">(12)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">D. Steinkrau, P. Simard and I. Buck, &ldquo;Using GPUs for machine learning algorithms&rdquo;, <i style="">Proc. of 8th Int. Conf. on Document Analysis and Recognition</i>, pp. 1115&ndash;1119, 2005.</span></font></p>       ]]></body>
<body><![CDATA[<p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r13." name="r13">(13)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">B. Catanzaro, N. Sundaram and K. Keutzer, &ldquo;Fast support vector machine training and classification on graphics processors&rdquo;, <i style="">Proc. of 25th International Conference on Machine Learning</i>, 2008, pp. 104&ndash;111. ACM, 2008.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r14." name="r14">(14)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">D. Rumelhart, B. Widrow and M. Lehr, &ldquo;The basic ideas in neural networks&rdquo;, <i style="">Communications of the ACM</i>, 37(3) pp. 87-92. ACM, 1994.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r15." name="r15">(15)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">S. Huang, L. Fu, P. Hsiao, &ldquo;A framework for human pose estimation by integrating data-driven Markov chain Monte Carlo with multi-objective evolutionary algorithm&rdquo;, <i style="">Proc. of &nbsp;Int. Conf. on Robotics and Automation</i>, pp. 3748&ndash;3753, 2006.</span></font></p>       <p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r16." name="r16">(16)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">E. Murphy-Chutorian, M. Trivedi, &ldquo;Head Pose Estimation in Computer Vision: A Survey&rdquo;, <i style="">IEEE Trans. on Patt. Analysis and Machine Intelligence</i>, 2009, pp. 607&ndash;626. IEEE, 2009.</span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a name="r17">(17)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">M. C. Bishop, <i style="">Neural Networks for Pattern Recognition</i>. Clarendon Press, Oxford, 1995.    </span></font></p>       <!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r18." name="r18">(18)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana"><span lang="EN-US"><font size="2">Yale Face Database, </font> <a href="http://cvc.yale.edu/projects/yalefaces/yalefaces.html"> <span style="font-size: 10pt;">cvc.yale.edu/projects/yalefaces/yalefaces.html</span></a><font size="2">. Accessed June 2012</font></span></font><!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r19." name="r19">(19)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana"><span lang="EN-US"><font size="2">Y. LeCun and C. Cortes, The MNIST Database of Handwritten Digits, MNIST Handwritten Digit Database, </font> <a href="http://yann.lecun.com/exdb/mnist"> <span style="font-size: 10pt;">http://yann.lecun.com/exdb/mnist</span></a><font size="2">. Accessed June 2012</font></span></font><!-- ref --><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r20." name="r20">(20)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana"><span lang="EN-US"><font size="2">CUDA Spotlights, </font> <a href="http://developer.nvidia.com/cuda-spotlights"> <span style="font-size: 10pt;">http://developer.nvidia.com/cuda-spotlights</span></a><font size="2">. Accessed June 2012</font></span></font><p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r21." name="r21">(21)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">C. Ng, M. Savvides and P. Khosla, &ldquo;Real-time face verification system on a cell-phone using advanced correlation filters&rdquo;, <i style="">Proc. of 4th IEEE Workshop on Automatic Identification Advanced Technologies</i>, pp. 57&ndash;62. IEEE, 2005.</span></font></p>       ]]></body>
<body><![CDATA[<p><span lang="EN-US"><font face="Verdana" size="2"><a href="#r22." name="r22">(22)</a></font><span style="font-family: &quot;Verdana&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; line-height: normal; font-size-adjust: none; font-stretch: normal">&nbsp; </span></span> <font face="Verdana" size="2"><span lang="EN-US">K. Venkataramani, S. Qidwai and B. Vijayakumar, &ldquo;Face authentication from cell phone camera images with illumination and temporal variations&rdquo;, <i style="">IEEE Trans. on Systems, Man, and Cybernetics</i>, Part C, vol. 35, pp. 411 &ndash; 418. IEEE, 2005.</span></font></p>   </div>        ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kyong]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Jung]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[GPU Implementation of Neural Network]]></article-title>
<source><![CDATA[Pattern Recognition]]></source>
<year>2004</year>
<volume>37</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>1311-1314</page-range><publisher-name><![CDATA[Pergamon]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mitchell]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Machine Learning]]></source>
<year>1997</year>
<publisher-name><![CDATA[McGraw Hill]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bishop]]></surname>
<given-names><![CDATA[C. M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Pattern Recognition and Machine Learning]]></source>
<year>2006</year>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mitchell]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Shufelt]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Neural Networks for Face Recognition]]></source>
<year>Acce</year>
<month>ss</month>
<day>ed</day>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="">
<source><![CDATA[Neural Network on GPU]]></source>
<year>Acce</year>
<month>ss</month>
<day>ed</day>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Jang]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Park]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Jung]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Neural Network Implementation Using CUDA and OpenMP]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of Computing: Techniques and Applications]]></conf-name>
<conf-date>2008</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Izotov]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Kazanskiy]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Golovashkin]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Sukhanov]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Optical Memory & Neural Networks]]></source>
<year>2011</year>
<volume>20</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>98-106</page-range><publisher-name><![CDATA[Allerton Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Nasse]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Thurau]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Fink]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Face Detection Using GPU-Based Convolutional Neural Networks]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Computer Analysis of Images and Patterns]]></conf-name>
<conf-date>2009</conf-date>
<conf-loc>Berlin </conf-loc>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lopes]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Ribeiro]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[An Evaluation of Multiple Feed-Forward Networks on GPUs]]></article-title>
<source><![CDATA[International Journal of Neural Systems]]></source>
<year>2011</year>
<volume>21</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>31-47</page-range><publisher-name><![CDATA[World Scientific Publishing Company]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[LeCun]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Bottou]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Orr]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Muller]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Efficient Backprop in Neural Networks-Tricks of the Trade]]></article-title>
<source><![CDATA[Springer Lecture Notes in Computer Sciences]]></source>
<year>1998</year>
<volume>1524</volume>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="">
<collab>NVIDIA</collab>
<source><![CDATA[CUDA C Programming Guide Version 4.1]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Steinkrau]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Simard]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Buck]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Using GPUs for machine learning algorithms]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 8th Int. Conf. on Document Analysis and Recognition]]></conf-name>
<conf-date>2005</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Catanzaro]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Sundaram]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Keutzer]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Fast support vector machine training and classification on graphics processors]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 25th International Conference on Machine Learning]]></conf-name>
<conf-date>2008</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rumelhart]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Widrow]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Lehr]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The basic ideas in neural networks]]></article-title>
<source><![CDATA[Communications of the ACM]]></source>
<year>1994</year>
<volume>37</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>87-92</page-range><publisher-name><![CDATA[ACM]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Huang]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Fu]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Hsiao]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A framework for human pose estimation by integrating data-driven Markov chain Monte Carlo with multi-objective evolutionary algorithm]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Int. Conf. on Robotics and Automation]]></conf-name>
<conf-date>2006</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Murphy-Chutorian]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Trivedi]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Head Pose Estimation in Computer Vision: A Survey]]></article-title>
<source><![CDATA[IEEE Trans. on Patt. Analysis and Machine Intelligence]]></source>
<year>2009</year>
<publisher-name><![CDATA[IEEE]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bishop]]></surname>
<given-names><![CDATA[M. C.]]></given-names>
</name>
</person-group>
<source><![CDATA[]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="">
<source><![CDATA[Yale Face Database]]></source>
<year>Acce</year>
<month>ss</month>
<day>ed</day>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[LeCun]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Cortes]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<source><![CDATA[The MNIST Database of Handwritten Digits]]></source>
<year>Acce</year>
<month>ss</month>
<day>ed</day>
</nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="">
<source><![CDATA[CUDA Spotlights]]></source>
<year>Acce</year>
<month>ss</month>
<day>ed</day>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ng]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Savvides]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Khosla]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Real-time face verification system on a cell-phone using advanced correlation filters]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ 4th IEEE Workshop on Automatic Identification Advanced Technologies]]></conf-name>
<conf-date>2005</conf-date>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Venkataramani]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Qidwai]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Vijayakumar]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Face authentication from cell phone camera images with illumination and temporal variations]]></article-title>
<source><![CDATA[IEEE Trans. on Systems, Man, and Cybernetics]]></source>
<year>2005</year>
<volume>35</volume>
<page-range>411 - 418</page-range><publisher-name><![CDATA[IEEE]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
