8:45 AM - 9:00 AM

Welcome Address and Opening Remarks

9:00 AM – 10:00 AM

Keynote by Sudhakar Yalamanchili, Georgia Institute of Technology, USA: Scaling Resource Compositions in a Flatter World

Following the end of Dennard scaling and the transition to multicore we are now seeing an evolution to communication-centric architectures and systems. Data movement is more expensive in time and energy than compute, and as a consequence systems are undergoing another fundamental transformation to optimize data movement rather than compute. This transformation will percolate up through the software stacks and to clusters and data centers. This trend has been amplified with the emergence of big data as a major challenge for future systems. This talk will make some observations about the impact of technology trends on cluster architectures and offer some opinions on anticipated research problems. It will conclude with a description of our new project, Oncilla, an experimental platform where we explore data movement optimizations for data warehousing applications in context of clusters that are architected to offer flexible compositions of heterogeneous compute and memory resources.


10:00 AM - 10:30 AM
Coffee break
10:30 AM – 12:00 AM

Technical Session 1: GPU-based computing

Ray Bittner, Erik Ruf: Direct GPU/FPGA Communication Via PCI Express

Parallel processing has hit mainstream computing in the form of CPUs, GPUs and FPGAs. While explorations proceed with all three platforms individually and with the CPU-GPU pair, little exploration has been performed with the synergy of GPU-FPGA. This is due in part to the cumbersome nature of communication between the two. This paper presents a mechanism for direct GPU-FPGA communication and characterizes its performance in a full hardware implementation.

Brad Suchoski, Caleb Severn, Manu Shantharam and Padma Raghavan: Adapting Sparse Triangular Solution to GPUs

High performance computing systems are increasingly incorporating hybrid CPU/GPU nodes to accelerate the rate at which floating point calculations can be performed for scientific applications. Currently, a key challenge is adapting scientific applications to such systems when the underlying computations are sparse, such as sparse linear solvers for the simulation of partial differential equation models using semiimplicit methods. Now, a key bottleneck is sparse triangular solution for solvers such as preconditioned conjugate gradients (PCG). We show that sparse triangular solution can be effectively mapped to GPUs by extracting very large degrees of fine grained parallelism using graph coloring. We develop simple performance models to predict these effects at intersection of the data and hardware attributes and we evaluate our scheme on a Nvidia Tesla M2090 GPU relative to the level set scheme developed at NVIDIA. Our results indicate that our approach significantly enhances the available fine-grained parallelism to speed-up execution time compared to the NVIDIA scheme, by a factor with a geometric mean of 5.41 on a single GPU, with speedups as high as 63 in some cases.


Pedro Valero-Lara: MRF satellite image classification on GPU

One of the stages of the analysis of satellite images is given by a classification based on the Markov Random Fields (MRF) method. It is possible to find in literature several packages to carry out this analysis, and of course the classification tasks. One of them is the Orfeo ToolBox (OTB). The analysis of satellite images is an expensive computational task requiring real time execution or automatization. In order to reduce the execution time spent on the analysis of satellite images, parallelism techniques can be used. Currently, Graphics Processing Units (GPUs) are becoming a good choice to reduce the execution time of several applications at a low cost. In this paper, the author presents a GPU-based classification using MRF from the sequential algorithm that appears in the OTB package. The experimental results show a spectacular reduction of the execution time for the GPU-based algorithm, up to 225 times faster than the sequential algorithm included in the OTB package. Moreover, this result is also observed in the total power consumption, which is reduced by a significant amount.


12:00 PM - 1:30 PM  Lunch
1:30 PM - 2:30 PM

Keynote by Joel Emer, Intel, USA:
Scaling the Von-Neumann Wall with Reconfigurable Computing

2:30 PM - 3:00 PM

Technical Session 2: FPGA-based computing

Yamuna Rajasekhar and Ron Sass: Architecture and Applications for an All-FPGA Parallel Computer

Architecture and Applications for an All-FPGA Parallel Computer The Reconfigurable Computing Cluster (RCC) project has been investigating unconventional architectures for high end computing using a cluster of FPGA devices connected by a high-speed, custom network. Most applications use the FPGAs to realize an embedded System-on-a-Chip (SoC) design augmented with application-specific accelerators to form a message-passing parallel computer. Other applications take a single accelerator core and tessellate the core across all of the devices, treating them like a large virtual FPGA. The experimental hardware has also been used for basic computer research by emulating novel architectures. This article discusses the genesis of the over-arching project, summarizes results of individual investigations that have been completed, and how this approach may prove useful in the investigation of future Exascale systems.


3:00 PM - 3:30 PM  Coffee break
3:30 PM - 4:00 PM

Technical Session 3: Low-power based computing

Edson L. Padoin, Daniel A. G. de Oliveira, Pedro Velho, Philippe O. A. Navaux: Evaluating Performance and Energy of ARM-based Clusters for High Performance Computing

The High-Performance Computing (HPC) community aimed for many years at increasing performance regardless to energy consumption. However, energy is limiting the scalability of next generation supercomputers. Current HPC systems already cost huge amounts of power, in the order of a few MegaWatts (MW). The future HPC systems intend to achieve 10 to 100 times more performance, but the accepted energy to power those machines must remain below 20 MW. Therefore, the scientic community is investigating ways to improve energy efficiency. This paper presents a study of the execution time, power consumption, maximum power and energy efficiency using ARM architectures. Our objective is to verify the feasibility of clusters using processors that target low power consumption. As a subproduct of our research we built an unconventional cluster of PandaBoards each one featuring two ARM Cortex A9 cores. We believe that these unconventional solutions bring an alternative base to build HPC clusters that respect the limits of electric energy.


4:00 PM - 5:00 PM  Closing Panel