| | 8:45
AM - 9:00 AM | Welcome Address and Opening Remarks | 9:00
AM – 10:00 AM | Keynote by Sudhakar Yalamanchili, Georgia
Institute of Technology, USA: Scaling Resource Compositions in a Flatter World
Following the end of Dennard scaling
and the transition to multicore we are now seeing an evolution to
communication-centric architectures and systems. Data movement is more
expensive in time and energy than compute, and as a consequence systems
are undergoing another fundamental transformation to optimize data
movement rather than compute. This transformation will percolate up
through the software stacks and to clusters and data centers. This trend
has been amplified with the emergence of big data as a major challenge
for future systems. This talk will make some observations about the
impact of technology trends on cluster architectures and offer some
opinions on anticipated research problems. It will conclude with a
description of our new project, Oncilla, an experimental platform where
we explore data movement optimizations for data warehousing applications
in context of clusters that are architected to offer flexible
compositions of heterogeneous compute and memory resources.
Presentation | 10:00 AM - 10:30 AM | Coffee break | 10:30
AM – 12:00 AM | Technical Session 1: GPU-based computingRay
Bittner, Erik Ruf: Direct GPU/FPGA Communication Via PCI Express Parallel processing has hit mainstream
computing in the form of CPUs, GPUs and FPGAs. While explorations
proceed with all three platforms individually and with the CPU-GPU pair,
little exploration has been performed with the synergy of GPU-FPGA.
This is due in part to the cumbersome nature of communication between
the two. This paper presents a mechanism for direct GPU-FPGA
communication and characterizes its performance in a full hardware
implementation.
Brad
Suchoski, Caleb Severn, Manu Shantharam and Padma Raghavan: Adapting Sparse
Triangular Solution to GPUs High performance computing systems are
increasingly incorporating hybrid CPU/GPU nodes to accelerate the rate
at which floating point calculations can be performed for scientific
applications. Currently, a key challenge is adapting scientific
applications to such systems when the underlying computations are
sparse, such as sparse linear solvers for the simulation of partial
differential equation models using semiimplicit methods. Now, a key
bottleneck is sparse triangular solution for solvers such as
preconditioned conjugate gradients (PCG). We show that sparse triangular
solution can be effectively mapped to GPUs by extracting very large
degrees of fine grained parallelism using graph coloring. We develop
simple performance models to predict these effects at intersection of
the data and hardware attributes and we evaluate our scheme on a Nvidia
Tesla M2090 GPU relative to the level set scheme developed at NVIDIA.
Our results indicate that our approach
significantly enhances the available fine-grained parallelism to
speed-up execution time compared to the NVIDIA scheme, by a factor with a
geometric mean of 5.41 on a single GPU, with speedups as high as 63 in
some cases.
Presentation
Pedro
Valero-Lara: MRF satellite image classification on GPU One of the stages of the analysis of
satellite images is given by a classification based on the Markov Random
Fields (MRF) method. It is possible to find in literature several
packages to carry out this analysis, and of course the classification
tasks. One of them is the Orfeo ToolBox (OTB). The analysis of satellite
images is an expensive computational task requiring real time execution
or automatization. In order to reduce the execution time spent on the
analysis of satellite
images, parallelism techniques can be used. Currently, Graphics
Processing Units (GPUs) are becoming a good choice to reduce the
execution time of several applications at a low cost. In this paper, the
author presents a GPU-based classification using MRF from the
sequential algorithm that appears in the OTB package. The experimental
results show a spectacular reduction of the execution time for the
GPU-based algorithm, up to 225 times faster than the sequential
algorithm included in the OTB package. Moreover, this result is also
observed in the total power consumption, which is reduced by a
significant amount.
Presentation
| 12:00 PM - 1:30 PM | Lunch | 1:30
PM - 2:30 PM | Keynote by Joel Emer, Intel, USA:
Scaling the Von-Neumann Wall with Reconfigurable Computing | 2:30
PM - 3:00 PM |
Technical Session 2: FPGA-based computingYamuna
Rajasekhar and Ron Sass: Architecture and Applications for an All-FPGA Parallel
Computer Architecture and Applications for an
All-FPGA Parallel Computer
The Reconfigurable Computing Cluster (RCC) project has been
investigating unconventional architectures for high end computing using a
cluster of FPGA devices connected by a high-speed, custom network. Most
applications use the FPGAs to realize an embedded System-on-a-Chip
(SoC) design augmented with application-specific accelerators to form a
message-passing parallel computer. Other applications take a single
accelerator core and tessellate the core across all of the devices,
treating them like a large virtual FPGA. The experimental hardware has
also been used for basic computer research by emulating novel
architectures. This article discusses the genesis of the over-arching
project, summarizes results of individual investigations that have been
completed, and how this approach may prove useful in the investigation
of future Exascale systems.
Presentation | 3:00 PM - 3:30 PM | Coffee break | 3:30
PM - 4:00 PM | Technical Session 3: Low-power based computingEdson
L. Padoin, Daniel A. G. de Oliveira, Pedro Velho, Philippe O. A. Navaux:
Evaluating Performance and Energy of ARM-based Clusters for High Performance
Computing
The High-Performance Computing (HPC)
community aimed for many years at increasing performance regardless to
energy consumption. However, energy is limiting the scalability of next
generation supercomputers. Current HPC systems already cost huge amounts
of power, in the order of a few MegaWatts (MW). The future HPC systems
intend to achieve 10 to 100 times more performance, but the accepted
energy to power those machines must remain below 20 MW. Therefore, the
scientic community is investigating ways to improve energy efficiency.
This paper presents a study of the execution time, power consumption,
maximum power and energy efficiency using ARM architectures. Our
objective is to verify the feasibility of clusters using processors that
target low power consumption. As a subproduct of our research we built
an unconventional cluster of PandaBoards each one featuring two ARM
Cortex A9 cores. We believe that these unconventional solutions bring an
alternative base to build HPC clusters that respect the limits of
electric energy. Presentation
| 4:00
PM - 5:00 PM | Closing Panel |
|
|