Tutorials

4 half-day tutorials will be hosted on the Monday of the conference:

  1. Parallel I/O: Lessons learnt in the last 20 years (9/20, 8:30AM-12PM)
  2. Building Highly Available HPC Clusters with HA-OSCAR (9/20, 8:30AM-12PM)
  3. State of InfiniBand in Designing HPC Clusters, Storage/File Systems, and Datacenters (9/20, 1PM-4:30PM)
  4. MPI Tuning with Intel© Trace Analyzer and Intel© Trace Collector CANCELLED

Parallel I/O: Lessons learnt in the last 20 years
Toni Cortes, Universitat Politècnica de Catalunya (UPC, CEPBA-IBM Research Institute (CIRI)

Around 20 years have passed since the first disk stripping experiences andplenty of work has been done in order to improve parallel I/O at many different levels. Interesting ideas have been presented, implemented, andevaluated raging from the devices to the interface going thought the file system, operating system, and middleware.

After these two decades, it is now a good time to go through all the donework and try to learn the important lessons all these parallel I/O initiatives have taught us. This tutorial aims at giving this globaloverview. The focus will not be on commercial/academic systems/prototypes, but on the concepts that lay behind them. These concepts have normally beenapplied at different levels, and thus, such an overview can be of interest to many people ranging from the hardware design to the applicationimplementation. Some of the most important concepts that will be discussed will be, among others, data placement (RAIDs, 2D and 3D files, ...), network architectures for parallel I/O (Network attached devices, SAN, ...), parallel caching and prefetching (cooperative caching, Informed caching and prefetching, ...), and interfaces (collective I/O, data distribution interfaces, ...).
Building Highly Available HPC Clusters with HA-OSCAR
Chokchai Leangsuksun, Louisana Tech University, and Ibrahim Haddad, Ericsson Research

March 2004 was a major milestone for the HA-OSCAR Working Group. It marked the announcement of the first public release of the HA-OSCAR software package. HA-OSCAR is an Open Source project that aims to provide a combined power of high availability and performance computing. HA-OSCAR enhances a Beowulf cluster system for mission critical grade applications with various high availability mechanisms such as component redundancy to eliminate this single point of failure, self-healing mechanism, failure detection and recovery mechanisms, in addition to supporting automatic failover and fail-back. The first release (version 1.0) supports new high availability capabilities for Linux Beowulf clusters based on the OSCAR 3.0 release from the Open Cluster Group. In this release of HA-OSCAR, we provide an installation wizard graphical user interface and a web-based administration tool, which allows intuitive creation and configuration of a multi-head Beowulf cluster. In addition, we have included a default set of monitoring services to ensure that critical services, hardware components, and important cluster resources are always available at the control node. This release also featured new services that can be configured and added via a WebMin-based HA-OSCAR administration tool.

This tutorial will address in detail all the design and implementation issues related to building HA Linux Beowulf clusters and using Linux and Open Source Software as the base technology. In addition, the focus of the tutorial is HA-OSCAR. We will present the architecture of HA-OSCAR, review of new features of the current release, explain how we implemented all the HA features, and discuss our experiments covering performance and availability, as well as our test results.
State of InfiniBand in Designing HPC Clusters, Storage/File Systems, and Datacenters
D.K. Panda, Ohio State University

InfiniBand is a new and emerging networking technology standard. It has many novel features which are not available in other contemporary networks. Current 4X IBA products support 10.0 Gbps bandwidth at the link level. Thus, current IBA products provide new ways to design next generation HPC systems. Since the announcement of the IBA standard in Oct. 2000, it has gone through many ups and downs. During the last one and half year, it is catching up as a novel interconnect to design next generation HPC clusters, servers, storage, file systems, and data centers. This is leading to the following open questions from designers, managers, developers, and users of high-end computing, networking, and storage systems: 1) What is IBA? 2) How is it different from other interfaces and interconnects (PCI-X, PCI-Express, iSCSI, 10.0 GigE, Myrinet, Quadrics, etc.)? 3) What kinds of IBA hardware and software solutions are currently available?, and 5) How one can take advantage of IBA features to design next generation high-end systems (HPC clusters, storage systems, file systems, and datacenters) with high performance and scalability?

Based on the current situation of IBA and its growth, the goals of this tutorial are as follows:
  1. To provide answers to the above open questions.
  2. Making the attendees familiar with the IBA architecture and the associated benefits.
  3. Providing an overview of InfiniBand hardware and software solutions available currently.
  4. Outlining case studies of designing next generation systems (HPC with MPI, Distributed Shared Memory, File Systems, Storage systems, Database systems, and Multi-tier Datacenters) while taking advantage of IBA features.
  5. Outlining issues, challenges, benefits, and limitations of designing the above systems with InfiniBand compared to other interconnects.

In summary, the tutorial aims to make the attendees familiar with IBA, its benefits, available IBA hardware/software solutions, and the latest trends in designing high-end computing, networking, and storage systems with IBA, and providing a critical assessment of whether IBA is ready for prime-time or not.


Questions -- Questions should be sent electronically to the Tutorial Chair, Jennifer M. Schopf (jms@mcs.anl.gov), and to the Tutorial Deputy Chair, Charles D. norton (Charles.D.Norton@jpl.nasa.gov).