CLUSTER 2004 POSTER ABSTRACTS


 

APENet: a high speed, low latency 3D interconnect network

 

R. Ammendolaa, M. Guagnellib, G. Mazzaa, F. Palombic,

R. Petronzioa,b, D. Rossettid, A. Salamona and P. Vicinid

 

aIstituto Nazionale di Fisica Nucleare, Sezione Roma II

bDipartimento di Fisica, Universitˆ di Roma Tor Vergata

cE. Fermi Research Center

dIstituto Nazionale di Fisica Nucleare, Sezione Roma I

 

 

In this paper we present APENet, a new high speed, low latency, 3-dimensional interconnect architecture optimized for PC clusters running LQCD-like numerical applications.  The hardware implementation is based on a single PCI-X 133MHz network interface card hosting six independent bi-directional channels with a peak bandwidth of 676 MB/s each direction and measured latency less than 10 ms.  The internal packet switching capabilities of the network card allows up to three couple of links simultaneously active.  The current software environment, based on Linux, is made of a low-level library and an high-level application library. An MPI implementation and a network device driver are being actively developed.


 


Application-Specific Scheduling for the Organic Grid

 

Arjav Chakravarti, Gerald Baumgartner, Mario Lauria

 

Dept. of Computer and Information Science, The Ohio State University

 

 

We propose a biologically inspired and fully-decentralized approach to the organization of computation that is based on the autonomous scheduling of strongly mobile agents on a peer-to-peer network. Our approach achieves the following design objectives: near-zero knowledge of network topology, zero knowledge of system status, autonomous scheduling, distributed computation, lack of specialized nodes. Every node is equally responsible for scheduling and computation, both of which are performed with practically no information about the system. We believe that this model is ideally suited for large-scale unstructured grids such as desktop grids. This model avoids the extensive system knowledge requirements of traditional Grid scheduling approaches. Contrary to the popular master/worker organization of current desktop grids, our approach does not rely on specialized super-servers or on application-specific clients. By encapsulating computation and scheduling behavior into mobile agents, we decouple both application code and scheduling functionality from the underlying infrastructure. The resulting system is one where every node can start a large grid job, and where the computation naturally organizes itself around available resources. Through the careful design of agent behavior, the resulting global organization of the computation can be customized for different classes of applications. In a previous paper, we described a proof-of-concept prototype for an independent task application. In this paper, we generalize the scheduling framework and demonstrate that our approach is applicable to a computation with a highly synchronous communication pattern, namely Cannon's matrix multiplication.


Attaining Higher Performance in Collective Communication

 

E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn

 

University of Texas

 

 

It has long been thought that research into collective communication algorithms on distributed-memory parallel computers has been exhausted.  This project demonstrates that the implementations available as part of widely-used libraries are suboptimal. We demonstrate this through the implementation of the Òreduce-scatterÓ collective communication and comparison with the MPICH implementation of MPI. Performance on a large cluster is reported.


 


GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC

 

A. Hunter, D. Schibeci, H. L. Hiew, M. Bellgard

 

The Centre for Bioinformatics and Biological Computing
Murdoch University, Murdoch, Western Australia

 

 

Bioinformatics is an important application area for Grid computing. The Grid computing issues required to tackle current bioinformatics challenges include processing power, large-scale data access and management, security, application integration, data integrity and curation, control/automation/tracking of workflows, data format consistency and resource discovery. In this poster, we describe preliminary steps taken to develop a Grid environment to advance Bioinformatics research. We developed a system called Grendel, with the aims of providing Bioinformatics researchers transparent access to basic computational resources used in their research. Grendel is a platform and language independent web-services based system for distributed resource management utilising Sun Grid Engine that provides a single entry point for computational tasks while keeping the actual resources transparent to the user. Grendel is developed in Java and deployed using the Tomcat. Client libraries have been developed in Perl and Java to provide access to computation resource exported via Grendel.


Fast broadcast by the divide-and-conquer algorithm[1]

 

Dongyoung Kim and Dongseung Kim

 

Department of Electrical Engineering

Korea University, Seoul, Korea

 

 

Collective communication functions including the broadcast in cluster computers usually take O(m log P) time in propagating the size-m message to P processors. We have devised a new O(m) broadcast algorithm, independent of the number of processors involved, by using divided-and-conquer algorithm. Details are given below.

 

Suppose the network is homogeneous and the point-to-point message communication take the same time regardless of the processor indexes if the message sizes are same. A group of K processors G(P0:PK-1) are partitioned into two equal-sized groups GL and GR where GL  = G(P0:PK/2-1) and GR  = G(PK/2:PK-1). For simplicity assume that K = 2k for some integer k. In the divided-and-conquer broadcast, a message M is also divided into two equal-sized sub-messages ML and MR, and each is sent to the respective leader processor (for example, the first processor in the group) in GL and GR. The leader then propagates the accepted message to all others in the group (this is also broadcast, called the inner broadcast). Once both groups finish broadcasting internally, there will be simultaneous intergroup message exchanges between pairs of processors, where a processor pair-j (j=0,1, ..., K/2-1) consists of j-th processor in GL and j-th processor in GR. Now, every processor will get the total message M after the exchange, since a half of it has been propagated within its own subgroup, and the remaining half is accepted from the other group by the exchange. The broadcast completes.

 

Note that the inner broadcast is also performed using the same method as the outer one. The algorithm is stated formally below where Broadcast(M, Pq:Pq+K-1) represents the broadcast of message M to all K processors from Pq to Pq+K-1, and send(X, Pq, Pr) is the point-to-point communication that transfers the message X from Pq to Pr.

 

Procedure Broadcast(M, Pq:Pq+K-1)

/* ML and MR are bisections of M */

/* M= ML || MR, |ML| = |MR| = |M|/2 */

/*|| represents concatenation, and |X| represents the size of X. */

{ if (K > 1) { send(MR, Pq, Pq+K/2)

do in parallel

{Broadcast(ML, Pq:Pq+K/2-1); Broadcast(MR, Pq+K/2: Pq+K-1)}

forall (i=0,1, ... K/2-1)

do in parallel

{send(ML, Pq+i, Pq+K/2+i); send(MR, Pq+K/2+i, Pq+i)}

}

else;

}

 

The overall time T(M,K) for the message broadcast of size M on K processors can be computed directly from the algorithm with T(M/2,K/2) and 2t(M/2), where t(M/2) is the point-to-point communication time of the message of size M/2 between arbitrary two processors. Hence,

 

     T(M,K) = T(M/2,K/2) + t(M/2) + t(M/2)   (for K>2)                   (1)

     T(M,K) = 0  if K = 1

 

The recursion is solved in a closed form as below.

 

     T(M,K) = 2{t(M/21)+t(M/22) + É + t(M/2k) }                                              (2)

 

The point-to-point message passing time of t(y) can be modeled as t(y) = S + y/B where S is a setup time and B is the bandwidth of the network. By substituting the equation into (2), the broadcast time becomes

 

     T(M,K) = 2SlogP + 2(M/B)(1 Ð 1/K)                                                             (3)

 

If K >> 1 and the setup time S is not large compared to M/B, the first term can be ignored and the broadcast time is T(M,K) = 2(M/B), which is independent of K.

 

In conclusion, the new broadcast algorithm will be much faster if there are many participating processors for the broadcast and the message is large.

 

VLAN-based Routing: Multi-path L2 Ethernet Network for HPC Clusters

 

Tomohiro Kudoh, Hiroshi Tezuka, Motohiko Matsuda

Yuetsu Kodama, Osamu Tatebe, Satoshi Sekiguchi

 

Grid Technology Research Center

National Institute of of

Advanced Industrial Science and Technology

Tsukuba Ibaraki, Japan

 

 

We propose and evaluate a VLAN-based routing method for intra-cluster networks, which realizes a large bi-section bandwidth using small L2 switches. Gigabit Ethernet is commonly used for intra-cluster networks because of its low cost and tolerable performance. However, costs of large non-blocking switches are high. Therefore, to realize a low-cost cluster, low-cost small L2 switches should be used. On the other hand, intra-cluster networks for high performance computing cluster require a large bi-section bandwidth. Since the topology of L2 Ethernet network can not include any loop, a large bi-section bandwidth can not be realized using small L2 switches when the number of computing nodes is large. To solve this problem, we map multiple VLANs on switches and network interfaces of a network. While no loops can be included in a VLAN, loops can be included in the physical topology by using multiple VLANs on a single physical network, and a large bi-section bandwidth can be realized. In this presentation, we will show some sample topologies which uses VLAN-based routing, and show performance of NPB executed on a cluster which uses VLAN-based routing.

 

Acknowledgement: This work has been supported by an NEDO Grant-in-Aid for private sector fundamental technology ``Research on Large-Scale and Reliable Servers,'' and Cisco Systems University Research Program.

 

A model for Resource-Aware Load Balancing on Heterogeneous Clusters

J. Faik, J. E. Flaherty, L. G. Gervasio

Department of Computer Science

Rensselaer Polytechnic Institute

Troy, NY 12180

 

J. D. Teresco

Department of Computer Science

Williams College

Williamstown, MA 01267

 

K. D. Devine, E. G. Boman

Sandia National Laboratories

Albuquerque, NM 87185-1111

 

We address the problem of partitioning and dynamic load balancing on clusters with heterogeneous hardware resources.We propose DRUM, a model that encapsulates hardware resources and their interconnection topology. DRUM provides monitoring facilities for dynamic evaluation of communication, memory and processing capabilities. Heterogeneity is quantified by merging the information from the monitors to produce a scalar number called "power". This power allows DRUM to be used easily by existing load-balancing procedures such as those in the Zoltan toolkit while placing minimal burden on the application programmers.  We demonstrate the use of DRUM to guide load balancing in the adaptive solution of a Laplace equation on a heterogeneous cluster. We observed significant reduction in execution time compared to traditional methods.

 

 


 



[1] This research was supported by KOSEF grant (R01-2001-0341-0) and KRF grant(2003-041-D0049.

 

Master Slave Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory

 

Arnaud Legrand, Olivier Beaumont*, Loris Marchal* ,Yves Robert*

 

LaBRI, UMR CNRS 5800, Bordeaux, France

*LIP, UMR CNRS-INRIA 5668, ENS Lyon, France

 

 

In this work, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations.  An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a "bandwidth-centric" distribution).


Reliability Algorithms for Network Swapping Systems with

Page Migration

 

Ben Mitchell, Julian Rosse, Tia Newhall

 

Swarthmore College, Swarthmore, PA, USA

 

 

Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap pages over the network. Without reliability support a single node crash can affect programs running on other nodes by losing their remotely swapped page data. RAID-based [3, 1] reliability solutions promise the best alternative in terms of flexibility and performance. However, two important features of our network swapping system, Nswap [2], make direct application of RAID-based schemes impossible. First, Nswap adapts to each node's local memory load, adjusting the amount of RAM space it makes available for remote swapping, which results in a variable capacity Òbacking storeÓ. Second, Nswap supports migration of remotely swapped pages between cluster nodes, which occurs when a node needs to reclaim some of its RAM from Nswap to use for local processing. Page migration complicates reliability if, for example, two pages in the same parity group end up on the same node. We present novel reliability algorithms that solve these problems. Our Parity algorithm uses dynamic parity group membership to match NswapÕs dynamic nature. We show that our algorithms add minimal overhead to remote swapping.

 

References

1.                    Evangelos P. Markatos and George Dramitinos. Implementation of a Reliable Remote Memory Pager.  In USENIX 1996 Annual Technical Conference, 1996.

2.                    Tia Newhall, Sean Finney, Kuzman Ganchev, and Michael Spiegel. Nswap: A network swapping module for linux clusters. In Euro-Par'03 International Conference on Parallel and Distributed Computing. Klagenfurt, Austria, 2003.

3.                    David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arrays of inexpensive disks (RAID). In ACM SIGMOD International Conference on Management of Data, pages 109-116, 1988.

 


A Community Faulted-crust Model Using PYRAMID on
Cluster Platforms

 

Jay Parker, Greg Lyzenga, Charles Norton, Edwin Tisdale, Andrea Donnellan

 

California Institute of Technology, Jet Propulsion Laboratory, Pasadena, CA, USA

 

 

Recent development has boosted the GeoFEST system for simulating the faulted crust from a local desktop research application to a community model deployed on advanced cluster platforms, including an Apple G5, Intel P4, SGI Altix 3000, and HP Itaniam 2 clusters. GeoFEST uses unstructured tetrahedral meshes to follow details of stress evolution, fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions, like Los Angeles. This makes it ideal for interpreting GPS and radar measurements of deformation.

 

To remake GeoFEST as a high-performance community code, essential new features are Web accessibility, scalable performance on popular clusters, and parallel adaptive mesh refinement (PAMR). While GeoFEST source is available for free download, a web portal environment is also supported. Users can work entirely within a web browser from problem definition to results animation, using tools like a database of faults, meshing, GeoFEST, and visualization.

 

For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct solver was rewritten as an iterative method, using PYRAMID's support for partitioning. Analysis determined that scaling is most sensitive to solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a binary tree method involving all domains was not.  On current Intel clusters with Myrinet the application has insignificant communication overhead for problems down to ~1000s of elements per processor. Over one million elements run well on 64 processors.

 

Initial tests using PYRAMID for the PAMR (essential for regional simulations) and a strain-energy metric produce quality meshes.


Flexible and Dynamic Control of Network QoS in Grid environments: the QoSINUS approach

 

Pascale Vicat-Blanc Primet2, Johan Montagnat1, Fabien Chanussot2

 

1 CNRS UMR5515, CREATIS, INSA de Lyon, 69621 Villeurbanne, France

2 INRIA-ENS LIP, RESO, 46, allŽe d'Italie, 69 007 Lyon, France

 

 

Grids rely on a complex interconnection of IP domains that may exhibit changing performance characteristics and may offer different quality of service (QoS) facilities. We examine the case of a biomedical application distributed over a grid and show how it may suffer from uncontrolled communication performance. Then we present the QoSINUS service that dynamically allocates the network resources to Grid flows in order to match their specific QoS requirements under different load conditions. The aim of this approach is to optimize the end to end performances the heterogeneous mix of grid flows gets from the network to enhance the individual application's performance as the overall grid infrastructure performance and utilization level. The QoSINUS service is based on the programmable network approach that offers flexibility, evolutivity and enables dynamic adaptation to network load variations. Finally results of QoSINUS experiments conducted in the context of the eToile french grid testbed based on the high speed and DiffServ capable research network infrastructure, VTHD, are presented.


Improving the Performance of Communication-Intensive Parallel Applications Executing on Clusters

 

Xiao Qin, Hong Jiang

 

Department of Computer Science and Engineering

University of Nebraska-Lincoln

 

 

Clusters have emerged as a primary and cost-effective infrastructure for parallel applications, including communication-intensive applications that transfer a large amount of data among nodes of a cluster via the interconnection network. Conventional load balancers have been proven effective in increasing the utilization of CPU, memory, and disk I/O resources in a cluster. However, most of the existing load balancing schemes ignore network resources, leaving open the opportunity for significant performance bottleneck to form for communication-intensive parallel applications due to unevenly distributed communication load. To remedy this problem, we propose a communication-aware load balancing technique that is capable of improving the performance of communication-intensive applications by increasing the effective utilization of network resources in clusters. To facilitate the proposed load-balancing scheme, we introduce a behavior model for parallel applications with large requirements of CPU, memory, network, and disk I/O resources. The proposed load-balancing scheme can make full use of this model to quickly and accurately determine the load induced by a variety of parallel applications. Simulation results on executing a diverse set of both synthetic bulk synchronous and real parallel applications on a cluster show that the proposed scheme can significantly improve the performance both in slowdown and turn-around time over three existing schemes by up to 206% (with an average of 74%) and 235% (with an average of 82%), respectively.


 

Dynamic Page Migration in Software DSM Systems

 

Thomas Repantis1, Christos D. Antonopoulos2,
Vana Kalogeraki1, Theodore S. Papatheodorou2

 

1Department of Computer Science & Engineering, University of California, Riverside

2Computer Engineering & Informatics Department, University of Patras, Greece

 

 

Dynamic page migration, when employed in Distributed Shared Memory (DSM) systems offers several advantages: (i) reduces the latency of memory accesses, (ii) improves resource utilization by considering the computational and communicational needs of the applications and adapting to the changing resource availability, and (iii) achieves the above with lower overhead than traditional approaches that rely on thread migration.

 

We propose a simple and efficient page migration mechanism [1], that dynamically allocates shared memory pages to home nodes. Each page has a designated home node and nodes that heavily modify the pages can become their new homes. In our protocol, to avoid redundant page transfers, we perform migration only when the number of modifications of a page becomes larger than a threshold. The migration information is piggybacked on the existing synchronization messages to minimize the communication overhead. The migration decision is taken locally, at the home of each page. We have implemented our mechanism in the JIAJIA software DSM [2]. Performance evaluation using real application benchmarks shows that our mechanism significantly reduces remote page modifications, improves memory access latencies, and achieves better performance than its competitors. We observe that the cost of executing the algorithm and of migrating the pages is amortized by the benefits gained.

 

References

1.                    Thomas Repantis. Implementation of Page Forwarding on Clusters. Diploma thesis, University of Patras, Greece. http://www.cs.ucr.edu/~trep/tsrDiplThesis.html.

2.                    Weisong Shi. Improving the Performance of Software DSM Systems. PhD thesis, Institute of Computing Technology, Chinese Academy of Sciences, 1999.

 


A Shared Virtual Memory Network with Fast Remote Direct Memory Access and Message Passing

 

Gang Shi, Mingchang Hu, Hongda Yin, Weiwu Hu, Zhimin Tang

 

Institute of Computing Technology, Chinese Academy of Sciences

 

 

The communication overhead has become one of the bottlenecks of SVM (shared virtual memory). Many methods have been taken to improve the performance of SVM. However, these canÕt obtain the improvement as expected. In order to get farther utility of communication hardware and reduce unnecessary overhead, a prototype with the ability of RDMA is designed and implemented in this paper, which is named FRAMP (virtual memory based Fast Remote direct memory Access and Message Passing network). FRAMP includes the crossbar-based switch, the custom host network interface and the user-level communication protocol. All of these are tightly coupled and deliberately balanced. FRAMP achieves 3.7 s one-way latency and 6.0 s RDMA read latency on system driver level. FRAMP gets 5.6 s one-way latency and 2.0 s ping-ping latency and 125MB/s asymptotic bandwidth on user API level with multi-thread programming method. Remote memory read for 8 bytes and a page of 4096 bytes only takes 8.0 s and 39 s respectively on user level. The obtained bandwidth is close to the hard-ware limit of our experimental environment, which is based on 33MHz 32-bit PCI bus, and the use rate of PCI bus is 94%. The SVM performance on FRAMP network with pure message passing is very good, but the one using RDMA read to fetch fault pages is not so good.


The Los Alamos Crestone Project: Cluster Computing Applications

 

Robert P. Weaver1 and Michael L. Gittings1,2

 

1Applied Physics (X) Division, Los Alamos National Laboratory
 2Science Applications International Corporation
 

 

The Los Alamos Crestone Project is part of the Department of EnergyÕs (DOE) Accelerated Strategic Computing Initiative, or ASCI Program. The main goal of this software development project is to investigate the use of continuous adaptive mesh refinement (CAMR) techniques for application to problems of interest to the Laboratory.  There are many code development efforts in the Crestone Project, both unclassified and classified codes. An overview of the Crestone Project, and the SAGE and RAGE codes, has been published recently in [1]. In this paper, I will give the status of the use of these CAMR codes on commodity cluster machines. A general description of the RAGE code has been published in [2], [3], [4] and [5].

 

One of the most economical methods for achieving supercomputing capability is to use commodity processors connected by commodity interconnects. This was highlighted recently at Virginia Tech when Dr. Varadarajan built the third fastest supercomputer in the world by connecting 1100 dual-processor Macintosh G5 machines together (see http://www.top500.org). Most commodity clusters use a form of LINUX as the operating system.

 

We will give an overview of the current status of using the Crestone Project codes SAGE and RAGE on commodity cluster machines. These codes are intended for general applications without tuning of algorithms or parameters.  We have run a wide variety of physical applications from millimeter-scale laboratory laser experiments, to the multikilometer-scale asteroid impacts into the Pacific Ocean, to parsec-scale galaxy formation. Examples of these simulations will be shown. The goal of our effort is to avoid ad hoc models and attempt to rely on first-principles physics. In addition to the large effort on developing parallel code physics packages, a substantial effort in the project is devoted to improving the computer science and software quality engineering (SQE) of the Project codes as well as a sizable effort on the verification and validation (V&V) of the resulting codes. Examples of these efforts for our project will be discussed. Recent results of the scaling of these codes on commodity clusters will be shown.

 

Acknowledgement. This work was supported by the University of California, operator of the Los Alamos National Laboratory under Contract No. W-7405-ENG-36 with the U.S. Department of Energy.

 

References

1.                    Massively Parallel Simulations with DOE's ASCI Supercomputers: An Overview of the Los Alamos Crestone Project. R. Weaver and M.L. Gittings, in Press Springer-Verlag 2004, proceedings of the Chicago Adaptive Mesh Refinement Workshop, September 2003.

2.                    The SAGE Code Multimaterial Equation of State Methods, M.L. Gittings in Numerical Methods Symposium, 28-30 April 1992; "The RAGE Code", R.N. Byrne, T. Betlach, and M.L. Gittings in Numerical Methods Symposium, 28-30 April 1992. Copies may be ordered from the Defense Nuclear Agency (currently the Defense Threat Reduction Agency), 56801 Telegraph Road, Alexandria, VA 22310-3398.

3.                    2D and 3D simulations of RM instability growth with RAGE: a continuous adaptive mesh refinement code, R.Weaver, M.L. Gittings, R.M. Baltrusaitis, Q. Zhang and S.Sohn, July 1997, proceedings of the 21st International Symposium on Shock Waves (ISSW21), Australia, paper 8271.

4.                    ÒThe parallel implementation of RAGE: a 3D continuous adaptive mesh refinement radiation-hydrodynamics codeÓ R. Weaver, M.L. Gittings, M.L. Clover, July 1999, proceedings of the 22st International Symposium on Shock Waves (ISSW22), London, paper 3560.

5.                    ÒThe Simulation of Shock-Generated InstabilitiesÓ R.M. Baltrusaitis, M.L. Gittings, R. Weaver, R. Benjamin and J. Budzinski 1996, Physics of Fluids, 8 (9), p. 2471 (also as LA-UR-95-4000).

 

BaBarGrid