APENet: a high speed, low latency 3D
interconnect network
R. Ammendolaa, M. Guagnellib, G. Mazzaa,
F. Palombic,
R. Petronzioa,b,
D. Rossettid, A. Salamona
and P. Vicinid
aIstituto Nazionale di
Fisica Nucleare, Sezione Roma II
bDipartimento di Fisica,
Universitˆ di Roma Tor Vergata
cE.
dIstituto Nazionale di
Fisica Nucleare, Sezione Roma I
In this paper we
present APENet, a new high speed, low latency,
3-dimensional interconnect architecture optimized for PC clusters running
LQCD-like numerical applications.
The hardware implementation is based on a single PCI-X 133MHz network
interface card hosting six independent bi-directional channels with a peak bandwidth
of 676 MB/s each direction and measured latency less than 10 ms. The internal packet switching
capabilities of the network card allows up to three couple of links
simultaneously active. The current
software environment, based on Linux, is made of a low-level library and an high-level application library. An MPI implementation and
a network device driver are being actively developed.
Application-Specific Scheduling for the
Organic Grid
Arjav Chakravarti, Gerald Baumgartner,
Mario Lauria
Dept. of Computer and Information Science, The
We propose a
biologically inspired and fully-decentralized approach to the organization of
computation that is based on the autonomous scheduling of strongly mobile
agents on a peer-to-peer network. Our approach achieves the following design
objectives: near-zero knowledge of network topology, zero knowledge of system
status, autonomous scheduling, distributed computation, lack of specialized
nodes. Every node is equally responsible for scheduling and computation, both
of which are performed with practically no information about the system. We
believe that this model is ideally suited for large-scale unstructured grids
such as desktop grids. This model avoids the extensive system knowledge
requirements of traditional Grid scheduling approaches. Contrary to the popular
master/worker organization of current desktop grids, our approach does not rely
on specialized super-servers or on application-specific clients. By
encapsulating computation and scheduling behavior into mobile agents, we decouple
both application code and scheduling functionality from the underlying
infrastructure. The resulting system is one where every node can start a large
grid job, and where the computation naturally organizes itself around available
resources. Through the careful design of agent behavior, the resulting global
organization of the computation can be customized for different classes of
applications. In a previous paper, we described a proof-of-concept prototype
for an independent task application. In this paper, we generalize the
scheduling framework and demonstrate that our approach is applicable to a
computation with a highly synchronous communication pattern, namely Cannon's
matrix multiplication.
Attaining
Higher Performance in Collective Communication
E. Chan, M.
Heimlich, A. Purkayastha, and R. van de Geijn
It has long been thought that research into collective communication
algorithms on distributed-memory parallel computers has been exhausted. This project demonstrates that the
implementations available as part of widely-used libraries are suboptimal. We
demonstrate this through the implementation of the Òreduce-scatterÓ collective
communication and comparison with the MPICH implementation of MPI. Performance
on a large cluster is reported.
GRID-enabled bioinformatics applications for
comparative genomic analysis at the CBBC
A. Hunter,
D. Schibeci, H. L. Hiew, M.
Bellgard
The Centre for Bioinformatics and Biological
Computing
Bioinformatics is an important application area for Grid computing. The
Grid computing issues required to tackle current bioinformatics challenges
include processing power, large-scale data access and management, security,
application integration, data integrity and curation,
control/automation/tracking of workflows, data format consistency and resource
discovery. In this poster, we describe preliminary steps taken to develop a
Grid environment to advance Bioinformatics research. We developed a system
called Grendel, with the aims of providing
Bioinformatics researchers transparent access to basic computational resources
used in their research. Grendel is a platform and
language independent web-services based system for distributed resource management
utilising Sun Grid Engine that provides a single
entry point for computational tasks while keeping the actual resources
transparent to the user. Grendel is developed in Java
and deployed using the Tomcat. Client libraries have been developed in Perl and
Java to provide access to computation resource exported via Grendel.
Fast broadcast by the divide-and-conquer
algorithm[1]
Dongyoung Kim and Dongseung Kim
Department of Electrical Engineering
Collective communication functions including the broadcast in
cluster computers usually take O(m log
P) time in propagating the size-m message to P processors. We
have devised a new O(m) broadcast algorithm,
independent of the number of processors involved, by using divided-and-conquer
algorithm. Details are given below.
Suppose the network is homogeneous and the point-to-point message
communication take the same time regardless of the processor indexes if the
message sizes are same. A group of K processors G(P0:PK-1)
are partitioned into two equal-sized groups GL and GR
where GL = G(P0:PK/2-1)
and GR = G(PK/2:PK-1).
For simplicity assume that K = 2k for some integer k.
In the divided-and-conquer broadcast, a message M is also divided into
two equal-sized sub-messages ML and MR, and
each is sent to the respective leader processor (for example, the first
processor in the group) in GL and GR. The
leader then propagates the accepted message to all others in the group (this is
also broadcast, called the inner broadcast). Once both groups finish
broadcasting internally, there will be simultaneous intergroup
message exchanges between pairs of processors, where a processor pair-j (j=0,1, ..., K/2-1) consists of j-th
processor in GL and j-th
processor in GR. Now, every processor will get the total
message M after the exchange, since a half of it has been propagated
within its own subgroup, and the remaining half is accepted from the other
group by the exchange. The broadcast completes.
Note that the inner broadcast is also performed using the same
method as the outer one. The algorithm is stated formally below where Broadcast(M,
Pq:Pq+K-1) represents the broadcast of message M to
all K processors from Pq to Pq+K-1,
and send(X,
Pq, Pr) is the point-to-point communication that
transfers the message X from Pq
to Pr.
Procedure Broadcast(M, Pq:Pq+K-1)
/* ML and MR are bisections of M */
/* M= ML || MR, |ML| = |MR|
= |M|/2 */
/*|| represents concatenation, and |X| represents the size of X.
*/
{ if (K > 1) { send(MR,
do in parallel
{Broadcast(ML, Pq:Pq+K/2-1);
Broadcast(MR, Pq+K/2: Pq+K-1)}
forall (i=0,1, ... K/2-1)
do in parallel
{send(ML, Pq+i, Pq+K/2+i); send(MR,
Pq+K/2+i, Pq+i)}
}
else;
}
The
overall time T(M,K) for the message broadcast of size M on K
processors can be computed directly from the algorithm with T(M/2,K/2)
and 2t(M/2), where t(M/2) is the point-to-point communication
time of the message of size M/2 between arbitrary two processors. Hence,
T(M,K) =
T(M/2,K/2) + t(M/2) + t(M/2) (for K>2) (1)
T(M,K) =
0 if K = 1
The
recursion is solved in a closed form as below.
T(M,K) = 2{t(M/21)+t(M/22)
+ É + t(M/2k) } (2)
The
point-to-point message passing time of t(y) can be modeled as t(y) =
S + y/B where S is a setup time and B is the bandwidth of the
network. By substituting the equation into (2), the broadcast time becomes
T(M,K) = 2SlogP + 2(M/B)(1 Ð
1/K) (3)
If
K >> 1 and the setup time S is not large compared to M/B,
the first term can be ignored and the broadcast time is T(M,K)
= 2(M/B), which is independent of K.
In
conclusion, the new broadcast algorithm will be much faster if there are many
participating processors for the broadcast and the message is large.
VLAN-based Routing: Multi-path L2 Ethernet Network for HPC
Clusters
Tomohiro Kudoh,
Hiroshi Tezuka, Motohiko
Matsuda
Yuetsu Kodama,
Osamu Tatebe, Satoshi Sekiguchi
National Institute of of
Advanced Industrial Science and
Technology
We propose and evaluate a VLAN-based routing method for
intra-cluster networks, which realizes a large bi-section bandwidth using small
L2 switches. Gigabit Ethernet is commonly used for intra-cluster networks
because of its low cost and tolerable performance. However, costs of large
non-blocking switches are high. Therefore, to realize a low-cost cluster,
low-cost small L2 switches should be used. On the other hand, intra-cluster
networks for high performance computing cluster require a large bi-section
bandwidth. Since the topology of L2 Ethernet network can not include any loop,
a large bi-section bandwidth can not be realized using small L2 switches when the
number of computing nodes is large. To solve this problem, we map multiple VLANs on switches and network interfaces of a network.
While no loops can be included in a VLAN, loops can be included in the physical
topology by using multiple VLANs on a single physical
network, and a large bi-section bandwidth can be realized. In this
presentation, we will show some sample topologies which uses VLAN-based
routing, and show performance of NPB executed on a cluster which uses
VLAN-based routing.
Acknowledgement: This work has been supported by an NEDO
Grant-in-Aid for private sector fundamental technology ``Research on
Large-Scale and Reliable Servers,'' and Cisco Systems University Research
Program.
A model for Resource-Aware Load
Balancing on Heterogeneous Clusters
J. Faik,
J. E. Flaherty, L. G. Gervasio
Department of Computer Science
Rensselaer Polytechnic Institute
J. D. Teresco
Department of Computer Science
K. D. Devine, E. G. Boman
Sandia
National Laboratories
Albuquerque,
NM 87185-1111
We address the problem of partitioning and dynamic load balancing
on clusters with heterogeneous hardware resources.We
propose DRUM, a model that encapsulates hardware resources and their
interconnection topology. DRUM provides monitoring facilities for dynamic
evaluation of communication, memory and processing capabilities. Heterogeneity
is quantified by merging the information from the monitors to produce a scalar
number called "power". This power allows DRUM to be used easily by
existing load-balancing procedures such as those in the Zoltan
toolkit while placing minimal burden on the application programmers. We demonstrate the use of DRUM to guide
load balancing in the adaptive solution of a
[1] This research was supported by KOSEF grant
(R01-2001-0341-0) and KRF grant(2003-041-D0049.
Master Slave
Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory
Arnaud Legrand, Olivier Beaumont*,
Loris Marchal* ,Yves Robert*
LaBRI, UMR CNRS 5800, Bordeaux, France
*LIP, UMR CNRS-INRIA
5668, ENS
In this
work, we consider the problem of allocating and scheduling a collection of
independent, equal-sized tasks on heterogeneous star-shaped platforms. We also
address the same problem for divisible tasks. For both cases, we take memory
constraints into account. We prove strong NP-completeness results for different
objective functions, namely makespan minimization and
throughput maximization, on simple star-shaped platforms. We propose an
approximation algorithm based on the unconstrained version (with unlimited
memory) of the problem. We introduce several heuristics, which are evaluated
and compared through extensive simulations. An unexpected conclusion drawn from
these experiments is that classical scheduling heuristics that try to greedily
minimize the completion time of each task are outperformed by the simple
heuristic that consists in assigning the task to the available processor that
has the smallest communication time, regardless of computation power (hence a
"bandwidth-centric" distribution).
Reliability
Algorithms for Network Swapping Systems with
Page
Migration
Ben
Mitchell, Julian Rosse, Tia
Newhall
Network swapping systems allow individual cluster nodes
with over-committed memory to use the idle memory of remote nodes as their
backing store, and to swap pages over the network. Without reliability support
a single node crash can affect programs running on other nodes by losing their
remotely swapped page data. RAID-based [3, 1] reliability solutions promise the
best alternative in terms of flexibility and performance. However, two
important features of our network swapping system, Nswap
[2], make direct application of RAID-based schemes impossible. First, Nswap adapts to each node's local memory load, adjusting
the amount of RAM space it makes available for remote swapping, which results
in a variable capacity Òbacking storeÓ. Second, Nswap
supports migration of remotely swapped pages between cluster nodes, which
occurs when a node needs to reclaim some of its RAM from Nswap
to use for local processing. Page migration complicates reliability if, for
example, two pages in the same parity group end up on the same node. We present
novel reliability algorithms that solve these problems. Our Parity algorithm
uses dynamic parity group membership to match NswapÕs
dynamic nature. We show that our algorithms add minimal overhead to remote
swapping.
1.
Evangelos P. Markatos and George Dramitinos.
Implementation of a Reliable Remote Memory Pager. In USENIX 1996 Annual Technical
Conference, 1996.
2.
Tia Newhall, Sean
Finney, Kuzman Ganchev, and
Michael Spiegel. Nswap: A network swapping module for
linux clusters. In Euro-Par'03 International
Conference on Parallel and Distributed Computing.
3.
David A.
Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arrays of
inexpensive disks (RAID). In ACM SIGMOD International Conference on
Management of Data, pages 109-116, 1988.
A Community
Faulted-crust Model Using PYRAMID on
Cluster Platforms
Jay Parker, Greg Lyzenga, Charles Norton, Edwin Tisdale, Andrea Donnellan
California
Institute of Technology, Jet Propulsion Laboratory,
Recent development has boosted the GeoFEST system for simulating the faulted crust from a
local desktop research application to a community model deployed on advanced
cluster platforms, including an Apple G5, Intel P4, SGI Altix
3000, and HP Itaniam 2 clusters. GeoFEST
uses unstructured tetrahedral meshes to follow details of stress evolution,
fault slip, and plastic/elastic processes in quake-prone inhomogeneous regions,
like
To remake GeoFEST
as a high-performance community code, essential new features are Web accessibility,
scalable performance on popular clusters, and parallel adaptive mesh refinement
(PAMR). While GeoFEST source is available for free
download, a web portal environment is also supported. Users can work entirely
within a web browser from problem definition to results animation, using tools
like a database of faults, meshing, GeoFEST, and
visualization.
For scalable deployment, GeoFEST now relies on the PYRAMID library. The direct
solver was rewritten as an iterative method, using PYRAMID's
support for partitioning. Analysis determined that scaling is most sensitive to
solver communication required at the domain boundaries. Direct pairwise exchange proved successful (linear), while a
binary tree method involving all domains was not. On current Intel clusters with Myrinet the application has insignificant communication
overhead for problems down to ~1000s of elements per processor. Over one
million elements run well on 64 processors.
Initial tests using PYRAMID
for the PAMR (essential for regional simulations) and a strain-energy metric
produce quality meshes.
Flexible and
Dynamic Control of Network QoS in
Grid environments: the QoSINUS approach
Pascale Vicat-Blanc Primet2,
Johan Montagnat1, Fabien Chanussot2
1 CNRS UMR5515, CREATIS, INSA
de Lyon, 69621 Villeurbanne, France
2 INRIA-ENS LIP, RESO, 46,
allŽe d'Italie, 69 007 Lyon, France
Grids
rely on a complex interconnection of IP domains that may exhibit changing
performance characteristics and may offer different quality of service (QoS) facilities. We examine the case of a biomedical
application distributed over a grid and show how it may suffer from
uncontrolled communication performance. Then we present the QoSINUS
service that dynamically allocates the network resources to Grid flows in order
to match their specific QoS requirements under
different load conditions. The aim of this approach is to optimize the end to
end performances the heterogeneous mix of grid flows gets from the network to
enhance the individual application's performance as the overall grid
infrastructure performance and utilization level. The QoSINUS
service is based on the programmable network approach that offers flexibility, evolutivity and enables dynamic adaptation to network load
variations. Finally results of QoSINUS experiments
conducted in the context of the eToile french grid testbed based on the
high speed and DiffServ capable research network infrastructure, VTHD, are presented.
Improving the Performance of
Communication-Intensive Parallel Applications Executing on Clusters
Xiao Qin, Hong Jiang
Department of Computer Science and
Engineering
University of Nebraska-Lincoln
Clusters have emerged as a primary and
cost-effective infrastructure for parallel applications, including
communication-intensive applications that transfer a large amount of data among
nodes of a cluster via the interconnection network. Conventional load balancers
have been proven effective in increasing the utilization of CPU, memory, and
disk I/O resources in a cluster. However, most of the existing load balancing schemes ignore network resources, leaving
open the opportunity for significant performance bottleneck to form for
communication-intensive parallel applications due to unevenly distributed
communication load. To remedy this problem, we propose a communication-aware
load balancing technique that is capable of improving the performance of
communication-intensive applications by increasing the effective utilization of
network resources in clusters. To facilitate the proposed load-balancing
scheme, we introduce a behavior model for parallel applications with large
requirements of CPU, memory, network, and disk I/O resources. The proposed
load-balancing scheme can make full use of this model to quickly and accurately
determine the load induced by a variety of parallel applications. Simulation
results on executing a diverse set of both synthetic bulk synchronous and real
parallel applications on a cluster show that the proposed scheme can
significantly improve the performance both in slowdown and turn-around time
over three existing schemes by up to 206% (with an average of 74%) and 235%
(with an average of 82%), respectively.
Dynamic Page
Migration in Software DSM Systems
Thomas Repantis1, Christos D. Antonopoulos2,
Vana Kalogeraki1, Theodore S.
Papatheodorou2
1Department
of Computer Science & Engineering,
2Computer
Engineering & Informatics Department,
Dynamic page migration, when employed in
Distributed Shared Memory (DSM) systems offers several advantages: (i) reduces the latency of memory accesses, (ii) improves
resource utilization by considering the computational and communicational needs
of the applications and adapting to the changing resource availability, and
(iii) achieves the above with lower overhead than traditional approaches that
rely on thread migration.
We propose a simple and efficient page
migration mechanism [1], that dynamically allocates
shared memory pages to home nodes. Each page has a designated home node and
nodes that heavily modify the pages can become their new homes. In our
protocol, to avoid redundant page transfers, we perform migration only when the
number of modifications of a page becomes larger than a threshold. The
migration information is piggybacked on the existing synchronization messages
to minimize the communication overhead. The migration decision is taken
locally, at the home of each page. We have implemented our mechanism in the
JIAJIA software DSM [2]. Performance evaluation using real application
benchmarks shows that our mechanism significantly reduces remote page
modifications, improves memory access latencies, and achieves better performance
than its competitors. We observe that the cost of executing the algorithm and
of migrating the pages is amortized by the benefits
gained.
1.
Thomas Repantis. Implementation of
Page Forwarding on Clusters. Diploma thesis,
2.
Weisong Shi. Improving the Performance of Software DSM Systems.
PhD thesis,
A Shared Virtual
Memory Network with Fast Remote Direct Memory Access and Message Passing
Gang Shi, Mingchang Hu, Hongda Yin, Weiwu Hu, Zhimin
Tang
Institute of Computing Technology,
The
communication overhead has become one of the bottlenecks of SVM (shared virtual
memory). Many methods have been taken to improve the performance of SVM.
However, these canÕt obtain the improvement as expected. In order to get
farther utility of communication hardware and reduce unnecessary overhead, a
prototype with the ability of RDMA is designed and implemented in this paper,
which is named FRAMP (virtual memory based Fast Remote direct memory Access and
Message Passing network). FRAMP includes the crossbar-based switch, the custom
host network interface and the user-level communication protocol. All of these
are tightly coupled and deliberately balanced. FRAMP achieves 3.7 s one-way latency and 6.0 s RDMA read latency on system
driver level. FRAMP gets 5.6 s one-way latency and 2.0 s ping-ping latency and
125MB/s asymptotic bandwidth on user API level with multi-thread programming
method. Remote memory read for 8 bytes and a page of 4096 bytes only takes 8.0
s and 39 s respectively on user level. The obtained bandwidth is close to the
hard-ware limit of our experimental environment, which is based on 33MHz 32-bit
PCI bus, and the use rate of PCI bus is 94%. The SVM performance on FRAMP
network with pure message passing is very good, but the one using RDMA read to
fetch fault pages is not so good.
The
Robert P. Weaver1 and
Michael L. Gittings1,2
1Applied
Physics (X) Division,
2Science Applications
International Corporation
The Los Alamos Crestone Project is part of
the Department of EnergyÕs (DOE) Accelerated Strategic Computing Initiative, or
ASCI Program. The main goal of this software development project is to
investigate the use of continuous adaptive mesh refinement (CAMR) techniques
for application to problems of interest to the Laboratory. There are many code development efforts
in the Crestone Project, both unclassified and
classified codes. An overview of the Crestone
Project, and the SAGE and RAGE codes, has been published recently in [1]. In
this paper, I will give the status of the use of these CAMR codes on commodity
cluster machines. A general description of the RAGE code has been published in
[2], [3], [4] and [5].
One
of the most economical methods for achieving supercomputing capability is to use
commodity processors connected by commodity interconnects. This was highlighted
recently at Virginia Tech when Dr. Varadarajan built
the third fastest supercomputer in the world by connecting 1100 dual-processor
Macintosh G5 machines together (see http://www.top500.org). Most commodity
clusters use a form of LINUX as the operating system.
We
will give an overview of the current status of using the Crestone
Project codes SAGE and RAGE on commodity cluster machines. These codes are
intended for general applications without tuning of algorithms or
parameters. We have run a wide
variety of physical applications from millimeter-scale laboratory laser
experiments, to the multikilometer-scale asteroid
impacts into the
Acknowledgement. This work was supported by the
1.
Massively Parallel Simulations with DOE's
ASCI Supercomputers: An Overview of the
2.
The SAGE Code Multimaterial Equation of
State Methods, M.L. Gittings in Numerical Methods
Symposium, 28-30 April 1992; "The RAGE Code", R.N. Byrne, T. Betlach, and M.L. Gittings in
Numerical Methods Symposium, 28-30 April 1992. Copies may be ordered from the
Defense Nuclear Agency (currently the Defense Threat Reduction Agency),
3.
2D and 3D simulations of RM instability growth with RAGE: a
continuous adaptive mesh refinement code, R.Weaver,
M.L. Gittings, R.M. Baltrusaitis,
Q. Zhang and S.Sohn, July 1997, proceedings of the
21st International Symposium on Shock Waves (ISSW21),
4.
ÒThe parallel implementation of RAGE: a 3D continuous adaptive
mesh refinement radiation-hydrodynamics codeÓ R. Weaver, M.L. Gittings, M.L. Clover, July 1999, proceedings of the 22st International Symposium on Shock Waves (ISSW22),
London, paper 3560.
5.
ÒThe Simulation of Shock-Generated InstabilitiesÓ R.M. Baltrusaitis, M.L. Gittings, R.
Weaver, R. Benjamin and J. Budzinski 1996, Physics of
Fluids, 8 (9), p. 2471 (also as LA-UR-95-4000).