"
In this paper, we discuss issues related to the high-performance implementation of collective communications operations on distributed-memory computer architectures. Using combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we develop implementations that are superior to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Pentium 4 (R) processor cluster are included.