#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass latex8 \begin_preamble \date{} \end_preamble \language english \inputencoding auto \fontscheme times \graphics default \paperfontsize 10 \spacing single \papersize letterpaper \paperpackage a4 \use_geometry 1 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \leftmargin 1in \topmargin 1in \rightmargin 1in \bottommargin 1.125in \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 2 \papersides 1 \paperpagestyle default \layout Title \paragraph_spacing other 0.70 \size larger Sparse Flat Neighborhood Networks (SFNNs): \newline Scalable Guaranteed Pairwise Bandwidth & Unit Latency \layout Author \paragraph_spacing other 0.70 \size normal Timothy I. Mattox, Henry G. Dietz, & William R. Dieter \newline ECE Department, University of Kentucky \newline 453 F. Paul Anderson Tower, Lexington, KY 40506-0046 \newline tmattox@engr.uky.edu, hankd@engr.uky.edu, dieter@engr.uky.edu \layout Abstract Network performance for a particular application is determined by the latency and bisection bandwidth that are achieved for the set of specific communication patterns used by that application. The number of nodes with which each node might potentially communicate grows linearly as nodes are added, thus, network cost for large systems either becomes a large fraction of machine cost or performance suffers. However, performance-critical communication patterns commonly occurring in real parallel programs rarely require that each node directly communicate with every other node. \layout Abstract The number of node pairs actually communicating generally grows far slower than the expected O( \begin_inset Formula $N^{2}$ \end_inset ). Thus, a carefully designed network for a massively parallel system can use relatively narrow switches while still providing single-switch latency and guaranteed pairwise bandwidth for performance-critical communications. This paper introduces Sparse Flat Neighborhood Networks (SFNNs), a variant of Flat Neighborhood Networks (FNNs) which are engineered from first principles to efficiently meet these detailed pairwise communication performance criteria. \layout Section Introduction \layout Standard Parallel supercomputers, clusters, and server farms rely heavily on performance of a system area network to coordinate the actions of individual nodes. High performance computing systems with tens to tens of thousands of Processing Elements (PEs) have been built for many decades using a variety of network designs, and a vastly larger set of network designs can be found in the literature. Most research is focused on networks with endearing mathematical properties or simple routing schemes. In this paper, we suggest that network form should follow function: the best network is the design which, of all feasible networks, yields performance characteristics best matching the latency and bandwidth needs of your code while simultaneously satisfying the relevant cost, scalability, and reliability constraints. The problem in taking this seemingly obvious approach is that the design space is surprisingly large and complex. \layout Standard The work presented in this paper does not consider the complete design space, but does search the relatively large and important space including all designs, symmetric or asymmetric, that could satisfy the constraints with single-switch latency. Even though only symmetric designs are commonly considered, for applications with complex communication patterns, the best topology is likely to be asymmetric. There are two reasons: \layout List \labelwidthstring 2. \size small 1. Even if only symmetric communication patterns are important to an application, the union of multiple symmetric patterns is generally asymmetric; thus, asymmetric communication patterns naturally occur in real scientific and engineering applications \layout List \labelwidthstring 2. \size small 2. The design space for asymmetric designs is far larger than the space for symmetric designs, with the corresponding result that the best asymmetric designs often are better than the best symmetric ones \layout Standard The problem is simply that asymmetry is difficult for humans to manage and there were neither automated design tools nor the technology with which to efficiently implement them. The insights presented here, combined with genetic algorithm (GA) search technology, make Sparse Flat Neighborhood Networks (SFNNs) an extremely cost-effective and highly scalable way to achieve specific performance goals for user-specified communication patterns. \layout Standard Section\SpecialChar ~ \begin_inset LatexCommand \ref{sec:What-Is-An} \end_inset defines and describes the properties of SFNNs, including several examples. The actual design process, as implemented by our tools, is described in Section\SpecialChar ~ \begin_inset LatexCommand \ref{sec:SFNN-Design} \end_inset . Although SFNNs can use ordinary routing procedures implemented within standard network hardware, Section\SpecialChar ~ \begin_inset LatexCommand \ref{sec:SFNN-Runtime-support} \end_inset discusses the SFNN runtime support we developed to more efficiently utilize Ethernet-like networking. A brief conclusion appears in Section\SpecialChar ~ \begin_inset LatexCommand \ref{sec:Conclusion} \end_inset . \layout Section What Is An SFNN? \begin_inset LatexCommand \label{sec:What-Is-An} \end_inset \layout Standard Although we plan to develop tools that will handle optimized design of fully general network topologies, we initially focused on the construction of design tools for an important and immediately useful class of networks that we named \series bold Flat Neighborhood Networks (FNNs) \series default \begin_inset LatexCommand \cite{lcpc2000} \end_inset . The FNN concept won a number of awards, including a 2000 Gordon Bell award honorable mention for price/performance \begin_inset LatexCommand \cite{sc2000} \end_inset , \begin_inset Quotes eld \end_inset Most Innovative Architecture \begin_inset Quotes erd \end_inset in the HPC Games Challenge at SC2000, and a Computerworld Smithsonian award as one of the six information technologies most advancing science in 2001. The FNN concept is based on a re-evaluation of the basic principles of interconnection networks. The ideal switched network should provide: \layout Itemize \paragraph_spacing other 0.00 \size small Single hop latency for all pairwise communications. \layout Itemize \paragraph_spacing other 0.00 \size small Full link bandwidth between any node pair independent of any other node pairs that may be communicating at the same time. \layout Standard These properties are easily achieved using a network consisting of a single switch that has at least as many ports as there are nodes. However, in an FNN, the same properties can be extended to several times as many nodes as there are ports per switch. \layout Standard Because using paths connecting one switch to another would introduce both additional latency and potential bandwidth bottlenecks, we ignore such paths for the purpose of satisfying the design constraints. An FNN is modeled as a non-recirculating single-stage network. The trick is to use multiple \series bold Network Interfaces \series default ( \series bold NIs \series default , or \series bold NICs \series default for NI Cards) in each node; thus, each node can be connected to multiple switches. The design problem is therefore determining to which switches each node should be connected. \layout Standard The complexity of the FNN design problem is perhaps surprisingly very high. The problem is actually a minor variation on the well-known graph theory problem called \series bold (v,k,t)-covering design \series default . The number of nodes corresponds to \series bold v \series default , the number of ports per switch corresponds to \series bold k \series default , and pairwise grouping implies that \series bold t \series default would be equal to 2. The standard covering problem is still an open problem in mathematics. An excellent discussion of the standard covering problem, a summary of recent research on methods for constructing covering designs, and a database of the best known solutions and bounds on solutions are given at the La Jolla Covering Repository \begin_inset LatexCommand \cite{LaJolla} \end_inset . The FNN design problem differs from the standard covering problem primarily in that the number of network interfaces per node is constrained; there is no corresponding constraint on (v,k,t)-covering design. The FNN design problem also differs in that it incorporates optimization of a variety of more complex secondary characteristics, such as making the number of times each pair is covered (i.e., the conflict-free bandwidth between each pair) be tuned to approximate a specification of relative bandwidth requirements. \layout Standard As our FNN design tools have matured, we also have developed better ways to graphically represent both design constraints and their solutions. It is natural to think of both design constraints and solutions in terms of a square connectivity matrix, with node sources listed down the left side and sinks listed across the top, that shows how many links worth of bandwidth are requested/dedicated to that directed pairwise communication. Although such a graph for a design specification can be completely asymmetric, because all commonly used network hardware employs bidirectional cabling, no directly useful information is lost if the matrix is folded along the diagonal; the bandwidth requested for a->b is made equal to that requested for b->a by giving both the maximum value of either. Taking advantage of this property, we can represent the design requirements and solution in a single square matrix: the lower left triangle defines the requirements while the upper right triangle shows the bandwidth delivered by the solution. This matrix is trivially shown in graphical form as a square image in which the color (gray shade) of each point corresponds to the number of links required or dedicated. \layout Standard The network in the world's first FNN-based supercomputer, \series bold KLAT2 (Kentucky Linux Athlon Testbed 2) \series default \begin_inset LatexCommand \cite{als2000} \end_inset , as built in Spring 2000, corresponds to the design/solution map shown in Figure \begin_inset LatexCommand \ref{cap:KLAT2's-FNN-Design/Solution} \end_inset . \begin_inset Float figure wide false collapsed false \layout Standard \align center \begin_inset Graphics filename klat2net.ps width 1.5in keepAspectRatio \end_inset \layout Caption KLAT2's FNN Design/Solution Map \begin_inset LatexCommand \label{cap:KLAT2's-FNN-Design/Solution} \end_inset \end_inset The white center line represents nodes talking to themselves, which neither requires nor uses network bandwidth. The lower left triangle specifies complete connectivity with a single unit bandwidth per pair and, additionally, two or more units bandwidth for the communication patterns shown in a somewhat darker gray. KLAT2's network actually delivers as much as four units of bandwidth per pair (i.e., a black pixel corresponds to four units of bandwidth), entirely covering the single-unit requirement region. Although KLAT2's design does not quite cover the two-unit requirement region with two or more units of bandwidth, it comes very close to covering it with an average of more than two units bandwidth per pair. This is because, at the time KLAT2 was designed, our FNN design tool favored a higher average over complete coverage of the two-or-more region. \layout Standard Several higher-level properties of KLAT2's network are easily visible in this graphic. One is the asymmetric nature of KLAT2's network design; the upper right triangle is a random-looking pattern of one to four units bandwidth per pair. Additionally, viewing KLAT2's network in this way it seems clear that the network is seriously over designed -- there are many low-importance pairs that are given high-bandwidth coverage. \layout Standard Suppose that we remove the constraint that all pairs must have at least one unit of reserved, single-hop latency, bandwidth. Our concern is thus shifted to finding a design which covers all node pairs that we expect will have significant communications between them. This is the basic concept of a \series bold Sparse Flat Neighborhood Network (SFNN) \series default . An SFNN is an FNN, except the FNN property is ensured not for all node pairs, but only for explicitly noted node pairs. For node pairs not explicitly selected, communications would still be possible through intermediary nodes, or through additional switch-to-switch connections. \layout Standard The first SFNN-based supercomputer is \series bold KASY0 (Kentucky ASYmmetric zero) \series default , a 128-node cluster which was built in the Summer of 2003. The design specification and solution for KASY0's SFNN are shown in Figure \begin_inset LatexCommand \ref{cap:KASY0's-SFNN-Design/Solution} \end_inset . \begin_inset Float figure placement th wide false collapsed false \layout Standard \align center \begin_inset Graphics filename kasy0net.ps width 2.75in keepAspectRatio \end_inset \layout Caption KASY0's SFNN Design/Solution Map \begin_inset LatexCommand \label{cap:KASY0's-SFNN-Design/Solution} \end_inset \end_inset \layout Standard Although KASY0's network also is over designed, it is far more effective than KLAT2's network in placing bandwidth where it will be used. The "sparseness" of the design specification and of the solution do not mean that communications between node pairs represented by white pixels are impossible, rather, it implies that those pairs do not communicate enough to justify reserving bandwidth and ensuring single-hop latency. Thus, in the rare event that communications between such pairs do occur, latency may be somewhat more than a single hop and the available bandwidth may be dependent on the other traffic in the network at that time. \layout Standard The result of this new design technology is that KASY0's 128-node SFNN cost only $4,200 -- about half as much as KLAT2's 64-node FNN. One might be tempted to think that the price difference is due to network hardware price drops in the three-year period between construction of KLAT2 and KASY0, but the primary difference actually is the use of 3 NICs per node and 24-port switches in KASY0 instead of 4 NICs per node and more expensive 32-port switches as used in KLAT2. \layout Standard Clearly, the concept of covering all desired communication patterns with single-hop latency reserved bandwidth has limited scalability. We are currently working on design tools that can maximize coverage without requiring that complete coverage of a design specification be achieved, and these new \series bold Fractional Flat Neighborhood Network (FFNN) \series default design tools will remove all scaling issues and even can be used to optimize the design of more conventional networks, such as trees and fat trees. Of course, an FFNN does not have as desirable properties as an SFNN. Thus, the question we consider here is: how large a system can SFNN technology scale to? \layout Subsection Scalability \layout Standard Although FNNs have proven very useful for moderate-sized systems, such as KLAT2, the most important property of SFNN technology is the fact that the design methodology allows scaling to very large systems. Using commodity networking components, we have created SFNN solutions for design problems with as many as 10,000 nodes. However, the primary reason that SFNNs scale is not really a property of SFNNs, but of communication patterns. A communication pattern defines a set of communicating pairs, most often, a \emph on 1:1 \emph default and \emph on onto \emph default mapping ( \emph on permutation \emph default ) of nodes onto nodes. \layout Standard It is a simple fact of arithmetic that the number of other nodes each node might need to talk to grows linearly with the number of nodes. More precisely, in a system with \emph on N \emph default nodes, the number of possible communication pairs involving each node is \begin_inset Formula 2(N-1) \end_inset for unidirectional (ordered) pairs or \emph on N-1 \emph default for bidirectional (unordered) pairs. This also corresponds to the well-known fact that a direct-connection (switch-l ess, single-link latency) network would require \begin_inset Formula N(N-1) \end_inset unidirectional wires or \begin_inset Formula (N(N-1))/2 \end_inset bidirectional links. Thus, network complexity seems to scale as \emph on O( \begin_inset Formula $N^{2}$ \end_inset ) \emph default . However, as we shall demonstrate in this paper by reviewing the various communication patterns commonly discussed in the parallel processing literature , most parallel programs require high performance on only a small fraction of the possible pairs for each node. As discussed in the following sections, the fraction actually used in typical applications sharply decreases as the number of nodes is increased! \layout Standard \series bold O(1) Scaling Patterns. \series default It is ironic that, despite the parallel processing community's concerns about the complexity of large scale networks, the most commonly used communicat ion patterns in parallel programs are all patterns that have a constant number of pairs per node independent of the number of nodes. This observation is confirmed by the fact that many of the largest systems built, such as the Intel Paragon, Cray T3D, and ASCI RED, have successfully used networks with a simple 3D mesh topology. \layout Standard The number of bidirectional communication pairs in which each node is involved when communicating with adjacent nodes within a 1D mesh is either one or two. For a toroidal 1D mesh (i.e., a ring), each node participates in two bidirectiona l communication pairs; a non-toroidal 1D mesh differs only in that the two end nodes participate in just one pair each. The per-node pair count is entirely unaffected by the total number of nodes. \layout Standard The design space becomes significantly larger when 2D meshes are considered, because there may be multiple ways to factor the nodes into a 2D mesh. For example, a 32-node system could be viewed as 2x16, 4x8, 8x4, or 16x2. However, once a factorization is selected, the pair count per node is independe nt of the total number of nodes. Communicating with nodes that are adjacent by row or column yields between two and four pairs per node, with edge nodes in non-toroidal meshes having the lower pair counts. Including diagonally adjacent nodes simply changes the count to three to eight pairs, again with the lower values corresponding to edge nodes in non-toroidal meshes. Eight pairs may be a large fraction of all possible pairs in a small cluster, but it becomes a vanishingly small fraction of all possible pairs as the system design is scaled to thousands of nodes. The same is true of higher-dimensionality mesh adjacency pairs. \layout Standard The conceptually most complex communication patterns commonly used in parallel programs are typically patterns consisting of a single pair per node. For example, the \emph on bit-reversal \emph default communication pattern \begin_inset LatexCommand \cite{gupta94implementing} \end_inset may employ a reasonably complex formula to determine which node each node will communicate with, but there is only a single pair involving each node. The same is true of \emph on shuffle \emph default \begin_inset LatexCommand \cite{kim95shortest} \end_inset ; it is also significant that, given bidirectional links, \emph on inverse-shuffle \emph default is implemented by the exact same pairing that \emph on shuffle \emph default uses. Notice further that all these patterns are permutations, so, with an appropriat e network, each pattern can be implemented in a single message time-step. \layout Standard \series bold O(LogN) Scaling Patterns. \series default In addition to \emph on O(1) \emph default scaling patterns, nearly all programs contain some communication patterns that scale as \emph on O(LogN) \emph default . Fundamentally, \emph on O(LogN) \emph default scaling patterns are most often an artifact of using a network that is incapable of performing computation. For example, \emph on collective communications \emph default including \emph on reductions \emph default , \emph on parallel-prefix scans \emph default , \emph on broadcast \emph default / \emph on multicast \emph default , and \emph on barrier synchronization \emph default are really operations sampling the global state of the parallel system, and Aggregate Function Network (AFN) hardware implements them directly within the network \begin_inset LatexCommand \cite{spdp96} \end_inset , but efficient message-passing implementations typically involve a sequence of tree-structured communications. \layout Standard Binary tree-structured communications follow adjacency in the familiar \emph on hypercube \emph default topology. Thus, each node communicates with the nodes whose numbers differ from the source node number by only a single bit position's value in the binary representation. For example, in a 32-node system, node 5 (binary 00101) would be paired with 1 (00001), 4 (00100), 7 (00111), 13 (01101), and 21 (10101). Clearly, there are no more than \emph on ceiling(Log \begin_inset Formula $_{\textrm{2}}$ \end_inset N) \emph default bit positions, so there are at most that many nodes differing from any given node's number by precisely one bit position, and the number of pairs per node grows as \emph on O(LogN) \emph default . \layout Standard It is important to note that many message-passing systems differentiate between what MPI \begin_inset LatexCommand \cite{mpi-standard} \end_inset calls \emph on all-reduce \emph default and \emph on reduce \emph default . An \emph on all-reduce \emph default would suggest that all nodes should have their complete tree, whereas a \emph on reduce \emph default requires only the tree rooted at a specific point (typically, node 0). Thus, \emph on reduce \emph default can be implemented using an average of half as many pairs per node, although the root node will still require the full set of pairs. If an \emph on all-reduce \emph default is implemented using a reduce followed by a \emph on broadcast \emph default from the root node, rather than by directly performing N \emph on reduces \emph default simultaneously, then the complete tree of pairs is only needed for the root node. This is significant in that \emph on reduce \emph default is far more common than \emph on all-reduce \emph default in parallel programs. \layout Standard \series bold O(N \begin_inset Formula $^{\textrm{1/}D}$ \end_inset ) Scaling Patterns. \series default D-dimensional \emph on scatter \emph default , \emph on gather \emph default , and \emph on personalized all-to-all \emph default communications involve each node interacting with every other node in its dimension. In a 2D space, each node would need to be paired with every node in the same row or column, yielding \emph on O( \begin_inset Formula $sqrt(N)$ \end_inset ) \emph default pairs per node. The 3D case scales pairs per node by the cube root of N. The 1D case is the worst; all nodes are in the same dimension. \layout Standard Superficially, it seems that all N-1 pairs are needed for each node in order to support 1D \emph on personalized all-to-all \emph default . However, such a communication pattern is not able to be accomplished in a single time-step unless N-1 messages can be simultaneously output by each node. With fewer than N-1 NICs, this is literally impossible. Further, the overhead associated with sending a message is significant; thus, unless messages are quite long, there is a significant penalty in sending N-1 messages rather than sending fewer, larger, messages that would then be repackaged and retransmitted until each node had seen the data destined for it. The result is that \emph on personalized all-to-all \emph default is nearly always best implemented as a compound communication, often following a broadcast-like tree pattern. Thus, it does not make sense to specify a design constraint for an abstract operation like \emph on personalized all-to-all \emph default , but rather to specify the design constraint that corresponds to the most efficient implementation that could be used by the specific MPI library that will be used by applications. The pairs in such a pattern can be determined by examining the MPI library documentation or source code, or by accumulating statistics on pairs communicat ing in test runs using your particular MPI. \layout Standard The result is that these compound patterns are usually able to be efficiently implemented using primitive patterns that scale as \emph on O(1) \emph default or \emph on O(LogN) \emph default . Thus, the number of pairs per node scales approximately as \emph on O(LogN) \emph default , not as \emph on O(N) \emph default . \layout Standard \series bold Pair Synergy. \series default A set of parallel programs will generally require not just one communication pattern, but the union of all communication patterns used in any of the programs. Thus, it is natural to expect that the number of pairs for each node is actually the sum of the number of pairs for that node in each communication pattern. Of course, the sum of a fixed number of \emph on O(1) \emph default and \emph on O(LogN) \emph default values grows no faster than \emph on O(LogN) \emph default . However, the sum could still become too large for commodity switches. \layout Standard In practice, the sum of pairs is a rarely-seen upper bound. A pair required by one pattern very often also is required by another pattern. \layout Standard This type of synergy is very common among various mesh patterns and even reductions. For example, nearly all pairs required to support 1D adjacency also are required to support 2D adjacency; only 1D pairs involving nodes in edge positions in the 2D pattern are not covered by the 2D pattern. Similarly, \emph on hypercube \emph default adjacency has many pairs that overlap those of meshes. The result is a significant reduction in the total number of pairs required for the union of multiple communication patterns, although the precise amount of reduction is highly dependent on the set of patterns specified. \layout Standard SFNNs are not the only type of network that can take advantage of this insight. For example, direct connection networks also can be made much simpler. However, the relatively high degree of switches (e.g., 24, 32, 48, 64, or even more ports) makes it feasible to cover design requirements for systems that are much larger than can be accommodated using the relatively low degree (e.g., 5 or 6 NICs) afforded by direct connections between nodes. \layout Standard \begin_inset Float figure placement t wide false collapsed false \layout Standard \align center \begin_inset Graphics filename 1k_all.ps width 3in keepAspectRatio \end_inset \layout Caption A 1,024-node Design/Solution Map \begin_inset LatexCommand \label{cap:A-1,024-node-Design/Solution} \end_inset \end_inset \layout Standard \series bold A Larger Example. \series default The result of all the above is that SFNN networks for very large systems, covering a wide range of communication patterns, can be built using commodity network hardware. The largest such designs we have thus far created contain over 10,000 nodes. The primary limitation in making larger designs appears to be the inefficiency of the design tool itself in dealing with very large designs. The concepts appear to allow much larger systems to be built and improvements to our design tool already have been responsible for the jump from a maximum design size of about 1,000 nodes to over 10,000 using the same network hardware components. It is impractical to use a 10,000x10,000 design map scaled to fit within the margins of this paper, so our large design example is a 1,024-node design, shown in Figure \begin_inset LatexCommand \ref{cap:A-1,024-node-Design/Solution} \end_inset . This design uses 48-port switches to cover all of the following: \layout Itemize \paragraph_spacing other 0.00 \size small 1D torus power of 2 offsets \layout Itemize \paragraph_spacing other 0.00 \size small 2D torus power of 2 offsets in 512x2, 256x4, 128x8, 64x16, and 32x32 topologies \layout Itemize \paragraph_spacing other 0.00 \size small 3D torus power of 2 offsets in 256x2x2, 128x4x2, 64x8x2, 64x4x4, 32x16x2, 32x8x4, 16x16x4, and 16x8x8 topologies \layout Itemize \paragraph_spacing other 0.00 \size small Shuffle & inverse-shuffle \layout Itemize \paragraph_spacing other 0.00 \size small Bit-reversal \layout Itemize \paragraph_spacing other 0.00 \size small Hypercube \layout Itemize \paragraph_spacing other 0.00 \size small 2D matrix transpose (32x32, one element per node) \layout Standard The total number of possible pairs is 523,776. The union of all the specified patterns requests coverage of just 14,566 pairs; approximately 2.78% of all possible pairs! The actual design covers 102,883 pairs, or 19.6%, including all 14,566 requested pairs. \layout Standard This system design requires 4,848 NICs (assuming no NIs are built-into the motherboards) and 101 switches. The total street cost for the network would thus be less than $100,000 using name-brand 100 Mb/s Fast Ethernet hardware (~$500 per switch, ~$10 per NI+cable). Latency would be around 30 microseconds (i.e., a single Fast Ethernet switch delay) for any communication pattern that is covered. Bisection bandwidth measured on any permutation pattern covered would be greater than 102.4 Gb/s, with a total network bandwidth of no more than 484.8 Gb/s. If these performance numbers are not sufficient, \shape italic any \shape default networking technology could be substituted for Fast Ethernet -- Gb/s Ethernet, Infiniband, etc. -- in which case a new design optimized for the constraints imposed by the networking technology should be created. Note that SFNN designs primarily use multiple NIs per node for connectivity reasons, and thus only require a node's motherboard to support full-bandwidth on a single NI at any one time. \layout Standard Two properties are immediately visible from this moderately large example. As predicted by the above discussion, the design constraints are indeed very sparse despite covering a rich set of communication patterns. The second property immediately visible is that the design solution, though quite sparse, is significantly less sparse than the design constraints. It is not clear why the solution has a higher density: is the higher density necessary or is it the result of the design tool settling for a less-than-minim al solution? The answer seems to be a mix of both. As our SFNN design tool has improved, the density of the solutions found has significantly decreased (and the design sizes the tool can handle with similar network components have correspondingly increased). However, because an SFNN must completely cover the design constraint pairs, it is highly likely that some (small) fraction of the pairs only can be covered with relatively loose-fitting solutions. Our ongoing research into FFNNs already has shown that very small switches can be used to cover the vast majority of design constraint pairs without including a significant number of pairs outside the design constraints. \layout Subsection Asymmetry \layout Standard It is significant that the vast majority of SFNN design solutions, including all three shown above, are asymmetric. In fact, by enumeration of all possible designs for very small design problems, it easily can be shown that the best asymmetric design is usually better than the best symmetric design for most sets of design constraints. Symmetric design constraints are usually more efficiently covered with somewhat asymmetric solutions. Just as the union of symmetric patterns generally is asymmetric, a cover constructed symmetrically using switches of a fixed width is unlikely to have precisely the same symmetry as the design constraints; thus, a symmetric cover would needlessly include additional pairs. Interestingly, many designers confuse symmetry with modularity; asymmetric designs easily can be constrained to have a specified modular structure without requiring that the complete design be symmetric. In summary, asymmetric design may require powerful design tools, but the benefits can be huge. \layout Subsection Multiprocessor Nodes \layout Standard The above discussion assumes a single processor per node. However, assuming multiple processors per node simply makes the communication pair requirements for each node be computed using the sum of the processor communication requirements for all processors within that node. In itself, this tends to yield somewhat denser specifications, making the benefit in using multiprocessor nodes unclear: contrary to popular belief, multiprocessor nodes do not necessarily reduce network cost per processor. Because interprocessor communication within a node does not require use of an external network, it would be possible for an SFNN design tool to incorporate renumbering of processors within nodes as a mechanism to help minimize the external network hardware required to cover a design specified by processor pairs. Rather than numbering processors within nodes sequentially, a future SFNN design tool for systems with multiprocessor nodes might number processors in a random-looking order that minimizes the external network hardware needed to cover all specified pairs. \layout Section SFNN Design \begin_inset LatexCommand \label{sec:SFNN-Design} \end_inset \layout Standard A primary motivator for the development of SFNNs was the realization that Universal FNNs supported many more communication patterns than were needed. By designing for the known needed communication patterns, the network could be much less expensive to construct, yet still yield comparable performance for a given set of applications. To design SFNNs, it is necessary to specify the set of communication patterns that we wish the SFNN to support with guaranteed pairwise bandwidth and unit latency. This set can be represented by an NxN weighted connectivity matrix for N PEs, with the value at matrix element (x,y) being the relative importance of the communications between PE x with PE y. If the weights are restricted to zero and one, the matrix is in the same form as an adjacency matrix; two PEs are considered adjacent if they are connected to a common switch. There are three approaches for determining these connectivity matrices. \layout Subsection Specification Of Performance Constraints \layout Standard One can search research publications to identify the complete set of communicati on patterns that are discussed. For example, some frequency-domain transformation algorithms (such as various codings of FFT) communicate using a bit-reversal pattern in which PE x communicates with PE y where y's binary value is equal to the bits of the binary value of x listed in reverse order. Another alternative is to construct the matrix by directly examining the source code for the applications that will be run on the machine or consulting the code's author or documentation. Care must be taken to distinguish between how the author or application code views a communication and how it really is implemented; libraries like MPICH or LAM-MPI do not always implement communications in the way that one might expect. A third approach, which we are developing primarily for FFNN design, involves automated determination of communication patterns using instrumented runs of target applications. \layout Standard The current SFNN design software allows the communication matrix to be specified as the union of any of the patterns we found to be common in a literature search. Each communicating pair yields a 1 entry in the matrix, every other pair yields a 0. The patterns available include: \layout Itemize \paragraph_spacing other 0.00 \size small Hypercube, single bit difference in PE ID number (N must be a power of 2) \layout Itemize \paragraph_spacing other 0.00 \size small Bit-Reversal of the PE ID number (N must be a power of 2) \layout Itemize \paragraph_spacing other 0.00 \size small Perfect-Shuffle (N must be even) \layout Itemize \paragraph_spacing other 0.00 \size small 2D Matrix Transpose of a single element per PE (N must be a square) \layout Itemize \paragraph_spacing other 0.00 \size small 1D, 2D, 3D Grids or Tori with various sub-patterns independently selectable: \begin_deeper \layout Itemize \paragraph_spacing other 0.00 \size small Distance 1 offsets in X, Y, or Z \layout Itemize \paragraph_spacing other 0.00 \size small Distance 1 diagonals in 2D and 3D \layout Itemize \paragraph_spacing other 0.00 \size small Power of 2 offsets in X, Y, or Z \layout Itemize \paragraph_spacing other 0.00 \size small All PEs that differ in only one dimension (e.g. every PE in same row, column, etc.) \end_deeper \layout Standard For 2D and 3D grids/tori, one can select to use just one balanced factorization of N, or all unique factorizations. For example a 512 PE cluster can be specified as having any of the following 3D grids: 128x2x2, 64x4x2, 32x8x2, 32x4x4, 16x16x2, 16x8x4, or 8x8x8; the default would be 8x8x8. \layout Subsection The Design Algorithm \layout Standard Although SFNN design appears to be harder than the unsolved (v,k,t)-covering problem, good solutions can be created using a \series bold Genetic Algorithm (GA) \series default very similar to that we developed for FNN design \begin_inset LatexCommand \cite{lcpc2000} \end_inset . The genome representation used in both the FNN and SFNN GAs consists of a fixed-length bit string representing the set of PEs connected to each switch. The crossover and mutation operations are specially coded to ensure various invariant properties are met by all population members: the number of PEs, the maximum number of switches, the maximum number of NICs per PE, and the maximum number of connections per switch. This saves considerable time because the GA's metric evaluation is merely ranking feasible network designs, not also checking their feasibility. However, the GA would be slow to find FNNs, and even slower to find SFNNs, without a few extra tricks. The FNN design tool attempts to speed its search by first finding approximate solutions to a scaled-down problem and then scaling those solutions up to seed the full problem's initial population; the SFNN design tool also seeds its initial population with good designs, but by using a greedy \begin_inset Quotes eld \end_inset buddy connection \begin_inset Quotes erd \end_inset algorithm rather than scaling the problem. \layout Subsection Related Work In Network Design \layout Standard There is a huge body of literature on interconnection network designs for parallel supercomputers, but very little on the design process. \layout Standard Traditionally, the emphasis has been on selecting a \begin_inset Quotes eld \end_inset universal \begin_inset Quotes erd \end_inset topology with good mathematical properties and then mapping program communicati ons onto that network architecture. Various mesh and hypercube architectures have been very popular in more traditional supercomputers, while cluster network designs most often are either trees or fat trees \begin_inset LatexCommand \cite{Fat-trees1985} \end_inset , perhaps also employing trunking or channel bonding \begin_inset LatexCommand \cite{sterling95beowulf} \end_inset to enable use of wider, but lower-speed, switches. All of these network topologies share the property that they have straightforwa rd scaling procedures, use relatively low-degree nodes, and require multiple routing hops to accomplish most communication patterns. Mesh and hypercube designs offer very simple routing algorithms and relatively easy use of alternative paths for load balancing or fault recovery. Trees are fully compatible with commodity network hardware. Fat trees offer the ability to maintain bisection bandwidth as the design is scaled, but at the cost of requiring routers rather than \begin_inset Quotes eld \end_inset dumb \begin_inset Quotes erd \end_inset switches to load balance across alternative paths through the thick portion of the tree. None of these designs directly addresses issues of latency for user-specified communication patterns. \layout Standard The design process for Circulant graphs and other Caley graphs \begin_inset LatexCommand \cite{XuTopolgyBook} \end_inset is becoming a popular topic in interconnection research due to the low latency these graphs afford by using relatively high-degree nodes. However, this design process is again oblivious to performance requirements for user-specified communication patterns. The work presented in this paper is closely related to, and based on, our earlier work using genetic algorithms to design FNNs \begin_inset LatexCommand \cite{lcpc2000} \end_inset which may be either symmetric or asymmetric. At the same time we were building KLAT2 (Spring 2000), a group at Australia's National University by hand created a symmetric FNN for the Bunyip supercompute r \begin_inset LatexCommand \cite{bunyip2000} \end_inset . The work on filtering random graphs published in Fall 2002 by Lakamraju et al \begin_inset LatexCommand \cite{rand2002} \end_inset is closely related to the FNN approach, but use of a random search is far less efficient than the genetic algorithm approach we have taken, they restricted their consideration to regular graphs of high-degree nodes, and the metrics that they considered do not directly correspond to performance on a user-specified set of communication patterns. \layout Section SFNN Runtime support \begin_inset LatexCommand \label{sec:SFNN-Runtime-support} \end_inset \layout Standard FNN and SFNN routing is closely related to Linux Channel Bonding \begin_inset LatexCommand \cite{sterling95beowulf} \end_inset , however, it is complicated by the fact that the set of NIs used is a function of the destination of each message. Another important difference is that it is often desirable to use uplink connections between switches for an SFNN to route messages between node pairs not covered by the design, which prohibits the duplication of MAC addresses (as used in channel bonding). \layout Standard Our driver assigns each NI a unique local MAC address such that a single table lookup of a 32-bit entry determines all routes to be used between a particular pair of nodes. Our MAC-level mechanism can be used with both standard IP and technology-specif ic libraries; e.g., a GAMMA \begin_inset LatexCommand \cite{gamma} \end_inset interface could further reduce Gb/s Ethernet latency. Our reassignment of MAC addresses somewhat complicates the network boot process, which we have resolved by incorporating support for our drivers in \emph on Warewulf \begin_inset LatexCommand \cite{warewulf} \end_inset . \layout Section Conclusion \begin_inset LatexCommand \label{sec:Conclusion} \end_inset \layout Standard When we first demonstrated FNNs in KLAT2 in Spring 2000, the general reaction to the technology was strongly positive. Within a short time, the technology had won various awards and was being applied by both commercial \begin_inset LatexCommand \cite{LinuxLabs} \end_inset and academic \begin_inset LatexCommand \cite{kage} \end_inset users. However, we knew that universal FNNs were just the \begin_inset Quotes eld \end_inset low-hanging fruit \begin_inset Quotes erd \end_inset of a much larger crop: the ability to automatically engineer an interconnection network to meet arbitrary performance criteria. SFNNs are a giant leap along this path, not just because they scale to much larger designs and are even more cost effective than FNNs, but because they represent the first time high-performance network design has been directly driven by a purely quantitative specification. Perhaps even more significantly, our support software now makes the fact that an SFNN is being used entirely transparent; it can be built using standard network hardware and even message-passing library routines need not know an SFNN is being used. The SFNN concept, and the tools developed to exploit it, allow a supercomputer systems designer more flexibility and control over the cost/performance trade-offs of the network for a suite of target applications. \layout Standard \size small \begin_inset LatexCommand \BibTeX[plain]{sc2004tim} \end_inset \the_end