Although one-million gate Mask-Programmed Gate Arrays (MPGAs) are available today, the physical design of these devices is difficult due to their sheer size and the fact that the delay of the routing is a much larger portion of the critical path delay than has been the case in the past. A one-million gate FPGA will have these problems and more, due to the programmable nature of the interconnect.
As the quantity of silicon available for FPGAs grows, it also makes sense to question if the devices should look the same or if some radical change in architecture is beneficial.
In the following sections we outline challenges that we foresee in the creation of the architecture of a one-million gate FPGA, and in the placement and routing of these large devices. We also suggest possible directions for architecture and challenges involved in those directions.
The FPGA routing architecture is the manner in which metal wires and programmable switches are placed to provide connectivity between the pins of the FPGA logic blocks. The routing architecture must be designed to provide paths with greatest possible speed without sacrificing too much area. Three key architectural parameters that have strong effect on delay and density are buffering, segmentation distribution and segmentation population. We discuss each of these in the next sections.
There are two extremes that describe this spectrum of buffer usage. In one extreme every programmable switch could have a buffer in series with it (with perhaps a programmable direction). This is the case in the Altera Flex 8K and 10K architectures [3][4]. Although this prevents large RC constants from ``building up'' through multiple programmable switches, it makes every programmable switch pay the price of a relatively high constant delay. The alternative, in which some connections have no buffers, would make some short connections much faster. This could have a significant positive effect on the critical path.
The other extreme is to have no buffers at any programmable switch. This is the case in the Xilinx 4000 architecture [5]. While local short connections will be faster than with the buffered case, it is also easy to create large slow RC networks in the absence of buffers. It is clear that some mixture of the buffered and non-buffered switch is most appropriate. The more recent 4000EX architecture provides a set of buffers in the routing [6].
2.2. Segmentation Distribution
A second technique to prevent buildup of large series resistance is to employ wires of differing lengths in what is known as a segmented routing architecture[1] [2]. Figure 1 illustrates segments of three different lengths. The segmentation distribution, which is the number of tracks in a channel with segments of each length, is a key question that has been studied to some degree [7][8][9][10]. If there are too many long segments, then the wire will be wasted when used for short connections. If there are insufficient long segments then networks will be too slow with many series programmable switches.
An associated issue in segmentation design occurs at the point that orthogonal channels intersect: should long segments connect to other long segments, or other short segments or both?
2.3. Internal Population of Segments
An important issue in segmentation design is whether or not there is access to the internal portion of the segment, referred to as the internal population of a segment. Figure 2 illustrates the two extremes of internal population with two different segments of length 4. The top segment is fully internally populated - the segment can connect to both the logic block it passes by and the orthogonal channels.The bottom segment is completely un-populated, and so can only be connected to at its ends.
The population of a segment is important with respect to both speed and density: the more programmable switches connected to a wire, the greater the parasitic capacitance (and hence delay) they add to the wire. From a density perspective, more switches provide greater flexibility and perhaps fewer routing tracks will be needed overall. However, a completely populated long segment requires vastly more area than a un-populated one. As processes with more metal layers become common, it will become more desirable to de-populate segments because the wires themselves will be cheaper.
2.4. Putting These Together - The Challenge
These three design spaces (buffering, segmentation distribution and internal population) interact with each other in very complicated ways. For example, a buffer may be needed to simply drive the capacitance of a long segment, depending on how long the segment is. It may also be unnecessary to use a buffer because a long segment reduces series resistance by removing series programmable switches.
We believe a key challenge for the one-million gate FPGA and beyond is to make the correct choices in this space of architectures.
3. Compile Time
The largest FPGAs available today have 5000 LUT/flip-flop pairs, and the next generation will contain in excess of 12,000. The 1 million-gate FPGA will contain (by today's gate counting standards) approximately 83,000 LUT-flip-flop pairs. There is evidence to suggest that the complexity of most placement and routing algorithms is non-linear. With these large sized FPGAs, the placement and routing time could easily exceed 1 day, extrapolating from current tools. 4. Partitioned Architectures
It is extremely unlikely that the million gate FPGA's will be structured as a flat, homogeneous sea of logic blocks and switch boxes. Rather, they will almost certainly have large scale structure. This will add flexibility to their use, but challenge to the architects and CAD developers.4.1. Higher-Functionality Blocks
Based on industry experience with standard ASICs and embedded arrays, we believe that typical 1Mgate FPGA will almost certainly contain reconfigurable RAM blocks [4] [15]. The design and use of these will be a major challenge because of the performance/flexibility trade-offs involved.
Going beyond RAM, as FPGAs are used for more computation-oriented applications, such as digital signal processing and FPGA-based compute engines, it is tempting to suggest the use of adders, multipliers, dividers as the basic logic block. Are these practical? They give up generality/flexibility to get more speed and density. Can synthesis be made to handle them? We believe the challenge is to architect blocks like these so that they retain some flexibility, so that they can be used in the widest spectrum of applications possible.
4.2. Partitioned FPGAs for Speeding up Compilation
As mentioned above, many designers would like their FPGA compilations to be as fast as their C compilations. In order to make the software processing faster, FPGA compiling will probably have to look more like software compiling. For example, a large C program is usually organized as many smaller files, and it is generally only necessary to recompile only the files that have changed. A fast linking process stitches them all together. To carry this into FPGA's, one approach would be to assume that the design is organized as a set of HDL files, stitched together at the block level with a system level netlist. The content of individual HDL files may change frequently, but the overall system organization should be relatively stable. If the FPGA architecture can support this type of hierarchy efficiently, it will be easier for designers to work incrementally.
Placement within a section would be facilitated by keeping the relationship between distance and speed easy to model (a quadratic relationship, for example). As discussed in Section 2, this in itself may be a significant challenge.
Routing within a section would be simplified by keeping the system regular, supporting bus routing by enabling parallel signals to turn corners and retain their parallel organization (as was done in ORCA [14]) and in general, as discussed in Section 3, providing more than sufficient routing resources.
5. Re-Configuration Time
A convincing case can be made for using an FPGA programmed in several different ways for during one computation. A key bottleneck for making that a practical reality for one-million gate FPGAs is to make the configuration time fast enough. As the number of gates increases, the number of programming bits will increase quickly, potentially increasing the configuration time. Effective partial reconfiguration strategies, and/or methods of increasing the configuration speed are needed.6. Yield Enhancement Through Redundancy
To make very large FPGAs, it seems clear that enhancing silicon yield through redundancy should provide significant cost savings. Altera has demonstrated a workable method with their long segment architectures based on column redundancy [11]. An open challenge for island-style architectures[1][5], with short-length segments is to determine if there is a practical way to do a similar kind of redundancy. Although there has been some work on this subject [12][13], none of has yet to be proven practical because of routing issues. Specifically, short routing segments required significant overhead in area to be made redundant.
7. Conclusions
In this paper we have presented several challenges that need to be met for the one-million gate FPGA to become a usable and practical reality. The greatest challenge, however, comes from the fact that all of these issues must be dealt with and traded off within one device. The FPGA design space is not well understood; very few people have even been exposed to all of the issues and engineering trade-offs that are possible across the spectrum of IC processes, circuit design, architecture and CAD. But in order to succeed, all of these factors must be balanced with each other effectively.References