An SRAM-Programmable Field-Configurable Memory
Tony Ngai*, Jonathan Rose, and Steven J. E. Wilton
This paper describes the design and implementation of an SRAM-Programmable Field-Configurable Memory (FCM), which has the flexibility to form over two hundred different memory configurations, each with up to four individual memories. The prototype Field-Configurable Memory has four 1Kb memory blocks, each of which can be configured into four different aspect ratios. It requires just 40 configuration bits and is only 38% larger and 46% slower than the ASIC memory upon which it is based. User memories implemented on this chip require from 16 to 23 times less area than if they were implemented on a Xilinx 4000 series memory architecture. Although this is a stand-alone FCM implementation, the design can be embedded as part of an FPGA.
On-chip Field-Configurable Memory will also provide significantly higher memory bandwidth than off-chip memory because the pin count limitations prevent wide off-chip memories and I/O pad delay limits off-chip access time.
Several FPGAs vendors already offer limited memory capability [Brit93] [Hsie90] [Plus89] [Marp92] [Smit93]. Although each of these architectures individually have either good performance, memory density or flexibility, none can achieve all three features at the same time [Ngai94]. In this paper we present the design and implementation of an FCM that can implement memories that are fast and dense yet is flexible enough to support a wide range of memory configurations. In the next section we describe our general notion of a Field-Configurable Memory. Section 3 describes the specific architecture of our prototype FCM and Section 4 gives results measured from the implemented chip.
The basic FCM structure consists of several Basic Memory Blocks (BMBs) connected by a programmable interconnect structure as illustrated in Figure 1. The BMBs are the physical memory units which can be used individually or combined together to form logical memories. The interconnect structure provides all the necessary connections between the BMBs and the external world (I/O pads for a stand-alone FCM or other programmable interconnects for a mixed FCM-FPGA design).
There are two ways that BMBs can be combined: either ``vertically'' to form deeper logical memories, or ``horizontally'' to form wider ones. Figure 1 shows an example using two 256x4 BMBs. Figure 1a shows how the two BMBs can be used to form a 256 x 8 logical memory and Figure 1b shows how a deeper 512 x 4 logical memory can be formed. In order to increase the overall flexibility of the FCM, we believe that it is important to make each BMB configurable into different aspect ratios [Wilt95].
3. Architecture of the Prototype Chip
In this section we describe the architecture of a prototype stand-alone FCM, called FiRM (for Field-Reconfigurable Memory). It consists of four 1K bit BMBs and thus can implement up to four logical memories. The memory is based on a synchronous SRAM design for ASIC memory from Bell-Northern Research provided through the Canadian Microelectronics Corporation. The aspect ratio of the base memory array is 128 x 8 bits giving seven address and eight data lines. The following sections describe the details of the BMBs, and the routing architecture that programmably joins the BMBs.
3.1 The Basic Memory Block
Each BMB can be programmed to take on an aspect ratio of 1K x 1, 512 x 2, 256 x 4, or 128 x 8. This flexibility is achieved using a mapping block surrounding the basic memory array as shown in Figure 3. The core of the mapping block is a pass-transistor network that connects the fixed-width 8-bit data bus (M bus) from the memory array to the variable-width data bus (D bus) emanating from the BMB. The two configuration bits (Cbit0 and Cbit1) are used to select one of the four output widths. The address lines A0 to A6 are connected to the memory block to select one of the 128 bytes. The mapping logic then decodes the upper address (A7 to A9) to connect the external D bus to the corresponding portion of the internal M bus.
Figure 4 illustrates the switch settings of the mapping block when the BMB is in the 1Kx1 and the 256x4 mode. The horizontal lines are the output D bus, and the arrows show how these signals are connected to the fixed M bus. The connections are made by turning on the appropriate group of pass transistors indicated by the solid circles in the two diagrams.
To support a write operation into the array when the D bus width is less than 8 requires a way to write into only part of the M data-path without changing the remainder. For this reason the base SRAM array provides separate write enable lines for each bit of the base memory word. For example, as shown in Figure 3, the solid circles, arrows and squares represent the 256x4 configuration. When the address is between 0 to 127, the Wen0-3 signals are activated so that the write buffer updates only the top four SRAM cells but not the bottom four.
3.2 Routing Architecture
The purpose of the routing structure is to connect several BMBs to form one logical memory, making them either deeper or wider than the native capability of the BMB itself. The routing structure also provides connectivity to the I/O.
The routing architecture for an FCM can be far more area-efficient than that of a general-purpose FPGA switching a similar number of wires for two reasons: First, since individual wires in the address and data buses are always connected or disconnected in the same way, only a single configuration bit is needed for programming a connection from one bus to another, whereas in an FPGA the number of configuration bits is equal to the number of wires. This represents a 10-fold saving in the number of address bus routing configuration bits for the 10-line address bus, and similar savings on the data bus. The second saving arises because complete flexibility isn't required in routing structure. Although we chose to implement a routing structure that gives the maximum number of logical memory configurations possible with the four BMBs described above, this does not require all possible connections between all physical memory buses. By restricting certain logical memories to particular I/O pads a much simpler global routing architecture is possible, as illustrated in Figure 5 [Ngai94].
The solid circles in Figure 5 represent those switches that are closed when implementing a 4Kx1 logical memory. In this example, the address buses from all BMBs are connected together, as are all the data buses. Figure 5 also illustrates the Block Enable lines, which are used to select the appropriate BMB when more than one BMB is used to form a logical memory.
The total flexibility of an FCM is determined by the combined flexibility of its BMBs and the routing structure that groups the BMBs together. In total, 133 different configurations are possible, counting only the logical memory configurations which utilize all 4Kb of memory. Table 1 gives a partial listing of the 133 logical memory configurations, organized by the number of distinct logical memories.
FiRM is programmed using a 40-bit shift register chain that stores the configurations of the four BMBs and the routing structure.
4. Chip Status and Measured Performance
The FiRM chip has been fabricated using a 1.2 m double-metal CMOS technology. Figure 6 illustrates its floorplan and dimensions. This design is pad limited, requiring 14.70 mm2 for the inner core and 7.08 mm2 for the active area. The white space inside the pad ring was used to fabricate a separately bondable instance of the base 128x8 memory in order to measure the speed of the memory array, and so estimate the speed penalty due to the Field-Configurability. Figure 7 shows a microphotograph of the chip.
The tested chip is partially functional: the separately bonded memory works correctly, and the programmable address and data bus connections also function. Although some memory locations of the FiRM can be read and written reliably with particular test vectors, other memory accesses do not work correctly. At this point in the testing, we are unable to determine the cause of the problem. We are, however, able to measure the access time for the working test vectors.
The clock read access time (from rising edge of the clock until the data is valid) of the FiRM chip was measured at 40ns (under the tester load). The read access time for the separately bonded 128x8 memory was measured at 28ns using the same load. Therefore, the FiRM chip is 46% slower than the ASIC memory that it is based on.
In this paper, we have described the design of a full-custom field-configurable memory that combines the speed and density of the traditional SRAM design and the flexible structure of the FPGA.