In a typical computer with 32MB of 16Mb DRAM chips and a 100MHz processor, there is 3000 times the bandwidth available inside the memory vs. at the CPU. If you can't bring the memory bandwidth to the processor, then bring the processors to the memory.
postscript paper less photos (181KB)
chip micrograph (584KB)
PE detail micrograph (136KB)
These files are also available via anonymous FTP from
ftp.eecg.toronto.edu in /pub/tech_reports/dunc/*
From the proceedings of the PetaFLOPS Frontier Workshop held February 1995 in Washington DC during The Fifth Symposium On The Frontiers Of Massively Parallel Computation:
postscript paper (51KB)
Other information on PetaFLOPS Enabling Technologies and Applications
|program||cram (ms)||host (ms)||speedup||Source code|
|3x3 Convolution 16M||17.6067||112760.0||6404||Parallel, Sequential|
|FIR 128K 40b||0.0991||311.7||3144||Parallel, Sequential|
|FIR 4M 16b||1.0437||5144.4||4929||Parallel, Sequential|
|Vector Quantization||25.746||33780||1312||Parallel, Sequential|
|Masked Blt||0.0182||442.8||24310||Parallel, Sequential|
|LMS Matching||0.2003||250.9||1253||Parallel, Sequential|
|Data Mining||70.66||192450||2724||Parallel, Sequential|
|Fault Simulation||0.0894||2380.0||26626||Parallel, Sequential|
|Memory Clear||0.0016||8.8||5493||Parallel, Sequential|
Computational RAM (C-RAM) is semiconductor random access memory with processors incorporated into the design to build an inexpensive massively-parallel computer. If an application contains sufficient parallelism, it will typically run orders of magnitude faster in C-RAM than the central processing unit. This work includes architecture, prototype chips, compiler and applications.
C-RAM integrates SIMD (Single Instruction stream, Multiple Data stream) processors into random access memory at the sense amplifiers (along one edge of a 2 dimensional array of memory cells). The novel combination of processors with memory (the memory retains its memory interface) allows C-RAM to be used as computer main memory, as a video frame buffer or for stand-alone signal processing. The use of high-density commodity dynamic memory makes C-RAM economical. The bit-serial, externally programmed processing elements (PEs) add only slightly to the cost of the chip (9-20%), yet a workstation with 32Mbytes of C-RAM would have an aggregate performance of 13 billion 32 bit operations per second. A working 64 processing element per chip C-RAM has been fabricated and the PE for a 2048PE, 4Mbit chip has been designed.
The performance of C-RAM for kernels and real applications was obtained by simulating their execution. For this purpose, a prototype compiler was written. Applications are drawn from the fields of signal and image processing, computer graphics, synthetic neural networks, CAD, data base and scientific computing.
Keywords: smart memory, smart DRAM, intelligent memory, intelligent DRAM, processors in memory, processing in memory, computing in memory, pitch-matched logic in memory, application specific memory, application specific DRAM, massively parallel computer, massively parallel computing, massively parallel SIMD, MPP, IRAM, DSP, VLSI, logic enhanced memory, logic enhanced DRAM, merged DRAM-logic, MPP applications, graphics, digital signal processing, image processing, image compression, scientific computing, database
Duncan's home page