An operating system for large-scale shared-memory multiprocessors, such as NUMAchine, must be specifically designed for this class of system if the operating system and the applications that run on it are to perform well. For example, the large disparity between cache and main memory access times dictate that data sharing be minimized in order to minimize cache misses and reduce consistency traffic. The generally low cache hit rates expected in executing operating system code dictate that memory access locality be made high so average memory access times are low. Parallelism in servicing independent operating system requests must increase proportionally to the number of processors in the system if bottlenecks in the operating system won`t start appearing as the system is scaled. Finally, the operating system must be flexible in allowing applications to customize operating system policies to suit their performance needs, since these needs tend to vary greatly and any single default policy will only perform well for a small percentage of applications.
Tornado is a new operating system we are developing for NUMAchine that addresses these issues using novel approaches, some of which were developed for our previous operating system Hurricane. For example, we have found that careful design can often eliminate data sharing even when sharing appears natural. This is best demonstrated by our previous work on a Protected Procedure Call (PPC) facility that supports cross-address space client-server interaction. This IPC facility accesses only local data, and hence will cause no consistency traffic and require no synchronization. As a result, PPC performance is comparable to the fastest uniprocessor IPC times and can be sustained regardless of the number of concurrent IPC operations in progress, even if they are all directed to the same server.
As another example, we use Hierarchical Clustering, a modular and flexible structure that can be applied to operating system objects or services. Using this structure, operating system service points are automatically replicated to provide for concurrency that increases linearly with the number of processors in the system, yet at a low constant synchronization overhead[ Unrau94, Unrau95]. Moreover, objects are automatically replicated and/or migrated to provide for good memory access locality. The degree of replication is determined by the hardware characteristics and the expected degree of contention on the object or service.
Lastly, Tornado uses an object-oriented, building block approach that allows applications to customize policies and adapt them to their performance needs. We have successfully applied this approach to the Hurricane File System (HFS) implementation [Krieger94] and intend to use it in other parts of Tornado. In the file system, each ``storage object'' implements a portion of a file structure or some simple file system policy. Under application direction, multiple storage objects can then be combined together in diverse ways to provide a wide variety of file structures and policies. With this approach the application can create files that distribute data across multiple disks using various distribution policies, create replicated files, create encrypted files, create compressed files that are transparently uncompressed on subsequent use, or files with many concurrent file pointers to name a few examples.
For research purposes, we intend to tune Tornado for applications with very large data sets that typically do not fit in memory and hence have high I/O demands. We intend to distribute twice as many disks to NUMAchine as there are processors, but providing a high capacity I/O system alone is not sufficient. A more integrated approach is required where the file system, the memory management system, the scheduler and the application programs cooperate in order to efficiently exploit the potential bandwidth of the parallel I/O subsystem.
We also intend to provide applications with an operating environment that provides predictable performance behavior to allow performance tuning by compiler or application programmer and to allow the application to appropriately parameterize its algorithms at run-time. Since Tornado will be a multiprogrammed system where concurrently running applications could interfere each other through common hardware resource usage, resource-constrained scheduling and preloading techniques will be necessary to provide the application with a consistent resource base.