Caam 420 - Understanding language compilation

Understanding language compilation

A compiler g++ is really just the manager of lots of smaller programs that convert, in stages, source code to executable machine code. Let's review some of the phases of compilation and standard compiler flags that interact with them.

Preprocessor

The first step in compilation is preprocessing. The C preprocessor is a simple text transformer; it does not act based on the syntax of an underlying C program, so it can be used in any situataion where text must be translated and the translation is not too complex. C++ compilers invoke the C preprocessor automatically as the first step in compilation.

The C preprocessor language is simple; it has only three or four sets of constructions.

#define X Y causes the phrase Y (words up to end of line) to replace the word X (delimited by whitespace characters) in subsequent lines.
#define X(U) Y causes the phrase Y(V|U) to replace the word X(V) in subsequent lines, where Y(V|U) denotes the phrase Y in which word V replaces all instances of word U.
#undef X causes the word X to no longer be replaced in subsequent lines.
defined(X) is 1 if X is defined, and 0 if it is not.
#if X
#elif Y
#else
#endif enables conditional compilation; if X is 1, all lines between #if X and #elif Y are included; otherwise, if Y is 1, all lines between #elif Y and #else are included; otherwise, all lines between #else and #endif are included.
#ifdef X is the same as #if defined(X)
#ifndef X is the same as #if !defined(X)
#include "X" includes at this point all the lines of file X in the current directory, after they have been preprocessed.
#include <X> includes at this point all the lines of file X in one of the standard directories, or in one of the directories specified in the compiler command line.

There are a few other preprocessor directives, but they are obscure and rarely used in source code.

The preprocessor is often used to allow a single source file to represent a program that varies slightly from one computer type to another. It is used to define short functions that are invoked without the usual overhead of function calls. It is also used to control the inclusion of header files, by ensuring that no file is included twice.

A small number of command line arguments are directed to the preprocessor. For example, -Iw causes the directory specified by the word w to be added to the list of directories searched for files to include, for the second form of the include directive. Another argument, -Dw=x, effectively inserts "#define w x" as the first line. A third, -Uw, effectively inserts "#undef w" at the beginning of source file.

To invoke the preprocessor, without doing anything else, most compilers offer the -E option; it causes the output of the preprocessing step to be sent to standard output. Many also offer the -P option, which saves the preprocessed code in a file. Occasionally, examining the output of the preprocessor is useful when you suspect that an included file may be having an undesirable effect, as for example when something like "#define sqrt abs" appears in a rogue header file. Files of preprocessed source code use the .i suffix.

Compiler

The actual compiler is itself composed of smaller parts. The lexer, or scanner, recognizes tokens, or words. The parser recognizes phrases, looks for mismatched parentheses and the like. The code generator produces assembly code from these phrases. Assembly code is a machine level code written in a style somewhat more legible than just a stream of digits. By examining the assembly code from a compiler, you see if your innermost loops are as fast as they should be, or whether you need to modify them by hand to optimize performance. However, modifying assembly code for better performance is difficult, requires specialized knowledge about the underlying computer, and is inherently unportable. To examine the assembly code produced by the compiler, use the -S compiler option, which causes the compiler to produce nothing but assembly code, in a file with a ".s" suffix.

Assembler

The assembler is a simple program that takes assembly code and produces an object file. (This use of the word object is unrelated to object-oriented programming.) The object file consists of two parts. One part is machine code, the digits that the processor ultimately interprets as instructions. The other is a symbol table. The symbol table describes the external symbols declared in the source program. Some of those symbols correspond to functions or global variables defined in the source code that other object files may need; with each symbol is stored the address of the function or variable in the machine code. Other symbols describe functions and global variables that the source code uses, but does not define. With each of these symbols is a list of the places the symbol gets used. To preserve an object file, use the -c compiler option, which causes the compiler to produce object code into a file with a .o suffix.

Libraries

A library is a file with a name ending in .a (and often beginning with lib) that stores a collection of object files and a master symbol table. The ar (for archive) command builds libraries, and modifies them. In particular, to build a library with ar, use the arguments cr, then the archive name, then the list of object files that it will contain. Older systems required a second command, ranlib, to build the master symbol table for the archive, but that is mostly unnecessary now.

Linker

The linker is the program that binds together object files to make an executable file. It checks every symbol table to ensure that each symbol that is used somewhere is defined somewhere else, and not more than once. Besides looking at the files you provide, the linker looks in standard libraries where common functions like cout are defined, and in libraries you may specify. If all the pieces can be linked together, the linker provides an executable file. It is called a.out, unless you use the -o flag to call it something else.

Driver

The driver program is the one called g++ that directs the components described above. The driver consists of two stages. In the first stage, command line options like -E, -P, -S, -c and -o are sought; they tell the driver, among other things, how many of its components will be necessary. In the second stage, the driver considers each file in the order in which they appear in the command line. Each file is processed, with the filename suffix identifying which transformation comes first. The C preprocessor is applied first for .c, .F, , and .C files, among others. Compilation comes first for .i and .f files, assembly for .s files, and linkage for .o files. Library files are like .o files, except that only those object files that define symbols that have been seen and not yet defined are extracted from the library. The last component is determined by the flag (-E, -P, -S, -c) read in the first stage, and the default is to link together an executable.

Along with files, some command-line flags are processed during this second phase. One is -L; the word after -L identifies a directory in which libraries will be sought. Another is -l; where the flag is used as -lw (for some word w), it is replaced by the library libw.a, taken from either a standard directory (typically, /usr/lib) or from a directory specified using the -L option.

Common preprocessor usage

Here are a couple of examples of preprocessor use that come up frequently.

First, header files are often written inside a preprocessor wrapper. The following example shows what that wrapper looks like for a header file "sof.h".

#ifndef _sof_h_
#define _sof_h_
// header file contents
#endif /* _sof_h_ */

The purpose of this wrapper is to keep the contents of the header file from being included twice in one invocation of the compiler. Depending on what the contents of the header file are, multiple copies could be a problem.

Another example uses preprocessor macros (#defines with arguments) to simulate inlined functions. For example, this macro

#define dist(x,y) sqrt(x*x+y*y)

could be used as if dist were a function; dist(3,4) would expand to sqrt(3*3+4*4). These inlined functions are subject to certain pitfalls. For example, dist(1+2,4) expands to sqrt(1+2*1+2+4*4), which has a different value than one might expect. A more careful macro definition:

#define dist(x,y) sqrt((x)*(x)+(y)*(y))