home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
-
-
-
-
-
- NNNNooootttteeeessss oooonnnn PPPPrrrrooooggggrrrraaaammmmmmmmiiiinnnngggg iiiinnnn CCCC
-
-
- Rob Pike
-
-
-
-
-
-
- _I_n_t_r_o_d_u_c_t_i_o_n
-
- Kernighan and Plauger's _T_h_e _E_l_e_m_e_n_t_s _o_f _P_r_o_g_r_a_m_m_i_n_g
- _S_t_y_l_e was an important and rightly influential book. But
- sometimes I feel its concise rules were taken as a cookbook
- approach to good style instead of the succinct expression of
- a philosophy they were meant to be. If the book claims that
- variable names should be chosen meaningfully, doesn't it
- then follow that variables whose names are small essays on
- their use are even better? Isn't MaximumValueUntilOverflow
- a better name than maxval? I don't think so.
-
- What follows is a set of short essays that collectively
- encourage a philosophy of clarity in programming rather than
- giving hard rules. I don't expect you to agree with all of
- them, because they are opinion and opinions change with the
- times. But they've been accumulating in my head, if not on
- paper until now, for a long time, and are based on a lot of
- experience, so I hope they help you understand how to plan
- the details of a program. (I've yet to see a good essay on
- how to plan the whole thing, but then that's partly what
- this course is about.) If you find them idiosyncratic,
- fine; if you disagree with them, fine; but if they make you
- think about why you disagree, that's better. Under no cir-
- cumstances should you program the way I say to because I say
- to; program the way you think expresses best what you're
- trying to accomplish in the program. And do so consistently
- and ruthlessly.
-
- Your comments are welcome.
-
- _I_s_s_u_e_s _o_f _t_y_p_o_g_r_a_p_h_y
-
- A program is a sort of publication. It's meant to be
- read by the programmer, another programmer (perhaps yourself
- a few days, weeks or years later), and lastly a machine.
- The machine doesn't care how pretty the program is - if the
- program compiles, the machine's happy - but people do, and
- they should. Sometimes they care too much: pretty printers
- mechanically produce pretty output that accentuates
- irrelevant detail in the program, which is _a_s sensible _a_s
-
-
-
- February 21, 1989
-
-
-
-
-
- - 2 -
-
-
- putting all the prepositions _i_n English text _i_n bold font.
- Although many people think programs should look like the
- Algol-68 report (and some systems even require you to edit
- programs in that style), a clear program is not made any
- clearer by such presentation, and a bad program is only made
- laughable.
-
- Typographic conventions consistently held are important
- to clear presentation, of course - indentation is probably
- the best known and most useful example - but when the ink
- obscures the intent, typography has taken over. So even if
- you stick with plain old typewriter-like output, be cons-
- cious of typographic silliness. Avoid decoration; for
- instance, keep comments brief and banner-free. Say what you
- want to say in the program, neatly and consistently. Then
- move on.
-
- _V_a_r_i_a_b_l_e _n_a_m_e_s
-
- Ah, variable names. Length is not a virtue in a name;
- clarity of expression _i_s. A global variable rarely used may
- deserve a long name, maxphysaddr say. An array index used
- on every line of a loop needn't be named any more ela-
- borately than i. Saying index or elementnumber is more to
- type (or calls upon your text editor) and obscures the
- details of the computation. When the variable names are
- huge, it's harder to see what's going on. This is partly a
- typographic issue; consider
-
- for(i=0 to 100)
- array[i]=0
-
- _v_s.
-
- for(elementnumber=0 to 100)
- array[elementnumber]=0;
-
- The problem gets worse fast with real examples. Indices are
- just notation, so treat them as such.
-
- Pointers also require sensible notation. np is just as
- mnemonic as nodepointer _i_f you consistently use a naming
- convention from which np means ``node pointer'' is easily
- derived. More on this in the next essay.
-
- As in all other aspects of readable programming, con-
- sistency is important in naming. If you call one variable
- maxphysaddr, don't call its cousin lowestaddress.
-
- Finally, I prefer minimum-length but maximum-
- information names, and then let the context fill in the
- rest. Globals, for instance, typically have little context
- when they are used, so their names need to be relatively
- evocative. Thus I say maxphysaddr (not
-
-
-
- February 21, 1989
-
-
-
-
-
- - 3 -
-
-
- MaximumPhysicalAddress) for a global variable, but np not
- NodePointer for a pointer locally defined and used. This is
- largely a matter of taste, but taste is relevant to clarity.
-
- I eschew embedded capital letters in names; to my
- prose-oriented eyes, they are too awkward to read comfort-
- ably. They jangle like bad typography.
-
- _T_h_e _u_s_e _o_f _p_o_i_n_t_e_r_s.
-
- C is unusual in that it allows pointers to point to
- anything. Pointers are sharp tools, and like any such tool,
- used well they can be delightfully productive, but used
- badly they can do great damage (I sunk a wood chisel into my
- thumb a few days before writing this). Pointers have a bad
- reputation in academia, because they are considered too
- dangerous, dirty somehow. But I think they are powerful
- _n_o_t_a_t_i_o_n, which means they can help us express ourselves
- clearly.
-
- Consider: When you have a pointer to an object, it is a
- name for exactly that object and no other. That sounds
- trivial, but look at the following two expressions:
-
- np
- node[i]
-
- The first points to a node, the second evaluates to (say)
- the same node. But the second form is an expression; it is
- not so simple. To interpret it, we must know what node is,
- what i is, and that i and node are related by the (probably
- unspecified) rules of the surrounding program. Nothing
- about the expression in isolation can show that i is a valid
- index of node, let alone the index of the element we want.
- If i and j and k are all indices into the node array, it's
- very easy to slip up, and the compiler cannot help. It's
- particularly easy to make mistakes when passing things to
- subroutines: a pointer is a single thing; an array and an
- index must be believed to belong together in the receiving
- subroutine.
-
- An expression that evaluates to an object is inherently
- more subtle and error-prone than the address of that object.
- Correct use of pointers can simplify code:
-
- parent->link[i].type
-
- _v_s.
-
- lp->type.
-
- If we want the next element's type, it's
-
-
-
-
-
- February 21, 1989
-
-
-
-
-
- - 4 -
-
-
-
- parent->link[++i].type
-
- or
-
- (++lp)->type.
-
- i advances but the rest of the expression must stay con-
- stant; with pointers, there's only one thing to advance.
-
- Typographic considerations enter here, too. Stepping
- through structures using pointers can be much easier to read
- than with expressions: less ink is needed and less effort is
- expended by the compiler and computer. A related issue is
- that the type of the pointer affects how it can be used
- correctly, which allows some helpful compile-time error
- checking that array indices cannot share. Also, if the
- objects are structures, their tag fields are reminders of
- their type, so
-
- np->left
-
- is sufficiently evocative; if an array is being indexed the
- array will have some well-chosen name and the expression
- will end up longer:
-
- node[i].left.
-
- Again, the extra characters become more irritating as the
- examples become larger.
-
- As a rule, if you find code containing many similar,
- complex expressions that evaluate to elements of a data
- structure, judicious use of pointers can clear things up.
- Consider what
-
- if(goleft)
- p->left=p->right->left;
- else
- p->right=p->left->right;
-
- would look like using a compound expression for p. Some-
- times it's worth a temporary variable (here p) or a macro to
- distill the calculation.
-
- _P_r_o_c_e_d_u_r_e _n_a_m_e_s
-
- Procedure names should reflect what they do; function
- names should reflect what they _r_e_t_u_r_n. Functions are used
- in expressions, often in things like if's, so they need to
- read appropriately.
-
- if(checksize(x))
-
-
-
-
- February 21, 1989
-
-
-
-
-
- - 5 -
-
-
- is unhelpful because we can't deduce whether checksize
- returns true on error or non-error; instead
-
- if(validsize(x))
-
- makes the point clear and makes a future mistake in using
- the routine less likely.
-
- _C_o_m_m_e_n_t_s
-
- A delicate matter, requiring taste and judgement. I
- tend to err on the side of eliminating comments, for several
- reasons. First, if the code is clear, and uses good type
- names and variable names, it should explain itself. Second,
- comments aren't checked by the compiler, so there is no
- guarantee they're right, especially after the code is modi-
- fied. A misleading comment can be very confusing. Third,
- the issue of typography: comments clutter code.
-
- But I do comment sometimes. Almost exclusively, I use
- them as an introduction to what follows. Examples: explain-
- ing the use of global variables and types (the one thing I
- always comment in large programs); as an introduction to an
- unusual or critical procedure; or to mark off sections of a
- large computation.
-
- There is a famously bad comment style:
-
- i=i+1; /* Add one to i */
-
- and there are worse ways to do it:
-
- /**********************************
- * *
- * Add one to i *
- * *
- **********************************/
-
- i=i+1;
-
- Don't laugh now, wait until you see it in real life.
-
- Avoid cute typography in comments, avoid big blocks of
- comments except perhaps before vital sections like the
- declaration of the central data structure (comments on data
- are usually much more helpful than on algorithms); basi-
- cally, avoid comments. If your code needs a comment to be
- understood, it would be better to rewrite it so it's easier
- to understand. Which brings us to
-
- _C_o_m_p_l_e_x_i_t_y
-
- Most programs are too complicated - that is, more com-
- plex than they need to be to solve their problems
-
-
-
- February 21, 1989
-
-
-
-
-
- - 6 -
-
-
- efficiently. Why? Mostly it's because of bad design, but I
- will skip that issue here because it's a big one. But pro-
- grams are often complicated at the microscopic level, and
- that is something I can address here.
-
- Rule 1. You can't tell where a program is going to
- spend its time. Bottlenecks occur in surprising places, so
- don't try to second guess and put in a speed hack until
- you've proven that's where the bottleneck is.
-
- Rule 2. Measure. Don't tune for speed until you've
- measured, and even then don't unless one part of the code
- _o_v_e_r_w_h_e_l_m_s the rest.
-
- Rule 3. Fancy algorithms are slow when _n is small, and
- _n is usually small. Fancy algorithms have big constants.
- Until you know that _n is frequently going to be big, don't
- get fancy. (Even if _n does get big, use Rule 2 first.) For
- example, binary trees are always faster than splay trees for
- workaday problems.
-
- Rule 4. Fancy algorithms are buggier than simple ones,
- and they're much harder to implement. Use simple algorithms
- as well as simple data structures.
-
- The following data structures are a complete list for
- almost all practical programs:
-
- array
- linked list
- hash table
- binary tree
-
- Of course, you must also be prepared to collect these into
- compound data structures. For instance, a symbol table
- might be implemented as a hash table containing linked lists
- of arrays of characters.
-
- Rule 5. Data dominates. If you've chosen the right
- data structures and organized things well, the algorithms
- will almost always be self-evident. Data structures, not
- algorithms, are central to programming. (See Brooks p.
- 102.)
-
- Rule 6. There is no Rule 6.
-
- _P_r_o_g_r_a_m_m_i_n_g _w_i_t_h _d_a_t_a.
-
- Algorithms, or details of algorithms, can often be
- encoded compactly, efficiently and expressively as data
- rather than, say, as lots of if statements. The reason is
- that the _c_o_m_p_l_e_x_i_t_y of the job at hand, if it is due to a
- combination of independent details, _c_a_n _b_e _e_n_c_o_d_e_d. A clas-
- sic example of this is parsing tables, which encode the
-
-
-
- February 21, 1989
-
-
-
-
-
- - 7 -
-
-
- grammar of a programming language in a form interpretable by
- a fixed, fairly simple piece of code. Finite state machines
- are particularly amenable to this form of attack, but almost
- any program that involves the `parsing' of some abstract
- sort of input into a sequence of some independent `actions'
- can be constructed profitably as a data-driven algorithm.
-
- Perhaps the most intriguing aspect of this kind of
- design is that the tables can sometimes be generated by
- another program - a parser generator, in the classical case.
- As a more earthy example, if an operating system is driven
- by a set of tables that connect I/O requests to the
- appropriate device drivers, the system may be `configured'
- by a program that reads a description of the particular dev-
- ices connected to the machine in question and prints the
- corresponding tables.
-
- One of the reasons data-driven programs are not common,
- at least among beginners, is the tyranny of Pascal. Pascal,
- like its creator, believes firmly in the separation of code
- and data. It therefore (at least in its original form) has
- no ability to create initialized data. This flies in the
- face of the theories of Turing and von Neumann, which define
- the basic principles of the stored-program computer. Code
- and data _a_r_e the same, or at least they can be. How else
- can you explain how a compiler works? (Functional languages
- have a similar problem with I/O.)
-
- _F_u_n_c_t_i_o_n _p_o_i_n_t_e_r_s
-
- Another result of the tyranny of Pascal is that
- beginners don't use function pointers. (You can't have
- function-valued variables in Pascal.) Using function
- pointers to encode complexity has some interesting proper-
- ties.
-
- Some of the complexity is passed to the routine pointed
- to. The routine must obey some standard protocol - it's one
- of a set of routines invoked identically - but beyond that,
- what it does is its business alone. The complexity is _d_i_s_-
- _t_r_i_b_u_t_e_d.
-
- There is this idea of a protocol, in that all functions
- used similarly must behave similarly. This makes for easy
- documentation, testing, growth and even making the program
- run distributed over a network - the protocol can be encoded
- as remote procedure calls.
-
- I argue that clear use of function pointers is the
- heart of object-oriented programming. Given a set of opera-
- tions you want to perform on data, and a set of data types
- you want to respond to those operations, the easiest way to
- put the program together is with a group of function
- pointers for each type. This, in a nutshell, defines class
-
-
-
- February 21, 1989
-
-
-
-
-
- - 8 -
-
-
- and method. The O-O languages give you more of course -
- prettier syntax, derived types and so on - but conceptually
- they provide little extra.
-
- Combining data-driven programs with function pointers
- leads to an astonishingly expressive way of working, a way
- that, in my experience, has often led to pleasant surprises.
- Even without a special O-O language, you can get 90% of the
- benefit for no extra work and be more in control of the
- result. I cannot recommend an implementation style more
- highly. All the programs I have organized this way have
- survived comfortably after much development - far better
- than with less disciplined approaches. Maybe that's it: the
- discipline it forces pays off handsomely in the long run.
-
- _I_n_c_l_u_d_e _f_i_l_e_s
-
- Simple rule: include files should never include include
- files. If instead they state (in comments or implicitly)
- what files they need to have included first, the problem of
- deciding which files to include is pushed to the user (pro-
- grammer) but in a way that's easy to handle and that, by
- construction, avoids multiple inclusions. Multiple inclu-
- sions are a bane of systems programming. It's not rare to
- have files included five or more times to compile a single C
- source file. The Unix /usr/include/sys stuff is terrible
- this way.
-
- There's a little dance involving #ifdef's that can
- prevent a file being read twice, but it's usually done wrong
- in practice - the #ifdef's are in the file itself, not the
- file that includes it. The result is often thousands of
- needless lines of code passing through the lexical analyzer,
- which is (in good compilers) the most expensive phase.
-
- Just follow the simple rule.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- February 21, 1989
-
-
-