home *** CD-ROM | disk | FTP | other *** search
- .so CW.ms
- .TL
- Notes on Programming in C
- .AU
- Rob Pike
- .SH
- Introduction
- .PP
- Kernighan and Plauger's
- .I
- The Elements of Programming Style
- .R
- was an important and rightly influential book.
- But sometimes I feel its concise rules were taken as a cookbook
- approach to good style instead of the succinct expression of a philosophy
- they were meant to be.
- If the book claims that variable names should be chosen
- meaningfully, doesn't it then follow that variables whose names are
- small essays on their use are even better?
- Isn't
- .CW MaximumValueUntilOverflow
- a better name than
- .CW maxval ?
- I don't think so.
- .PP
- What follows is a set of short essays that collectively
- encourage a philosophy of clarity in programming rather than
- giving hard rules.
- I don't expect you to agree with all of them,
- because they are opinion and opinions change with the times.
- But they've been accumulating in my head, if not on paper
- until now, for a long time, and are based on a lot of experience,
- so I hope they help you understand
- how to plan the details of a program.
- (I've yet to see a good essay on how to plan the whole thing,
- but then that's partly what this course is about.)
- If you find them idiosyncratic, fine;
- if you disagree with them, fine;
- but if they make you think about why you disagree, that's better.
- Under no circumstances should you program the way I say to
- because I say to; program the way you think expresses best
- what you're trying to accomplish in the program.
- And do so consistently and ruthlessly.
- .PP
- Your comments are welcome.
- .SH
- Issues of typography
- .PP
- A program is a sort of publication.
- It's meant to be read by the programmer,
- another programmer (perhaps yourself a few days, weeks or years later),
- and lastly a machine.
- The machine doesn't care how pretty the program is \(em
- if the program compiles, the machine's happy \(em
- but people do, and they should.
- Sometimes they care too much: pretty printers
- mechanically produce pretty output that accentuates irrelevant detail in
- the program, which is
- .B as
- sensible
- .B as
- putting all the prepositions
- .B in
- English text
- .B in
- bold font.
- Although many people think programs should look like the Algol-68 report
- (and some systems even require you to edit programs in that style),
- a clear program is not made any clearer by such presentation,
- and a bad program is only made laughable.
- .PP
- Typographic conventions consistently held
- are important to clear presentation, of course \(em
- indentation is probably the best known and most useful example \(em
- but when the ink obscures the intent, typography has taken over.
- So even if you stick with plain old typewriter-like output, be conscious
- of typographic silliness.
- Avoid decoration; for instance,
- keep comments brief and banner-free.
- Say what you want to say in the program, neatly and consistently.
- Then move on.
- .SH
- Variable names
- .PP
- Ah, variable names.
- Length is not a virtue in a name;
- clarity of expression
- .I is.
- A global variable rarely used may deserve a long name,
- .CW maxphysaddr
- say. An array index used on every line of a loop needn't be named
- any more elaborately than
- .CW i .
- Saying
- .CW index
- or
- .CW elementnumber
- is more to type (or calls upon your text editor) and obscures
- the details of the computation.
- When the variable names are huge, it's harder to see what's going on.
- This is partly a typographic issue; consider
- .P1
- for(i=0 to 100)
- array[i]=0
- .P2
- .I vs.
- .P1
- for(elementnumber=0 to 100)
- array[elementnumber]=0;
- .P2
- The problem gets worse fast with real examples.
- Indices are just notation, so treat them as such.
- .PP
- Pointers also require sensible notation.
- .CW np
- is just as mnemonic as
- .CW nodepointer
- .I if
- you consistently use a naming convention from which
- .CW np
- means ``node pointer'' is easily derived.
- More on this in the next essay.
- .PP
- As in all other aspects of readable programming, consistency is important
- in naming. If you call one variable
- .CW maxphysaddr ,
- don't call its cousin
- .CW lowestaddress .
- .PP
- Finally, I prefer minimum-length but maximum-information names, and
- then let the context fill in the rest.
- Globals, for instance, typically have little context when they are used,
- so their names need to be relatively evocative. Thus I say
- .CW maxphysaddr
- (not
- .CW MaximumPhysicalAddress )
- for a global variable, but
- .CW np
- not
- .CW NodePointer
- for a pointer locally defined and used.
- This is largely a matter of taste, but taste is relevant to clarity.
- .PP
- I eschew embedded capital letters in names; to my prose-oriented eyes,
- they are too awkward to read comfortably. They jangle like bad typography.
- .SH
- The use of pointers.
- .PP
- C is unusual in that it allows pointers to point to anything.
- Pointers are sharp tools, and like any such tool, used well
- they can be delightfully productive, but used badly they can
- do great damage (I sunk a wood chisel into my thumb
- a few days before writing this).
- Pointers have a bad reputation in academia,
- because they are considered too dangerous, dirty somehow.
- But I think they are powerful
- .I notation,
- which means they can help us express ourselves clearly.
- .PP
- Consider: When you have a pointer to an object, it is a name for exactly
- that object and no other. That sounds trivial, but look at the following
- two expressions:
- .P1
- np
- node[i]
- .P2
- The first points to a node, the second evaluates to (say) the same node.
- But the second form is an expression; it is not so simple.
- To interpret it, we must know what
- .CW node
- is, what
- .CW i
- is, and that
- .CW i
- and
- .CW node
- are related by the (probably unspecified) rules
- of the surrounding program.
- Nothing about the expression in isolation can show that
- .CW i
- is a valid index of
- .CW node ,
- let alone the index of the element we want.
- If
- .CW i
- and
- .CW j
- and
- .CW k
- are all indices into the node array, it's very easy to slip up,
- and the compiler cannot help.
- It's particularly easy to make mistakes when passing things to subroutines:
- a pointer is a single thing; an array and an index must be believed
- to belong together in the receiving subroutine.
- .PP
- An expression that evaluates to an object is inherently more subtle
- and error-prone than the address of that object.
- Correct use of pointers can simplify code:
- .P1
- parent->link[i].type
- .P2
- .I vs.
- .P1
- lp->type.
- .P2
- If we want the next element's type, it's
- .P1
- parent->link[++i].type
- .P2
- or
- .P1
- (++lp)->type.
- .P2
- .CW i
- advances but the rest of the expression must stay constant;
- with pointers, there's only one thing to advance.
- .PP
- Typographic considerations enter here, too.
- Stepping through structures using pointers can be much
- easier to read than with expressions: less ink is needed and
- less effort is expended by the compiler and computer.
- A related issue is that the type of the pointer affects how it
- can be used correctly, which allows some helpful compile-time error
- checking that array indices cannot share.
- Also, if the objects are structures, their tag fields are
- reminders of their type, so
- .P1
- np->left
- .P2
- is sufficiently evocative; if an array is being indexed the array
- will have some well-chosen name and the expression will end up
- longer:
- .P1
- node[i].left.
- .P2
- Again, the extra characters become more irritating as the examples
- become larger.
- .PP
- As a rule, if you find code containing many similar, complex expressions
- that evaluate to elements of a data structure, judicious use of
- pointers can clear things up.
- Consider what
- .P1
- if(goleft)
- p->left=p->right->left;
- else
- p->right=p->left->right;
- .P2
- would look like using a compound expression for
- .CW p .
- Sometimes it's worth a temporary variable (here
- .CW p )
- or a macro to distill the calculation.
- .SH
- Procedure names
- .PP
- Procedure names should reflect what they do; function names should
- reflect what they
- .I return.
- Functions are used in expressions, often in things like
- .CW if 's,
- so they need to read appropriately.
- .P1
- if(checksize(x))
- .P2
- is unhelpful because we can't deduce whether checksize returns true on
- error or non-error; instead
- .P1
- if(validsize(x))
- .P2
- makes the point clear and makes a future mistake in using the routine
- less likely.
- .SH
- Comments
- .PP
- A delicate matter, requiring taste and judgement.
- I tend to err on the side of eliminating comments, for several reasons.
- First, if the code is clear, and uses good type names and variable names,
- it should explain itself.
- Second, comments aren't checked by the compiler,
- so there is no guarantee they're right,
- especially after the code is modified.
- A misleading comment can be very confusing.
- Third, the issue of typography: comments clutter code.
- .PP
- But I do comment sometimes.
- Almost exclusively, I use them as an introduction to what follows.
- Examples: explaining the use of global variables and types
- (the one thing I always comment in large programs);
- as an introduction to an unusual or critical procedure;
- or to mark off sections of a large computation.
- .PP
- There is a famously bad comment style:
- .P1
- i=i+1; /* Add one to i */
- .P2
- and there are worse ways to do it:
- .P1
- /**********************************
- * *
- * Add one to i *
- * *
- **********************************/
-
- i=i+1;
- .P2
- Don't laugh now, wait until you see it in real life.
- .PP
- Avoid cute typography in comments,
- avoid big blocks of comments except perhaps before vital
- sections like the declaration of the central data structure
- (comments on data are usually much more helpful than
- on algorithms);
- basically, avoid comments.
- If your code needs a comment to be understood, it would be better
- to rewrite it so it's easier to understand.
- Which brings us to
- .SH
- Complexity
- .PP
- Most programs are too complicated \(em that is,
- more complex than they need to be to solve their problems efficiently.
- Why?
- Mostly it's because of bad design,
- but I will skip that issue here because it's a big one.
- But programs are often complicated
- at the microscopic level,
- and that is something I can address here.
- .PP
- Rule 1.
- You can't tell where a program is going to spend its time.
- Bottlenecks occur in surprising places, so don't
- try to second guess and put in a speed hack until you've
- proven that's where the bottleneck is.
- .PP
- Rule 2.
- Measure.
- Don't tune for speed until you've measured,
- and even then don't unless one part of the code
- .I overwhelms
- the rest.
- .PP
- Rule 3.
- Fancy algorithms are slow when
- .I n
- is small, and
- .I n
- is usually small.
- Fancy algorithms have big constants.
- Until you know that
- .I n
- is frequently going to be big,
- don't get fancy.
- (Even if
- .I n
- does get big, use Rule 2 first.)
- For example, binary trees are always faster than splay trees for
- workaday problems.
- .PP
- Rule 4.
- Fancy algorithms are buggier than simple ones,
- and they're much harder to implement.
- Use simple algorithms as well as simple data structures.
- .PP
- The following data structures are a complete list for almost
- all practical programs:
- .DS
- array
- linked list
- hash table
- binary tree
- .DE
- Of course, you must also be prepared to collect these into compound
- data structures.
- For instance, a symbol table might be implemented as a hash table
- containing linked lists of arrays of characters.
- .PP
- Rule 5.
- Data dominates.
- If you've chosen the right data structures and organized things well,
- the algorithms will almost always be self-evident.
- Data structures, not algorithms, are central to programming.
- (See Brooks p. 102.)
- .PP
- Rule 6.
- There is no Rule 6.
- .SH
- Programming with data.
- .PP
- Algorithms, or details of algorithms,
- can often be encoded compactly, efficiently and expressively as data
- rather than, say, as lots of
- .CW if
- statements.
- The reason is that the
- .I complexity
- of the job at hand, if it is due to a combination of
- independent details,
- .I
- can be encoded.
- .R
- A classic example of this is parsing tables,
- which encode the grammar of a programming language
- in a form interpretable by a fixed, fairly simple
- piece of code.
- Finite state machines are particularly amenable to this
- form of attack, but almost any program that involves
- the `parsing' of some abstract sort of input into a sequence
- of some independent `actions' can be constructed profitably
- as a data-driven algorithm.
- .PP
- Perhaps the most intriguing aspect of this kind of design
- is that the tables can sometimes be generated by another
- program \(em a parser generator, in the classical case.
- As a more earthy example, if an operating system is driven
- by a set of tables that connect I/O requests to the appropriate
- device drivers, the system may be `configured'
- by a program that reads a description of the particular
- devices connected to the machine in question and prints
- the corresponding tables.
- .PP
- One of the reasons data-driven programs are not common, at least
- among beginners, is the tyranny of Pascal.
- Pascal, like its creator, believes firmly in the separation
- of code and data.
- It therefore (at least in its original form) has no ability to
- create initialized data.
- This flies in the face of the theories of Turing and von Neumann,
- which define the basic principles of the stored-program computer.
- Code and data
- .I are
- the same, or at least they can be.
- How else can you explain how a compiler works?
- (Functional languages have a similar problem with I/O.)
- .SH
- Function pointers
- .PP
- Another result of the tyranny of Pascal is that beginners don't use
- function pointers.
- (You can't have function-valued variables in Pascal.)
- Using function pointers to encode complexity has some interesting
- properties.
- .PP
- Some of the complexity is passed to the routine pointed to.
- The routine must obey some standard protocol \(em it's one of a set
- of routines invoked identically \(em but beyond that, what it
- does is its business alone.
- The complexity is
- .I distributed.
- .PP
- There is this idea of a protocol, in that all functions used similarly
- must behave similarly. This makes for easy documentation, testing,
- growth and even making the program run distributed over a network \(em
- the protocol can be encoded as remote procedure calls.
- .PP
- I argue that clear use of function pointers is the heart of object-oriented
- programming. Given a set of operations you want to perform on data,
- and a set of data types you want to respond to those operations,
- the easiest way to put the program together is with a group of function
- pointers for each type.
- This, in a nutshell, defines class and method.
- The O-O languages give you more of course \(em prettier syntax,
- derived types and so on \(em but conceptually they provide little extra.
- .PP
- Combining data-driven programs with function pointers leads to an
- astonishingly expressive way of working, a way that, in my experience,
- has often led to pleasant surprises. Even without a special O-O
- language, you can get 90% of the benefit for no extra work and be
- more in control of the result.
- I cannot recommend an implementation style more highly.
- All the programs I have organized this way have survived comfortably
- after much development \(em far better than with less disciplined
- approaches.
- Maybe that's it: the discipline it forces pays off handsomely in the long run.
- .SH
- Include files
- .PP
- Simple rule: include files should never include include files.
- If instead they state (in comments or implicitly) what files they need to have
- included first, the problem of deciding which files to include
- is pushed to the user (programmer) but in a way that's easy to handle
- and that, by construction, avoids multiple inclusions.
- Multiple inclusions are a bane of systems programming.
- It's not rare to have files included five or more times to
- compile a single C source file.
- The Unix
- .CW /usr/include/sys
- stuff is terrible this way.
- .PP
- There's a little dance involving
- .CW #ifdef 's
- that can prevent a file being read twice, but it's usually done
- wrong in practice \(em the
- .CW #ifdef 's
- are in the file itself, not the file that includes it.
- The result is often thousands of needless lines of code
- passing through the lexical analyzer, which is (in good compilers)
- the most expensive phase.
- .PP
- Just follow the simple rule.
-