home *** CD-ROM | disk | FTP | other *** search
- Spreadsheets As An Intermediate Step In Data Analysis
-
- by Willie Lockeretz
-
- When I first started using a Kaypro, I didn't expect to have much
- use for a spreadsheet. The little I knew about this type of program was
- gleaned from advertisements, which always featured things like financial
- planning, depreciation schedules, and profit and loss projections. As
- someone who knows nothing -- and cares even less -- about Presentation
- Graphics, Bottom Lines, or Corner Offices, I didn't see that there could
- be much in it for me. It seemed that the main shtick was that this or
- that spreadsheet could put me on the Fast Track, in turn letting me get
- Up the Corporate Ladder, at the top of which, presumably, I would find
- the Key to the Executive Bathroom.
-
- Well I was wrong. Nowadays I probably get more use out of
- spreadsheets than any other single category of software. But first I
- needed some consciousness-raising. For starters, this meant
- acknowledging that no matter what the numbers represent, a tabular array
- of numbers is a tabular array of numbers, and a program that lets you
- manipulate such an array quickly and conveniently could be very handy
- even if the numbers didn't happen to have dollar signs in front of them.
- Second, I realized that the manipulations you might perform with such a
- program could be very different from the operations the
- dressed-for-success types in the ads were doing. In fact, spreadsheets
- could be useful even if you didn't do any manipulations with them at
- all.
-
- That was the real breakthrough. Spreadsheets might be great for
- calculating net present value or internal rate of return, but in my work
- I do statistical computations that would be impossible on a spreadsheet,
- things like multiple regressions and factor analysis. For this I need a
- scientific statistics package. But rather than ruling out spreadsheets,
- I started using them to prepare the data for a statistics program that
- would do the real analysis. In other words, the fact that a spreadsheet
- can't do the kinds of calculations I need is not important -- I don't
- depend on it for calculations. But as an intermediary between raw data
- and statistical analysis, a spreadsheet has simplified my life
- immensely. I happen to favor SuperCalc2, which works very nicely on a
- CP/M Kaypro, but the basic idea would hold for any reasonably powerful
- spreadsheet.
-
- I often work with moderately large arrays of demographic, economic,
- and agricultural data -- perhaps 12 numbers for each of 150 counties,
- say. I use SuperCalc2 for three purposes. First, it's a convenient way
- to enter the data. Second, it can help spot large errors (like an extra
- zero after a number). Third, it's a convenient way to compute
- additional variables from the raw data. After it has done all that, the
- statistics program is ready to take over. A statistics package could
- have done the first three tasks, too, but most don't do them as
- conveniently. This added convenience should not entail any new
- problems. Any good statistics program and any good spreadsheet should
- have at least one format in common that permits them to be linked this
- way. That is, the spreadsheet should be able to write a file that the
- statistics program can use as input.
-
- I begin an analysis with a blank spreadsheet that has the right
- column and row labels, but with a dashed line where each number is to
- go. Printed out, this is a very efficient form on which to copy numbers
- from a source that you can't take back to the office, such as the
- Census. (A laptop would eliminate the needed for writing the numbers
- down at all, of course.) When you enter the data into the computer, the
- spreadsheet on the screen corresponds in every visual detail to the
- sheet from which you are reading the numbers -- the rows and columns are
- in the right order, the headings are identical, the column widths are
- the same, and so forth. This helps eliminate a very common error --
- your eye skipping to the wrong line.
-
- Checking for order-of-magnitude errors is easily done by asking
- the spreadsheet to report the maximum and minimum value in a row or
- column. For some kinds of data, values that are way out of line are
- obviously wrong. The average age in a county is not likely to be 272 --
- 27.2 is more like it. But for some variables an extreme value is not
- necessarily an error. In that case, the trick is to construct a new
- variable whose range is more restricted than the original one. For
- example, the population of towns and cities might range from a few
- hundred to several million. But if you are using population data from
- two different census years, the spreadsheet can quickly compute the
- percentage change. If you see a change of +923% in a decade, there is a
- pretty good chance that a digit was repeated or a decimal misplaced --
- exactly the kind of error that is most likely to occur with manual data
- entry. If you are clever in concocting new variables you should be able
- to detect every order-of-magnitude error. It's not a substitute for
- comparing each number to the source, of course. But that can be very
- monotonous and wearying, which means an error can slip past. It's nice
- to have a second line of defense.
-
- Finally, most data analysis will require you to construct new
- variables from the raw data (I'm talking now about quantities that are
- of interest in their own right, not just as a way of detecting errors).
- I prefer doing this on the spreadsheet rather than with the statistics
- package because I can see the results before they get irrevocably stored
- with all the other data. A statistics package is a black box: you put
- in the data at one end, and you get results out the other. But you
- don't see what goes on in between. I like to keep an eye on the data as
- much as possible. With a spreadsheet you can inspect the newly created
- variables, get familiar with them, fondle them, do whatever you want
- with them (in private, and with their consent, of course). I prefer to
- relinquish control to the statistics package only after the spreadsheet
- has done everything it can.
-
- In short, I use a statistics package for the things that only a
- statistics package can do. But a spreadsheet is a much more practical
- and pleasant way to do the tedious but necessary preliminaries: to enter
- the raw numbers, check them, and carry out first-level processing. As
- soon as I started using SuperCalc2 this way, I found that it was a true
- anomaly in the world of commercial personal computer packages -- a
- program suitable for an even wider range of applications than its
- advertising claimed. It also is a true anomaly in having earned the
- highest tribute I have ever bestowed on anything in my overflowing box
- of non-bundled software: I own it legally.
-
- -- from The Boston Kugel, published by The Boston Kaypro Users Group
- (BOSKUG) of the Boston Computer Society.