home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The World of Computer Software
/
World_Of_Computer_Software-02-386-Vol-2of3.iso
/
c
/
codebk11.zip
/
CODEBOOK.DOC
< prev
next >
Wrap
Text File
|
1990-12-21
|
9KB
|
143 lines
Program CODEBOOK.BAS 1.1, 21 December 1990, by Jim Groeneveld.
PURPOSE
-------
This program unformats a fixed formatted ascii data file for (STATGRAPHICS)
using a user created codebook file. STATGRAPHICS can not read ASCII data files
with records longer than 640 bytes. Next, preparing to read suited ASCII files
is time consuming and user-unfriendly: the user has to create a vector
containing the information of the position of each variable in the data file.
This means a lot of initial arithmetic and errors are difficult to correct.
Besides, all fields will be read: the vector only specifies field widths of
fields that are directly following the previous fields. The sum of all widths
cannot exceed 640. So using standard STATGRAPHICS options in this case is
clumsy, time consuming and error sensitive. This asked for the current replacing
solution.
DESCRIPTION
-----------
CODEBOOK transforms any (fixed formatted) ASCII file with one record per case of
any (unlimited) length into multiple Blank or Comma delimited (or eventually
Fixed formatted or Report) data files (optionally) with STATGRAPHICS (or
generally acceptable, e.g. Lotus) variable names on the first row, each
containing a user specified number of variables. This has the advantage that
all the necessary preparation can be done within any editor, creation of a
codebook file, indicating for each variable to be transformed a.o. the field
width, the starting and ending columns and the variable name. The resulting
data files may then be read after each other into STATGRAPHICS (or used with
any appropriate program). Completely blank fields (mostly representing missing
values) in the original ASCII file may be replaced automatically by any (user-
specified) numerical or character value in the resulting output data files.
This makes reading ASCII data files much more efficient, less sensitive to
errors, much quicker and more logical and surveyable.
(Unique) limit specifications:
------------------------------
1) maximum number of variables per single output file: 32767, default: 58
2) if not reserved enough space initially when specifying the current maximum
number of variables: optional automatic (but rather slow) adaptation to the
actually required number of variables (number of array elements) read from
the codebook file
3) maximum input record length (=number of columns per line): unlimited
4) maximum column specification (interpreted bytes/record):32767*255-1=8,355,584
(practically unlimited, practically limited by available memory in BASIC)
5) number of cases: unlimited (only hardware and software (BASIC) restrictions)
The resulting unformatted data file(s) contain default maximally 58 variables
because of the STATGRAPHICS limits of 640 bytes max.line length and 10
character variable names separated by delimiters; STATGRAPHICS, however, allows
for a maximum of 64 variables to be edited at the same time within its data
editor and even more (?) within any (STATGRAPHICS) data file.
The output records (generally unformatted values) are preceded by a first line
with variable names. As many unformatted, blank or comma delimited (or fixed
formatted or report) output files are generated as are necessary to contain the
total number of variables as a multiple of the number of variables per output
file. They are named automatically by the file name of the original formatted
data file with their sequence number as the extension. All output files may be
read by STATGRAPHICS.
USE
---
A codebook file should be created using any editor in which on each line a
variable should be described as follows:
(all data descriptors, except for the first column, MUST be separated by
COMMA's, only widths and columns may be ended with one or more spaces)
FIRST column :─┬─ space: numeric or character variable to be output 'as is';
(WITHOUT ├─ " : character variable to be output within double quotes;
ending ├─ ' : character variable to be output within single quotes;
delimiter) └─ any other character (or empty line): comment line, no action.
Missing Value:─┬─ any value to replace originally entirely blank fields, may
(END with │ be a character value, optionally enclosed by double quotes;
a COMMA) └─ if left empty it will take the value prompted for when run;
Starting Column : integer positive value <=8,355,584 (end with comma or spaces);
Ending Column : integer positive value <=8,355,584 (end with comma or spaces);
Field Width :─┬─ integer positive value <=255, for double checking field
(end with com-│ correspondence with Starting and Ending Column;
ma or spaces)└─ if 0: disable checking with Starting and Ending Column;
Variable Name:─┬─ any character value up to 255 (!) characters, not quoted;
(end with └─ if omitted a default name, consisting of 'VarX' in which X is
comma/space/ the variable number, will be generated; the Variable Names are
/EOL:CRLF) inserted as the first line of the output file(s);
Comment : optional, may be omitted, not interpreted.
Column specifications need not to be contiguous and sequential. Columns may be
skipped or read multiple times (as part of different variables). This allows
for extracting one or more data files with a restricted, specified number of any
variables from the original database file.
MEMORY REQUIREMENTS
-------------------
Approximate needed memory space per variable to process:
1 byte VAR.TYPE$ (first column of codebook file)
≈ 2 bytes MISSING.VALUE$ (average 2 columns)
4 bytes BEGIN.COLUMN! (double precision)
4 bytes END.COLUMN! (double precision)
≈ 10 bytes VARIABLE.NAME$ (common max. variable name length)
≈ 2 bytes VALUE$ in DATA.LINE$ (average 2 columns)
────────── +
≈ 23 bytes (say generally no more then 25) per variable altogether.
E.g. for 1000 variables this requires a data space in BASIC memory of ≈25 Kb.
So the programs own algorithmic limit of 32767 variables may not be reached
far enough due to other limitations. (Maybe compiled BASIC allows for more
data space). The same applies for the maximum column specification: this would
assume a data file with at least one line of more than 8 Mb long, which has to
fit in memory entirely.
Suggestion: if limits occur while processing a CODEBOOK file break it into
smaller pieces (of about the number of variables per output file) and run
CODEBOOK multiple times. This will not take more time in total. Beware of
duplicate file names! Rename, if necessary in between.
Suggestion: if limits occur while processing a DATABASE file break it into
smaller pieces using COPYFIX (specify record lengths of ≤80 and included and
synchronized CRLF's) and SEPARATE (specify NO record numbers) and rerun CODEBOOK
with adapted (variables and column specifications) parts of the original
codebook file on each of the (renamed!: different FileNameS and no numeric
extensions) generated files from SEPARATE.
GWBASIC-LINE INPUT
------------------
In GWBASIC a LINE INPUT reads at most 255 characters within ONE line. If 255
characters are read any following CRLF has not been encountered yet. Any
succeeding LINE INPUT will start from the point where the previous LINE INPUT
was left. If still more than 255 characters are to be read only 255 will be
read, leaving the rest for the eventual next LINE INPUT. If less than 255
characters on the SAME line are to be read, even if only a remaining CRLF, they
are ALL read, INCLUDING the CRLF, but the CRLF are NOT part of the read STRING.
Any following LINE INPUT starts with the NEXT line. If another BASIC (e.g.
BASICA, according to its manual) processes LINE INPUT in a different way the
course of this program may be unpredicted and erroneous, so BE AWARE of your
BASIC version! The number of characters read by a LINE INPUT statement may be
changed ONLY IF NECESSARY by redefining the BASIC variable MAX.LINE.INPUT.LENGTH
in program line 70.
NOTE: (GW)BASIC performs Garbage Collection (or House Cleaning) regularly.
Centrum voor Medische Informatica TNO <Email> | | |\/|
TNO Center for Medical Informatics | GROENEVELD@CMI.TNO.NL | \_/ | | |
( CMI-TNO ) | Y. Groeneveld | GROENEVELD@CMIHP1.UUCP | Jim Groeneveld
P.O.Box 124 | Wassenaarseweg 56 | GROENEVELD@TNO.NL | Schoolweg 14
2300 AC Leiden | 2333 AL Leiden | ...@HDETNO51.BITNET | 8071 BC Nunspeet
Nederland. | (+31|0)71-181810 | Fax (+31|0)71-176382 | 03412-60413