home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
kermit.columbia.edu
/
kermit.columbia.edu.tar
/
kermit.columbia.edu
/
public_html
/
csv.save
< prev
next >
Wrap
Text File
|
2010-08-20
|
20KB
|
621 lines
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<title>
Kermit and Comma-Separated-Value Files
</title>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META http-equiv="Content-Style-Type" content="text/css">
<LINK REL=STYLESHEET TYPE="text/css" HREF="kermit.css">
<LINK REL="shortcut icon" href="favicon.ico" >
<style type="text/css">
ul li { padding-bottom:9;padding-right:64;line-height:12.5pt; }
h2, h3 { font-family: sans-serif }
h3 { border-top: 2px solid #999999 }
ul.contents li { line-height:10pt; font-family:sans-serif }
tt,pre { font-size:11pt }
dl.loose dd { padding:0 0 8 0 }
blockquote.example { margin-top:6; margin-bottom:8 }
body { color:black; background:white; margin:0; font-size:12pt">
</style>
</head>
<body>
<table cellpadding=0 cellspacing=0 width="100%"
style="border:1px solid darkmagenta; background-color:white;">
<tr style="background-image:url('lb3.jpg');">
<td style="padding:8 8 8 8">
<a style="text-decoration:none"
href="http://www.columbia.edu"><img
border=0
alt="The Columbia Crown"
title="The Columbia Crown (crown of King George II)"
height=105
src="crownico.gif"></a>
<td align="left" style="padding-top:23">
<tt style="font-size:24pt"><b>The Kermit Project</b></tt> |
<span style="font-family:Ariel,times; font-size:18pt"><i>Columbia
University</i></span>
<br><span style="font-family:Ariel,times; font-size:14pt">
612 West 115th Street, New York NY 10025 USA •
<a href="mailto:kermit@columbia.edu">kermit@columbia.edu</a>
</span>
<table width="100%">
<tr>
<td style="font-size:12pt; font-style:italic">…since
<small>1981</small></div>
</table>
<tr>
<td colspan=2 style="padding:0">
<table class=menu cellpadding=0 cellspacing=0
width="100%" style="border-top:1px solid darkmagenta">
<tr>
<td onClick="document.location.href='index.html';"
title="Kermit Project Home Page"
style="cursor:pointer"><a href="index.html">Home</a>
<td onClick="document.location.href='k95.html';"
title="Kermit 95 for Windows"
style="cursor:pointer"><a href="k95.html">Kermit 95</a>
<td class=this onClick="document.location.href='ckermit.html';"
title="C-Kermit for Unix and VMS"
style="cursor:pointer"><a href="ckermit.html">C-Kermit</a>
<td onClick="document.location.href='ckscripts.html';"
title="Kermit Script Language and Tutorial"
style="cursor:pointer"><a href="ckscripts.html">Scripts</a>
<td onClick="document.location.href='current.html';"
title="Current Versions of Kermit Software"
style="cursor:pointer"><a href="current.html">Current</a>
<td onClick="document.location.href='new.html';"
title="What's New"
style="cursor:pointer"><a href="new.html">New</a>
<td onClick="document.location.href='faq.html';"
title="Frequently Asked Questions"
style="cursor:pointer"><a href="faq.html">FAQ</a>
<td onClick="document.location.href='support.html';"
style="border-right:0"
title="Kermit Software Support"
style="cursor:pointer"><a href="support.html">Support</a>
</table>
</table>
<div style="margin:8; font-family:calibri,sans-serif,times">
<div style="margin:8">
<div align="right" style="font-weight:bold; font-size:11pt">
<span class=qq><a title="C-Kermit 9.0"
href="ck90.html">C-Kermit 9.0</a></span>
</div>
<h2 style="margin-top:8">C-Kermit 9.0 and Comma-Separated-Value Files</h2>
<blockquote>
Frank da Cruz<br>
The Kermit Project<br>
Columbia University<br>
<i>Last update:</i>
Fri Aug 20 14:02:34 2010
</blockquote>
<p>
<div class=dm style="background:#eeeeee;border:1px solid plum; padding:0">
<ul style="font-size:14pt">
<li><a href="#record"><b>Reading a CSV or TSV Record and Converting it
to an Array</b></a>
<li><a href="#join"><b>Using \fjoin() to create a
Comma- or Tab-Separated Value List from an Array</b></a>
<li><a href="#file"><b>Using CSV or TSV Files</b></a>
</ul>
</div>
<p>
Comma-Separated Value (CSV) format is commonly output by spreadsheets and
databases when exporting data into plain-text files for import into other
applications. It is important to understand the formal definition of a CSV
file.
<ol>
<li>Each record is a series of fields.
<li>Records are in whatever format is used by the underlying file system
for lines of text.
<li>Fields within records are separated by commas, with zero or more
whitespace characters (space or tab) before and/or after the comma.
<li>Fields with imbedded commas are enclosed in ASCII doublequote characters.
<li>Fields with leading or trailing spaces are enclosed in ASCII doublequotes.
<li>Fields with embedded doublequotes are enclosed in doublequotes and each
interior doublequote is doubled.
</ol>
Here is an example:
<blockquote>
<pre>
aaa, bbb, has spaces,,"ddd,eee,fff", " has spaces ","Muhammad ""The Greatest"" Ali"
</pre>
</blockquote>
The first two are regular fields. The second is a field that has an
embedded space but in whichy any leading or trailing spaces are to be
ignored. The fourth is an empty field, but still a field. The fifth is a
field that contains embedded commas. The sixth has leading and trailing
spaces. The last field has embedded quotation marks.
<p>
Prior to C-Kermit 9.0 Alpha.06, C-Kermit did not handle CSV files according
to the specification above. Most seriously, there was no provision for a
separator to be surrounded by whitespace that was to be considered part of
the separator. Also there was no provision for quoting doublequotes inside
of a quoted string.
<h3 id=record>Reading a CSV record</h3>
Now the <tt>\fsplit()</tt> function can handle any CSV-format string if you
include the symbolic include set "CSV" as the 4th parameter.
To illustrate, this program:
<blockquote>
<pre>
def xx {
echo [\fcontents(\%1)]
.\%9 := \fsplit(\fcontents(\%1), &a, \44, CSV)
for \%i 1 \%9 1 { echo "\flpad(\%i,3). [\&a[\%i]]" }
echo "-----------"
}
xx {a,b,c}
xx { a , b , c }
xx { aaa,,ccc," with spaces ",zzz }
xx { "1","2","3","","5" }
xx { this is a single field }
xx { this is one field, " and this is another " }
xx { name,"Mohammad ""The Greatest"" Ali", age, 67 }
exit
</pre>
</blockquote>
gives the following results:
<blockquote>
<pre>
[a,b,c]
1. [a]
2. [b]
3. [c]
-----------
[ a , b , c ]
1. [a]
2. [b]
3. [c]
-----------
[ aaa,,ccc," with spaces ",zzz ]
1. [aaa]
2. []
3. [ccc]
4. [ with spaces ]
5. [zzz]
-----------
[ "1","2","3","","5" ]
1. [1]
2. [2]
3. [3]
4. []
5. [5]
6. [] <span style="color:red">← Oops, this was fixed in Alpha.07</span>
-----------
[ this is a single field ]
1. [this is a single field]
-----------
[ this is one field, " and this is another " ]
1. [this is one field]
2. [ and this is another ]
3. [] <span style="color:red">← Ditto</span>
-----------
[ name,"Mohammad ""The Greatest"" Ali", age, 67 ]
1. [name]
2. [Mohammad "The Greatest" Ali]
3. [age]
4. [67]
-----------
</pre>
</blockquote>
The separator <tt>\44</tt> (comma) must still be specified as the break set
(3rd <tt>\fsplit()</tt> parameter). When “CSV” is specified as
the include set:
<ul>
<li>The Grouping Mask is automatically set to 1 (which specifies that the
ASCII doublequote character (<tt>"</tt>) is used for grouping;
<li>The Separator Flag is automatically set to 1 so that adjacent field
separators will not be collapsed;
<li>All bytes (values 0 through 255) other than the break character are
added to the include set;
<li>Any leading whitespace is stripped from the first element unless it is
enclosed in doublequotes;
<li>Any trailing whitespace is trimmed from the end of the last element
unless it is enclosed in doublequotes;
<li>If the separator character has any spaces or tabs preceding it or
following it, they are ignored and discarded;
<li>The separator character is treated as an ordinary data character if
it appears in a quoted field;
<li>A sequence of two doublequote characters (<tt>""</tt>) within a quoted
field is converted to a single doublequote.
</ul>
There is also a new TSV symbolic include set, which is like CSV except
without the quoting rules or the stripping of whitespace around the
separator because, by definition, TSV fields do not contain tabs.
<p>
Of course you can specify any separator(s) you want with either the CSV,
TSV, or ALL symbolic include sets. For example, if you have a TSV file in
which you want the spaces around each Tab to be discarded, you can use:
<blockquote>
<pre>
\fsplit(<i>variable</i>, &a, \9, CSV)
</pre>
</blockquote>
<tt>\9</tt> is Tab.
<p>
The new symbolic include sets can also be used with <tt>\fword()</tt>, which
is just like <tt>\fsplit()</tt> except that it retrieves the
<i>n</i><small>th</small> word from the argument string, rather than an
array of all the words. In C-Kermit you can get information about these or
any other functions with the <small>HELP FUNCTION</small> command, e.g.:
<blockquote>
<pre>
C-Kermit> <u>help func word</u>
\fword(s1,n1,s2,s3,n2,n3) - Extract word from string.
s1 = source string
n1 = word number (1-based) counting from left; if negative, from right.
s2 = optional break set.
s3 = optional include set.
n2 = optional grouping mask.
n3 = optional separator flag:
0 = collapse adjacent separators
1 = don't collapse adjacent separators.
Default break set is all characters except ASCII letters and digits.
ASCII (C0) control characters are treated as break characters by default.
Default include set is null. Three special symbolic include sets are also
allowed: ALL (bytes not in the break set), CSV (special treatment
for Comma-Separated-Value records), and TSV (special treatment for Tab-
Separated-Value records).
If grouping mask given and nonzero, words can be grouped by quotes or
brackets selected by the sum of the following:
1 = doublequotes: "a b c"
2 = braces: {a b c}
4 = apostrophes: 'a b c'
8 = parentheses: (a b c)
16 = square brackets: [a b c]
32 = angle brackets: <a b c>
Nesting is possible with {}()[]<> but not with quotes or apostrophes.
Returns string:
Word number n, if there is one, otherwise an empty string.
C-Kermit>
</pre>
</blockquote>
<h3 id=join>Using <tt>\fjoin()</tt> to create
Comma- or Tab-Separated Value Lists from Arrays</h3>
In Alpha.08 of C-Kermit 9.0, <tt>\fsplit()</tt>'s inverse function, <a
href="ckermit80.html#fjoin"><tt>\fjoin()</tt></a> received the capability of
converting an array into a comma-separated or a tab-separated value list.
Thus, given a CSV, if you split it into an array with <tt>\fsplit()</tt> and
then join the array with <tt>\fjoin()</tt>, giving each function the new CSV
parameter in the appropriate argument position, the result will be will be
equivalent to the original, according to the CSV definition. It might not
be identical, because if the result had extraneous spaces before or after
the separating commas, these are discarded, but that does not affect the
elements themselves. The new syntax for <tt>\fjoin()</tt></a> is:
<p>
<dl>
<dt><b><tt>\fjoin(&a,CSV)</tt></b>
<dd>Given the array <tt>\&a[]</tt> or any other valid array designator,
joins its elements into a comma-separated list according to the
rules listed above.
<p>
<dt><b><tt>\fjoin(&a,TSV)</tt></b>
<dd>Joins the elements of the given array into a tab-separated list, also
described above.
</dl>
<p>
<a href="ckermit80.html#fjoin">Previous calling conventions for
<tt>\fjoin()</tt></a> are undisturbed, including the ability to specify a
portion of an array, rather than the whole array:
<p>
<blockquote>
<pre>
declare \&a[] = 1 2 3 4 5 6 7 8 9
echo \fjoin(&a[3:7],CSV)
3,4,5,6,7
</pre>
</blockquote>
<p>
Using <tt>\fsplit()</tt> and <tt>\fjoin()</tt> is now possible to convert a
comma-separated value list into a tab-separated value list, and vice versa
(which is not a simple matter of changing commas to tabs or vice versa).
<h3 id=join>Applications for CSV Files</h3>
Databases such as MS Access or MySQL can export tables or reports in CSV
format, and then Kermit can read the resulting CSV file and do whatever you
like with it; typically something that could not be done with the database
query language itself (or that you didn't know how to do that way): create
reports or datasets that based on complex criteria or procedures, edit or
modify some fields, etc, and then use <tt>\fjoin()</tt> to put each record
back in CSV form so it can be reimported into a spreadsheet or database.
<p>
Here is a simple example in which we purge all records of customers who have
two or more unpaid bills. The file is sorted in serial-number order.
<p>
<blockquote>
<pre>
#!/usr/local/bin/kermit
.filename = somefile.csv # Input file in CSV format
fopen /read \%c \m(filename) # Open it
if fail exit # Don't go on if open failed
copy \m(filename) ./new # Make a copy of the file
.oldserial = 00000000000 # Multiple records for each serial number
.zeros = 0 # Unpaid bill counter
while true { # Loop
fread /line \%c line # Get a record
if fail exit # End of file
.n := \fsplit(\m(line),&a,\44,CSV) # Split the fields into an array
if not equ "\m(oldserial)" "\&a[6]" { # Have new serial number?
# Remove all records for previous serial number
# if two or more bills were not paid...
if > \m(zeros) 1 {
grep /nomatch \m(oldserial) /output:./new2 ./new
rename ./new2 ./new
}
.oldserial := \&a[6] # To detect next time serial number changes
.zeros = 0 # Reset unpaid bill counter
}
if equ "\&a[5]" "$0.00" { # Element 5 is amount paid
increment zeros # If it's zero, count it.
}
}
fclose \%c
</pre>
</blockquote>
<p>
Rewriting the file multiple times is inelegant, but this is a quick and
dirty use-once-and-discard script, so elegance doesn't count.
The example is interesting in that it purges certain records based on
the contents of other records. Maybe there is a way to do this directly
with SQL, but why use SQL when you can use Kermit?
<h3 id=file>Using CSV Files: Extending Kermit's Data Structures</h3>
Now that we can parse a CSV record, what would we do with a CSV <i>file</i>
– that is, a sequence of records? If we needed all the data available
at once, we would want to load it into a matrix of (row,column) values. But
Kermit doesn't have matrices. Or does it?
<p>
Kermit has several built-in data types, but you can invent your own data
types as needed using Kermit's macro feature:
<blockquote>
<pre>
define <i>variablename value</i>
</pre>
</blockquote>
For example:
<blockquote>
<pre>
define alphabet abcdefghijklmnopqrstuvwxyz
</pre>
</blockquote>
This defines a macro named <i>alphabet</i> and gives it the value
<i>abcdefghijklmnopqrstuvwxyz</i>. A more convenient notation (added in
C-Kermit 7.0) for this is:
<blockquote>
<pre>
.alphabet = abcdefghijklmnopqrstuvwxyz
</pre>
</blockquote>
The two are exactly equivalent: they make a literal copy the "right hand side"
as the value of the macro. Then you can refer to the macro anywhere in a
Kermit command as "<tt>\m(</tt><i>macroname</i><tt>)</tt>":
<blockquote>
<pre>
echo "Alphabet = \m(alphabet)"
</pre>
</blockquote>
There is a second way to define a macro, which is like the first except that
the right-hand side is <i>evaluated</i> first; that is, any variable
references or function calls in the right-hand side are replaced by their
values before the result is assigned to the macro. The command for this is
<small>ASSIGN</small> rather than <small>DEFINE</small>:
<blockquote>
<pre>
define alphabet abcdefghijklmnopqrstuvwxyz
assign backwards \freverse(\m(alphabet))
echo "Alphabet backwards = \m(backwards)"
</pre>
</blockquote>
which prints:
<blockquote>
<pre>
Alphabet backwards = zyxwvutsrqponmlkjihgfedcba
</pre>
</blockquote>
This kind of assignment can also be done like this:
<blockquote>
<pre>
.alphabet = abcdefghijklmnopqrstuvwxyz
.backwards := \freverse(\m(alphabet))
</pre>
</blockquote>
<a href="ckermit70.html#x7.9">Any command starting with a period is an
assignment</a>, and the operator (<tt>=</tt> or <tt>:=</tt>) tells what to
do with the right-hand side before making the assignment.
<p>
In both the <small>DEFINE</small> and <small>ASSIGN</small> commands, the
variable name itself is taken literally. It is also possible, however, to
have Kermit <i>compute</i> the variable name. This is done (as described
in <a title="Using C-Kermit, 2nd Edition"
href="http://www.amazon.com/gp/product/1555581641?ie=UTF8&tag=aleidmoreldom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1555581641"><i>Using
C-Kermit</i></a>, 2nd Ed., p.457), using parallel
commands that start with underscore: <small>_DEFINE</small> and
<small>_ASSIGN</small> (alias <small>_ASG</small>). These are just like
<small>DEFINE</small> and <small>ASSIGN</small> except they evaluate the
variable name before making the assigment. For example:
<blockquote>
<pre>
define \%a one
_define \%a\%a\%a 111
</pre>
</blockquote>
would create a macro named <small>ONEONEONE</small> with a value of 111, and:
<blockquote>
<pre>
define \%a one
define number 111
_assign \%a\%a\%a \m(number)
</pre>
</blockquote>
would create the same macro with the same value, but:
<blockquote>
<pre>
define \%a one
define number 111
_define \%a\%a\%a \m(number)
</pre>
</blockquote>
would give the macro a value of "<tt>\m(number)</tt>".
<p>
You can use the <small>_ASSIGN</small> command to create any kind of data
structure you want; you can find some examples in the
<a href="ckscripts.html#oops">Object-Oriented Programming</a> section of the
<a href="ckscripts.html">Kermit Script Library</a>. In the following
program we use this capability to create a two-dimensional array, or matrix,
to hold the all the elements of the CSV file, and then to display the matrix:
<blockquote>
<pre>
fopen /read \%c data.csv # Open CSV file
if fail exit 1
.\%r = 0 # Row
.\%m = 0 # Maximum columns
while true {
fread /line \%c line # Read a record
if fail break # End of file
.\%n := \fsplit(\m(line),&a,\44,CSV) # Split record into items
incr \%r # Count this row
for \%i 1 \%n 1 { # Assign items to this row of matrix
_asg a[\%r][\%i] \&a[\%i]
}
if > \%i \%m { .\%m := \%i } # Remember width of widest row
}
fclose \%c # Close CSV file
decrement \%m # (because of how FOR loop works)
echo MATRIX A ROWS: \%r COLUMNS: \%m # Show the matrix
for \%i 1 \%r 1 { # Loop through rows
for \%j 1 \%m 1 { # Loop through columns of each row
xecho "\flpad(\m(a[\%i][\%j]),6)"
}
echo
}
exit 0
</pre>
</blockquote>
The matrix is called <tt>a</tt> and its elements are <tt>a[1][1]</tt>,
<tt>a[1][2]</tt>, <tt>a[1][3]</tt>, ... <tt>a[2][1]</tt>, etc, and you can
treat this data structure exactly like a two-dimensional array, in which you
can refer to any element by its "X and Y coordinates". For example, if the
CSV file contained numeric data you could compute row and column sums using
simple FOR loops and Kermit's built-in one-dimensional array data type:
<blockquote>
<pre>
declare \&r[\%r] # Make an array for the row sums
declare \&c[\%m] # Make an array for the column sums
for \%i 1 \%r 1 { # Loop through rows
for \%j 1 \%m 1 { # Loop through columns of each row
increment \&r[\%i] \m(a[\%i][\%j]) # Accumulate row sum
increment \&c[\%j] \m(a[\%i][\%j]) # Accumulate column sum
}
}
</pre>
</blockquote>
Note that the sum arrays don't have to be initialized to zero because
Kermit's <small>INCREMENT</small> command treats empty definitions as zero.
</div>
<hr>
</body>
</html>