kermit.columbia.edu

home *** CD-ROM | disk | FTP | other *** search

/ kermit.columbia.edu / kermit.columbia.edu.tar / kermit.columbia.edu / public_html / csv.save < prev next >

Wrap

Text File | 2010-08-20 | 20KB | 621 lines

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html lang="en"> <head> <title> Kermit and Comma-Separated-Value Files </title> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <META http-equiv="Content-Style-Type" content="text/css"> <LINK REL=STYLESHEET TYPE="text/css" HREF="kermit.css"> <LINK REL="shortcut icon" href="favicon.ico" > <style type="text/css"> ul li { padding-bottom:9;padding-right:64;line-height:12.5pt; } h2, h3 { font-family: sans-serif } h3 { border-top: 2px solid #999999 } ul.contents li { line-height:10pt; font-family:sans-serif } tt,pre { font-size:11pt } dl.loose dd { padding:0 0 8 0 } blockquote.example { margin-top:6; margin-bottom:8 } body { color:black; background:white; margin:0; font-size:12pt"> </style> </head> <body> <table cellpadding=0 cellspacing=0 width="100%" style="border:1px solid darkmagenta; background-color:white;"> <tr style="background-image:url('lb3.jpg');"> <td style="padding:8 8 8 8"> <a style="text-decoration:none" href="http://www.columbia.edu"><img border=0 alt="The Columbia Crown" title="The Columbia Crown (crown of King George II)" height=105 src="crownico.gif"></a> <td align="left" style="padding-top:23"> <tt style="font-size:24pt"><b>The Kermit Project</b></tt> | <span style="font-family:Ariel,times; font-size:18pt"><i>Columbia University</i></span> <br><span style="font-family:Ariel,times; font-size:14pt"> 612 West 115th Street, New York NY 10025 USA • <a href="mailto:kermit@columbia.edu">kermit@columbia.edu</a> </span> <table width="100%"> <tr> <td style="font-size:12pt; font-style:italic">…since <small>1981</small></div> </table> <tr> <td colspan=2 style="padding:0"> <table class=menu cellpadding=0 cellspacing=0 width="100%" style="border-top:1px solid darkmagenta"> <tr> <td onClick="document.location.href='index.html';" title="Kermit Project Home Page" style="cursor:pointer"><a href="index.html">Home</a> <td onClick="document.location.href='k95.html';" title="Kermit 95 for Windows" style="cursor:pointer"><a href="k95.html">Kermit 95</a> <td class=this onClick="document.location.href='ckermit.html';" title="C-Kermit for Unix and VMS" style="cursor:pointer"><a href="ckermit.html">C-Kermit</a> <td onClick="document.location.href='ckscripts.html';" title="Kermit Script Language and Tutorial" style="cursor:pointer"><a href="ckscripts.html">Scripts</a> <td onClick="document.location.href='current.html';" title="Current Versions of Kermit Software" style="cursor:pointer"><a href="current.html">Current</a> <td onClick="document.location.href='new.html';" title="What's New" style="cursor:pointer"><a href="new.html">New</a> <td onClick="document.location.href='faq.html';" title="Frequently Asked Questions" style="cursor:pointer"><a href="faq.html">FAQ</a> <td onClick="document.location.href='support.html';" style="border-right:0" title="Kermit Software Support" style="cursor:pointer"><a href="support.html">Support</a> </table> </table> <div style="margin:8; font-family:calibri,sans-serif,times"> <div style="margin:8"> <div align="right" style="font-weight:bold; font-size:11pt"> <span class=qq><a title="C-Kermit 9.0" href="ck90.html">C-Kermit 9.0</a></span> </div> <h2 style="margin-top:8">C-Kermit 9.0 and Comma-Separated-Value Files</h2> <blockquote> Frank da Cruz<br> The Kermit Project<br> Columbia University<br> <i>Last update:</i> Fri Aug 20 14:02:34 2010 </blockquote> <p> <div class=dm style="background:#eeeeee;border:1px solid plum; padding:0"> <ul style="font-size:14pt"> <li><a href="#record"><b>Reading a CSV or TSV Record and Converting it to an Array</b></a> <li><a href="#join"><b>Using \fjoin() to create a Comma- or Tab-Separated Value List from an Array</b></a> <li><a href="#file"><b>Using CSV or TSV Files</b></a> </ul> </div> <p> Comma-Separated Value (CSV) format is commonly output by spreadsheets and databases when exporting data into plain-text files for import into other applications. It is important to understand the formal definition of a CSV file. <ol> <li>Each record is a series of fields. <li>Records are in whatever format is used by the underlying file system for lines of text. <li>Fields within records are separated by commas, with zero or more whitespace characters (space or tab) before and/or after the comma. <li>Fields with imbedded commas are enclosed in ASCII doublequote characters. <li>Fields with leading or trailing spaces are enclosed in ASCII doublequotes. <li>Fields with embedded doublequotes are enclosed in doublequotes and each interior doublequote is doubled. </ol> Here is an example: <blockquote> <pre> aaa, bbb, has spaces,,"ddd,eee,fff", " has spaces ","Muhammad ""The Greatest"" Ali" </pre> </blockquote> The first two are regular fields. The second is a field that has an embedded space but in whichy any leading or trailing spaces are to be ignored. The fourth is an empty field, but still a field. The fifth is a field that contains embedded commas. The sixth has leading and trailing spaces. The last field has embedded quotation marks. <p> Prior to C-Kermit 9.0 Alpha.06, C-Kermit did not handle CSV files according to the specification above. Most seriously, there was no provision for a separator to be surrounded by whitespace that was to be considered part of the separator. Also there was no provision for quoting doublequotes inside of a quoted string. <h3 id=record>Reading a CSV record</h3> Now the <tt>\fsplit()</tt> function can handle any CSV-format string if you include the symbolic include set "CSV" as the 4th parameter. To illustrate, this program: <blockquote> <pre> def xx { echo [\fcontents(\%1)] .\%9 := \fsplit(\fcontents(\%1), &a, \44, CSV) for \%i 1 \%9 1 { echo "\flpad(\%i,3). [\&a[\%i]]" } echo "-----------" } xx {a,b,c} xx { a , b , c } xx { aaa,,ccc," with spaces ",zzz } xx { "1","2","3","","5" } xx { this is a single field } xx { this is one field, " and this is another " } xx { name,"Mohammad ""The Greatest"" Ali", age, 67 } exit </pre> </blockquote> gives the following results: <blockquote> <pre> [a,b,c] 1. [a] 2. [b] 3. [c] ----------- [ a , b , c ] 1. [a] 2. [b] 3. [c] ----------- [ aaa,,ccc," with spaces ",zzz ] 1. [aaa] 2. [] 3. [ccc] 4. [ with spaces ] 5. [zzz] ----------- [ "1","2","3","","5" ] 1. [1] 2. [2] 3. [3] 4. [] 5. [5] 6. [] <span style="color:red">← Oops, this was fixed in Alpha.07</span> ----------- [ this is a single field ] 1. [this is a single field] ----------- [ this is one field, " and this is another " ] 1. [this is one field] 2. [ and this is another ] 3. [] <span style="color:red">← Ditto</span> ----------- [ name,"Mohammad ""The Greatest"" Ali", age, 67 ] 1. [name] 2. [Mohammad "The Greatest" Ali] 3. [age] 4. [67] ----------- </pre> </blockquote> The separator <tt>\44</tt> (comma) must still be specified as the break set (3rd <tt>\fsplit()</tt> parameter). When “CSV” is specified as the include set: <ul> <li>The Grouping Mask is automatically set to 1 (which specifies that the ASCII doublequote character (<tt>"</tt>) is used for grouping; <li>The Separator Flag is automatically set to 1 so that adjacent field separators will not be collapsed; <li>All bytes (values 0 through 255) other than the break character are added to the include set; <li>Any leading whitespace is stripped from the first element unless it is enclosed in doublequotes; <li>Any trailing whitespace is trimmed from the end of the last element unless it is enclosed in doublequotes; <li>If the separator character has any spaces or tabs preceding it or following it, they are ignored and discarded; <li>The separator character is treated as an ordinary data character if it appears in a quoted field; <li>A sequence of two doublequote characters (<tt>""</tt>) within a quoted field is converted to a single doublequote. </ul> There is also a new TSV symbolic include set, which is like CSV except without the quoting rules or the stripping of whitespace around the separator because, by definition, TSV fields do not contain tabs. <p> Of course you can specify any separator(s) you want with either the CSV, TSV, or ALL symbolic include sets. For example, if you have a TSV file in which you want the spaces around each Tab to be discarded, you can use: <blockquote> <pre> \fsplit(<i>variable</i>, &a, \9, CSV) </pre> </blockquote> <tt>\9</tt> is Tab. <p> The new symbolic include sets can also be used with <tt>\fword()</tt>, which is just like <tt>\fsplit()</tt> except that it retrieves the <i>n</i><small>th</small> word from the argument string, rather than an array of all the words. In C-Kermit you can get information about these or any other functions with the <small>HELP FUNCTION</small> command, e.g.: <blockquote> <pre> C-Kermit> <u>help func word</u> \fword(s1,n1,s2,s3,n2,n3) - Extract word from string. s1 = source string n1 = word number (1-based) counting from left; if negative, from right. s2 = optional break set. s3 = optional include set. n2 = optional grouping mask. n3 = optional separator flag: 0 = collapse adjacent separators 1 = don't collapse adjacent separators. Default break set is all characters except ASCII letters and digits. ASCII (C0) control characters are treated as break characters by default. Default include set is null. Three special symbolic include sets are also allowed: ALL (bytes not in the break set), CSV (special treatment for Comma-Separated-Value records), and TSV (special treatment for Tab- Separated-Value records). If grouping mask given and nonzero, words can be grouped by quotes or brackets selected by the sum of the following: 1 = doublequotes: "a b c" 2 = braces: {a b c} 4 = apostrophes: 'a b c' 8 = parentheses: (a b c) 16 = square brackets: [a b c] 32 = angle brackets: <a b c> Nesting is possible with {}()[]<> but not with quotes or apostrophes. Returns string: Word number n, if there is one, otherwise an empty string. C-Kermit> </pre> </blockquote> <h3 id=join>Using <tt>\fjoin()</tt> to create Comma- or Tab-Separated Value Lists from Arrays</h3> In Alpha.08 of C-Kermit 9.0, <tt>\fsplit()</tt>'s inverse function, <a href="ckermit80.html#fjoin"><tt>\fjoin()</tt></a> received the capability of converting an array into a comma-separated or a tab-separated value list. Thus, given a CSV, if you split it into an array with <tt>\fsplit()</tt> and then join the array with <tt>\fjoin()</tt>, giving each function the new CSV parameter in the appropriate argument position, the result will be will be equivalent to the original, according to the CSV definition. It might not be identical, because if the result had extraneous spaces before or after the separating commas, these are discarded, but that does not affect the elements themselves. The new syntax for <tt>\fjoin()</tt></a> is: <p> <dl> <dt><b><tt>\fjoin(&a,CSV)</tt></b> <dd>Given the array <tt>\&a[]</tt> or any other valid array designator, joins its elements into a comma-separated list according to the rules listed above. <p> <dt><b><tt>\fjoin(&a,TSV)</tt></b> <dd>Joins the elements of the given array into a tab-separated list, also described above. </dl> <p> <a href="ckermit80.html#fjoin">Previous calling conventions for <tt>\fjoin()</tt></a> are undisturbed, including the ability to specify a portion of an array, rather than the whole array: <p> <blockquote> <pre> declare \&a[] = 1 2 3 4 5 6 7 8 9 echo \fjoin(&a[3:7],CSV) 3,4,5,6,7 </pre> </blockquote> <p> Using <tt>\fsplit()</tt> and <tt>\fjoin()</tt> is now possible to convert a comma-separated value list into a tab-separated value list, and vice versa (which is not a simple matter of changing commas to tabs or vice versa). <h3 id=join>Applications for CSV Files</h3> Databases such as MS Access or MySQL can export tables or reports in CSV format, and then Kermit can read the resulting CSV file and do whatever you like with it; typically something that could not be done with the database query language itself (or that you didn't know how to do that way): create reports or datasets that based on complex criteria or procedures, edit or modify some fields, etc, and then use <tt>\fjoin()</tt> to put each record back in CSV form so it can be reimported into a spreadsheet or database. <p> Here is a simple example in which we purge all records of customers who have two or more unpaid bills. The file is sorted in serial-number order. <p> <blockquote> <pre> #!/usr/local/bin/kermit .filename = somefile.csv # Input file in CSV format fopen /read \%c \m(filename) # Open it if fail exit # Don't go on if open failed copy \m(filename) ./new # Make a copy of the file .oldserial = 00000000000 # Multiple records for each serial number .zeros = 0 # Unpaid bill counter while true { # Loop fread /line \%c line # Get a record if fail exit # End of file .n := \fsplit(\m(line),&a,\44,CSV) # Split the fields into an array if not equ "\m(oldserial)" "\&a[6]" { # Have new serial number? # Remove all records for previous serial number # if two or more bills were not paid... if > \m(zeros) 1 { grep /nomatch \m(oldserial) /output:./new2 ./new rename ./new2 ./new } .oldserial := \&a[6] # To detect next time serial number changes .zeros = 0 # Reset unpaid bill counter } if equ "\&a[5]" "$0.00" { # Element 5 is amount paid increment zeros # If it's zero, count it. } } fclose \%c </pre> </blockquote> <p> Rewriting the file multiple times is inelegant, but this is a quick and dirty use-once-and-discard script, so elegance doesn't count. The example is interesting in that it purges certain records based on the contents of other records. Maybe there is a way to do this directly with SQL, but why use SQL when you can use Kermit? <h3 id=file>Using CSV Files: Extending Kermit's Data Structures</h3> Now that we can parse a CSV record, what would we do with a CSV <i>file</i> – that is, a sequence of records? If we needed all the data available at once, we would want to load it into a matrix of (row,column) values. But Kermit doesn't have matrices. Or does it? <p> Kermit has several built-in data types, but you can invent your own data types as needed using Kermit's macro feature: <blockquote> <pre> define <i>variablename value</i> </pre> </blockquote> For example: <blockquote> <pre> define alphabet abcdefghijklmnopqrstuvwxyz </pre> </blockquote> This defines a macro named <i>alphabet</i> and gives it the value <i>abcdefghijklmnopqrstuvwxyz</i>. A more convenient notation (added in C-Kermit 7.0) for this is: <blockquote> <pre> .alphabet = abcdefghijklmnopqrstuvwxyz </pre> </blockquote> The two are exactly equivalent: they make a literal copy the "right hand side" as the value of the macro. Then you can refer to the macro anywhere in a Kermit command as "<tt>\m(</tt><i>macroname</i><tt>)</tt>": <blockquote> <pre> echo "Alphabet = \m(alphabet)" </pre> </blockquote> There is a second way to define a macro, which is like the first except that the right-hand side is <i>evaluated</i> first; that is, any variable references or function calls in the right-hand side are replaced by their values before the result is assigned to the macro. The command for this is <small>ASSIGN</small> rather than <small>DEFINE</small>: <blockquote> <pre> define alphabet abcdefghijklmnopqrstuvwxyz assign backwards \freverse(\m(alphabet)) echo "Alphabet backwards = \m(backwards)" </pre> </blockquote> which prints: <blockquote> <pre> Alphabet backwards = zyxwvutsrqponmlkjihgfedcba </pre> </blockquote> This kind of assignment can also be done like this: <blockquote> <pre> .alphabet = abcdefghijklmnopqrstuvwxyz .backwards := \freverse(\m(alphabet)) </pre> </blockquote> <a href="ckermit70.html#x7.9">Any command starting with a period is an assignment</a>, and the operator (<tt>=</tt> or <tt>:=</tt>) tells what to do with the right-hand side before making the assignment. <p> In both the <small>DEFINE</small> and <small>ASSIGN</small> commands, the variable name itself is taken literally. It is also possible, however, to have Kermit <i>compute</i> the variable name. This is done (as described in <a title="Using C-Kermit, 2nd Edition" href="http://www.amazon.com/gp/product/1555581641?ie=UTF8&tag=aleidmoreldom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1555581641"><i>Using C-Kermit</i></a>, 2nd Ed., p.457), using parallel commands that start with underscore: <small>_DEFINE</small> and <small>_ASSIGN</small> (alias <small>_ASG</small>). These are just like <small>DEFINE</small> and <small>ASSIGN</small> except they evaluate the variable name before making the assigment. For example: <blockquote> <pre> define \%a one _define \%a\%a\%a 111 </pre> </blockquote> would create a macro named <small>ONEONEONE</small> with a value of 111, and: <blockquote> <pre> define \%a one define number 111 _assign \%a\%a\%a \m(number) </pre> </blockquote> would create the same macro with the same value, but: <blockquote> <pre> define \%a one define number 111 _define \%a\%a\%a \m(number) </pre> </blockquote> would give the macro a value of "<tt>\m(number)</tt>". <p> You can use the <small>_ASSIGN</small> command to create any kind of data structure you want; you can find some examples in the <a href="ckscripts.html#oops">Object-Oriented Programming</a> section of the <a href="ckscripts.html">Kermit Script Library</a>. In the following program we use this capability to create a two-dimensional array, or matrix, to hold the all the elements of the CSV file, and then to display the matrix: <blockquote> <pre> fopen /read \%c data.csv # Open CSV file if fail exit 1 .\%r = 0 # Row .\%m = 0 # Maximum columns while true { fread /line \%c line # Read a record if fail break # End of file .\%n := \fsplit(\m(line),&a,\44,CSV) # Split record into items incr \%r # Count this row for \%i 1 \%n 1 { # Assign items to this row of matrix _asg a[\%r][\%i] \&a[\%i] } if > \%i \%m { .\%m := \%i } # Remember width of widest row } fclose \%c # Close CSV file decrement \%m # (because of how FOR loop works) echo MATRIX A ROWS: \%r COLUMNS: \%m # Show the matrix for \%i 1 \%r 1 { # Loop through rows for \%j 1 \%m 1 { # Loop through columns of each row xecho "\flpad(\m(a[\%i][\%j]),6)" } echo } exit 0 </pre> </blockquote> The matrix is called <tt>a</tt> and its elements are <tt>a[1][1]</tt>, <tt>a[1][2]</tt>, <tt>a[1][3]</tt>, ... <tt>a[2][1]</tt>, etc, and you can treat this data structure exactly like a two-dimensional array, in which you can refer to any element by its "X and Y coordinates". For example, if the CSV file contained numeric data you could compute row and column sums using simple FOR loops and Kermit's built-in one-dimensional array data type: <blockquote> <pre> declare \&r[\%r] # Make an array for the row sums declare \&c[\%m] # Make an array for the column sums for \%i 1 \%r 1 { # Loop through rows for \%j 1 \%m 1 { # Loop through columns of each row increment \&r[\%i] \m(a[\%i][\%j]) # Accumulate row sum increment \&c[\%j] \m(a[\%i][\%j]) # Accumulate column sum } } </pre> </blockquote> Note that the sum arrays don't have to be initialized to zero because Kermit's <small>INCREMENT</small> command treats empty definitions as zero. </div> <hr> </body> </html>