NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / unix / shell / 3741 < prev next >

Wrap

Internet Message Format | 1992-09-01 | 2.9 KB

Xref: sparky comp.unix.shell:3741 comp.unix.questions:10617 Path: sparky!uunet!haven.umd.edu!darwin.sura.net!jvnc.net!yale.edu!think.com!barmar From: barmar@think.com (Barry Margolin) Newsgroups: comp.unix.shell,comp.unix.questions Subject: Re: Some Unix qs Date: 1 Sep 1992 20:06:31 GMT Organization: Thinking Machines Corporation, Cambridge MA, USA Lines: 60 Message-ID: <180ig7INNifo@early-bird.think.com> References: <1992Sep1.145646.1663@cc.gatech.edu> NNTP-Posting-Host: telecaster.think.com Please use more descriptive Subject lines. Nearly everything in comp.unix.questions is "Some Unix qs", so using that as your subject is not very useful. See my revised Subject line for an example. In article <1992Sep1.145646.1663@cc.gatech.edu> sougata@cc.gatech.edu (Sougata Mukherjea) writes: > I have a file say X which contains some records. These records are >fixed-format, single line ( each record ends with a NEWLINE character) and the >record length extends upto 1600 characters. ... > The problem I run into while applying the same technique to this >file X is, that when I apply the "comm" command to the sorted data files X1 and >X2, each record breaks up (WHY ??) into multiple records (each of length 256 >characters). Many Unix programs have arbitrary limits like this. The particular behavior of breaking long lines into multiple lines often comes from the following programming practice: #define BUFSIZ 256 char line[BUFSIZ]; ... while (1) { result = fgets(line, BUFSIZ, stdin); if (result == 0) break; ... } fgets() will read up to BUFSIZ-1 characters, a newline, or EOF, whichever comes first. In the case of a line longer than BUFSIZ-1, the next iteration will begin with the file pointer in the middle of the line. Be thankful that it simply treats the long line as multiple lines. Less careful programs use gets() instead of fgets(), and it doesn't take a buffer size parameter. In that case, it will read beyond the end of the line[] array and overwrite lots of other variables. >My question is how do I perform the above operations [ (i), (ii) & (iii) ] >without breaking a record into pieces. The output SHOULD preserve the >existing record format ( which "comm" does for small record lengths) and >SHOULD NOT introduce extraneous characters in the output files. If you must use the standard utilities, you're stuck with their arbitrary limitations. Often GNU versions of Unix utilities don't have these limits. In particular, they often use the GNU readline() subroutine to read input instead of fgets(). Readline() allocates the input buffer itself, and automatically expands it as necessary. And if the GNU version still doesn't quite meet your needs you have source so you can fix it if necessary. The GNU version of comm(1) is in the textutils package. -- Barry Margolin System Manager, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar