home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.unix.shell:3741 comp.unix.questions:10617
- Path: sparky!uunet!haven.umd.edu!darwin.sura.net!jvnc.net!yale.edu!think.com!barmar
- From: barmar@think.com (Barry Margolin)
- Newsgroups: comp.unix.shell,comp.unix.questions
- Subject: Re: Some Unix qs
- Date: 1 Sep 1992 20:06:31 GMT
- Organization: Thinking Machines Corporation, Cambridge MA, USA
- Lines: 60
- Message-ID: <180ig7INNifo@early-bird.think.com>
- References: <1992Sep1.145646.1663@cc.gatech.edu>
- NNTP-Posting-Host: telecaster.think.com
-
- Please use more descriptive Subject lines. Nearly everything in
- comp.unix.questions is "Some Unix qs", so using that as your subject is not
- very useful. See my revised Subject line for an example.
-
- In article <1992Sep1.145646.1663@cc.gatech.edu> sougata@cc.gatech.edu (Sougata Mukherjea) writes:
- > I have a file say X which contains some records. These records are
- >fixed-format, single line ( each record ends with a NEWLINE character) and the
- >record length extends upto 1600 characters.
- ...
- > The problem I run into while applying the same technique to this
- >file X is, that when I apply the "comm" command to the sorted data files X1 and
- >X2, each record breaks up (WHY ??) into multiple records (each of length 256
- >characters).
-
- Many Unix programs have arbitrary limits like this. The particular
- behavior of breaking long lines into multiple lines often comes from the
- following programming practice:
-
- #define BUFSIZ 256
-
- char line[BUFSIZ];
-
- ...
-
- while (1) {
- result = fgets(line, BUFSIZ, stdin);
- if (result == 0) break;
- ...
- }
-
- fgets() will read up to BUFSIZ-1 characters, a newline, or EOF, whichever
- comes first. In the case of a line longer than BUFSIZ-1, the next
- iteration will begin with the file pointer in the middle of the line.
-
- Be thankful that it simply treats the long line as multiple lines. Less
- careful programs use gets() instead of fgets(), and it doesn't take a
- buffer size parameter. In that case, it will read beyond the end of the
- line[] array and overwrite lots of other variables.
-
- >My question is how do I perform the above operations [ (i), (ii) & (iii) ]
- >without breaking a record into pieces. The output SHOULD preserve the
- >existing record format ( which "comm" does for small record lengths) and
- >SHOULD NOT introduce extraneous characters in the output files.
-
- If you must use the standard utilities, you're stuck with their arbitrary
- limitations.
-
- Often GNU versions of Unix utilities don't have these limits. In
- particular, they often use the GNU readline() subroutine to read input
- instead of fgets(). Readline() allocates the input buffer itself, and
- automatically expands it as necessary. And if the GNU version still
- doesn't quite meet your needs you have source so you can fix it if
- necessary.
-
- The GNU version of comm(1) is in the textutils package.
- --
- Barry Margolin
- System Manager, Thinking Machines Corp.
-
- barmar@think.com {uunet,harvard}!think!barmar
-