NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / unix / shell / 3905 < prev next >

Wrap

Internet Message Format | 1992-09-11 | 4.6 KB

Path: sparky!uunet!charon.amdahl.com!amdahl!krs From: krs@uts.amdahl.com (Kris Stephens [Hail Eris!]) Newsgroups: comp.unix.shell Subject: Re: Substring of a file name Summary: long followup because of performance considerations in the scripting Message-ID: <02tT03R280oV00@amdahl.uts.amdahl.com> Date: 11 Sep 92 16:39:21 GMT References: <37016@uflorida.cis.ufl.edu> <JDELL.92Sep10180225@golda.mit.edu> Distribution: usa Organization: Amdahl Corporation, Sunnyvale CA Lines: 139 On 10 Sep 92 15:15:41 GMT, urs@carp.cis.ufl.edu (Uppili Srinivasan) said: >Nntp-Posting-Host: carp.cis.ufl.edu >Hi, >I need to write a script to read a substring of some of the filenames >in a directory. (i.e) if xy1992 is the name i need to read the last four >letters only. I guess 'awk' is ideal for this . But I am curious to >know the other ways of doing this. Actually, awk isn't that great a choice here. Yeah, it's got substring matching and editing capability, but it's better at whole-token work. A sed example was posted, and it'll work to strip the first two alpha characters off. What concerns me is that it'd wind up invoking sed once for each file in the directory. A better approach is to locate the directory and shove all the files through one call to sed. Beware of one thing: different versions of sh and ksh (levels and OSes) will decide differently when to fork subshells in the while loop. See after the script for some other mods I'd make if needed. ---- example.sh ---- : # This is sh or ksh to grab the last four characters of all the filenames # in the $1 directory and, if they're numeric, associate them with the # filenames. The likely point is that the directory contains files which # are yearly summaries or some such. # if [ $# -ne 1 ] then echo "usage: $0 directory" 1>&2 exit 1 fi if [ ! -d $1 ] then echo "$0: '$1' - not a directory" 1>&2 exit 1 fi # # Okay, we've got a directory # cd $1 # # List the files and use sed to replicate the last four characters as # a second word if those characters are numeric. # ls | sed 's/[0-9]\{4\}$/& &/' | # # Read in filename and year if available, skipping files that didn't # have a year associated. # while read file year do [ -z "$year" ] && continue # no year at the end; check next file : # do what needs to be done done ---- end example.sh ---- IFF (if and *only* if!) this script will only be needed for files dated 1900-1999, I'd be tempted to rewrite the sed like sed 's/19[0-9][0-9]$/& &/' | because it may be easier to read. ------ From this point on, the while loop won't need to test to see if $year is null -- the modifications do the job in the pipeline. If the directory has (or may have) a *lot* of files in it that don't end in years (i.e. chaffe to be deleted), I alter the sed: # # sed to # delete all filenames which don't end in 4 numbers, then # replicate the last four characters (numbers) as word 2 # sed -e '/[0-9]\{4\}$/!d' \ -e 's/.\{4\}$/& &/' | to have it strip the non-matching files so we never get them in the loop. And if there are a bazillion files to handle, some matching some not, I'd take the filter job away from sed and toss a grep into the pipeline: # get the filenames ls | # get rid of files not ending in four numbers grep '[0-9]\{4\}$' | # duplicate the last four chars (numbers) as a second word sed 's/.\{4\}$/& &/' | This is because making three small calls (ls, grep, sed), each doing a very little of exactly what they're designed to do, is faster than one program trying to do a lot against a lot of data. The stream will definitely outrun the while read loop. There are more ways to speed it up if needed, but usually this is sufficient for even pretty heavy-duty lists. Testing for subdirectories while correctly handling symbolic links is another tricky part, but this article is already too long. One other thing, though. I used ls with no filename args because if we did ls -d *[0-9][0-9][0-9][0-9] | we'd actually be processing all the filenames twice in that one line: once by the shell (to build the argument list) and once by ls (to list them). Here's the last possible tweak: # # list the files that end with four numbers # echo *[0-9][0-9][0-9][0-9] | # # turn spaces into line-ends # tr ' ' '\012' | # # duplicate the last four chars (numbers) as a second word # sed 's/.\{4\}$/& &/' | I don't know of a faster approach at shell level. ...Kris -- Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS Amdahl Corporation | | | [The opinions expressed above are mine, solely, and do not ] [necessarily reflect the opinions or policies of Amdahl Corp. ]