home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!charon.amdahl.com!amdahl!krs
- From: krs@uts.amdahl.com (Kris Stephens [Hail Eris!])
- Newsgroups: comp.unix.shell
- Subject: Re: Substring of a file name
- Summary: long followup because of performance considerations in the scripting
- Message-ID: <02tT03R280oV00@amdahl.uts.amdahl.com>
- Date: 11 Sep 92 16:39:21 GMT
- References: <37016@uflorida.cis.ufl.edu> <JDELL.92Sep10180225@golda.mit.edu>
- Distribution: usa
- Organization: Amdahl Corporation, Sunnyvale CA
- Lines: 139
-
- On 10 Sep 92 15:15:41 GMT, urs@carp.cis.ufl.edu (Uppili Srinivasan) said:
- >Nntp-Posting-Host: carp.cis.ufl.edu
- >Hi,
- >I need to write a script to read a substring of some of the filenames
- >in a directory. (i.e) if xy1992 is the name i need to read the last four
- >letters only. I guess 'awk' is ideal for this . But I am curious to
- >know the other ways of doing this.
-
- Actually, awk isn't that great a choice here. Yeah, it's got substring
- matching and editing capability, but it's better at whole-token work.
-
- A sed example was posted, and it'll work to strip the first two alpha
- characters off. What concerns me is that it'd wind up invoking sed
- once for each file in the directory. A better approach is to locate
- the directory and shove all the files through one call to sed.
-
- Beware of one thing: different versions of sh and ksh (levels and OSes)
- will decide differently when to fork subshells in the while loop.
-
- See after the script for some other mods I'd make if needed.
-
- ---- example.sh ----
- :
- # This is sh or ksh to grab the last four characters of all the filenames
- # in the $1 directory and, if they're numeric, associate them with the
- # filenames. The likely point is that the directory contains files which
- # are yearly summaries or some such.
- #
- if [ $# -ne 1 ]
- then
- echo "usage: $0 directory" 1>&2
- exit 1
- fi
- if [ ! -d $1 ]
- then
- echo "$0: '$1' - not a directory" 1>&2
- exit 1
- fi
-
- #
- # Okay, we've got a directory
- #
- cd $1
-
- #
- # List the files and use sed to replicate the last four characters as
- # a second word if those characters are numeric.
- #
- ls |
-
- sed 's/[0-9]\{4\}$/& &/' |
-
- #
- # Read in filename and year if available, skipping files that didn't
- # have a year associated.
- #
- while read file year
- do
- [ -z "$year" ] && continue # no year at the end; check next file
-
- : # do what needs to be done
- done
- ---- end example.sh ----
-
- IFF (if and *only* if!) this script will only be needed for files dated
- 1900-1999, I'd be tempted to rewrite the sed like
-
- sed 's/19[0-9][0-9]$/& &/' |
-
- because it may be easier to read.
-
- ------
- From this point on, the while loop won't need to test to see if $year is
- null -- the modifications do the job in the pipeline.
-
- If the directory has (or may have) a *lot* of files in it that don't end
- in years (i.e. chaffe to be deleted), I alter the sed:
-
- #
- # sed to
- # delete all filenames which don't end in 4 numbers, then
- # replicate the last four characters (numbers) as word 2
- #
- sed -e '/[0-9]\{4\}$/!d' \
- -e 's/.\{4\}$/& &/' |
-
- to have it strip the non-matching files so we never get them in the loop.
-
- And if there are a bazillion files to handle, some matching some not, I'd
- take the filter job away from sed and toss a grep into the pipeline:
-
- # get the filenames
- ls |
-
- # get rid of files not ending in four numbers
- grep '[0-9]\{4\}$' |
-
- # duplicate the last four chars (numbers) as a second word
- sed 's/.\{4\}$/& &/' |
-
- This is because making three small calls (ls, grep, sed), each doing a very
- little of exactly what they're designed to do, is faster than one program
- trying to do a lot against a lot of data. The stream will definitely outrun
- the while read loop. There are more ways to speed it up if needed, but
- usually this is sufficient for even pretty heavy-duty lists. Testing for
- subdirectories while correctly handling symbolic links is another tricky
- part, but this article is already too long.
-
- One other thing, though. I used ls with no filename args because if we did
-
- ls -d *[0-9][0-9][0-9][0-9] |
-
- we'd actually be processing all the filenames twice in that one line: once
- by the shell (to build the argument list) and once by ls (to list them).
- Here's the last possible tweak:
-
- #
- # list the files that end with four numbers
- #
- echo *[0-9][0-9][0-9][0-9] |
-
- #
- # turn spaces into line-ends
- #
- tr ' ' '\012' |
-
- #
- # duplicate the last four chars (numbers) as a second word
- #
- sed 's/.\{4\}$/& &/' |
-
- I don't know of a faster approach at shell level.
-
- ...Kris
- --
- Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS
- Amdahl Corporation | | |
- [The opinions expressed above are mine, solely, and do not ]
- [necessarily reflect the opinions or policies of Amdahl Corp. ]
-