home *** CD-ROM | disk | FTP | other *** search
- Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU
- Path: sparky!uunet!paladin.american.edu!auvm!CUNYVM.BITNET!FXTBB
- Message-ID: <SAS-L%92082914054137@UGA.CC.UGA.EDU>
- Newsgroups: bit.listserv.sas-l
- Date: Sat, 29 Aug 1992 13:20:31 EDT
- Reply-To: Philip Tejera <FXTBB@CUNYVM.BITNET>
- Sender: "SAS(r) Discussion" <SAS-L@UGA.BITNET>
- From: Philip Tejera <FXTBB@CUNYVM.BITNET>
- Subject: Re: Duplicate records - IML approach
- Lines: 43
-
- As Tom Abernathy indicated in his posting, the Sort approach only works
- if it is possible to sort by ALL the variables. Even if this is possible,
- (and it turned out it sometimes is not) this could get expensive for a
- goodly number of variables with a large number of cases. So I decided to
- try an IML data-processing approach, and it turns out to be fairly easy.
-
- Of course, I had to make some assumptions. For simplicity, I assumed that
- the dataset was already sorted on a few fields that should result in
- unique cases (obviously they really don't, or there wouldn't be any
- duplicates). Also for simplicity, I assumed that the variables were
- all numeric, although character variables could be dealt with also if
- I had the manuals at home and more time to spend. I used the ANY function
- to determine if the differences between adjacent records were zero,
- although it would be easy to substitute one's own definition of difference
- using IML's operators and/or functions. Finally, for sure there must be
- some limitations on how big a dataset will fit in memory, but given the
- memory capacities of modern operating systems this shouldn't be too serious.
- So here is a simple example that works. Have fun improving it! For instance,
- it should be easy to implement Phil Gallagher's suggestion to also print
- out observations with duplicate keys but different data values.
-
- data first; /* set up some sample data */
- input id1 id2 a b c;
- cards;
- 01 11 101 201 301
- 01 12 102 202 302
- 02 11 101 201 301
- 02 11 101 201 301
- 02 12 102 202 302
- 03 12 102 202 302
- ;
- proc iml;
- use first; /* assume already sorted */
- read all into zed; /* zed assumed numeric matrix*/
- nr=nrow(zed)-1; /* number of rows for base */
- nv=1:ncol(zed); /* index vector for all vars */
- do i=1 to nr; /* i is base obs - OLD */
- j=i+1; /* j is next obs - NEW */
- old=zed[i,nv];
- new=zed[j,nv];
- anydiff=any(new-old); /* any non-zero differences? */
- if anydiff = 0 then print j new; /* print obs. num & values */
- end;
-