NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / bit / listserv / sasl / 3957 < prev next >

Wrap

Text File | 1992-08-29 | 2.6 KB | 55 lines

Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU Path: sparky!uunet!paladin.american.edu!auvm!CUNYVM.BITNET!FXTBB Message-ID: <SAS-L%92082914054137@UGA.CC.UGA.EDU> Newsgroups: bit.listserv.sas-l Date: Sat, 29 Aug 1992 13:20:31 EDT Reply-To: Philip Tejera <FXTBB@CUNYVM.BITNET> Sender: "SAS(r) Discussion" <SAS-L@UGA.BITNET> From: Philip Tejera <FXTBB@CUNYVM.BITNET> Subject: Re: Duplicate records - IML approach Lines: 43 As Tom Abernathy indicated in his posting, the Sort approach only works if it is possible to sort by ALL the variables. Even if this is possible, (and it turned out it sometimes is not) this could get expensive for a goodly number of variables with a large number of cases. So I decided to try an IML data-processing approach, and it turns out to be fairly easy. Of course, I had to make some assumptions. For simplicity, I assumed that the dataset was already sorted on a few fields that should result in unique cases (obviously they really don't, or there wouldn't be any duplicates). Also for simplicity, I assumed that the variables were all numeric, although character variables could be dealt with also if I had the manuals at home and more time to spend. I used the ANY function to determine if the differences between adjacent records were zero, although it would be easy to substitute one's own definition of difference using IML's operators and/or functions. Finally, for sure there must be some limitations on how big a dataset will fit in memory, but given the memory capacities of modern operating systems this shouldn't be too serious. So here is a simple example that works. Have fun improving it! For instance, it should be easy to implement Phil Gallagher's suggestion to also print out observations with duplicate keys but different data values. data first; /* set up some sample data */ input id1 id2 a b c; cards; 01 11 101 201 301 01 12 102 202 302 02 11 101 201 301 02 11 101 201 301 02 12 102 202 302 03 12 102 202 302 ; proc iml; use first; /* assume already sorted */ read all into zed; /* zed assumed numeric matrix*/ nr=nrow(zed)-1; /* number of rows for base */ nv=1:ncol(zed); /* index vector for all vars */ do i=1 to nr; /* i is base obs - OLD */ j=i+1; /* j is next obs - NEW */ old=zed[i,nv]; new=zed[j,nv]; anydiff=any(new-old); /* any non-zero differences? */ if anydiff = 0 then print j new; /* print obs. num & values */ end;