home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!utcsri!torn!nott!dgbt!netfs!ub!zaphod.mps.ohio-state.edu!darwin.sura.net!europa.asd.contel.com!paladin.american.edu!auvm!NIHCU.BITNET!HIS
- From: HIS@NIHCU.BITNET (Howard Schreier)
- Newsgroups: bit.listserv.sas-l
- Subject: Tricky Lookback Problem
- Message-ID: <SAS-L%92111210220231@UGA.CC.UGA.EDU>
- Date: 12 Nov 92 15:21:21 GMT
- Sender: "SAS(r) Discussion" <SAS-L@UGA.BITNET>
- Reply-To: Howard Schreier <HIS@NIHCU.BITNET>
- Lines: 121
- Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU
-
- CONTENT: Response
- SUMMARY: DATA step can be more efficient than SQL
-
- > From: GFX@PSUVM.PSU.EDU
- >
- > The dataset is structured like this:
- >
- > OBS Date x1 x2 ... xn N_12Mo
- > 1 04/12/82 1 0 1
- > 2 04/21/82 1 1 0
- > 3 04/28/82 0 0 1
- > .
- > .
- > K
- >
- > Variable N_12Mo should indicate the number of observations
- > that are less than a year old. In other words, how many
- > lines above observation K are less than a year old. We used
- > the LOOKUP procedure in Microsoft Excel. How could it be
- > done in SAS?
- >
- > Thanks,
- > Stephane
-
- I've been thinking about this one off and on for a few days.
- I assume that there is at most one observation per day and
- that the file is sorted, so that "above" translates to
- *strictly* earlier. Also, I agree with Bob Snyder's advice
- on using the SAS convention for representing dates
- internally. So, here is my test file:
-
- data test; input date mmddyy8. x1-x4; cards;
- 04/11/90 0 0 0 1
- 04/12/90 0 0 1 0
- 04/13/90 0 0 1 1
- 04/11/91 0 1 0 0
- 04/12/91 0 1 0 1
- 04/13/91 0 1 1 0
- 04/11/92 0 1 1 1
- 04/12/92 1 0 0 0
- 04/13/92 1 0 0 1
- 04/14/92 1 0 1 0
- 04/15/92 1 0 1 1
- ;
-
- A solution using SQL is fairly natural:
-
- proc sql;
- create table sresult as
- select distinct test.date format=mmddyy8.,
- test.x1,test.x2,test.x3,test.x4,
- n(joindate.date) as n_12mo
- from test left join test as joindate
- on mdy(month(test.date),
- day (test.date),
- year (test.date) - 1) < joindate.date < test.date
- group by test.date
- order by test.date
- ;
-
- The essential technique is to JOIN the data set with itself
- (LEFT JOIN is used, to generate a row even when the answer
- is zero). The ON condition implements the one-year
- lookback; by using MDY instead of just subtracting 365, the
- query gives correct results even when a leap day (Feb. 29)
- intervenes.
-
- So, SQL can do the job, but... (and this is a big ol' "but"
- :-)). The JOIN required is *not* an "equijoin" and
- therefore cannot be optimized very much; in other words,
- PROC SQL will be forced to use brute force in evaluating it.
- Moreover, I suspect that the real data set here may be
- fairly large. If it has just 5K observations, the JOIN will
- entail forming 5Kx5K=25M pairings before the ON condition
- whittles it down and the DISTINCT option purges duplicate
- rows.
-
- So, unless the data set is quite small, the SQL solution
- will be very inefficient, and a different approach is
- needed. This is a situation where the flexibility of the
- DATA step really helps:
-
- data dresult;
- set test;
- year_ago = mdy(month(date),day(date),year(date)-1);
- if backwhen<=year_ago then do until (backwhen>year_ago);
- set test(keep=date rename=(date=backwhen));
- count + 1;
- end;
- n_12mo = _n_ - count;
- format date year_ago backwhen mmddyy8.;
-
- The presence of two SET statements makes this pretty
- unusual. It takes advantage of the fact that multiple SET
- statements operating on the same data set maintain
- independent pointers. The first SET works conventionally,
- advancing one observation per iteration. The second one
- advances as necessary to remain a year behind.
-
- I get the same results using either SQL or the DATA step:
-
- OBS DATE X1 X2 X3 X4 N_12MO
-
- 1 04/11/90 0 0 0 1 0
- 2 04/12/90 0 0 1 0 1
- 3 04/13/90 0 0 1 1 2
- 4 04/11/91 0 1 0 0 2
- 5 04/12/91 0 1 0 1 2
- 6 04/13/91 0 1 1 0 2
- 7 04/11/92 0 1 1 1 2
- 8 04/12/92 1 0 0 0 2
- 9 04/13/92 1 0 0 1 2
- 10 04/14/92 1 0 1 0 3
- 11 04/15/92 1 0 1 1 4
-
- /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
- \ Howard Schreier, U.S. Dept. of Commerce, Washington /
- / MVS 5.18 & 6.07 \
- \ Voice: (202) 377-4180 BITNET: HIS@NIHCU /
- / Fax: (202) 377-4614 INTERNET: HIS@CU.NIH.GOV \
- \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
-