retain的使用说明


2023年12月27日发(作者:talented的中文谐音)

SUGI 28104-28

Coders' CornerRETAINing Information to Identify Entity Characteristics

John D. Chapman, Ph.D., Markcelian Analytics, Inc., Cohasset, MA

ABSTRACT

A common requirement of data analysis is to identify an entity

characteristic that is not explicit in the data but is derivable from

it. The SAS RETAIN statement is a powerful and flexible tool for

analysts and programmers with such a requirement. This paper

introduces the RETAIN statement, along with its cousin the SUM

statement, and illustrates its use for this class of analytic

requirements. The concepts discussed apply to Base SAS® and

will be of most interest to beginning and intermediate users with

an interest in using the data step to derive information from data.

INTRODUCTION TO THE TASK OF IDENTIFYING

ENTITY CHARACTERISTICS

Data commonly contains information about entities, such as

customers, students, patients or stores. Some elements of data

explicitly describe entity characteristics, such as the location of

stores or the gender of students. Other characteristics may be

derivable from the elements on a single record, e.g., the age of

students derived from date of birth and school year. Frequently,

however, information about entity characteristics is not explicit

and is spread over many data records; the data must be

manipulated in order to express the information in a form that is

usable for analysis or reporting. For example, student records

that list all courses taken and grades received do not explicitly

identify the students who have never received an ‘A’, but taken

together they do contain that information.

OVERVIEW OF RETAIN

Extracting entity characteristics from multiple data records

requires consolidating and/or comparing data elements across

records. The SAS RETAIN statement enables such actions.

By default data elements created in a data step are set to missing

at the beginning of each DATA step iteration. The RETAIN

statement is used to modify this behavior so that a value changes

only at the behest of the programmer. It may also be used to

assign an initial value to a data element. For example, the

statement:

RETAIN sample 0;

will cause the creation of a data element named SAMPLE with an

initial value of 0. The value of SAMPLE will continue to be 0 until

it is explicitly changed, regardless of how many times the data

step iterates.

RETAIN may also be invoked with the SUM statement, as in :

sample + ;

which is equivalent to:

RETAIN sample 0;

sample = sample + ;

RETAINed data elements can be used to carry information

across iterations of the DATA step. In conjunction with other

techniques such as sorting and using the FIRST and LAST

automatic variables, the RETAIN statement enables the

programmer to implement logic that requires information from

multiple data records. Identifying entity characteristics is a

valuable application of this capability.

SOME TYPICAL CASES

The following three case studies demonstrate the application of

RETAIN to the task of identifying entity characteristics.

CASE #1

A simple case is when a single data element has all the

information required about the desired characteristic. Consider,

for example a dataset containing student grades for a term, which

is illustrated by:

data Grade_Book ;

infile datalines ;

input Name $ Class $ Grade $;

datalines;

Alice Math A

Alice English B

Alice Physics A

Alice History B

Alice French C

Alice Art A

Joe Math B

Joe English C

Joe Physics C

Joe History B

Joe French B

Joe Art B

Sue Math A

Sue English A

Sue Physics B

Sue History A

Sue French B

Sue Art A

run;

The requirement is to identify students who qualify for the honor

roll, which requires that they have received at least three A

grades and no grade of C or lower. The data element GRADE in

the GRADE_BOOK dataset has the required information. The

following data step illustrates how information about grades is

retained from record to record, in this case using the SUM

statement, and then evaluated to determine which student names

should be put in the HONOR_ROLL dataset.

data Honor_Roll (keep = name); ;

1 set Grade_Book ;

5 by name ;

3 if then do;

Num_A = 0;

Num_C = 0;

2 end;

if Grade = 'A' then Num_A + 1 ;

if Grade >= 'C' then Num_C + 1 ;

4 if and Num_A >=3 and Num_C = 0

then output;

run;

This data step has several elements that are typical of the

technique:

1. The source dataset is SET by NAME, the variable that

identifies students, which is the entity of interest. This

of course requires that it be sorted or indexed by

NAME.

2. There are one or more counters, NUM_A and NUM_C

1

SUGI 28in this case, which are RETAINed across iterations.

Here they are coded using the SUM statement though

RETAIN could be used explicitly as well.

3. The FIRST automatic variable is used to reset the

counter(s) for each student.

4. The LAST automatic variable is used to control

evaluation of the counter(s) and assignment of

qualifying students to the honor roll. When the last

record for Alice is evaluated NUM_A = 3 and NUM_C =

1. Alice’s C in French disqualifies her so no record is

output. When Joe’s last record is encountered NUM_A

= 2 so he does not qualify and GRADE_BOOK is still

empty. Finally, at Sue’s last record the criteria are met

and Sue’s name is written to GRADE_BOOK.

5. The result is a dataset with one name, Sue.

CASE #2

The objective in the second case is to identify ‘loyal’ customers in

a set of customer purchase data. Loyal customers are defined as

those who have made at least three purchases in any given 90

day period. The data provides a record for each purchase, the

attributes of which are the date and amount of the purchase.

data PurchaseHistory ;

format Date date9. Purchase dollar6.2;

informat Date Date9. ;

infile datalines ;

input Customer_ID $ Date Purchase;

datalines;

1 27OCT2002 40

1 15DEC2002 25

1 9JAN2003 24

1 30JAN2003 50

2 15NOV2002 30

2 27FEB2003 40

2 27MAR2003 40

run;

The illustration is simplified by including only customers with at

least three purchases and considering only the period beginning

with a customer’s first purchase and ending with his/her last

purchase.

This data has two sources of information that must be considered

to determine customer loyalty. The existence of a purchase,

indicated by the presence of a record, is one source of relevant

information, and the difference between dates on successive

records is another. The following data step uses RETAINed

variables to identify the loyal customers.

data Loyal ;

set PurchaseHistory ;

by Customer_ID Date ;

12

retain QualDate1 QualDate2 ;

if er_ID then do ;

QualDate1 = Date ;

QualDate2 = . ;

IsLoyal = 0 ;

end;

else if QualDate2 = . then do ;

QualDate2 = Date ;

end;

else do ;

if Date - QualDate1 > 90

3

then IsLoyal + 1;

QualDate1 = QualDate2 ;

QualDate2 = Date ;

end;

if er_ID and IsLoyal = 0

then output;

Coders' Corner

run;

Features which distinguish this case from the previous one are:

1. The source data is processed by the entity to be

described, as in case #1, and then for each customer

by DATE, since the relationship between dates of

purchase is an aspect of loyalty..

2. The RETAIN statement is used to initialize two

variables that will carry the DATE from one purchase

record to the next so time between purchases can be

computed. At the beginning of each iteration

QUALDATE1 is the date of the second previous

purchase and QUALDATE2 is the date of the most

recent purchase.

3. RETAINed information is used in conjunction with

information in the current record to evaluate each

customer’s loyalty. When a record is encountered for

which the second previous purchase (QUALDATE1)

was more than 90 days before the current one (DATE)

the customer is identified as not loyal. The values of

DATE, QUALDATE1 and QUALDATE2 when this

evaluation is done (i.e., beginning with the third

purchase for each customer) are:

ID QUALDATE1 QUALDATE2 DATE

1 27OCT2002 15DEC2002 09JAN2003

1 15DEC2002 09JAN2003 30JAN2003

2 15NOV2002 27FEB2003 27MAR2003

In both cases for customer 1 the difference between

DATE and QUALDATE2 is less than 90 days, so

ISLOYAL remains 0 until the last record and customer

1 is included in the LOYAL dataset. For customer 2

the difference exceeds 90 days, so ISLOYAL is

assigned a value of 1 and no record is output for

customer 2.

Note that in a situation like this the LAG function may also be

used to RETAIN dates from one period to the next.

CASE #3

In our final case the requirement is to identify polypharmacy

cases from prescription drug claim records. The polypharmacy

cases to be identified are those patients who are taking two or

more drugs, each prescribed by a different doctor, in each of

three drug classes.

data RxClaims ;

infile datalines ;

input Patient_ID $ Doctor $ Drug_Class $

Drug_Num $;

datalines;

1 Miller A 101

1 Sanchez B 111

1 Hayes B 112

1 Corbin C 113

1 Singh C 133

1 Leary A 104

2 Miller A 102

2 Sanchez A 103

2 Hayes B 114

2 Hayes B 116

2 Corbin C 123

2 Singh C 125

3 Anders A 101

3 Leary A 103

3 Jones B 115

3 Sanchez C 123

3 Singh C 126

run;

2

SUGI 28

To simplify matters, the illustration does not have multiple claims

for a single drug.

Identifying the polypharmacy cases requires considering multiple

attributes of each transaction record, in the context of the values

of those attributes on other records. Such cases will often involve

repeating the techniques described in the previous cases in a

‘nested’ fashion.

data PolyPharm (keep = Patient_ID);

set RxClaims ;

by Patient_ID Drug_Class Drug_Num;

1 Retain Doctor1 ;

if t_ID

then Drug_Classes = 0 ;

3 if _Class then do ;

Drug_Nums = 1 ;

Doctor1 = Doctor ;

end;

if Drug_Nums = 1 and Doctor ne Doctor1

then Drug_Nums + 1 ;

if _Class and Drug_Nums>1

2 then Drug_Classes + 1 ;

if t_ID and Drug_Classes>=3

then output PolyPharm;

run;

This code illustrates such nesting of the techniques used in the

previous cases.

1. Data is ordered and set by the entity variable

(PATIENT_ID) and then by two criterion variables

(DRUG_CLASS and DRUG_NUM).

2. A counter variable tracks each criterion variable, with

the incrementing of the second dependent on the value

of the first. The counters, DRUG_NUMS and

DRUG_CLASSES, are RETAINed by use of the SUM

statement.

3. The FIRST automatic variable is invoked iteratively,

first for PATIENT_ID and then for DRUG_CLASS, to

initialize the counters first for each patient and then for

each drug class received by the patient.

It is left as an exercise for the reader to determine the value of

DOCTOR1, DRUG_NUMS and DRUG_CLASSES at each step of

each iteration, and the reason POLYPHARM includes patient 1

and not patients 2 or 3

ALTERNATIVES

It should be noted that in some cases PROC SQL can be used to

accomplish this task more efficiently, most notably in cases (like

the first) for which the requirement can be implemented solely by

counting by sets. Even in these cases, though, there are

instances when the methods discussed here may reasonably be

used. These include when the programmer is unfamiliar with

SQL, and when a data step has to be used anyways for other

purposes.

CONCLUSION

The RETAIN statement is a useful tool for the analyst needing to

identify and label a characteristic of entities when the information

that identifies the characteristic is spread over multiple records.

REFERENCES

SAS Institute, Inc., 1999. SAS Online Doc, Version 8. Cary,

NC; SAS Institute, Inc.

Coders' Corner

Aster, R. , 2000. Professional SAS Programming Logic. Paoli,

PA; Breakfast Communications, Inc.

CONTACT INFORMATION

Your comments and questions are valued and encouraged.

Contact the author at:

John D. Chapman, Ph.D.

Markcelian Analytics, Inc.

25 Virginia Lane

Cohasset, MA 02025

(781) 383-9793

jchapman@

SAS and all other SAS Institute Inc. product or service names are

registered trademarks or trademarks of SAS Institute Inc. in the

USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their

respective companies.

3


本文发布于:2024-09-24 06:25:50,感谢您对本站的认可!

本文链接:https://www.17tex.com/fanyi/39050.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议