使用SAS进行KNN分类和回归

Paper SD-09

KNN Classiﬁcation and Regression using SAS R

Liang Xie,The Travelers Companies,Inc.

ABSTRACT

K-Nearest Neighbor(KNN)classiﬁcation and regression are two widely used analytic methods in predictive modeling and data miningﬁelds.They provide a way to model highly nonlinear decision boundaries,and to fulﬁll many other analytical tasks such as missing value imputation,local smoothing,etc.

In this paper,we discuss ways in SAS R to conduct KNN classiﬁcation and KNN Regression.Speciﬁcally,PROC DISCRIM is used to build multi-class KNN classiﬁcation and PROC KRIGE2D is used for KNN regression tasks. Technical details such as tuning parameter selection,etc are discussed.We also discuss tips and tricks in using these two procedures for KNN classiﬁcation and regression.

降膜吸收塔Examples are presented to demonstrate full processﬂow in applying KNN classiﬁcation and regression

in real world business projects.

INTRODUCTION

kNN stands for k Nearest Neighbor.In data mining and predictive modeling,it refers to a memory-based(or instance-based)algorithm for classiﬁcation and regression problems.It is a widely used algorithm with many suc-cessfully applications in medical research,business applications,etc.In fact,according to Google Analytics,it is the second most viewed article on my SAS programming blog of all time,with more than2200views a year.

In classiﬁcation problems,the label of potential objects is determined by the labels of closest training data points in the feature space.The determination process is either through”majority voting”or”averaging”.In”majority voting”,the label of object is assigned to be the label which most frequent among the k closest training examples. In”averaging”,the object is not assigned a label,but instead,the ratio of each class among the k closest training data points.

In Regression problems,the property of object is obtained via a similar”averaging”process,where the value of the object is the average value of the k closest training points.

In practice,both the”majority voting”and”averaging”process can be reﬁned by adding weights to the k closest training points,where the weights are proportional to the distance between object and the training point.In this way,the closest points will have the biggest inﬂuence on theﬁnal results.In fact,the SAS implementation of kNN classiﬁcation has the averaging process be weighted by volume size.Interested readers are encouraged to read the manual for details.

CHARACTERISTICS OF KNN-BASED ALGORITHMS

kNN algorithm has several characteristics that worth addressing.Understanding its advantages and disadvantages in theory and in practice will help practicioners better leverage this tool.Generally,KNN algorithm has4advantages worth noting.电子设计技术

1.It is simple to implement.Theoretically,kNN algorithm is very simple to implement.The naive version of the al-

gorithm is easy to implement.For every data point in the test sample,directly computing the desired distances to all stored vectors,and choose those shortest k examples among stored vectors.It is,however,computa-tionally intensive,especially when the size of the training set grows.Over the years,many nearest neighbor search algorithms have been proposed,seeking to reduce the number

of distance evaluations actually per-formed.Using an appropriate nearest neighbor search algorithm makes k-NN computationally tractable even for large data sets.In SAS,the k-d T ree data structure and associated search algorithm of Friedman et.al.is implemented.

2.It is analytically tractable.

3.It is highly adaptive to local information.kNN algorithm uses the closest data points for estimation,therefore it

is able to take full advantage of local information and form highly nonlinear,highly adaptive decision boundaries for each data point.

4.It is easily implemented in parallel.Because it is instance based,for each data point to be scored,the algorithm

check against the training table for the k nearest neighbor.Since each data point is independent of the others, the execution of search and score can be conducted in parallel.

On the other hand,kNN algorithm has3disadvantages,which will also be addressed in details in section PRACTI-CAL ISSUES ON IMPLEMENT A TION.

1.It is computationally very intensive,because for each records to be classiﬁed and/or scored,the program has

to scan through available data points toﬁnd the nearest neighbors and then determine which ones to use for further calculation.

2.It is demanding in storage,because all records in training sample have to be stored and used whenever

scoring is necessary.

3.It is highly susceptible to curse of dimensionality,because as dimension increases,the distance calculation

andﬁnding the appropriate space to contain at least one relevant record is becoming difﬁcult.

KNN USING SAS

In order to conduct either KNN Classiﬁcation or KNN regression in SAS,there are two basic approaches.The ﬁrst one utilize special options in certain SAS Procedures and conduct the KNN ana

lysis directly.I call it”Direct Approach”;while the second method willﬁrstﬁnd the nearest data points explicitly leveraging search capability in some SAS procedures and manually calculate theﬁnal response values,which is simple anyway.I call it”Manual Approach”.

T o take the manual approach,there are three PROCs that are capable toﬁnd nearest neighbor points in efﬁcient way,namely PROC FASTCLUS,PROC KRIGE2D and PROC MODECLUS.Each has its pro and con and all need data preparation.The discussion of this issue will beyond the scope of this paper.

To take the direct approach,PROC DISCRIM can be directly used for KNN Classiﬁcation while PROC KRIGE2D can be directly used for KNN regression,even though the functionality SAS provides is only basic.For kNN-based classiﬁcation,PROC DISCRIM offers intuitive programming interface and rich options.For kNN-based regression, even though there is not dedicated procedure available,PROC KRIGE2D can be used to fulﬁll this task,with some tweaks.In this section,we will demonstrate how to conduct these two data mining tasks in SAS and address some closely related issues.For kNN-based regression,we will discuss the details of how to tweak PROC KRIGE2D. KNN CLASSIFICATION USING SAS

心无旁骛抓落实

SAS offers kNN classiﬁcation capability via PROC DISCRIM,by invoking the following procedure options: METHOD=NPAR K=

The’METHOD=NP AR’option asks SAS to use non-parametric discrimination function,together with’K=’option PROC DISCRIM will use kNN classiﬁcation,where’K=’tells SAS how many neighbors to use in determining the.A typical SAS program for kNN-based classiﬁcation looks like the follows:

proc discrim data=train (1)

method=npar k=5 (2)

testdata=toscore (3)

testout=toscore_out (4)

;

class y; (5)

var x1-x10; (6)

run;

We explain the function of each statement below.

1.Statement[1]tells SAS that PROC DISCRIM should be called to process the data named train.

2.Statement[2]tells SAS to apply kNN Classiﬁcation method using5nearest neighbors.

3.Statement[3]tells SAS that the classiﬁcation rule be applied to a test data called’toscore’.

4.Statement[4]tells SAS to output classiﬁcation result for the test data and to name the output data as toscore-

out.

5.Statement[5]deﬁnes Y as the dependent variable to use in PROC DISCRIM.

6.Statement[6]tells SAS to use x1to x10as input features in calculating the distance to neighbors.

kNN classiﬁcation is computationally intensive.Based on SAS manual[4],for the speciﬁc tree search algorithm implemented in PROC DISCRIM,the time usage,excluding I/O time,is roughly proportional

to log(N)∗(N∗P), where N is the number of observations and P is the number of variables used.

kNN is a memory-based method,when an analyst wants to score the test data or new data in production,the whole raw table will be loaded to memory in the scoring process.In computing,PROC DISCRIM uses memory approximately proportional to the second order of number of proportional to P2.

Therefore,it would be interesting to see how the cost of computing,in terms of memory usage and time consump-tion,increases as the number of observations in training data and the number of features are increasing.

KNN REGRESSION USING SAS

kNN technique can be applied to regression problems,too,but the coding in SAS is not as straightforward as in a classiﬁcation problem.kNN regression uses the average value of dependent variable over the selected nearest neighbors to generate predicted value for scoring data point.In SAS,we can use PROC KRIGE2D to conduct this analysis,but it needs more cautions and tweaks.

The following example code instructs SAS to do a kNN regression on variable Y using5nearest neighbors and2 features:X1and X2.

proc krige2d data=train outest=pred outn=NNlist; (1)

coordinate xc=X1yc=X2; (2)

predict var=Y NP=5; (3)

model scale=10range=1E-7form=exp; (4)

grid griddata=toscore xc=X1yc=X2; (5)

run;

1.Statement[1]invokes KRIGE2D procedure and ask SAS to operate on data named train,and output the

prediction to data named pred via option OUTEST=.The option OUTN=ask SAS to output information about the selected nearest neighbors of each data points in the data speciﬁed in GRID statement.KRIGE2D will directly output the prediction of dependent variable in the data from’OUTEST=’,but only output the corresponding value of dependent variable in the selected closest data points as well as the coordinates.

This is a disadvantage in the sense that when cross validation is conducted,the data from’OUTN=’does not contain the actual value of dependent variable and this data set has to be merged back to the validation part to obtain actual values so that errors can be computed.

2.Statement[2]tells KRIGE2D that the names of the2coordinates to be used,or in the language of data mining,

the2features to be used.In this example,it is X1and X2.

3.Statement[3]tells KRIGE2D to use5nearest neighbors to predicts dependent variables’s value at current grid

position.As to be discussed,the number of nearest neighbors is a tuning parameter in this algorithm,and should be determined using data driven approaches,such as Cross Validation.

4.Statement[4]tells KRIGE2D what model to be used in kriging using’FORM=’option and the associated

parameters,scale and range.Note that,in order to conduct a kNN regression,it is required to specify a large scale value and a small range value.For example,in example above,scale is set to10while rang

e is set to

1E-7.Typically,a scale value larger than1and a range value smaller than1E-7work well.The type of model is irrelevant.The reason why this works is beyond the scope of this paper and we will not discuss on this issue here.

5.Statement[5]tells KRIGE2D that the name of the data to be scored is tosore in option GRIDDATA=and the

corresponding names of features(coordinates)to be used for scoring.

PROC KRIGE2D was designed to perform ordinary kriging in two dimensions,therefore it is not fully functional in the standard kNN regression practice.There are two major issues with PROC KRIGE2D when it is applied in a kNN regression analysis.Theﬁrst problem is that PROC KRIGE2D accept two and only two inputs as features because it treat kNN regression as a two dimensional kriging.The second problem is due to the nature of semivariogram model used in kriging.In cases where the data are dense and highly overlapped,PROC KRIGE2D will report singularity in kriging system.However,these issues should not bother analysis in practice and there are good reasons which will be detailed below.

Overcome Fixed Number of Features in KRIGE2D In a real kNN regression application,the number of features ranges from1to hundreds if necessary,but PROC KRIGE2D requires exactly two features as inputs.When only1 feature is used,or when more than2features are required,the analysts can take a detour by pre-processing the data up front.We will discuss these two cases separately below.

石诗龙1.Only one feature.When only one feature is used,we can set up a dummy input,which is a constant number,

in both the training data and scoring data.The dummy variable will cancel out in calculating,thus it has no effect virtually.The code below shows a demonstration:

data trainv/view=trainv;

set train;

dummy=1;

run;

data toscorev/view=toscorev;

set toscore;

dummy=1;

run;

脐点1988

proc krige2d data=trainv outest=pred outn=knn;

coordinate xc=dummy yc=feature1;

predict var=Y NP=5;

model scale=10range=1E-7form=exp;

grid gdata=toscorev xc=dummy yc=feature1;

run;

In the example code above,we used SAS data view to avoid re-write the data sets and save computing resources.This technique is explained in[10].

2.More than two features.When more than two features are used,there are two approaches to handle this

problem.Theﬁrst approach conduct dimension reduction upfront and PROC KRIGE2D take two derived new dimensions as features to perform kNN regression;the second approach follows the ensemble principle[5], and randomly select two features out of the pool of available features.Theﬁnal kNN regression prediction is an average of predictions from each of the ensemble regressions.Theﬁrst approach can also use the ensemble principle,too,in practice.There are many dimension reduction techniques,and among all,the most widely used is Singular Value Decomposition(SVD),or its equivalent counterpart Principal Components Analysis (PCA).Details of SVD or PCA is beyond the scope of this paper and interested readers can refer to[7],[4].

The example code in Appendix1demonstrates theﬁrst method where we conduct dimension reduction using PCA upfront and use the leading two orthogonal bases for kNN regression.

Note that in the example code,weﬁrst combine both train and score data sets together because PCA is sample sensitive[7],and using records from both training data and scoring data will stabilize the generated orthogonal bases.

The second method is demonstrated below.It is worth noting that when the number of features is small, say10,it is a good practice to select all pairwise combination as input features and ensemble all results at theﬁnal stage.On the other hand,when the number of such combinations blows as the number of features increases,a random sample of pairwise combination worksﬁne,just like in a RandomForest algorithm.In the code below,we assume100features which are stored in a table called FEA TURES and the ensemble iterates 50times,each with a pair of randomly selected features.

%macro knnreg(train_dsn,score_dsn,

feature1,feature2,

loopid);

proc krige2d data=&train_dsn

outest=pred&loopid

outn=nn&loopid;

coord xc=&feature1yc=&feature2;

pred var=Y NP=10;

model scale=2range=1E-8form=exp;

grid gdata=&score_dsn xc=&feature1yc=&feature2;

run;

%mend;

%let nloops=50;

proc surveyselect data=features out=feature2

reps=&nloops sampsize=2method=srs;

run;

data_null_;

set feature2;by replicate;

retain train_dsn’train’score_dsn’toscore’;

retain_id0;为我唱首歌吧阅读答案

array_f{2}$_temporary_;

plicate then do;

call missing(of_f[*]);

_id=1;

end;

_f[_id]=varname;

_id+1;

plicate then do;

call execute(’%knnreg(’

||compress(train_dsn)||’,’

||compress(score_dsn)||’,’

||compress(_f[1])||’,’

||compress(_f[2])||’)’

);

end;

run;

The’call execute’statement calls the pre-deﬁned macro%kddreg within a data step.For details about’call execute’command,consult[9].

Singularity in kriging process When the scoring point is extremely close to at least one of the training record, local ordinary kriging process will be singular and the estimation for the scoring point will be skipped.One solution is to set a large number for scale and small number for range in MODEL statement,for example:

MODEL SCALE=10RANGE=1e-8FORM=EXP;

In the extreme cases,while inconvenient,it is necessary to manually process the data containing nearest neigh-boring points from OUTN=to generate prediction.It turns out to be fairly simple,because ordinary kNN regres-

sion only requires average every k observations.This technique is demonstrated in Appendix2.In such case, it is unnecessary to output predictions from KRIGE2D directly,and the output can be suppressed by specifying OUTEST=_NULL_.

PRACTICAL ISSUES ON IMPLEMENTATION

kNN is a proven technique in many classiﬁcation and regression problems.T o fully appreciate its power in real analysis,there are a couple of issues need to be addressed appropriately.These problems touch the pain points of efﬁciency,accuracy,robustness and more.In the sections below,we willﬁrst discuss efﬁciency around storage and computation issues and see how parallelization can improve the efﬁciency.Then we will discuss model selection that balances accuracy and robustness.Lastly,we will discuss other miscellaneous issues when applying KNN in practice.

OPTIMIZE STORAGE AND COMPUTATION

kNN algorithm is expensive in both storage and computation.Therefore,it is important to the factors that contribute to complexity in time and space of this algorithm,thus optimize the implementation in SAS.

Optimize Storage kNN classiﬁcation requires a lot of storage because this is a in-memory algorithm,and all the training information to score a new data will be load into the memory to build a kd-T ree when conduct scoring. Therefore,when the size of the training data is large,and when the number of variables used is large,the required memory storage will be non-trivial.According to SAS/ST A T Manual,the required storage for a kNN classiﬁcation is at least c(32v+3l+128)+8v2+104v+4l bytes,where v is the number of variables,c is the number of classes levels of the dependent variable,and l is the length of CLASS variable.Apparently,the storage requirement is linear in the latter two factors which are given by the problem at hand,while quadratic in the number of variables.T o optimize storage,it is necessary to reduce the number of variables involved in the analysis.

1.Only choose the necessary variables in kNN analysis.In many cases,2or3features is enough,actually.

On the other hand,like in RandomForest algorithm,multiple runs of kNN algorithm were conducted,each with a small subset of variables,and these classiﬁers or regression results will be ensembeled to be theﬁnal solution.

2.When selected number of variables is still relatively large,one commonly used technique for dimension re-

duction is Singular Value Decomposition(SVD),or equivalently,the Principal Component Analysis(PCA).

Because kNN algorithm is highly adaptive to local information,SVD or PCA won’t suffer from common critics that the information in dependent variable is not used when variable reduction is used.This technique is demonstrated in the Digits Recognition example below.

kNN algorithm used in classiﬁcation and regression suffers a lot from curse of dimensionality,therefore,choosing only necessary and more useful features for the problem in hand is critical for the success of application and implementation.

Optimize Computation kNN classiﬁcation is also computationally intensive.Unlike many classiﬁcation

methods, such as logistic regression,discriminant analysis and other parametric regressions,kNN has no condense math-ematical form to represent the information in training data,and lack of explicit model training.When scoring is needed,the training data is simply used to populate the sample search space with known results,and scoring is no different from a searching and comparison process,which is very time consuming.In the most naive way of search-ing,the algorithm compares scoring current data point to all data in the training sample,calculate desired distance metrics and select the top k nearest neighbors for their response information.This is computationally very redundant and intensive and doesn’t leverage the distribution information in training sample well.Modern algorithm relies on heuristic searching algorithms and one of the most signiﬁcant one is kd-Tree[2].kd-T ree stands for k-dimensional tree,and it is a space-partitioning data structure to organize data points in k-dimensional space for efﬁcient search. In SAS,PROC DISCRIM stores data in kd-T ree structure.The time required to organize the observations into this speciﬁc tree structure is proportional to vn ln(n).The time for performing each tree search is proportional to ln(n). Apparently,the way to reduce computational time is reduce the number of data points n.On the other hand,it is advised to keep enough data points in training sample.This is because a k-d trees are not suitable for efﬁciently ﬁnding the nearest neighbor in high dimensional spaces.As a general rule,if the dimensionality is k,the number of

本文发布于:2024-09-20 16:30:10，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/326108.html

上一篇：增量KNN模型的修剪策略研究

下一篇：针对中文网页评测kNN与NB分类算法1

标签：阅读降膜电子设计答案吸收塔技术唱首歌

留言与评论（共有 0 条评论）