roc曲线python_风控（一）：ROC曲线和K-S曲线比较及python实现

roc曲线python_风控（⼀）：ROC曲线和K-S曲线⽐较及

python实现

以分类模型中最简单的⼆分类为例，对于这种问题，我们的模型最终需要判断样本的结果是0还是1，或者说是positive还是negative。我们通过样本的采集，能够直接知道真实情况下，哪些数据结果是positive，哪些结果是negative。同时，我们通过⽤样本数据跑出分类型模型的结果，也可以知道模型认为这些数据哪些是positive，哪些是negative。因此，我们就能得到这样四个基础指标，我称他们是⼀级指标(最底层的)：

速录器真实值是positive，模型认为是positive的数量(True Positive=TP)

真实值是positive，模型认为是negative的数量(False Negative=FN)：这就是统计学上的第⼀类错误(Type I Error)

真实值是negative，模型认为是positive的数量(False Positive=FP)：这就是统计学上的第⼆类错误(Type II Error)

真实值是negative，模型认为是negative的数量(True Negative=TN)

注：T肯定是对的，F是错的。

预测性分类模型，肯定是希望越准越好。那么，对应到混淆矩阵中，那肯定是希望TP与TN的数量⼤，⽽FP与FN的数量⼩。所以当我们得到了模型的混淆矩阵后，就需要去看有多少观测值在第⼆、四象限对应的位置，这⾥的数值越多越好；反之，在第⼀、三象限对应位置出现的观测值肯定是越少越好。

2.⼆级指标

但是，混淆矩阵⾥⾯统计的是个数，有时候⾯对⼤量的数据，光凭算个数，很难衡量模型的优劣。因此混淆矩阵在基本的统计结果上⼜延伸了如下4个指标，我称他们是⼆级指标(通过最底层指标加减乘除得到的)：

准确率(Accuracy)—— 针对整个模型

精确率(Precision)

灵敏度(Sensitivity)：就是召回率(Recall)

特异度(Specificity)

可以将混淆矩阵中数量的结果转化为0-1之间的⽐率。便于进⾏标准化的衡量。

3.三级指标

在这四个指标的基础上在进⾏拓展，会产令另外⼀个三级指标这个指标叫做F1 Score。他的计算公式是：

其中，P代表Precision，R代表Recall。

F1-Score指标综合了Precision与Recall的产出的结果。F1-Score的取值范围从0到1的，1代表模型的输出最好，0代表模型的输出结果最差。

4.ROC曲线

ROC曲线：Receiver Operating Characteristic曲线，横轴是FPR(False Positive Rate)，纵轴是TPR(True Positive Rate)。

AUC(Area Under ROC Curve)：ROC曲线下的⾯积。

5.K-S曲线

洛伦兹曲线(Kolmogorov-Smirnov curve)值越⼤，表⽰模型能够将正、负客户区分开的程度越⼤。KS值的取值范围是[0，1] 。

KS曲线是两条线，其横轴是阈值，纵轴是TPR(上⾯那条)与FPR(下⾯那条)的值，值范围[0，1] 。两条曲线之间之间相距最远(差)的地⽅对应的阈值，就是最能划分模型的阈值。绘制过程如下：

可以看出，在阈值等于0.4的地⽅，TPR和FPR差最⼤，说明该处阈值可作为最佳区分点。

6.相关代码

6.1混淆矩阵pyspark

1 '''

2 TP(True Positive)：真实为1，预测为1

3 FN(False Negative)：真实为1，预测为0

4 FP(False Positive)：真实为0，预测为15

TN(True Negative)：真实为0，预测为06 '''

7 #训练集

8 a=0.1

9 result_train_tmp=result_train.withColumn("tp",F.expr("""case when label=1 and round(prediction+{a},0)=1 then 1 else 0 end""".format(a=a))).\10 withColumn("fn",F.expr("""case when label=1 and round(prediction+{a},0)=0 then 1 else 0 end""".format(a=a))).\11 withColumn("fp",F.expr("""case when label=0 and round(prediction+{a},0)=1 then 1 else 0 end""".format(a=a))).\12 withColumn("tn",F.expr("""case when label=0 and round(prediction+{a},0)=0 then 1 else 0 end""".format(a=a)))

View Code

6.2 ROC曲线python

importmatplotlib.pyplot as plt

水电安装开槽机#预测

y_pred_lr=lr.predict_proba(x_test)#计算AUC

fpr_lr,tpr_lr,thresholds = roc_curve(y_test,y_pred_lr[:,1],pos_label=1)

roc_auc_lr=auc(fpr_lr, tpr_lr)#绘制roc

plt.figure()

plt.plot(fpr_lr, tpr_lr, color='darkorange', label='ROC curve (area = %0.2f)' %roc_auc_lr)

plt.plot([0,1], [0, 1], color='navy', linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

折叠麻将桌

plt.ylabel('True Positive Rate')

plt.title('ROC曲线-LR')

plt.legend(loc="lower right")#-------------------------------------------------

#交叉验证画roc

del_selection importKFold

kf=KFold(n_splits=5)

fig=plt.figure(figsize=(7,5))

mean_tpr=0.0mean_fpr=np.linspace(0,1,100)

all_tpr=[]

x_train=np.array(x_train)

y_train=np.array(y_train)

i=0for train_index,test_index inkf.split(x_train):

偏心井口model=gbdt.fit(x_train[train_index],y_train[train_index])

probas=model.predict_proba(x_train[test_index])

fpr,tpr,thresholds= roc_curve(y_train[test_index],probas[:,1],pos_label=1)

mean_tpr+=np.interp(mean_fpr,fpr,tpr)

mean_tpr[0]=0.0roc_auc=auc(fpr,tpr)

plt.plot(fpr,tpr,lw=1,label='ROC fold %d (area = %0.2f)'%(i+1,roc_auc))

i+=1plt.plot([0,1],[0,1],linestyle='--',color=(0.6,0.6,0.6),label='random guessing') mean_tpr/=5mean_tpr[-1]=1.0mean_auc=auc(mean_fpr,mean_tpr)

plt.plot(mean_fpr,mean_tpr,'k--',label='mean ROC (area=%0.2f)'%mean_auc,lw=2) plt.plot([0,0,1],[0,1,1],lw=2,linestyle=':',color='black',label='perfect performance') plt.xlim([-0.05,1.05])

plt.ylim([-0.05,1.05])

plt.xlabel('false positive rate')

plt.ylabel('true positive rate')

plt.title('Receiver Operator Characteristic')

plt.legend(loc='lower right')

plt.show()

View Code

6.3 K-S曲线python

1 #绘制K-S曲线

2 importnumpy as np

3 importpandas as pd

4 defPlotKS(preds, labels, n, asc):5

6 #preds is score: asc=1

7 #preds is prob: asc=0

8 #n为划分阈值的个数，10为0-1

10 pred = preds #预测值

11 bad = labels #取1为bad, 0为good

12 ksds = pd.DataFrame({'bad': bad, 'pred': pred})13 ksds['good'] = 1 -ksds.bad14

15 if asc == 1:16 ksds1 = ksds.sort_values(by=['pred', 'bad'], ascending=[True, True])17 elif asc ==0:18 ksds1 =

ksds.sort_values(by=['pred', 'bad'], ascending=[False, True])19 ksds1.index =range(len(ksds1.pred))20

ksds1['cumsum_good1'] = 1.d.cumsum()/d)21 ksds1['cumsum_bad1'] =

1.0*ksds1.bad.cumsum()/sum(ksds1.bad)22

23 if asc == 1:24 ksds2 = ksds.sort_values(by=['pred', 'bad'], ascending=[True, False])25 elif asc ==0:26 ksds2 =

ksds.sort_values(by=['pred', 'bad'], ascending=[False, False])27 ksds2.index =range(len(ksds2.pred))28

ksds2['cumsum_good2'] = 1.d.cumsum()/d)29 ksds2['cumsum_bad2'] =

1.0*ksds

2.bad.cumsum()/sum(ksds2.bad)30

31 #ksds1 ksds2 -> average

向初

32 ksds = ksds1[['cumsum_good1', 'cumsum_bad1']]33 ksds['cumsum_good2'] = ksds2['cumsum_good2']34

ksds['cumsum_bad2'] = ksds2['cumsum_bad2']35 ksds['cumsum_good'] = (ksds['cumsum_good1'] +

ksds['cumsum_good2'])/2

36 ksds['cumsum_bad'] = (ksds['cumsum_bad1'] + ksds['cumsum_bad2'])/2

38 #ks

39 ksds['ks'] = ksds['cumsum_bad'] - ksds['cumsum_good']40 ksds['tile0'] = range(1, len(ksds.ks) + 1)41 ksds['tile'] =

人工智能建站1.0*ksds['tile0']/len(ksds['tile0'])42

43 qe = list(np.arange(0, 1, 1.0/n))44 qe.append(1)45 qe = qe[1:]46

47 ks_index =pd.Series(ksds.index)48 ks_index = ks_index.quantile(q =qe)49 ks_index =np.ceil(ks_i

ndex).astype(int)50

ks_index =list(ks_index)51

52 ksds =ksds.loc[ks_index]53 ksds = ksds[['tile', 'cumsum_good', 'cumsum_bad', 'ks']]54 ksds0 =np.array([[0, 0, 0, 0]])55 ksds = np.concatenate([ksds0, ksds], axis=0)56 ksds = pd.DataFrame(ksds, columns=['tile', 'cumsum_good', 'cumsum_bad', 'ks'])57

58 ks_value =ksds.ks.max()59 ks_pop =ksds.tile[ksds.ks.idxmax()]60 print ('ks_value is' + und(ks_value, 4)) + 'at pop =' + und(ks_pop, 4)))61

62 #chart

64 #chart

65 plt.plot(ksds.tile, ksds.cumsum_good, label='cum_good',66 color='blue', linestyle='-', linewidth=2)67

68 plt.plot(ksds.tile, ksds.cumsum_bad, label='cum_bad',69 color='red', linestyle='-', linewidth=2)70

71 plt.plot(ksds.tile, ksds.ks, label='ks',72 color='green', linestyle='-', linewidth=2)73

74 plt.axvline(ks_pop, color='gray', linestyle='--')75 plt.axhline(ks_value, color='green', linestyle='--')76

plt.axhline(ksds.loc[ksds.ks.idxmax(), 'cumsum_good'], color='blue', linestyle='--')77

plt.axhline(ksds.loc[ksds.ks.idxmax(),'cumsum_bad'], color='red', linestyle='--')78 plt.title('KS=%s' %np.round(ks_value, 4) + 79 'at Pop=%s' %np.round(ks_pop, 4), fontsize=15)80

82 return ksds

View Code

6.4 其他指标计算python

1 ics importprecision_score, recall_score, f1_score,accuracy_score2

3 acc=accuracy_score(y_test, y_pred_lr_new)

4 p = precision_score(y_test, y_pred_lr_new, average='binary')

5 r =

recall_score(y_test, y_pred_lr_new, average='binary')6 f1score = f1_score(y_test, y_pred_lr_new, average='binary')7

print(acc,p,r,f1score)

View Code