Kaggle心脏病分类预测数据分析案例(逻辑回归,KNN,决策树,随机森林...)

Kaggle⼼脏病分类预测数据分析案例(逻辑回归,KNN,决策树,随机森林...)
本⽂是⼀篇关于kaggle上⼀个’⼼脏病分类预测’数据集的分析⼩demo
总体过程为:数据观察,数据处理,分别建⽴逻辑回归,KNN,决策树模型,观察F1指标,混淆矩阵,精准率和召回率曲线,绘制每个模型的
ROC曲线进⾏对⽐,最后进⾏模型融合,使⽤到随机森林.
数据集地址:
数据观察部分
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 解决matplotlib中⽂问题
from pylab import mpl
# 导⼊数据
df = pd.read_csv('heart_disease_data/heart.csv')
瞄⼀瞄数据的总体情况
df.info()
<class 'frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age        303 non-null int64
sex        303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs        303 non-null int64
restecg    303 non-null int64
thalach    303 non-null int64
exang      303 non-null int64
oldpeak    303 non-null float64
slope      303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB
特征的含义
age 年龄
sex 性别1=male,0=female
cp  胸痛类型(4种)值1:典型⼼绞痛,值2:⾮典型⼼绞痛,值3:⾮⼼绞痛,值4:⽆症状
trestbps 静息⾎压
chol ⾎清胆固醇
fbs 空腹⾎糖>120mg/dl ,1=true;0=false
restecg 静息⼼电图(值0,1,2)
thalach 达到的最⼤⼼率
exang 运动诱发的⼼绞痛(1=yes;0=no)
oldpeak 相对于休息的运动引起的ST值(ST值与⼼电图上的位置有关)
slope 运动⾼峰ST段的坡度 Value 1: upsloping向上倾斜, Value 2: flat持平, Value 3: downsloping向下倾斜
ca  The number of major vessels(⾎管)(0-3)
thal A blood disorder called thalassemia (3= normal;6= fixed defect;7= reversable defect)
⼀种叫做地中海贫⾎的⾎液疾病(3=正常;6=固定缺陷;7=可逆转缺陷)
target ⽣病没有(0=no,1=yes)
df.describe()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal count303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000303.000000 mean54.3663370.6831680.966997131.623762246.2640260.1485150.528053149.6468650.326733  1.039604  1.3993400.729373  2.313531 std9.0821010.466011  1.03205217.53814351.8307510.3561980.52586022.9051610.469794  1.1610750.616226  1.0226060.612277 min29.0000000.0000000.00000094.000000126.0000000.0000000.00000071.0000000.0000000.0000000.0000000.0000000.000000 25%47.5000000.0000000.000000120.000000211.0000000.0000000.000000133.5000000.0000000.000000  1.0000000.000000  2.000000 50%55.000000  1.000000  1.000000130.000000240.0000000.000000  1.000000153.0000000.0000000.800000  1.0000000.000000  2.000000 75%61.000000  1.000000  2.000000140.000000274.5000000.000000  1.000000166.000000  1.000000  1.600000  2.000000  1.000000  3.000000 max77.000000  1.000000  3.000000200.000000564.000000  1.000000  2.000000202.000000  1.000000  6.200000  2.000000  4.000000  3.000000简单的出图看看特征之间的关系
df.target.value_counts()
1    165
0    138
Name: target, dtype: int64
plt.xlabel("得病/未得病⽐例")
Text(0.5,0,'得病/未得病⽐例')
df.sex.value_counts()
1    207
0    96
Name: sex, dtype: int64
黄冈师范学院图书馆untplot(x='sex',data=df,palette="Set3")
plt.xlabel("Sex (0 = ⼥, 1= 男)")
Text(0.5,0,'Sex (0 = ⼥, 1= 男)')
plt.figure(figsize=(18,7))
plt.show()
对数据的认识是很重要的⼀部分,但是这篇主要针对建模的部分,所以数据探索部分就简单到此
数据处理
对特征中⾮连续型数值(cp,slope,thal)特征进⾏处理
first = pd.get_dummies(df['cp'], prefix ="cp")
second = pd.get_dummies(df['slope'], prefix ="slope")
thrid = pd.get_dummies(df['thal'], prefix ="thal")
df = pd.concat([df,first,second,thrid], axis =1)
df = df.drop(columns =['cp','slope','thal'])
df.head(3)
age sex trestbps chol fbs restecg thalach exang p_1cp_2cp_3slope_0slope_1slope_2thal_0thal_1thal_2thal_3 0631145233101500  2.30 (0011000100)
1371130250011870  3.50 (010*******)
2410130204001720  1.40 (1000010010)
3 rows × 22 columns
处理完成,⽣成最后的数据
y = df.target.values
X = df.drop(['target'], axis =1)
X.shape
(303, 21)
分割数据集,并进⾏归⼀化处理
del_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=6)#随机种⼦6
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train = ansform(X_train)
X_test = ansform(X_test)
模型创建 --Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
log_reg.score(X_train,y_train)
0.8810572687224669
log_reg.score(X_test,y_test)
0.8289473684210527
ics import accuracy_score
y_predict_log = log_reg.predict(X_test)
# 调⽤accuracy_score计算分类准确度
accuracy_score(y_test,y_predict_log)
0.8289473684210527
使⽤⽹格搜索出更好的模型参数
param_grid =[
{
'C':[0.01,0.1,1,10,100],
'penalty':['l2','l1'],
'class_weight':['balanced',None]
}
]
del_selection import GridSearchCV
grid_search = GridSearchCV(log_reg,param_grid,cv=10,n_jobs=-1)
%%time
grid_search.fit(X_train,y_train)
Wall time: 2.88 s
GridSearchCV(cv=10, error_score='raise',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
可燃气
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
fit_params=None, iid=True, n_jobs=-1,
param_grid=[{'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l2', 'l1'], 'class_weight': ['balanced', None]}],      pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
grid_search.best_estimator_
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
魏雨琦
grid_search.best_score_
0.8502202643171806
grid_search.best_params_
{'C': 0.01, 'class_weight': None, 'penalty': 'l2'}
log_reg = grid_search.best_estimator_
log_reg.score(X_train,y_train)
0.8634361233480177
log_reg.score(X_test,y_test)泰州pm2.5
0.8289473684210527
查看F1指标
ics import f1_score
f1_score(y_test,y_predict_log)
0.8470588235294118
ics import classification_report
print(classification_report(y_test,y_predict_log))
precision    recall  f1-score  support
0      0.87      0.75      0.81        36
1      0.80      0.90      0.85        40
avg / total      0.83      0.83      0.83        76
绘制混淆矩阵
ics import confusion_matrix
cnf_matrix = confusion_matrix(y_test,y_predict_log)
cnf_matrix
array([[27,  9],
[ 4, 36]], dtype=int64)
def plot_cnf_matirx(cnf_matrix,description):
class_names =[0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))
#create a heat map
sns.heatmap(pd.DataFrame(cnf_matrix), annot =True, cmap ='OrRd',
fmt ='g')
ax.xaxis.set_label_position('top')
plt.tight_layout()
plt.title(description, y =1.1,fontsize=16)
plt.ylabel('实际值0/1',fontsize=12)
plt.xlabel('预测值0/1',fontsize=12)
plt.show()
plot_cnf_matirx(cnf_matrix,'Confusion matrix -- Logistic Regression')
decision_scores = log_reg.decision_function(X_test)
ics import precision_recall_curve
precisions,recalls,thresholds = precision_recall_curve(y_test,decision_scores)
plt.plot(thresholds,precisions[:-1])
plt.plot(thresholds,recalls[:-1])
plt.show()#没有从最⼩值开始取,sklearn⾃⼰从⾃⼰觉得ok的位置开始取
绘制ROC曲线
ics import roc_curve
fprs,tprs,thresholds = roc_curve(y_test,decision_scores)
dopodo
def plot_roc_curve(fprs,tprs):
plt.figure(figsize=(8,6),dpi=80)
plt.plot(fprs,tprs)
plt.plot([0,1],linestyle='--')
plt.ylabel('TP rate',fontsize=15)
plt.xlabel('FP rate',fontsize=15)
plt.title('ROC曲线',fontsize=17)
plt.show()
plot_roc_curve(fprs,tprs)
# 求⾯积,相当于求得分
ics import roc_auc_score  #auc:area under curve
roc_auc_score(y_test,decision_scores)
0.8784722222222222
模型创建–KNN临近算法
略过基本模型的创建,直接使⽤⽹格搜索进⾏参数调优
param_grid =[
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,31)]
沸腾的黄土地},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,31)],
'p':[i for i in range(1,6)]
}
]
%%time
ighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf,param_grid)
grid_search.fit(X_train,y_train)
Wall time: 7.23 s
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',          metric_params=None, n_jobs=1, n_neighbors=24, p=3,
weights='distance')
grid_search.best_score_

本文发布于:2024-09-20 22:52:40,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/326111.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:数据   模型   特征   观察   分类   部分
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议