基于MNIST数据集的⼿写数字识别
1数据获取与数据集介绍
数据来源:
Kaggle Competition:Digit Recognizer, Learn computer vision fundamentals with the famous MNIST data.
该数据集包含数万条⼿写数据的图像信息,⽬标是对于根据有标记的⼿写数据图像数据建模,从⽽对未标记的数据进⾏分类。该⽐赛是计算机视觉中最为⼊门级的⽐赛,通过这个⽐赛可以掌握处理⾮结构化数据(图像)的基本流程。 2 预处理与特征提取
这⾥根据图像数据的特征选择合适的机器学习模型进⾏处理,这⾥采⽤三种不同的⽅法来应对⼿写数字的分类问题:PCA+SVM、KNN以及卷积神经⽹络,使⽤到sklearn、keras等常⽤模块。 2.1 数据导⼊
# 导⼊所必要的⼀些包
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
from time import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
del_selection import GridSearchCV
del_selection import train_test_split
from sklearn.svm import SVC
from sklearn import neural_network
from sklearn import metrics
import math
import time
from collections import Counter
import keras
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
dels import Sequential
import warnings
warnings.filterwarnings('ignore')
# 数据导⼊并查看基本信息
PATH="E:/kaggle/digit-recognizer/"
ad_csv(PATH+'train.csv')
print(train.shape)
print(train.info)
(42000, 785)
<bound method DataFrame.info of label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 \
0 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
41995 0 0 0 0 0 0 0 0 0
41996 1 0 0 0 0 0 0 0 0
41997 7 0 0 0 0 0 0 0 0
41998 6 0 0 0 0 0 0 0 0
41999 9 0 0 0 0 0 0 0 0
pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 \
0 0 ... 0 0 0 0 0
1 0 ... 0 0 0 0 0
2 0 ... 0 0 0 0 0
3 0 ... 0 0 0 0 0
4 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ...
41995 0 ... 0 0 0 0 0
41996 0 ... 0 0 0 0 0
41997 0 ... 0 0 0 0 0
41998 0 ... 0 0 0 0 0
41999 0 ... 0 0 0 0 0
pixel779 pixel780 pixel781 pixel782 pixel783
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ... ...
41995 0 0 0 0 0
41996 0 0 0 0 0
41997 0 0 0 0 0
41998 0 0 0 0 0
41999 0 0 0 0 0
[42000 rows x 785 columns]>
train.head()
pixel774pixel775pixel776pixel777pixel778pixel779pixel780pixel781 010******** (00000000)
10000000000 (00000000)
21000000000 (00000000)
34000000000 (00000000)
40000000000 (00000000)
5 rows × 785 columns
可以看到,图像数据就是由像素点的数据组成的,每张图⽚为28*28=784个像素。MNIST数据集的⼿
写数字图像为⿊⽩图像,即在每个格⼦中数据的取值只有可能是0或1,现
在我们要根据这些像素值来进⾏分类,在处理的过程中,784个像素可以看做target的784个特征。
2.2 利⽤PCA降维提取特征
⾸先我们可以试着⽤传统的⽅法,SVM来进⾏图像的分类,在分类之前,我们先⽤PCA的⽅法对于数据进⾏降维,从⽽达到降低计算开销的作⽤。
# 训练集测试集划分
X_train=train.drop(['label'],axis='columns',inplace=False)
y_train=train['label']
del_selection import train_test_split
X_tr,X_ts,y_tr,y_ts=train_test_split(X_train,y_train,test_size=0.30,random_state=4)
在主成分分析中,n_components是最重要的参数,代表我们需要保留的主成分个数。通过设置n_component=16,我们可以建⽴起只有16个值的模型,极⼤减少运算时间,
同时能够不丢失太多的准确率。
n_components =16
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
print("done in %0.3fs"%(time()- t0))
X_train_pca = ansform(X_train)
done in 1.828s
# 查看⽅差直⽅图
plt.plained_variance_ratio_, bins=n_components, log=True)
0.5953435812797994
根据输出结果我们可以看到,保留前16个主成分能够留住数据59%的主要信息。
3 建⽴模型
3.1 SVM分类器
使⽤sklearn包中⾃带的SVM函数来对于数据进⾏训练。
param_grid ={"C":[0.1]
,"gamma":[0.1]}
rf = SVC()
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=2, n_jobs=-1, verbose=1)
gs = gs.fit(X_train_pca, y_train)
欧广
print(gs.best_score_)
print(gs.best_params_)
Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s finished
0.9430238095238095
{'C': 0.1, 'gamma': 0.1}
bp = gs.best_params_
t0 = time()
clf = SVC(C=bp['C'], kernel='rbf', gamma=bp['gamma'])
索尼爱立信 w380cclf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs"%(time()- t0))
done in 18.860s
clf.ansform(X_ts), y_ts)
0.9568253968253968
可以看到,在我们的验证数据中已经达到了95.6%的精确度,其中SVM的参数分别为C:0.1,gamma:0.1。其中C为惩罚系数,C减⼩可以防⽌过拟合,这⾥使⽤适当的C使得模型达到最好的泛化性能。gamma为⽀持向量的多少。
接着我们可以按照要求将结果输出,即对于未打标签的图像,进⾏实际label的预测。最后的效果可以通过Kaggle的线上平台进⾏评估分析。
val = pd.read_csv(PATH+'test.csv')
pred = clf.ansform(val))
# ImageId,Label
val['Label']= pd.Series(pred)园艺学报
val['ImageId']= val.index +1
sub = val[['ImageId','Label']]
<_csv(PATH+'submission1.csv', index=False)
最终的模型结果为97.1%的准确率,确实是效率较⾼的⼀种⽅法了。
3.2 KNN
KNN是⼀种⽆监督聚类⽅法,这⾥构建KNN分类器,其原理是将样本分到样本空间中距离最近的⼀个类别⾥。这⾥设计实现了⼀个简单的KNN模块。
%matplotlib inline
# 导⼊数据的函数
def load_data(data_dir):
train_data =open(data_dir +"train.csv").read()
train_data = train_data.split("\n")[1:-1]真菌之怒
train_data =[i.split(",")for i in train_data]
X_train = np.array([[int(i[j])for j in range(1,len(i))]for i in train_data])
y_train = np.array([int(i[0])for i in train_data])
test_data =open(data_dir +"test.csv").read()
test_data = test_data.split("\n")[1:-1]
test_data =[i.split(",")for i in test_data]
X_test = np.array([[int(i[j])for j in range(0,len(i))]for i in test_data])
return X_train, y_train, X_test
# KNN实现的模块
class simple_knn():
def__init__(self):
pass
def train(self, X, y):
self.X_train = X
self.y_train = y
借代
def predict(self, X, k=1):
# 计算样本距离
dists = selfpute_distances(X)
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
k_closest_y =[]
labels = self.y_train[np.argsort(dists[i,:])].flatten()
k_closest_y = labels[:k]# 将k个最近邻居的label到
c = Counter(k_closest_y)
y_pred[i]= c.most_common(1)[0][0]
return(y_pred)
def compute_distances(self, X):
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dot_pro = np.dot(X, self.X_train.T)
sum_square_test = np.square(X).sum(axis =1)
sum_square_train = np.square(self.X_train).sum(axis =1)
dists = np.sqrt(-2* dot_pro + sum_square_train + np.matrix(sum_square_test).T) return(dists)
X_train, y_train, X_test = load_data(PATH)
batch_size =2000
k =3# 邻居类别的个数(knn的参数)
classifier = simple_knn()
调⽤KNN模块对于模型进⾏预测
predictions =[]
for i in range(int(len(X_test)/batch_size)):
print("Computing batch "+str(i+1)+"/"+str(int(len(X_test)/batch_size))+"...")
tic = time.time()
predts = classifier.predict(X_test[i * batch_size:(i+1)* batch_size], k)
toc = time.time()
predictions = predictions +list(predts)
print("Completed this batch in "+str(toc-tic)+" Secs.")
print("Completed predicting the test data.")
Computing batch
Completed this batch in 53.51499319076538 Secs.
Computing batch
Completed this batch in 43.31397557258606 Secs.
Computing batch
Completed this batch in 42.59756851196289 Secs.
Computing batch
Completed this batch in 43.00966835021973 Secs.
Computing batch
Completed this batch in 43.01448702812195 Secs.
Computing batch
Completed this batch in 47.93128275871277 Secs.
Computing batch
Completed this batch in 44.85835313796997 Secs.
Computing batch
Completed this batch in 44.42547106742859 Secs.
Computing batch
Completed this batch in 44.020007610321045 Secs.
Computing batch
Completed this batch in 44.085976362228394 Secs.
Computing batch
Completed this batch in 43.6392982006073 Secs.
Computing batch
Completed this batch in 43.603368282318115 Secs.
Computing batch
Completed this batch in 45.03933787345886 Secs.
Computing batch
Completed this batch in 44.59685492515564 Secs.
Completed predicting the test data.
out_file =open(PATH+"submission2.csv","w")
out_file.write("ImageId,Label\n")
for i in range(len(predictions)):
out_file.write(str(i+1)+","+str(int(predictions[i]))+"\n")
out_file.close()
该⽅案的准确率为97.114%,准确率有⼩幅度提⾼。
3.3 NN Model
尝试⼀种最基本的神经⽹络模型:MLP(多层感知机)。这⾥使⽤sklearn中的神经⽹络模块MLPClassifier来处理图像分类的问题。
# 数据导⼊
train = pd.read_csv(PATH+"train.csv")
test = pd.read_csv(PATH+"test.csv")
Y = train['label'][:10000]# use more number of rows for more training
X = train.drop(['label'], axis =1)[:10000]# use more number of rows for more training
x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42)
model = neural_network.MLPClassifier(alpha=1e-5, hidden_layer_sizes=(5,), solver='lbfgs', random_state=18)
model.fit(x_train, y_train)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(5,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=18, shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
现在我们就建好了如上的分类器,将验证集的数据输⼊分类器来检验模型的效果。
predicted = model.predict(x_val)
print("Classification Report:\n %s:"%(metrics.classification_report(y_val, predicted)))
Classification Report:
precision recall f1-score support
0 0.00 0.00 0.00 186
1 0.97 0.81 0.88 210
2 0.12 0.99 0.21 220
3 0.00 0.00 0.00 190
4 0.00 0.00 0.00 188
5 0.00 0.00 0.00 194
6 0.00 0.00 0.00 190
7 0.00 0.00 0.00 233
8 0.00 0.00 0.00 197
9 0.00 0.00 0.00 192
accuracy 0.19 2000
macro avg 0.11 0.18 0.11 2000影片未分级
weighted avg 0.12 0.19 0.12 2000
:
可以看到利⽤MLP Model进⾏分类的结果,可以看到多层感知器分类并不是很适⽤于这样的图像分类问题,在精确率得分上⽐较低,这启发我们更换其他的神经⽹络模型看看是否能取得更好的效果。
3.4 CNN
3.4.1 数据处理和准备
为了能够将数据合适地输⼊模型,还需要对数据进⾏⼀些处理。在keras的CNN中,其卷积等模块中的操作已经能够⾃动实现图像的特征提取,因此不在需要⼈为设置规则来提取图像中的特征。