首页 > 学术百科

图像处理实践基于MNIST数据集的手写数字识别

图像处理实践基于MNIST数据集的⼿写数字识别

基于MNIST数据集的⼿写数字识别

1数据获取与数据集介绍

数据来源：

Kaggle Competition：Digit Recognizer, Learn computer vision fundamentals with the famous MNIST data.

该数据集包含数万条⼿写数据的图像信息，⽬标是对于根据有标记的⼿写数据图像数据建模，从⽽对未标记的数据进⾏分类。该⽐赛是计算机视觉中最为⼊门级的⽐赛，通过这个⽐赛可以掌握处理⾮结构化数据（图像）的基本流程。

2 预处理与特征提取

这⾥根据图像数据的特征选择合适的机器学习模型进⾏处理，这⾥采⽤三种不同的⽅法来应对⼿写数字的分类问题：PCA+SVM、KNN以及卷积神经⽹络，使⽤到sklearn、keras等常⽤模块。

2.1 数据导⼊

# 导⼊所必要的⼀些包

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.image as mpimg

import matplotlib.pyplot as plt

import matplotlib

%matplotlib inline

from time import time

from sklearn.manifold import TSNE

from sklearn.decomposition import PCA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

del_selection import GridSearchCV

del_selection import train_test_split

from sklearn.svm import SVC

from sklearn import neural_network

from sklearn import metrics

import math

import time

from collections import Counter

import keras

from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D

dels import Sequential

import warnings

warnings.filterwarnings('ignore')

# 数据导⼊并查看基本信息

PATH="E:/kaggle/digit-recognizer/"

ad_csv(PATH+'train.csv')

print(train.shape)

print(train.info)

(42000, 785)

<bound method DataFrame.info of label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 \

0 1 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0

2 1 0 0 0 0 0 0 0 0

3 4 0 0 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ...

41995 0 0 0 0 0 0 0 0 0

41996 1 0 0 0 0 0 0 0 0

41997 7 0 0 0 0 0 0 0 0

41998 6 0 0 0 0 0 0 0 0

41999 9 0 0 0 0 0 0 0 0

pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 \

0 0 ... 0 0 0 0 0

1 0 ... 0 0 0 0 0

2 0 ... 0 0 0 0 0

3 0 ... 0 0 0 0 0

4 0 ... 0 0 0 0 0

... ... ... ... ... ... ... ...

41995 0 ... 0 0 0 0 0

41996 0 ... 0 0 0 0 0

41997 0 ... 0 0 0 0 0

41998 0 ... 0 0 0 0 0

41999 0 ... 0 0 0 0 0

pixel779 pixel780 pixel781 pixel782 pixel783

0 0 0 0 0 0

1 0 0 0 0 0

2 0 0 0 0 0

3 0 0 0 0 0

4 0 0 0 0 0

... ... ... ... ... ...

41995 0 0 0 0 0

41996 0 0 0 0 0

41997 0 0 0 0 0

41998 0 0 0 0 0

41999 0 0 0 0 0

[42000 rows x 785 columns]>

train.head()

pixel774pixel775pixel776pixel777pixel778pixel779pixel780pixel781 010******** (00000000)

10000000000 (00000000)

21000000000 (00000000)

34000000000 (00000000)

40000000000 (00000000)

5 rows × 785 columns

可以看到，图像数据就是由像素点的数据组成的，每张图⽚为28*28=784个像素。MNIST数据集的⼿

写数字图像为⿊⽩图像，即在每个格⼦中数据的取值只有可能是0或1，现

在我们要根据这些像素值来进⾏分类，在处理的过程中，784个像素可以看做target的784个特征。

2.2 利⽤PCA降维提取特征

⾸先我们可以试着⽤传统的⽅法，SVM来进⾏图像的分类，在分类之前，我们先⽤PCA的⽅法对于数据进⾏降维，从⽽达到降低计算开销的作⽤。

# 训练集测试集划分

X_train=train.drop(['label'],axis='columns',inplace=False)

y_train=train['label']

del_selection import train_test_split

X_tr,X_ts,y_tr,y_ts=train_test_split(X_train,y_train,test_size=0.30,random_state=4)

在主成分分析中，n_components是最重要的参数，代表我们需要保留的主成分个数。通过设置n_component=16，我们可以建⽴起只有16个值的模型，极⼤减少运算时间，

同时能够不丢失太多的准确率。

n_components =16

t0 = time()

pca = PCA(n_components=n_components, svd_solver='randomized',

whiten=True).fit(X_train)

print("done in %0.3fs"%(time()- t0))

X_train_pca = ansform(X_train)

done in 1.828s

# 查看⽅差直⽅图

plt.plained_variance_ratio_, bins=n_components, log=True)

0.5953435812797994

根据输出结果我们可以看到，保留前16个主成分能够留住数据59%的主要信息。

3 建⽴模型

3.1 SVM分类器

使⽤sklearn包中⾃带的SVM函数来对于数据进⾏训练。

param_grid ={"C":[0.1]

,"gamma":[0.1]}

rf = SVC()

gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=2, n_jobs=-1, verbose=1)

gs = gs.fit(X_train_pca, y_train)

欧广

print(gs.best_score_)

print(gs.best_params_)

Fitting 2 folds for each of 1 candidates, totalling 2 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s remaining: 0.0s

[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s finished

0.9430238095238095

{'C': 0.1, 'gamma': 0.1}

bp = gs.best_params_

t0 = time()

clf = SVC(C=bp['C'], kernel='rbf', gamma=bp['gamma'])

索尼爱立信 w380cclf = clf.fit(X_train_pca, y_train)

print("done in %0.3fs"%(time()- t0))

done in 18.860s

clf.ansform(X_ts), y_ts)

0.9568253968253968

可以看到，在我们的验证数据中已经达到了95.6%的精确度，其中SVM的参数分别为C：0.1，gamma：0.1。其中C为惩罚系数，C减⼩可以防⽌过拟合，这⾥使⽤适当的C使得模型达到最好的泛化性能。gamma为⽀持向量的多少。

接着我们可以按照要求将结果输出，即对于未打标签的图像，进⾏实际label的预测。最后的效果可以通过Kaggle的线上平台进⾏评估分析。

val = pd.read_csv(PATH+'test.csv')

pred = clf.ansform(val))

# ImageId,Label

val['Label']= pd.Series(pred)园艺学报

val['ImageId']= val.index +1

sub = val[['ImageId','Label']]

<_csv(PATH+'submission1.csv', index=False)

最终的模型结果为97.1%的准确率，确实是效率较⾼的⼀种⽅法了。

3.2 KNN

KNN是⼀种⽆监督聚类⽅法，这⾥构建KNN分类器，其原理是将样本分到样本空间中距离最近的⼀个类别⾥。这⾥设计实现了⼀个简单的KNN模块。

%matplotlib inline

# 导⼊数据的函数

def load_data(data_dir):

train_data =open(data_dir +"train.csv").read()

train_data = train_data.split("\n")[1:-1]真菌之怒

train_data =[i.split(",")for i in train_data]

X_train = np.array([[int(i[j])for j in range(1,len(i))]for i in train_data])

y_train = np.array([int(i[0])for i in train_data])

test_data =open(data_dir +"test.csv").read()

test_data = test_data.split("\n")[1:-1]

test_data =[i.split(",")for i in test_data]

X_test = np.array([[int(i[j])for j in range(0,len(i))]for i in test_data])

return X_train, y_train, X_test

# KNN实现的模块

class simple_knn():

def__init__(self):

pass

def train(self, X, y):

self.X_train = X

self.y_train = y

借代

def predict(self, X, k=1):

# 计算样本距离

dists = selfpute_distances(X)

num_test = dists.shape[0]

y_pred = np.zeros(num_test)

for i in range(num_test):

k_closest_y =[]

labels = self.y_train[np.argsort(dists[i,:])].flatten()

k_closest_y = labels[:k]# 将k个最近邻居的label到

c = Counter(k_closest_y)

y_pred[i]= c.most_common(1)[0][0]

return(y_pred)

def compute_distances(self, X):

num_test = X.shape[0]

num_train = self.X_train.shape[0]

dot_pro = np.dot(X, self.X_train.T)

sum_square_test = np.square(X).sum(axis =1)

sum_square_train = np.square(self.X_train).sum(axis =1)

dists = np.sqrt(-2* dot_pro + sum_square_train + np.matrix(sum_square_test).T) return(dists)

X_train, y_train, X_test = load_data(PATH)

batch_size =2000

k =3# 邻居类别的个数（knn的参数）

classifier = simple_knn()

调⽤KNN模块对于模型进⾏预测

predictions =[]

for i in range(int(len(X_test)/batch_size)):

print("Computing batch "+str(i+1)+"/"+str(int(len(X_test)/batch_size))+"...")

tic = time.time()

predts = classifier.predict(X_test[i * batch_size:(i+1)* batch_size], k)

toc = time.time()

predictions = predictions +list(predts)

print("Completed this batch in "+str(toc-tic)+" Secs.")

print("Completed predicting the test data.")

Computing batch

Completed this batch in 53.51499319076538 Secs.

Computing batch

Completed this batch in 43.31397557258606 Secs.

Computing batch

Completed this batch in 42.59756851196289 Secs.

Computing batch

Completed this batch in 43.00966835021973 Secs.

Computing batch

Completed this batch in 43.01448702812195 Secs.

Computing batch

Completed this batch in 47.93128275871277 Secs.

Computing batch

Completed this batch in 44.85835313796997 Secs.

Computing batch

Completed this batch in 44.42547106742859 Secs.

Computing batch

Completed this batch in 44.020007610321045 Secs.

Computing batch

Completed this batch in 44.085976362228394 Secs.

Computing batch

Completed this batch in 43.6392982006073 Secs.

Computing batch

Completed this batch in 43.603368282318115 Secs.

Computing batch

Completed this batch in 45.03933787345886 Secs.

Computing batch

Completed this batch in 44.59685492515564 Secs.

Completed predicting the test data.

out_file =open(PATH+"submission2.csv","w")

out_file.write("ImageId,Label\n")

for i in range(len(predictions)):

out_file.write(str(i+1)+","+str(int(predictions[i]))+"\n")

out_file.close()

该⽅案的准确率为97.114%，准确率有⼩幅度提⾼。

3.3 NN Model

尝试⼀种最基本的神经⽹络模型：MLP（多层感知机）。这⾥使⽤sklearn中的神经⽹络模块MLPClassifier来处理图像分类的问题。

# 数据导⼊

train = pd.read_csv(PATH+"train.csv")

test = pd.read_csv(PATH+"test.csv")

Y = train['label'][:10000]# use more number of rows for more training

X = train.drop(['label'], axis =1)[:10000]# use more number of rows for more training

x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42)

model = neural_network.MLPClassifier(alpha=1e-5, hidden_layer_sizes=(5,), solver='lbfgs', random_state=18)

model.fit(x_train, y_train)

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,

beta_2=0.999, early_stopping=False, epsilon=1e-08,

hidden_layer_sizes=(5,), learning_rate='constant',

learning_rate_init=0.001, max_iter=200, momentum=0.9,

n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,

random_state=18, shuffle=True, solver='lbfgs', tol=0.0001,

validation_fraction=0.1, verbose=False, warm_start=False)

现在我们就建好了如上的分类器，将验证集的数据输⼊分类器来检验模型的效果。

predicted = model.predict(x_val)

print("Classification Report:\n %s:"%(metrics.classification_report(y_val, predicted)))

Classification Report:

precision recall f1-score support

0 0.00 0.00 0.00 186

1 0.97 0.81 0.88 210

2 0.12 0.99 0.21 220

3 0.00 0.00 0.00 190

4 0.00 0.00 0.00 188

5 0.00 0.00 0.00 194

6 0.00 0.00 0.00 190

7 0.00 0.00 0.00 233

8 0.00 0.00 0.00 197

9 0.00 0.00 0.00 192

accuracy 0.19 2000

macro avg 0.11 0.18 0.11 2000影片未分级

weighted avg 0.12 0.19 0.12 2000

可以看到利⽤MLP Model进⾏分类的结果，可以看到多层感知器分类并不是很适⽤于这样的图像分类问题，在精确率得分上⽐较低，这启发我们更换其他的神经⽹络模型看看是否能取得更好的效果。

3.4 CNN

3.4.1 数据处理和准备

为了能够将数据合适地输⼊模型，还需要对数据进⾏⼀些处理。在keras的CNN中，其卷积等模块中的操作已经能够⾃动实现图像的特征提取，因此不在需要⼈为设置规则来提取图像中的特征。

本文发布于:2024-09-25 02:25:51，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/326293.html

上一篇：weka实验报告

下一篇：Python数据集：乳腺癌数据集（fromsklearn.datasetsimportlo。。。

标签：数据图像模型分类

留言与评论（共有 0 条评论）