机器学习（一）---KNN算法总结（手写体识别）

机器学习（⼀）---KNN算法总结（⼿写体识别）

1、算法综述

Cover和Hart在1968年提出了最初的邻近算法，K-近邻算法（k-Nearest Neighbors）的本质计算未知数据集（测试样本）与训练集的各样本的距离，按从⼩到⼤取前k个距离排序，然后选择k个最相似数据中出现次数最多的分类作为新数据的分类。

(1)根据《机器学习》第10.1节，最近邻分类器虽然简单，但它的泛化错误率不超过贝叶斯最优分类器的错误率的两倍；

(2)对于⼆分类问题k值⼀般取奇数，便于分类；

（3)特征（属性）可以为多个，分类也可以是多个（⽐如考察约会对象的喜欢与否，特征为：玩游戏百分⽐、飞⾏⾥程、冰淇淋消耗；样本分为三类：1-特别喜欢，2-⼀般，3-不喜欢）；

(4)距离⼀般选⽤欧⽒距离或者曼哈顿距离；

(5)算法结果在很⼤程度上取决于k的取值以及距离公式选择。

2、评价

优点：精度⾼、对异常值不敏感、⽆数据输⼊假定(仅仅是把样本存起来，训练时间为0)。

缺点：计算复杂度⾼、空间复杂度⾼（关键是分类图⽚时，背景影响很⼤，这⼀点下⾯后说）。

适⽤数据范围：数值型和标称性。

3、⼯作原理

存在⼀个样本数据集合，也称作训练样本集，并且样本集中每个数据都存在标签，即我们知道样本集中每⼀个数据与所属分类的对应关系。输⼊没有标签的新数据后，将新数据的每个特征与样本集中数据对应的特征进⾏⽐较，然后算法提取样本集中特征最相似数据（最近邻）的分类标签。⼀般来说，我们只选择样本数据及中前ｋ个最相似的数据，这就是k-近邻算法中k的出处，通常k选择不⼤于20的整数。最后，选择k个最相似数据中出现次数最多的分类，作为新数据的分类。

4、K-近邻算法的⼀般流程

（1）收集数据：可以使⽤任何⽅法

（2）准备数据：距离计算所需要的数值，最好是结构化的数据格式

集通信（3）分析数据：可以使⽤任何⽅法

（4）训练算法：此步骤不适⽤k-邻近算法

（5）测试算法：计算错误率

（6）使⽤算法：⾸先需要输⼊样本数据和结构化的输出结果，然后运⾏k-近邻算法判定输⼊数据分别属于哪个分类，最后应⽤对计算出的分类执⾏后续的处理。

5、算法应⽤举例（Python和C++）

1、使⽤K-近邻算法分类爱情⽚和动作⽚

可以以打⽃次数和接吻次数来作为⼆维图的x轴和y轴坐标：

⾸先计算出未知电影与样本集中其他电影的距离（距离的具体计算先不叙述）

按照距离递增排序，可以到k个距离最近的电影，若k=3，则三个最靠近的电影依次是He’s Not Really into Dudes, Beautiful Woman, andCalifornia Man.这三部电影全部是爱情⽚，因此我们判定未知电影是爱情⽚。

C++实现：

数据集

3 10

4 R

2 100 R

1 81 R

101 10 A

99 5 A

98 2 A

代码

#include <iostream>

#include<map>

#include<vector>

#include<stdio.h>

#include<cmath>

#include<cstdlib>

#include<algorithm>

#include<fstream>

using namespace std;

typedef char tLabel;

typedef double tData;

typedef pair<int,double> PAIR;

const int colLen = 2;//导⼊新的数据集时只需要修改⾏列参数

const int rowLen = 6;

ifstream fin;

ofstream fout;

class KNN

{

private:

tData dataSet[rowLen][colLen]; //⽤数组定义样本集

tLabel labels[rowLen];

tData testData[colLen];

int k;

map<int,double> map_index_dis;

solvermap<tLabel,int> map_label_freq;

double get_distance(tData *d1,tData *d2); //计算两两样本间距离函数public:

KNN(int k); //构造函数

void get_all_distance();

void get_max_freq_label();

struct CmpByValue

{

bool operator() (const PAIR& lhs,const PAIR& rhs)

{

return lhs.second < rhs.second;

}

};

KNN::KNN(int k)

{

this->k = k;

fin.open("");//导⼊新的数据集时只需修改⽂件名

if(!fin)

{

cout<<"can not open the "<<endl;

exit(1);

}

for(int i = 0; i < rowLen; i++)

{

功率谱密度for(int j = 0;j <colLen; j++)

{

fin>>dataSet[i][j];

}

fin>>labels[i];

}

cout<<"please input the test data :"<<endl;

//输⼊测试数据

for(int i=0;i<colLen;i++)

cin>>testData[i];

}

double KNN:: get_distance(tData *d1,tData *d2)

{

double sum = 0;

for(int i=0;i<colLen;i++)

{

sum += pow( (d1[i]-d2[i]) , 2 );

}

// cout<<"the sum is = "<<sum<<endl;

return sqrt(sum);

}

//计算测试样本与训练集中每个样本的距离

void KNN:: get_all_distance()

{

double distance;

int i;

for(i=0;i<rowLen;i++)

{

distance = get_distance(dataSet[i],testData);电子导盲仪

//<key,value> => <i,distance>

map_index_dis[i] = distance;

}

//遍历map，打印各个序号和距离

map<int,double>::const_iterator it = map_index_dis.begin();

while(it!=map_d())

{

cout<<"index = "<<it->first<<" distance = "<<it->second<<endl;

it++;

}

//在k值设定的情况下，计算测试数据属于哪个lable，并输出

void KNN:: get_max_freq_label()

{

//将map_index_dis转换为vec_index_dis

vector<PAIR> vec_index_dis( map_index_dis.begin(),map_d() );

//对vec_index_dis进⾏从低到⾼排序，以获得最近距离数据

sort(vec_index_dis.begin(),vec_d(),CmpByValue());

少先队章程

for(int i=0;i<k;i++)

{

cout<<"the index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_dis[i].second<<" the label = "<<labels[vec_index_dis[i].first]<<" the coordinate ( "< //calculate the count of each label

map_label_freq[ labels[ vec_index_dis[i].first ] ]++;

}

map<tLabel,int>::const_iterator map_it = map_label_freq.begin();

tLabel label;

int max_freq = 0;

//find the most frequent label

while( map_it != map_d() )

{

if( map_it->second > max_freq )

{

max_freq = map_it->second;

label = map_it->first;

}

map_it++;

}

cout<<"The test data belongs to the "<<label<<" label"<<endl;

}

int main()

{

int k ;

cout<<"please input the k value : "<<endl;

cin>>k;

KNN knn(k);

<_all_distance();

<_max_freq_label();

return 0;

}

运⾏结果：

2、Python实现⼿写体识别：

#coding=utf-8

from numpy import *

import operator

from os import listdir

#k近邻算法

def classify0(inX, dataSet, labels, k):

dataSetSize = dataSet.shape[0] #返回dataset这个array的⾏数

diffMat = tile(inX, (dataSetSize,1)) - dataSet #tile(A,reps)将A补成reps规格

sqDiffMat = diffMat**2 #平⽅

sqDistances = sqDiffMat.sum(axis=1) #默认的axis=0 就是普通的相加⽽当加⼊axis=1以后就是将⼀个矩阵的每⼀⾏向量相加 distances = sqDistances**0.5 #开⽅

sortedDistIndicies = distances.argsort() #argsort其实是返回array排序后的下标（或索引）

classCount={} #新建⼀个字典

for i in range(k):

voteIlabel = labels[sortedDistIndicies[i]]

#依次查询cclassCount中是否有该key，有则将取出value再+1，没有则返回添加该key并置value为0，再+1

classCount[voteIlabel] = (voteIlabel,0) + 1 #统计得到各个标签的个数

#按classCount字典的第2个元素（即类别出现的次数）从⼤到⼩排序，即获得得票最⾼的标签

sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

return sortedClassCount[0][0]

#从⽂本⽂件解析数据

def file2matrix(filename):

fr = open(filename)

numberOfLines = adlines()) #get the number of lines in the file

returnMat = zeros((numberOfLines,3)) #prepare matrix to return

classLabelVector = [] #prepare labels return

fr = open(filename)

index = 0

for line adlines():

line = line.strip() #删除⽂本⾏line后的回车符

listFromLine = line.split('\t') #使⽤’\t’分割字符串str，返回⼀个列表

returnMat[index,:] = listFromLine[0:3]

classLabelVector.append(int(listFromLine[-1]))

index += 1

return returnMat,classLabelVector

#归⼀化特征值

def autoNorm(dataSet):

minVals = dataSet.min(0)

maxVals = dataSet.max(0)

当代文学思潮

ranges = maxVals - minVals

normDataSet = zeros(shape(dataSet)) #shape数组或矩阵的各个维的⼤⼩

本文发布于:2024-09-23 16:19:44，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/326143.html

上一篇：机器学习算法模型——K近邻法（KNN）

下一篇：机器学习--KNN算法基本思想

标签：数据样本距离算法分类集中计算

留言与评论（共有 0 条评论）