首页 > 学术百科

Ex6_机器学习_吴恩达课程作业(Python):SVM支持向量机(

Ex6_机器学习_吴恩达课程作业（Python ）：SVM ⽀持向量机（SupportVect 。。。

Ex6_机器学习_吴恩达课程作业（Python ）：SVM ⽀持向量机（Support Vector Machines ）

⽂章⽬录使⽤说明：

本⽂章为关于吴恩达⽼师在Coursera上的机器学习课程的学习笔记。

本⽂第⼀部分⾸先介绍课程对应周次的知识回顾以及重点笔记，以及代码实现的库引⼊。本⽂第⼆部分包括代码实现部分中的⾃定义函数实现细节。

本⽂第三部分即为与课程练习题⽬相对应的具体代码实现。0. Pre-condition

This section includes some introductions of libraries.

00. Self-created Functions

This section includes self-created functions.

18号染体loadData(path)：读取数据

妇科新药

plotData(X, y)：可视化数据# This file includes self-created functions used in exercise 3import numpy as np import pandas as pd import matplotlib .pyplot as plt import re # regular expression for e-mail processing import nltk .stem .porter # 英⽂分词算法from scipy .io import loadmat from sklearn import svm

8# Load data from the given file 读取数据# ARGS: { path: 数据路径 }def loadData (path ): data = loadmat (path ) return data ['X'], data ['y']

5# Visualize data 可视化数据# ARGS: { X: 训练集; y: 标签集 }def plotData (X , y ): plt .figure (figsize =[8, 6]) plt .scatter (X [:, 0], X [:, 1], c =y .flatten ())1

3染料敏化太阳能电池

plotBoundary(classifier, X)：绘制类别间的决策边界

displayBoundaries(X, y)：绘制不同SVM 参数C 下的的决策边界（线性核）

gaussianKernel(x1, x2, sigma)

：实现⾼斯核函数

displayGaussKernelBoundary(X, y, C, sigma)：绘制⾼斯核SVM 对某数据集的决策边界 plt .ylabel ('X2') plt .title ('Data Visualization') # plt.show()

10# Plot the boundary between two classes 绘制类别间的决策边界# ARGS: { classifier: 分类器; X: 训练集 }def plotBoundary (classifier , X ): x_min , x_max = X [:, 0].min () * 1.2, X [:, 0].max () * 1.1 y_min , y_max = X [:, 1].min () * 1.2, X [:, 1].max () * 1.1 xx , yy = np .meshgrid (np .linspace (x_min , x_max , 500), np .linspace (y_min , y_max , 500)) # 利⽤传⼊的分类器，对预测样本做出类别预测 Z = classifier .predict (np .c_[xx .flatten (), yy .flatten ()]) Z = Z .reshape (xx .shape ) plt .contour (xx , yy , Z )

11# Display boundaries for different situations with different C (1 and 100)# 改变SVM 参数C ，绘制

各情况下的的决策边界# ARGS: { X: 训练集 ; y: 标签集 }def displayBoundaries (X , y ): # 此处使⽤skilearn 的包，采⽤线性核函数，获取多个SVM 模型 models = [svm .SVC (C =C , kernel ='linear') for C in [1, 100]] # 给定训练集X 和标签集y ，训练得到的多个SVM 模型，获得多个分类器 classifiers = [model .fit (X , y .flatten ()) for model in models ] # 输出信息 titles = ['SVM Decision Boundary with C = {}'.format (C ) for C in [1, 100]] # 对于每个分类器，绘制其得出的决定边界 for classifier , title in zip (classifiers , titles ): plotData (X , y ) plotBoundary (classifier , X ) plt .title (title ) # 展⽰数据 plt .show ()

17# Implement a Gaussian kernel function (Could be considered as a similarity function)# 实现⾼斯核函数（可以看作相似度函数，测量⼀对样本的距离）# ARGS: { x1: 样本1; x2: 样本2; sigma: ⾼斯核函数参数 }def gaussianKernel (x1, x2, sigma ): return np .exp (-(np .power (x1 - x2, 2).sum () / (2

* np .power (sigma , 2))))

5# Display the decision boundary using SVM with a Gaussian kernel # 绘制出基于⾼斯核的SVM 对某数据集的决策边界# ARGS: { X: 训练集; y: 标签集; C: SVM 参数; sigma: ⾼斯核函数参数 }def displayGaussKernelBoundary (X , y , C , sigma ): gamma = np .power (sigma , -2.) / 2 # 'rbf'指径向基函数/⾼斯核函数 model = svm .SVC (C =1, kernel ='rbf', gamma =gamma ) classifier = model .fit (X , y .flatten ()) plotData (X , y ) plotBoundary (classifier , X ) plt .title ('Decision boundary using SVM with a Gaussian Kernel')1

trainGaussParams(X, y, Xval, yval)：⽐较交叉验证集误差，训练最优参数C 和sigma

preprocessEmail(email)：预处理邮件

email2TokenList(email)：词⼲提取及去除⾮字符内容，返回单词列表# Train out the best parameters

'C' and 'sigma" with the least cost on the validation set # 通过⽐较在交叉验证集上的误差，训练出最优的参数C 和sigma # ARGS: { X: 训练集; y: 标签集; Xval: 训练交叉验证集; yval: 标签交叉验证集 }def trainGaussParams (X , y , Xval , yval ): C_values = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.) sigma_values = C_values best_pair , best_score = (0, 0), 0 for C in C_values : for sigma in sigma_values : gamma = np .power (sigma , -2.) / 2 model = svm .SVC (C =C , kernel ='rbf', gamma =gamma ) classifier = model .fit (X , y .flatten ()) this_score = model .score (Xval , yval ) if this_score > best_score : best_score = this_score best_pair = (C , sigma ) print ('Best pair(C, sigma): {}, best score: {}'.format (best_pair , best_score )) return best_pair [0], best_pair [1]

18# Preprocess an email 预处理邮件# 执⾏除了Word Stemming 和Removal of non-words 的所有处理def preprocessEmail (email ): # 全⽂⼩写 email = email .lower () # 统⼀化HTML 格式。匹配<；开头，以及所有不是< ,> 的内容，直到>结尾，相当于匹配<...> email = re .sub ('<[^<>]>', ' ', email ) # 统⼀化URL 。将所有URL 地址转化成"httpadddr"。 email = re .sub ('(http|https)://[^\s]*', 'httpaddr', email ) # 统⼀化邮件地址。将所有邮件地址转化成"emailaddr"。 email = re .sub ('[^\s]+@[^\s]+', 'emailaddr', email ) # 统⼀化美元符号。 email = re .sub ('[\$]+', 'dollar', email ) # 统⼀化数字。 email = re .sub ('[\d]+', 'number', email ) return email

16# Conduct Word Stemming and Removal of non-words.# Besides, here we use "NLTK" lib's stemmer, since it's more accurate and efficient.# 执⾏词⼲提取以及去除⾮字符内容的处理，返回的是⼀个个的处理后的单词# 此处⽤NLTK 包的提取器，效率更⾼且更准确def email2TokenList (email ):

# Preprocess the email 预处理邮件 email = preprocessEmail (email ) # Instantiate the stemmer 实例化提取器 stemmer = nltk .stem .porter .PorterStemmer () # Split the whole email into separated words 将邮件分割为⼀个个单词 tokens = re .split ('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\{\}\,\'\"\>\_\<\;\%]', email ) # Traverse all the split contents 遍历逐个分割出来的内容 token_list = [] for token in tokens : # Remove non-word contents 删除任何⾮字母数字的字符 token = re .sub ('[^a-zA-Z0-9]', '', token ) # Stem the root of the word 提取词根 stemmed_word = stemmer .stem (token ) # Remove empty string 去除空字符串‘’，⾥⾯不含任何字符，不添加它1

email2VocabularyList(email, vocab_list)：获取在邮件和词汇表中同时出现的单词的索引

email2FeatureVector(email)：提取邮件的特征

1. Support Vector Machines In the fifirst half of this exercise, you will be using support vector machines (SVMs) with various example 2D datasets.Experimenting with these datasets will help you gain an intuition of how SVMs work and how to use a Gaussian kernel with SVMs.

In the next half of the exercise, you will be using support vector machines to build a spam classififier.

调⽤的相关函数在⽂章头部"Self-created functions"中详细描述。

1.1 Example dataset 1 if not len (token ): continue # Append the word into the list 添加到list 中 token_list .append (stemmed_word ) return token_list

红楼三人行20

23# Get the indices of words that exist both in the email and the vocabulary list # 获取在邮件和词汇表中同时出现的单词的索引# ARGS: { email: 邮件; vocab_list: 单词表 }def email2VocabularyList (email , vocab_list ): token = email2TokenList (email ) index = [i for i in range (len (vocab_list )) if vocab_list [i ] in token ] return index

7# Extract features from email, turn the email into a feature vector # 提取邮件的特征，获取⼀个表⽰邮件的特征向量（长度为单词表长度，存在该单词则对应下标位置值为1，反之为0）# ARGS: { email: 邮件 }def email2FeatureVector (email ): # 提供的单词表 df = pd .read_table ('../data/voca

<', names =['words']) vocab_list = np .asmatrix (df ) # 长度与单词表长度相同 feature_vector = np .zeros (len (vocab_list )) # 邮件中存在该单词则对应下标位置值为1，反之为0 vocab_indices = email2VocabularyList (email , vocab_list ) for i in vocab_indices : feature_vector [i ] = 1 return feature_vector

14# 1. Support Vector Machines ⽀持向量机path = '../data/ex6data1.mat'X , y = func .loadData (path )

截潜流工程2

3# 1.1 Example dataset 1 样例数据集1# 可视化数据func .plotData (X , y )# 尝试不同的参数C ，并且绘制各种情况下的决定边界func .displayBoundaries (X , y )

华泰特拉卡3

数据可视化：

决策边界（线性核，C = 1）：

决策边界（线性核，C = 100）：可以从上图看到：

当较⼤（即较⼤，较⼩）时，模型对误分类的惩罚增⼤，较严格，误分类少，间隔较⼩。

当较⼩（即较⼩，较⼤）时，模型对误分类的惩罚减⼩，较宽松，允许⼀定误分类存在，间隔较⼤。

1.2 SVM with Gaussian Kernels 为了⽤SVM 出⾮线性的决策边界，我们⾸先要实现⾼斯核函数。我可以把⾼斯核函数想象成⼀个相似度函数，⽤来测量⼀对样本的距离 (x ( i ) , y ( j ) ) (x^{(i)}, y^{(j)}) (x(i),y(j))。注意，⼤多数SVM 库会⾃动帮你添加额外的特征以及，所以⽆需⼿动添加。

1.2.1 Gaussian Kernel 1.2.2 Example dataset 2

数据可视化：

决策边界（⾼斯核）：

1.2.3 Example dataset 3C 1/λλC 1/λλx 0θ0# 1.2 SVM with Gaussian Kernels 基于⾼斯核函数的SVM

path2 = '../data/ex6data2.mat'X2, y2 = func .loadData (path2)path3 = '../data/ex6data3.mat'df3 = loadmat (path3)X3, y3, Xval , yval = df3['X'], df3['y'], df3['Xval'], df3['yval']

7# 1.2.1 Gaussian Kernel ⾼斯核函数res_gaussianKernel = func .gaussianKernel (np .array ([1, 2, 1]), np .array ([0, 4, -1]), 2.)print (res_gaussianKernel ) # 0.32465246735834974

3# 1.2.2 Example dataset 2 样例数据集2# 可视化数据func .plotData (X2, y2)# 绘制基于⾼斯核函数的SVM 对于数据集的决策边界func .displayGaussKernelBoundary (X2, y2, C =1, sigma =0.1)

本文发布于:2024-09-24 15:22:21，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/570004.html

上一篇：以图像分割为例浅谈支持向量机(SVM)

下一篇：基于模糊支持向量机的多类分类算法研究共3篇

标签：函数边界邮件决策单词训练提取

留言与评论（共有 0 条评论）