数据分析--分类与回归模型(一)

数据分析--分类回归模型(⼀)
⼀、分类回归⽅法
主要的分类、回归算法,⽹上和书上的资料进⾏梳理整理。
⼆、各类分类⽅法
代码参照《⼈⼯智能:python实现》⼀书,对部分代码进⾏了修改。
1、logistic 回归
柳州一中王静
logistics回归模型步骤
根据挖掘⽬的设置特征,并筛选特征x1,x2...xp,使⽤sklearn中的feature_selection库,F检验来给出特征的F值和P值,筛选出F⼤的,p⼩的值。RFE(递归特征消除)和SS(稳定性选择)
列出回归⽅程ln(p/1-p)=β0+β1x1+...+βpxp+e
估计回归系数
模型检验
预测控制
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from utilities import visualize_classifier
# Define sample input data
X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5], [3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6,
4.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])
test=np.array([[3.5,6.7],[4.2,5.5]])
help(linear_model.LogisticRegression)
# Create the logistic regression classifier
clf = linear_model.LogisticRegression(solver='liblinear', C=1)  #C为惩罚系数,过⼤易过拟合
#LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1,
class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0,
warm_start=False, n_jobs=None, l1_ratio=None)
# Train the classifier
山西农业大学学报clf.fit(X, y)
#输出所有相关参数
print(clf.classes_)  #输出分类类别
金花清感方f_)      #输出分类回归系数
print(clf.intercept_)  #截距
print(clf.n_iter_)  #输出迭代次数
print(clf.n_iter_)  #输出迭代次数
print(clf.score(X,y))
y_pred=clf.predict(test)  #预测数据
print(y_pred)水半夏
#图形化输出
def visualize_classifier(classifier, X, y):
# Define the minimum and maximum values for X and Y
# that will be used in the mesh grid
min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0
# Define the step size to use in plotting the mesh grid
mesh_step_size = 0.01
# Define the mesh grid of X and Y values
x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size), np.arange(min_y, max_y, mesh_step_size))
# Run the classifier on the mesh grid
output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
# Reshape the output array
图像处理与模式识别output = shape(x_vals.shape)
# Create a plot
plt.figure()
# Choose a color scheme for the plot
plt.pcolormesh(x_vals, y_vals, output, ay)
# Overlay the training points on the plot
plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black', linewidth=1, Paired)
# Specify the boundaries of the plot
五月卅一日急雨中
plt.xlim(x_vals.min(), x_vals.max())
plt.ylim(y_vals.min(), y_vals.max())
# Specify the ticks on the X and Y axes
plt.show()
visualize_classifier(clf, X, y)
随机logistic回归
sklearn.linear_model.RandomizedLogisticRegression 随机逻辑回归
官⽹对于随机逻辑回归的解释:
Randomized Logistic Regression works by subsampling the training data and fitting a L1-penalized L
ogisticRegression model where the penalty of a random subset of coefficients has been scaled. By performing this double randomization several times, the method assigns high scores to features that are repeatedly selected across randomizations. This is known as stability selection. In short, features selected more often are considered good features.
解读:对训练数据进⾏多次采样拟合回归模型,即在不同的数据⼦集和特征⼦集上运⾏特征算法,不断重复,最终选择得分⾼的重要特征。这是稳定性选择⽅法。得分⾼的重要特征可能是由于被认为是重要特征的频率⾼(被选为重要特征的次数除以它所在的⼦集被测试的次数)
2、朴素贝叶斯分类
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn import model_selection
from utilities import visualize_classifier
# Input file containing data
input_file = 'data_'
# Load data from input file
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
>>>>>>>>>##
# Cross validation
# Split data into training and test data
X_train, X_test, y_train, y_test = ain_test_split(X, y, test_size=0.2, random_state=3)  #数据集划分为训练集和测试集
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
# compute accuracy of the classifier
accuracy = 100.0 * (y_test == y_test_pred).sum() / X_test.shape[0]      #预测正确的⽐例
print("Accuracy of the new classifier =", round(accuracy, 2), "%")
# Visualize the performance of the classifier
visualize_classifier(classifier, X_test, y_test)
>>>>>>>>>##
# Scoring functions
num_folds = 3    #3折交叉验证
accuracy_values = ss_val_score(classifier, X, y, scoring='accuracy', cv=num_folds)
print("Accuracy: " + str(round(100*an(), 2)) + "%")
precision_values = ss_val_score(classifier, X, y, scoring='precision_weighted', cv=num_folds) print("Precision: " + str(round(100*an(), 2)) + "%")
recall_values = ss_val_score(classifier, X, y, scoring='recall_weighted', cv=num_folds)
print("Recall: " + str(round(100*an(), 2)) + "%")
f1_values = ss_val_score(classifier, X, y, scoring='f1_weighted', cv=num_folds)
print("F1: " + str(round(100*an(), 2)) + "%")
训练集和测试集的准确度有100%,⽽3折交叉验证的准确度却只有99.75%。那是不是交叉验证⽅法不好?我⾃⼰⼜了下⾯两个⽅法看了⼀下,都只有99.75%。之所以有差别,在于训练集、测试集检验的只是部分样本,⽽交叉验证检验的却基本上是全样本,k折取平均。
我查了K折验证的⽅法,发现还有以下两种:
(1)分层交叉验证(Stratified k-fold cross validation)
del_selection import StratifiedKFold,cross_val_score
strKFold = StratifiedKFold(n_splits=3,shuffle=False,random_state=0)
accuracy_values= cross_val_score(classifier,X,y,scoring='accuracy',cv=strKFold)
print("Accuracy: " + str(round(100*an(), 2)) + "%")
分层交叉验证的准确度是99.75%
(2)Leave-one-out Cross-validation 留⼀法
如果样本容量为n,则k=n,进⾏n折交叉验证,每次留下⼀个样本进⾏验证。主要针对⼩样本数据。
del_selection import LeaveOneOut,cross_val_score
loout = LeaveOneOut()
scores = cross_val_score(classifier,X,y,cv=loout)
print("Accuracy: " + str(round(100*an(), 2)) + "%")
留⼀法交叉验证的准确度是99.75% 。
3、SVM⽀持向量机
SVMs: LinearSVC, Linear SVR, SVC, Nu-SVC, SVR, Nu-SVR, OneClassSVM
from sklearn import svm
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC
from sklearn.svm import SVR
from sklearn.svm import LinearSVR
(1)SVM ⼀对⼀法
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsOneClassifier
from sklearn import model_selection
# Cross validation
X_train, X_test, y_train, y_test = ain_test_split(X, y, test_size=0.2, random_state=5) classifier = OneVsOneClassifier(LinearSVC(random_state=0))    #两类样本的分类器
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
# Compute the F1 score of the SVM classifier
f1 = ss_val_score(classifier, X, y, scoring='f1_weighted', cv=3)    #k折交叉验证print("F1 score: " + str(round(an(), 2)) + "%")
(2)SVM参数输出
#SVC,基于libsvm,不适⽤于数量⼤于10K的数据(时间复杂度⾼)
#数据准备
data = scio.loadmat("train and test top321.mat") #导⼊数据
test = data['test'] #待分类数据
testclass = data['testclass']
testclass = np.ravel(testclass)
train = data['train']
trainclass = data['trainclass']
trainclass = np.ravel(trainclass)
#建模分类
clf = SVC(kernel='linear')
clf.fit(train, trainclass)
weight = f_
print(clf.predict(test))
Pretest = clf.predict(test) #输出分类标签
a = Pretest ^ testclass
acc = (a.size-a.sum())/a.size
print("Accuracy:",acc)
print("⽀持向量指数:",clf.support_)
print("⽀持向量",clf.support_vectors_)
print("每⼀类的⽀持向量个数",clf.n_support_)
print("⽀持向量的系数:",clf.dual_coef_)
print("截距:",clf.intercept_)
(3)多分类问题
这块内容不⼤懂,先放着吧,等后⾯再梳理下。
SVM解决多分类问题的⽅法

本文发布于:2024-09-21 19:35:21,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/452395.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:特征   分类   输出   回归   数据   验证
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议