首页 > 学术百科

数据分析--分类与回归模型（一）

数据分析--分类与回归模型（⼀）

⼀、分类回归⽅法

主要的分类、回归算法，⽹上和书上的资料进⾏梳理整理。

⼆、各类分类⽅法

代码参照《⼈⼯智能：python实现》⼀书，对部分代码进⾏了修改。

1、logistic 回归

柳州一中王静

logistics回归模型步骤

根据挖掘⽬的设置特征，并筛选特征x1，x2...xp，使⽤sklearn中的feature_selection库，F检验来给出特征的F值和P值，筛选出F⼤的，p⼩的值。RFE（递归特征消除）和SS（稳定性选择）

列出回归⽅程ln（p/1-p）=β0+β1x1+...+βpxp+e

估计回归系数

模型检验

预测控制

import numpy as np

from sklearn import linear_model

import matplotlib.pyplot as plt

from utilities import visualize_classifier

# Define sample input data

X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5], [3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6,

4.9]])

y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

test=np.array([[3.5,6.7],[4.2,5.5]])

help(linear_model.LogisticRegression)

# Create the logistic regression classifier

clf = linear_model.LogisticRegression(solver='liblinear', C=1) #C为惩罚系数，过⼤易过拟合

#LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1,

class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0,

warm_start=False, n_jobs=None, l1_ratio=None)

# Train the classifier

山西农业大学学报clf.fit(X, y)

#输出所有相关参数

print(clf.classes_) #输出分类类别

金花清感方f_) #输出分类回归系数

print(clf.intercept_) #截距

print(clf.n_iter_) #输出迭代次数

print(clf.score(X,y))

y_pred=clf.predict(test) #预测数据

print(y_pred)水半夏

#图形化输出

def visualize_classifier(classifier, X, y):

# Define the minimum and maximum values for X and Y

# that will be used in the mesh grid

min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0

min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

# Define the step size to use in plotting the mesh grid

mesh_step_size = 0.01

# Define the mesh grid of X and Y values

x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size), np.arange(min_y, max_y, mesh_step_size))

# Run the classifier on the mesh grid

output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])

# Reshape the output array

图像处理与模式识别output = shape(x_vals.shape)

# Create a plot

plt.figure()

# Choose a color scheme for the plot

plt.pcolormesh(x_vals, y_vals, output, ay)

# Overlay the training points on the plot

plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black', linewidth=1, Paired)

# Specify the boundaries of the plot

五月卅一日急雨中

plt.xlim(x_vals.min(), x_vals.max())

plt.ylim(y_vals.min(), y_vals.max())

# Specify the ticks on the X and Y axes

plt.show()

visualize_classifier(clf, X, y)

随机logistic回归

sklearn.linear_model.RandomizedLogisticRegression 随机逻辑回归

官⽹对于随机逻辑回归的解释：

Randomized Logistic Regression works by subsampling the training data and fitting a L1-penalized L

ogisticRegression model where the penalty of a random subset of coefficients has been scaled. By performing this double randomization several times, the method assigns high scores to features that are repeatedly selected across randomizations. This is known as stability selection. In short, features selected more often are considered good features.

解读：对训练数据进⾏多次采样拟合回归模型，即在不同的数据⼦集和特征⼦集上运⾏特征算法，不断重复，最终选择得分⾼的重要特征。这是稳定性选择⽅法。得分⾼的重要特征可能是由于被认为是重要特征的频率⾼（被选为重要特征的次数除以它所在的⼦集被测试的次数）

2、朴素贝叶斯分类

import numpy as np

import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB

from sklearn import model_selection

from utilities import visualize_classifier

# Input file containing data

input_file = 'data_'

# Load data from input file

data = np.loadtxt(input_file, delimiter=',')

X, y = data[:, :-1], data[:, -1]

>>>>>>>>>##

# Cross validation

# Split data into training and test data

X_train, X_test, y_train, y_test = ain_test_split(X, y, test_size=0.2, random_state=3) #数据集划分为训练集和测试集

classifier = GaussianNB()

classifier.fit(X_train, y_train)

y_test_pred = classifier.predict(X_test)

# compute accuracy of the classifier

accuracy = 100.0 * (y_test == y_test_pred).sum() / X_test.shape[0] #预测正确的⽐例

print("Accuracy of the new classifier =", round(accuracy, 2), "%")

# Visualize the performance of the classifier

visualize_classifier(classifier, X_test, y_test)

>>>>>>>>>##

# Scoring functions

num_folds = 3 #3折交叉验证

accuracy_values = ss_val_score(classifier, X, y, scoring='accuracy', cv=num_folds)

print("Accuracy: " + str(round(100*an(), 2)) + "%")

precision_values = ss_val_score(classifier, X, y, scoring='precision_weighted', cv=num_folds) print("Precision: " + str(round(100*an(), 2)) + "%")

recall_values = ss_val_score(classifier, X, y, scoring='recall_weighted', cv=num_folds)

print("Recall: " + str(round(100*an(), 2)) + "%")

f1_values = ss_val_score(classifier, X, y, scoring='f1_weighted', cv=num_folds)

print("F1: " + str(round(100*an(), 2)) + "%")

训练集和测试集的准确度有100%，⽽3折交叉验证的准确度却只有99.75%。那是不是交叉验证⽅法不好？我⾃⼰⼜了下⾯两个⽅法看了⼀下，都只有99.75%。之所以有差别，在于训练集、测试集检验的只是部分样本，⽽交叉验证检验的却基本上是全样本，k折取平均。

我查了K折验证的⽅法，发现还有以下两种：

（1）分层交叉验证（Stratified k-fold cross validation）

del_selection import StratifiedKFold,cross_val_score

strKFold = StratifiedKFold(n_splits=3,shuffle=False,random_state=0)

accuracy_values= cross_val_score(classifier,X,y,scoring='accuracy',cv=strKFold)

print("Accuracy: " + str(round(100*an(), 2)) + "%")

分层交叉验证的准确度是99.75%

（2）Leave-one-out Cross-validation 留⼀法

如果样本容量为n，则k=n，进⾏n折交叉验证，每次留下⼀个样本进⾏验证。主要针对⼩样本数据。

del_selection import LeaveOneOut,cross_val_score

loout = LeaveOneOut()

scores = cross_val_score(classifier,X,y,cv=loout)

print("Accuracy: " + str(round(100*an(), 2)) + "%")

留⼀法交叉验证的准确度是99.75% 。

3、SVM⽀持向量机

SVMs： LinearSVC, Linear SVR, SVC, Nu-SVC, SVR, Nu-SVR, OneClassSVM

from sklearn import svm

from sklearn.svm import SVC

from sklearn.svm import LinearSVC

from sklearn.svm import NuSVC

from sklearn.svm import SVR

from sklearn.svm import LinearSVR

（1）SVM ⼀对⼀法

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn import preprocessing

from sklearn.svm import LinearSVC

from sklearn.multiclass import OneVsOneClassifier

from sklearn import model_selection

# Cross validation

X_train, X_test, y_train, y_test = ain_test_split(X, y, test_size=0.2, random_state=5) classifier = OneVsOneClassifier(LinearSVC(random_state=0)) #两类样本的分类器

classifier.fit(X_train, y_train)

y_test_pred = classifier.predict(X_test)

# Compute the F1 score of the SVM classifier

f1 = ss_val_score(classifier, X, y, scoring='f1_weighted', cv=3) #k折交叉验证print("F1 score: " + str(round(an(), 2)) + "%")

（2）SVM参数输出

#SVC，基于libsvm，不适⽤于数量⼤于10K的数据（时间复杂度⾼）

#数据准备

data = scio.loadmat("train and test top321.mat") #导⼊数据

test = data['test'] #待分类数据

testclass = data['testclass']

testclass = np.ravel(testclass)

train = data['train']

trainclass = data['trainclass']

trainclass = np.ravel(trainclass)

#建模分类

clf = SVC(kernel='linear')

clf.fit(train, trainclass)

weight = f_

print(clf.predict(test))

Pretest = clf.predict(test) #输出分类标签

a = Pretest ^ testclass

acc = (a.size-a.sum())/a.size

print("Accuracy:",acc)

print("⽀持向量指数:",clf.support_)

print("⽀持向量",clf.support_vectors_)

print("每⼀类的⽀持向量个数",clf.n_support_)

print("⽀持向量的系数:",clf.dual_coef_)

print("截距:",clf.intercept_)

（3）多分类问题

这块内容不⼤懂，先放着吧，等后⾯再梳理下。

SVM解决多分类问题的⽅法

本文发布于:2024-09-21 19:35:21，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/452395.html

上一篇：基于子结构的参数化模型降阶方法

下一篇：用python编译ABM（Agent-basedmodeling）模型简介

标签：特征分类输出回归数据验证

留言与评论（共有 0 条评论）