数据挖掘中数据预处理方法_数据挖掘中的数据预处理

数据挖掘中数据预处理⽅法_数据挖掘中的数据预处理

数据挖掘中数据预处理⽅法

In the previous article, we have discussed the with which we have started a detailed journey towards data mining. We have learnt about , Statistical Description of Data, Concept of Data Visualization & Various technique of Data Visualization.

在上⼀篇⽂章中，我们讨论了并由此开始了详细的数据挖掘之旅。我们已经了解了，数据统计描述，数据可视化的概念以及各种数据可视化技术。

In this article we will be discussing,

在本⽂中，我们将讨论

1. Need of Data Preprocessing

需要数据预处理

2. Data Cleaning Process

数据清理流程

3. Data Integration Process

数据整合流程sdtv

4. Data Reduction Process

数据缩减流程

5. Data Transformation Process

数据转换过程

1)需要数据预处理 (1) Need of Data Preprocessing)

Data preprocessing refers to the set of techniques implemented on the databases to remove noisy, missing, and inconsistent data. Different Data preprocessing techniques involved in data mining are data cleaning, data integration, data reduction, and data transformation.

数据预处理是指在数据库上实施的⽤于消除噪声，丢失和不⼀致数据的技术集。数据挖掘中涉及的不

同数据预处理技术是数据清理，数据集成，数据缩减和数据转换。

The need for data preprocessing arises from the fact that the real-time data and many times the data of the database is often incomplete and inconsistent which may result in improper and inaccurate data mining results. Thus to improve the quality of data on which the observation and analysis are to be done, it is treated with these four steps of data preprocessing. More the improved data, More will be the accurate observation and prediction.

数据预处理的需求源于以下事实：实时数据以及很多时候数据库的数据通常不完整且不⼀致，这可能导致数据挖掘结果不正确和不准确。因此，为了提⾼要进⾏观察和分析的数据的质量，可以通过数据预处理的这四个步骤对其进⾏处理。改进的数据越多，准确的观察和预测就越多。

Fig 1: Steps of Data Preprocessing

图1：数据预处理步骤

2)数据清理流程 (2) Data Cleaning Process)

Data in the real world is usually incomplete, incomplete and noisy. The data cleaning process includes the procedure which aims at filling the missing values, smoothing out the noise which determines the outliers and rectifies the inconsistencies in data. Let us discuss the basic methods of data cleaning,

现实世界中的数据通常不完整，不完整且嘈杂。数据清除过程包括旨在填补缺失值，消除噪声的过程，该噪声确定了异常值并纠正了数据中的不⼀致之处。让我们讨论数据清理的基本⽅法，

2.1. Mi ssi ng V alues

2.1。缺失值

Assume that you are dealing with any data like sales and customer data and you observe that there are several attributes from which the data is missing. One cannot compute data with missing values. In this case, there are some methods which sort out this problem. Let us go through them one by one,

假设您正在处理任何数据(例如销售和客户数据)，并且发现缺少⼀些属性。不能计算缺少值的数据。

在这种情况下，有⼀些⽅法可以解决此问题。让我们⼀⼀讲解

2.1.1. Ignore the tuple:

异常睡眠忽略元组：

2.1.1。

2.1.1。忽略元组：

If there is no class label specified then we could go for this method. It is not effective in the case if the percentage of missing values per attribute changes considerably.

如果未指定类标签，则可以使⽤此⽅法。如果每个属性的缺失值百分⽐发⽣很⼤变化，则此⽅法⽆效。

2.1.2. Enter the missing value manually or fill it with global constant:

⼿动输⼊缺少的值或⽤全局常数填充它：

2.1.2。⼿动输⼊缺少的值或⽤全局常数填充它：

2.1.2。

When the database contains large missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming. Another method is to fill it with some global constant.

当数据库包含较⼤的缺失值时，⼿动填充⽅法不可⾏。同时，此⽅法很耗时。另⼀种⽅法是⽤⼀些全局常数填充它。

2.1.

3. Filling the missing value with attribute mean or by using the most probable value:

2.1.3。使⽤属性均值或使⽤最可能的值来填充缺失值：

使⽤属性均值或使⽤最可能的值来填充缺失值：

2.1.3。

Filling the missing value with attribute value can be the other option. Filling with the most probable value uses regression, Bayesian formulation or decision tree.

⽤属性值填充缺失值可以是另⼀种选择。⽤回归，贝叶斯公式或决策树填充最可能的值。

2.2. N oi sy D at a

2.2。噪⾳数据

Noise refers to any error in a measured variable. If a numerical attribute is given you need to smooth out the data by eliminating noise. Some data smoothing techniques are as follows,

噪声是指测量变量中的任何误差。如果给定了数字属性，则需要通过消除噪声来平滑数据。⼀些数据平滑技术如下，

太原大学学报

2.2.1. Binning:

2.2.1。装箱：

装箱：

2.2.1。

Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean v

alue of the bin.

1. Smoothing by bin means

按bin⽅式进⾏平滑：在按bin⽅式进⾏平滑处理中，将bin中的每个值替换为bin的平均值。

按bin⽅式进⾏

Smoothing by bin median: In this method, each bin value is replaced by its bin median value.

2. Smoothing by bin median

按bin中值进⾏平滑

按bin中值进⾏平滑：在这种⽅法中，每个bin值都将替换为其bin中值。

Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and maximum values in a given bin are 3. Smoothing by bin boundary

identified as the bin boundaries. Every value of bin is then replaced with the closest boundary value.

按bin边界进⾏平滑：在按bin边界进⾏平滑处理中，将给定bin中的最⼩值和最⼤值标识为bin边界。然后将bin的每个值替换为最接按bin边界进⾏

近的边界值。

Let us understand with an example,

让我们以⼀个例⼦来理解，

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

价格排序数据(美元)：4、8、9、15、21、21、24、25、26、28、29、34

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Smoothing by bin median:

- Bin 1: 9 9, 9, 9

- Bin 2: 24, 24, 24, 24

- Bin 3: 29, 29, 29, 29

2.2.2. Regression:

2.2.2。

回归：

2.2.2。回归：

Regression is used to predict the value. Linear regression uses the formula of a straight line which predicts the value of y on the specified value of x whereas multiple linear regression is used to predict the value of a variable is predicted by using given values of two or more variables.

回归⽤于预测值。线性回归使⽤直线公式来预测y在x的指定值上的值，⽽多元线性回归⽤于预测变量的值是通过使⽤两个或多个变量的给定值来预测的。

3)数据整合过程 (3) Data Integration Process)

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and supply a unified view of the info. These sources may include multiple data cubes, databases or flat files.

数据集成是⼀种数据预处理技术，涉及将来⾃多个异构数据源的数据组合到⼀个⼀致的数据存储中，并提供信息的统⼀视图。这些源可能包括多个数据多维数据集，数据库或平⾯⽂件。

3.1. Approach es

3.1。⽅法

"tight coupling approach" and another is the "loose

"loose There are mainly 2 major approaches for data integration – one is "tight coupling approach"

coupling approach".

coupling approach"

“松散耦合⽅法” 。

数据集成主要有2种主要⽅法-⼀种是“紧密耦合⽅法”

“紧密耦合⽅法” ，另⼀种是“松散耦合⽅法”

Tight Coupling:

紧耦合：

Here, a knowledge warehouse is treated as an information retrieval component.

在这⾥，知识仓库被视为信息检索组件。

ETL –

上海佳程广场In this coupling, data is combined from different sources into one physical location through the method of ETL –Extraction, Transformation, and Loading.

Extraction, Transformation, and Loading

ETL(提取，转换和加载)⽅法将数据从不同源组合到⼀个物理位置。

在这种耦合中，通过ETL(提取，转换和加载)

Loose Coupling:

松耦合：

Here, an interface is as long as it takes the query from the user, transforms it during away the source database can understand then sends the query on to the source databases to get the result. And the data only remains within the actual source databases.

在这⾥，接⼝只要它从⽤户那⾥获取查询，并在源数据库可以理解的时间内对其进⾏转换，然后将查询发送到源数据库以获取结果。并且数据仅保留在实际的源数据库中。

3.2. Issues i n D at a Int egrat i on

3.2。数据集成中的问题

There are not any issues to think about during data integration: Schema Integration, Redundancy, Detection and determination of knowledge value conflicts. These are explained in short as below,

数据集成期间没有任何问题可考虑：架构集成，冗余，知识值冲突的检测和确定。这些简述如下：

螨类3.1.1. Schema Integration:

模式集成：

3.1.1。模式集成：

3.1.1。

Integrate metadata from different sources.

集成来⾃不同来源的元数据。

The real-world entities from multiple sources are matched mentioned because of the entity identification problem.

由于实体标识问题，提到了来⾃多个来源的真实实体。

For example, How can the info analyst and computer make certain that customer id in one database and customer number in another regard to an equivalent attribute.

例如，信息分析师和计算机如何才能确定⼀个数据库中的客户ID和其他⽅⾯的客户编号是否具有等效属性。

3.2.2. Redundancy:

冗余：

3.2.2。冗余：

3.2.2。

An attribute could also be redundant if it is often derived or obtaining from another attribute or set of the attribute.

如果某个属性通常是从另⼀个属性或该属性的集合派⽣或获取的，则它也可能是多余的。

Inconsistencies in attribute also can cause redundancies within the resulting data set.

属性不⼀致还会导致结果数据集内的冗余。

Some redundancies are often detected by correlation analysis.

经常通过相关分析来检测⼀些冗余。

3.3.3. Detection and determination of data value conflicts:nojiya

3.3.3。3.3.3。检测和确定数据值冲突：检测和确定数据值冲突：

This is the third important issues in data integration. Attribute values from another different source may differ for an equivalent world entity. An attribute in one system could also be recorded at a lower level abstraction than the "same"attribute in another.

这是数据集成中的第三个重要问题。对于等效的世界实体，来⾃另⼀个不同来源的属性值可能有所不同。与另⼀个系统中的“ same”属性相⽐，⼀个系统中的属性也可以以较低的抽象级别记录。

4)数据缩减流程 (4) Data Reduction Process)

Data warehouses usually store large amounts of data the data mining operation takes a long time to process this data. The data reduction technique helps to minimize the size of the dataset without affecting the result. The following are the methods that are commonly used for data reduction,

数据仓库通常存储⼤量数据，数据挖掘操作需要很长时间才能处理此数据。数据缩减技术有助于在不影响结果的情况下最⼩化数据集的⼤⼩。以下是通常⽤于数据缩减的⽅法，

1. Data cube aggregation

数据⽴⽅体聚合

Refers to a method where aggregation operations are performed on data to create a data cube, which helps to analyze business trends and performance.

指对数据执⾏聚合操作以创建数据多维数据集的⽅法，该⽅法有助于分析业务趋势和性能。

2. Attribute subset selection

属性⼦集选择

Refers to a method where redundant attributes or dimensions or irrelevant data may be identified and removed.

指可以识别和删除冗余属性或尺⼨或不相关数据的⽅法。

3. Dimensionality reduction

降维

Refers to a method where encoding techniques are used to minimize the size of the data set.

指的是⼀种使⽤编码技术来最⼩化数据集⼤⼩的⽅法。

4. Numerosity reduction

减少雾度

Refers to a method where smaller data representation replaces the data.

指的是较⼩的数据表⽰替换数据的⽅法。

5. Discretization and concept hierarchy generation

离散化和概念层次⽣成

Refers to methods where higher conceptual values replace raw data values for attributes. Data discretization is a type of numerosity reduction for the automatic generation of concept hierarchies.

指的是较⾼的概念值替换属性的原始数据值的⽅法。数据离散化是⼀种⽤于⾃动⽣成概念层次结构的数量减少⽅法。

5)数据整合流程 (5) Data Integration Process)

In data transformation process data are transformed from one format to a different format, that's more appropriate for data processing.

本文发布于:2024-09-24 03:24:56，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/68579.html

上一篇：数据分析与挖掘

下一篇：数据挖掘技术及其应用