Why Do Some Data Mining Models Give Erroneous Results?

Herbert Huffner

3 yıl önce

Social Networks And Local Area Networks That Connect To The Internet Generate And Store Different Types Of Data. To Prepare The Optimal Data Mining Models, We Need To Analyze The Basic Types And Basic Features Of The Data Set.

The first step in this analysis is to classify the data using computer systems. Data typically a source for the data mining process can be classified into structured, semi-structured, and non-structured data.

Most of the databases used by businesses contain structured data consisting of numeric fields and numeric + alphabetical values, while scientific databases may contain all three fields.

Examples are semi-structured data, electronic images of business documents, medical reports, management report summaries, and manuals. Most web documents also fall into this category. Unstructured data includes videos recorded by CCTV cameras in a department store.

The declining cost of network video surveillance equipment has led various businesses to use these cameras in stores, which is why we see an increase in unstructured data captured by video cameras.

In general, extracting information from such data requires more work and more extensive processing.

Structural data is often referred to as traditional data, while semi-structured and non-structural data is available as non-traditional data (called multimedia data). Most current data mining methods and business tools have been developed to work with traditional data.

However, the development of data mining tools for non-traditional data and the relationship of converting this information model to structured formats is progressing rapidly.

There is a specific set of features in the standard model of structured data used for data mining. In the world of data mining, potential measurements are known as attributes and are generally measured in the same way in most cases.

Typically, structured data is represented in tabular form or a single relation (a term used about relational databases). Columns are the properties of objects stored in a table, and rows are the values of these properties for specific entities. A simple graphical representation of a data set and its specifications are shown in the figure below.

In the data mining literature, we typically use the examples or cases of the terms to describe lines.

There are different attributes (attributes or variables), for example, contexts – in structured data records in data mining. However, it is important to note that not all data mining patterns interact in the same way with attributes and should be used in the right place.

There are several ways to describe features. One common way to look at a feature, more commonly referred to as a variable, is to see if the variable is independent or dependent, that is, whether the variable whose values depend on the values of the other variables shown in a dependent data set.

This is a model-based method for classifying variables.

In addition to the (independent) input variables X and the (dependent) output Y, a real system often has unobserved Z inputs.

An important point to note is that some additional variables affect system behavior. Still, the corresponding values are not available in a data set during a modeling process. There are several reasons for this problem, including the high complexity, high cost of measuring features, lack of knowledge, and deep understanding of the model about the importance of some factors and their impact on the model.

This model of properties is known as unobserved variables, which are the main factor in forming a model that produces erroneous results.

Unknown features are also described as missing data.

Today’s computers and software tools have the capacity to process datasets of millions of instances and hundreds of features. Large datasets, including datasets that combine hybrid data types, create the ideal environment suitable for applying data mining techniques.

When a large amount of data is stored on a computer, it is impossible to quickly move on to data mining techniques because data quality must solve first.

In addition, manual quality analysis is not available at this stage. Therefore, it is necessary to prepare data quality analysis in the early stages of the data mining process.

Normally this process should be done in the data pre-processing stage.

Qualitative data analysis profoundly affects the system image and identifies the corresponding model, which is implicitly described. It is difficult to detect major qualitative changes in an organization that produces low-quality information using existing data mining techniques. In addition, new identification in scientific data without quality is almost impossible.

There are various qualitative indicators associated with data that you should pay attention to in the data mining preprocessing stage.

Some of them are as follows:

Data must be accurate. The analyst should check that the letters are pronounced correctly, that the code is within a certain range, a complete value, and so on.
Data must store inappropriate data types. The analyst must ensure that the numeric value is not represented as a character, that the numbers are integers, not real, and so on.
Data must integrate. Updates should not be ignored, as different users may make changes to the data. If a mechanism is not available by default through the Database Management System (DBMS), it is necessary to back up the data regularly so that the data can be restored if necessary.
The data must be consistent. The form and content must be the same after merging large datasets from different sources.
Data should not be redundant. In practice, redundant data must minimize, duplicates controlled, or duplicate records removed.
Data must use at the right time. The temporal component of the data must be explicitly identified through the data or implicitly identified manually from the data classifier.
Data must well understand. Naming standards are a prerequisite, but they are not enough to understand the data alone. The user should know that the data corresponds to the domain he has published.
The data set must be complete. The rate of data loss should keep to a minimum. Data loss can reduce the quality of the model. However, some data mining techniques work well to support data set analysis, even with missing values.

The important thing to consider is fixing low-quality data, so it is essential to always look for the best templates, especially when you are initially processing data. These processes are often performed using data warehousing technology.