Clearing data, like the game “Stone, Scissors, Paper”. Is this a game with or without a finish? Part 1. Theoretical

1. The source data


Data cleaning is one of the challenges facing data analysis tasks. This material reflected the developments, decisions that arose as a result of solving the practical problem of analyzing the database during the formation of the cadastral value. Sources here are "REPORT No. 01 / -2019 on the results of the state cadastral valuation of all types of real estate (except for land) in the Khanty-Mansiysk Autonomous Okrug - Ugra" .

The file “Comparative model total.ods” was considered in “Appendix B. Results of determining the COP 5. Information on the method for determining the cadastral value 5.1 Comparative approach”.

Table 1. Stat indicators of the dataset in the file “Comparative model total.ods”
Total number of fields, pcs. - 44
Total number of records - 365,490
The total number of characters, pcs. - 101,714,693 The
average number of characters in a record, pcs. - 278.297
Standard deviation of characters in the record, pcs. - 15,510
The minimum number of characters in the record, pcs. - 198 The
maximum number of characters in the record, pcs. - 363

2. Introductory part. Basic standards


Pursuing the analysis of the indicated database, a task has been formed to specify the requirements for the degree of purification, since, as everyone understands, this database forms the legal and economic consequences for users. In the process, it turned out that no particular requirements for the degree of purification of big data were formed. Analyzing legal norms in this matter, I came to the conclusion that they are all formed from opportunities. That is, a certain task has appeared, sources of information are completed for the task, then a dataset is formed and, based on the created dataset, tools for solving the problem. The obtained solutions are reference points in the choice of alternatives. Presented this in Figure 1.



Since it is preferable to rely on proven technologies in determining standards, I chose the analysis criteria as the basis for the requirements set forth in the MHRA GxP Data Integrity Definitions and Guidance for Industry , because I considered this document to be the most holistic for this issue. In particular, the section in this document reads “It should be noted that data integrity requirements apply equally to manual (paper) and electronic data.” (trans. "... data integrity requirements apply equally to manual (paper) and electronic data"). This wording is quite specifically associated with the concept of “written evidence”, in the norms of Article 71 of the Code of Civil Procedure, Article 70 CAS, Art. 75 AIC, "writing" Art. 84 GIC.

Figure 2 presented a diagram of the formation of approaches to the types of information in jurisprudence.


Fig. 2. Source here .

Figure 3 shows the mechanism of Figure 1, for the tasks of the above “Guidance”. When comparing, it is easy to see that the approaches used, when fulfilling the requirements for the integrity of information, in modern standards for information systems, are significantly limited in comparison with the legal concept of information.


Fig.3

In the indicated document (Guidance), the link to the technical part, the processing and storage capabilities of the data, is well confirmed by the quote from chapter 18.2. Relational database: "This file structure is inherently more secure, as the data is held in a large file format which preserves the relationship between data and metadata."

In fact, in this approach - from the existing technical capabilities, there is nothing normal and, in itself, it is a natural process, since the expansion of concepts comes from the most studied activity - database design. But, on the other hand, legal norms appear that do not provide discounts on the technical capabilities of existing systems, for example: GDPR - General Data Protection Regulation .


Fig. 4. Funnel of technical capabilities ( Source ).

In these aspects, it becomes clear that the initial dataset (Fig. 1) will have to be preserved, first of all, and, secondly, be the basis for extracting additional information from it. Well, as an example: SDA fixation cameras are widespread, information processing systems filter out violators, but other information can also be offered to other consumers, for example, as marketing monitoring of the structure of the flow of customers to the shopping center. And this is a source of additional added value when using Bigdat. It is entirely conceivable that the datasets being assembled now, somewhere in the future, will have value by a mechanism similar to the value of rare books of the 1700s at the present time. Indeed, in fact, temporary datasets are unique and unlikely to be repeated in the future.

3. Introductory part. Evaluation Criteria


During processing, the following classification of errors was developed.

1. Error class (based on GOST R 8.736-2011): a) systematic errors; b) random errors; c) a gross error.

2. By multiplicity: a) mono-distortion; b) multi-distortion.

3. According to the criticality of the consequences: a) critical; b) not critical.

4. By the source of occurrence:

A) Technical - errors arising during the operation of the equipment. A fairly current error for IoT systems, systems with a significant degree of influence of the quality of communication, equipment (hardware).

B) Operator - errors in a wide range from typographical errors of the operator during entry to errors in the technical specifications for database design.

C) Custom - here the user’s errors in the entire range from “forgot to switch the layout” to what the meters took for feet.

5. Selected in a separate class:

a) the “separator task”, that is, the space and “:” (in our case) when it was duplicated;
b) words all in one piece;
c) the absence of a space after the service characters
; d) symmetric-plural characters: (), "", "...".

In aggregate, with the systematization of database errors presented in Figure 5, a sufficiently effective coordinate system is formed for searching errors and developing an algorithm for cleaning data, for this example.


Fig. 5. Typical errors corresponding to the structural units of the database (Source: Oreshkov VI, Paklin NB “Key concepts of data consolidation” ).

Accuracy, Domain Integrity, Data Type, Consistency, Redundancy, Completeness, Duplication, Conformance to Business Rules, Structural Definiteness, Data Anomaly, Clarity, Timely, Adherence to Data Integrity Rules. (Page 334. Data warehousing fundamentals for IT professionals / Paulraj Ponniah. — 2nd ed.)

Presented English formulations and Russian machine translation in brackets.

Accuracy The value stored in the system for a data element is the right value for that occurrence of the data element. If you have a customer name and an address stored in a record, then the address is the correct address for the customer with that name. If you find the quantity ordered as 1000 units in the record for order number 12345678, then that quantity is the accurate quantity for that order.
[Accuracy. The value stored in the system for the data item is the correct value for this occurrence of the data item. If you have a customer name and the address stored in the record, then the address is the correct address for the client with this name. If you find the quantity ordered as 1000 units in the entry for order number 12345678, then this quantity is the exact quantity for this order.]

Domain Integrity The data value of an attribute falls in the range of allowable, defined values. The common example is the allowable values ​​being “male” and “female” for the gender data element.
[Domain Integrity. The value of the attribute data falls into the range of valid, defined values. A common example is the valid male and female values ​​for a gender data item.]

Data Type. Value for a data attribute is actually stored as the data type defined for that attribute. When the data type of the store name field is defined as “text,” all instances of that field contain the store name shown in textual format and not numeric codes.
[Data type. The value of the data attribute is actually stored as the data type defined for this attribute. If the data type of the store name field is defined as “text”, all instances of this field contain the store name displayed in text format and not in numeric codes.]

Consistency. The form and content of a data field is the same across multiple source systems. If the product code for product ABC in one system is 1234, then the code for this product is 1234 in every source system.
[Consistency. The form and content of the data field are the same in different source systems. If the product code for an ABC product in one system is 1234, then the code for this product is 1234 in each source system.]

Redundancy. The same data must not be stored in more than one place in a system. If, for reasons of efficiency, a data element is intentionally stored in more than one place in a system, then the redundancy must be clearly identified and verified.
[Redundancy. The same data should not be stored in more than one place in the system. If, for reasons of efficiency, the data element is intentionally stored in several places in the system, then redundancy should be clearly defined and verified.]

Completeness. There are no missing values ​​for a given attribute in the system. For example, in a customer file, there must be a valid value for the “state” field for every customer. In the file for order details, every detail record for an order must be completely filled.
[Completeness. There are no missing values ​​for this attribute in the system. For example, the client file must have a valid value for the “state” field for each client. In the order details file, each order details record must be fully populated.]

Duplication. Duplication of records in a system is completely resolved. If the product file is known to have duplicate records, then all the duplicate records for each product are identified and a cross-reference created.
[Duplication. Duplication of entries in the system is completely eliminated. If it is known that the product file contains duplicate entries, then all duplicate entries for each product are identified and cross-referenced.]

Conformance to Business Rules. The values ​​of each data item adhere to prescribed business rules. In an auction system, the hammer or sale price cannot be less than the reserve price. In a bank loan system, the loan balance must always be positive or zero.
[Compliance with business rules. The values ​​of each data item are in accordance with established business rules. In an auction system, the price of a hammer or sale cannot be less than the reserve price. In a banking credit system, the credit balance should always be positive or zero.]

Structural Definiteness. Wherever a data item can naturally be structured into individual components, the item must contain this well-defined structure. For example, an individual's name naturally divides into first name, middle initial, and last name. Values ​​for names of individuals must be stored as first name, middle initial, and last name. This characteristic of data quality simplifies enforcement of standards and reduces missing values.
[Structural Certainty. Where a data element can be naturally structured into separate components, the element should contain this well-defined structure. For example, a person’s name is naturally divided into first name, middle initial and last name. Values ​​for the names of individuals must be stored as first name, middle initial and last name. This feature of data quality simplifies the application of standards and reduces missing values.]

Data Anomaly. A field must be used only for the purpose for which it is defined. If the field Address-3 is defined for any possible third line of address for long addresses, then this field must be used only for recording the third line of address. It must not be used for entering a phone or fax number for the customer.
[Data Anomaly. The field should be used only for the purpose for which it is defined. If the Address-3 field is defined for any possible third address line for long addresses, then this field should only be used to record the third address line. It should not be used to enter a telephone or fax number for a customer.]

Clarity. A data element may possess all the other characteristics of quality data but if the users do not understand its meaning clearly, then the data element is of no value to the users. Proper naming conventions help to make the data elements well understood by the users.
[Clarity. A data element may possess all other characteristics of quality data, but if users do not understand its meaning clearly, then the data element is not valuable to users. Proper naming conventions help make data elements well understood by users.]

Timely. The users determine the timeliness of the data. lf the users expect customer dimension data not to be older than one day, the changes to customer data in the source systems must be applied to the data warehouse daily.
[In a timely manner. Users determine the timeliness of data. if users expect that the customer’s measurement data will not be older than one day, changes to the customer’s data in the source systems should be applied to the data warehouse daily.]

Usefulness Every data element in the data warehouse must satisfy some requirements of the collection of users. A data element may be accurate and of high quality, but if it is of no value to the users, then it is totally unnecessary for that data element to be in the data warehouse.
[Utility. Each data item in the data warehouse must meet some of the requirements of a user collection. A data item can be accurate and of high quality, but if it is not of value to users, then it is not necessary that the data item is in the data warehouse.]

Adherence to Data Integrity Rules. The data stored in the relational databases of the source systems must adhere to entity integrity and referential integrity rules. Any table that permits null as the primary key does not have entity integrity. Referential integrity forces the establishment of the parent – ​​child relationships correctly. In a customer-to-order relationship, referential integrity ensures the existence of a customer for every order in the database.
[Compliance with data integrity rules. The data stored in the relational databases of the source systems must comply with the rules of entity integrity and referential integrity. Any table that allows null as a primary key does not have entity integrity. Referential integrity forces the right relationship between parents and children to be established. In a customer-order relationship, referential integrity ensures that a customer exists for each order in the database.]

4. The quality of data cleaning


The quality of data cleaning is a rather problematic question in bigdata. To answer the question what degree of data cleansing is necessary when performing the task, is basic for every data analyst. In most current tasks, each analyst establishes this himself and it is unlikely that someone from the outside is able to evaluate this aspect in his decision. But for this task in this case, this question was extremely important, since the reliability of legal data should tend to unity.

Considering software testing technologies to determine reliability in work. Today, there are more than 200 of these models . Many of the models use the application service model:


Fig. 6

Thinking as follows: “If the error found is an event similar to the failure event in this model, then how to find an analog of the parameter t?” And I made the following model: Imagine that the time it takes for a tester to check one record is 1 minute (for the database in question), then to find all the errors, it will take 365,494 minutes, which is approximately 3 years and 3 months of working time. As we understand it, this is not a very small amount of work and the costs of checking the database will be unbearable for the compiler of this database. In this reflection, the economic concept of costs appears and after analysis came to the conclusion that this is a fairly effective tool. Based on the law of economics: “The volume of production (in units) at which the maximum profit of the company is achieved,"is located at the point where the marginal cost of producing a new unit of production is compared with the price that this company can receive for a new unit." Relying on the postulate that finding each subsequent error requires more and more verification of records, this is a cost factor. That is, the postulate adopted in the testing models physically makes sense, in the following regularity: if, to find the ith error, it was necessary to check n records, then to find the next (i + 1) error, it will already be necessary to check m records and n <m. This postulate, in testing models, is formulated mainly by the requirement that the errors found should be fixed, but not corrected, so that the software is tested in its natural state, that is, the failure flow is uniform. Accordingly, for our case,Validation of records can show two variants of uniformity:

  1. ;
  2. .

To determine the critical value, he turned to the concept of economic feasibility, which in this case, when using the concept of social costs, can be formulated as follows: "The cost of correcting the error should be borne by the economic agent who can do this at the lowest cost." We have one agent - this is a tester who spends 1 minute on checking one record. In monetary terms, with earnings of 6000 rubles / day, this will amount to 12.2 rubles. (approximately today). It remains to determine the second side of the equilibrium in economic law. He reasoned like that. The existing error will require that it concerns someone to spend efforts to correct it, that is, the owner of the property. Suppose for this you need 1 day of action (include the application, get the corrected document).Then, from a public point of view, its costs will be equal to the average salary per day. The average accrued salary in the Khanty-Mansi Autonomous Okrug"Results of the socio-economic development of the Khanty-Mansiysk Autonomous Okrug - Ugra for January-September 2019" 73285 rubles. or 3053.542 rubles / day. Accordingly, we obtain a critical value equal to:
3053.542: 12.2 = 250.4 units.

This means, from a public point of view, if the tester checked 251 entries and found one error, this is equivalent to the user fixing this error on his own. Accordingly, if the tester spent the time equal to checking 252 records to find the next error, then in this case it is better to shift the costs of correction to the user.

A simplified approach is presented here, since from a public point of view, it is necessary to take into account all the additional cost generated by each specialist, that is, costs including taxes and social payments, but the model is clear. The consequence of this relationship is the following requirement for specialists: an IT specialist should have a salary greater than the national average. If his salary is less than the average salary of potential database users, then he himself must check the entire database in hand-to-hand combat.

When using the described criterion, the first requirement for the quality of the database is formed:
I (tr). The share of critical errors should not exceed 1 / 250.4 = 0.39938%. Slightly less than gold refining in industry. And in kind, no more than 1,459 entries with errors.

Economic retreat.

In fact, by allowing such a number of errors in the entries, the company agrees to economic losses in the amount of:

1,459 * 3,053.542 = 4,455,118 rubles.

This amount is determined by the fact that the company does not have tools to reduce these costs. It follows that if someone develops a technology that allows to reduce the number of records with errors to, for example, 259, then this allows society to save:
1200 * 3053.542 = 3,664,250 rubles.

But at the same time, he can ask for his talent and work, well, let's say - 1 million rubles.
That is, social costs are reduced by:

3 664 250 - 1 000 000 = 2 664 250 rubles.

In fact, this effect is the added value from the use of Bigdat technologies.

But here it should be borne in mind that this is a social effect, and the owner of the database is the municipal authorities, their income from the use of property recorded in this database at a rate of 0.3% is: 2.778 billion rubles / year. And these costs (4 455 118 rubles) do not bother him much, since they are transferred to the owners of the property. And, in this aspect, the developer of more refining technologies in Bigdata will have to show the ability to convince the owner of this database, and such things need considerable talent.

In this example, an error estimation algorithm was selected based on the Schumann model [2] software verification when testing for reliability. Due to its prevalence in the network and the ability to obtain the necessary statistical indicators. The methodology is taken from Monks Yu.M. "Functional stability of information systems", see under the spoiler in Fig. 7-9.

Fig. 7 - 9 Schumann Model Methodology






The second part of this material presents an example of data cleaning, in which the results of using the Schuman model are obtained.
I will present the results: The
estimated number of errors N = 3167 shN.
Parameter C, lambda and reliability function:


Fig.17

In fact, lambda is an actual indicator of the intensity with which errors are detected at each stage. If you look, in the second part, then the estimate of this indicator was 42.4 errors per hour, which is quite comparable with the figure of Schumann. Above, it was determined that the error detection rate by the developer should be no lower than 1 error per 250.4 records, while checking 1 record per minute. Hence the critical value of lambda for the Schumann model:

60 / 250.4 = 0.239617.

That is, the need for error locating procedures must be carried out until the lambda, from the available 38.964, drops to 0.239617.

Or until the indicator N (potential number of errors) minus n (corrected number of errors) does not decrease less than the threshold we adopted - 1459 pieces.

Literature


  1. Monakhov, Yu. M. Functional stability of information systems. At 3 hours. Part 1. Software reliability: textbook. allowance / Yu. M. Monakhov; Vladim. state un-t - Vladimir: Izdvo Vladim. state University, 2011 .-- 60 p. - ISBN 978-5-9984-0189-3.
  2. Martin L. Shooman, "Probabilistic models for software reliability prediction."
  3. Data warehousing fundamentals for IT professionals / Paulraj Ponniah. — 2nd ed.

Part two. Theoretical

All Articles