Clearing data, like the game “Stone, Scissors, Paper”. Is this a game with or without a finish? Part 2. Practical

The first part described that this publication was made on the basis of the dataset of the results of the cadastral valuation of real estate in the Khanty-Mansi Autonomous Area.

The practical part is presented in steps. All cleaning was performed in Excel, since the most common tool and the described operations can be repeated by most experts who know Excel. And quite well suited for work in the "melee."

At the zero stage, I’ll start the work of launching and saving the file, since it is 100 MB in size, then with the number of these operations, tens and hundreds take a significant amount of time.
Opening, on average, 30 sec.
Saving - 22 sec.

The first stage begins with the determination of the statistical indicators of the dataset.

Table 1. Stat indicators of dataset



Technology 2.1.

We create an auxiliary field, I have it under the number - AY. For each record, we form the formula "= DLSTR (F365502) + DLSTR (G365502) + ... + DLSTR (AW365502)"

The total time spent on step 2.1 (for Schumann's formula) t21 = 1 hour.
The number of errors found in step 2.1 (for Schumann's formula) n21 = 0 pcs.

Second phase.
Checking the components of the dataset.
2.2. All values ​​in the records are formed by standard characters. Therefore, we will track statistics on the characters.

Table 2. Stat indicators of characters in the dataset with a preliminary analysis of the results.






Technology 2.2.1.

Create an auxiliary field - “alpha1”. For each record, we form the formula “= CONNECT (Sheet1! B9; ... Sheet1! AQ9)”.
Create a fixed cell “Omega-1”. In this cell, we will alternately enter the character codes for Windows-1251 from 32 to 255.
Create an auxiliary field - "alpha2". With the formula "= FIND (SYMBOL (Omega; 1);" alpha1 "; N)."
Create an auxiliary field - "alpha3". With the formula "= IF (NUMBER (" alpha2 "; N); 1; 0)"
Create a fixed cell "Omega-2", with the formula "= SUM (" alpha3 "N1:" alpha3 "N365498)"

Table 3. Results preliminary analysis of the results


Table 4. Reported errors at this stage


The total time spent on step 2.2.1 (for Schumann's formula) t221 = 8 hours.
The number of corrected errors in step 2.2.1 (for Schumann's formula) n221 = 0 pcs.

Stage 3.
The third stage is to fix the state of the dataset. By assigning each record a unique number (ID) and each field. This is necessary to map the converted dataset to the original. It is also necessary to fully utilize the capabilities of groupings and filtration. Here we again turn to table 2.2.2 and select a character that is not used in the dataset. We get the one shown in Figure 10.


Figure 10. Assignment of identifiers.

The total time spent on stage 3 (for Schumann's formula) t3 = 0.75 hours.
The number of errors found in stage 3 (for Schumann's formula) n3 = 0 pcs.

Since Schumann’s formula requires that the stage be completed with error correction. We return to stage 2.

Stage 2.2.2.
At this stage, we also fix double and triple spaces.

Fig. 11. The number of double spaces.

Correction of errors defined in table 2.2.4.

Table 5. Error Correction Step



An example of why such an aspect as the use of the letters “e” or “e” is significant is presented in Figure 12.


Fig.12. Disagreement on the letter "".

The total time spent in step 2.2.2 is t222 = 4 hours.
The number of errors found in step 2.2.2 (for Schumann's formula) n222 = 583 pcs.

The fourth stage.
At this stage, checking for field redundancy fits well. Of 44 fields, 6 fields:
7 - Purpose of the structure
16 - Number of underground floors
17 - Parent object
21 - Village Council
38 - Parameters of the structure (description)
40 - Cultural heritage

They do not have a single record. That is redundant.
Field “22 - City” has one single entry, Figure 13.


Fig. 13. The only entry Z_348653 in the "City" field.

Field “34 - Name of the building” bear records that clearly do not correspond to the purpose of the field, Figure 14.


Fig.14. An example of an inappropriate entry.

We exclude these fields from the dataset. And we fix the change of 214 records.

The total time spent on stage 4 (for Schumann's formula) t4 = 2.5 hours.
The number of errors found in stage 4 (for Schumann's formula) n4 = 222 pcs.

Table 6. Analysis of dataset indicators after the 4th stage



Generally analyzing changes in indicators (table 6), we can say that:
1) The ratio of the levers of the average number of characters to the lever of the standard deviation is close to 3, that is, there are signs of a normal distribution (rule six sigma).
2) A significant deviation of the minimum and maximum levers from the average lever suggests that the study of tails is a promising direction in the search for errors.

We study the results of finding errors according to the Schumann methodology.

Single stages

2.1. The total time spent on step 2.1 (for Schumann's formula) t21 = 1 hour.
The number of errors found in step 2.1 (for Schumann's formula) n21 = 0 pcs.

3. The total time spent on stage 3 (for Schumann's formula) t3 = 0.75 hours.
The number of errors found in stage 3 (for Schumann's formula) n3 = 0 pcs.

Effective stages
2.2. The total time spent on step 2.2.1 (for Schumann's formula) t221 = 8 hours.
The number of corrected errors in step 2.2.1 (for Schumann's formula) n221 = 0 pcs.
The total time spent in step 2.2.2 is t222 = 4 hours.
The number of errors found in step 2.2.2 (for Schumann's formula) n222 = 583 pcs.

The total time spent in step 2.2 is t22 = 8 + 4 = 12 hours.
The number of errors found in step 2.2.2 (for Schumann's formula) n222 = 583 pcs.

4. The total time spent on stage 4 (for Schumann's formula) t4 = 2.5 hours.
The number of errors found in stage 4 (for Schumann's formula) n4 = 222 pcs.

Since there are zero stages that should be included in the first stage of the Schumann model, and on the other hand, stages 2.2 and 4 are essentially independent, given that the Schumann model assumes that by increasing the duration of the test, the probability of detecting errors is reduced, i.e., the flow is reduced failures, then examining this flow, we will determine which of the stages to put first, according to the rule where the density of failure is more often, that of the stages and put first.


Fig. 15.

From the formula in Figure 15 it follows that it is preferable to put the fourth stage before step 2.2 in the calculations.

Using Schumann's formula, we determine the estimated initial number of errors:


Fig. 16.

From the results in Figure 16 it can be seen that the predicted number of errors is N2 = 3167, which is more than the minimum criterion of 1459.

As a result of the correction, we corrected 805 errors, and the predicted amount is 3167 - 805 = 2362, which is still more than the minimum threshold that we accepted.

We determine the parameter C, lambda and the reliability function:


Fig.17.

In fact, lambda is an actual indicator of the intensity with which errors are detected at each stage. If you look above, the estimate of this indicator earlier was 42.4 errors per hour, which is quite comparable with the Schuman indicator. Turning to the first part of this material, it was determined that the intensity of finding errors by the developer should be no lower than 1 error per 250.4 entries, while checking 1 entry per minute. Hence the critical value of lambda for the Schumann model:
60 / 250.4 = 0.239617.

That is, the need for error locating procedures must be carried out until the lambda, from the available 38.964, drops to 0.239617.

Or until the indicator N (potential number of errors) minus n (corrected number of errors) does not decrease less than the threshold we adopted (in the first part) - 1459 pieces.

Part 1. Theoretical.

All Articles