🦕 🛀🏿 🧗🏽 Defecting *** s is not just randomization ⛷️ 🛵 🚶

There is a problem in the bank: developers and testers need to be given access to the database. There is a lot of client data that cannot be used for disclosure to the development and testing departments in accordance with PCI DSS requirements of the Central Bank and personal data laws.

It would seem that just changing everything to some asymmetric hashes is enough and everything will be fine.

So, it will not.

The fact is that the database of the bank is a set of interconnected tables. Somewhere they are connected by name and customer account number. Somewhere by its unique identifier. Somewhere (here the pain begins) through a stored procedure that calculates a pass-through identifier based on this and the neighboring table. Etc.

The usual situation is that the developer of the first version of the system has already died or left ten years ago, and the kernel systems running in the old hypervisor inside the new hypervisor (to ensure compatibility) are still in the prod.

That is, before anonymizing all this, you first need to understand the database.

Who does depersonalization and why?

They engage in depersonalization or masking because there are laws and standards. Yes, it’s much better to test for a “sell snapshot,” but regulators may revoke a license for such a flight. That is, to cover up the business as such.

Any depersonalization is a rather expensive and clumsy layer between productive systems and development testing.

The goal of anonymization (masking) projects is almost always to prepare test data that is as similar as possible to real ones stored in productive databases. That is, if the data contains errors - instead of an email, the phone is clogged, instead of the Cyrillic alphabet in the surname Latin, etc., then the disguised data should be of the same quality, but changed beyond recognition. The second goal is to reduce the volume of databases that are used in testing and development. The full amount is left only for load testing, and for the rest of the tasks, a certain data slice is usually done according to predefined rules - database truncation. The third goal is to get related data in different disguised and truncated databases. This means that data in different systems, at different times, must be anonymized uniformly.

In terms of computational complexity, depersonalization is about the same as a few database archives with extreme compression. The algorithm is roughly similar. The difference is that archiving algorithms have been honed over the years and have reached almost maximum efficiency. And depersonalization algorithms are written so that they at least work on the current base and are quite universal. And software after depersonalization generally worked. That is an excellent result - to grind 40 TB per night. It so happens that it’s cheaper for the customer to drive the database into depersonalization once every six months for a week on a weak server - also an approach.

How are data replaced?

Each data type changes according to the rules that can be used in the code. For example, if we replace the name with a random hash with special characters and numbers, then the very first data validation will immediately produce an error in real testing.

Therefore, first the depersonalization system must determine what type of data is stored in the field. Depending on the vendor, different approaches are used, from manual markup to attempts to discover the database and auto-detect what is stored there. We have the practice of introducing all the major solutions in the market. We’ll analyze one of the options when there is a wizard that tries to find data and “guess” what kind of data is stored there.

Naturally, to work with this software you need access to real data (usually this is a copy of a recent backup of the database). According to banking experience, we first sign a ton of papers for two months, and then we come to the bank, we are undressed, searched and dressed, then we go to a separate room, caged with a Faraday cage, in which there are two security guards and breathe warmly in the back of our heads.

So, suppose, after all this, we see a table in which there is a field "Name". The wizard has already marked it for us as a name, and we can only confirm and choose the type of depersonalization. Wizard offers a random replacement for Slavic names (there are bases for different regions). We agree and get replacements like Ivan Ivanov Petrenko - Joseph Albertovich Chingachguk. If this is important, the gender is preserved, if not, replacements go throughout the database of names.

Examples of replacements:

. ->
->
->
->
-> -

The next field is the date in Unixtime. Wizard determined this too, but we need to choose the depersonalization function. Usually, dates are used to control the sequence of events, and the situation when a client first made a transfer in a bank and then opened an account, nobody really needs to test. Therefore, we set a small delta - by default, within 30 days. There will still be errors, but if this is critical, you can configure more complex rules by adding your script to anonymization processing.

The address must be validated, so the database of Russian addresses is used. The card number must correspond to the real numbers and be validated by them. Sometimes the task is to “make all Visas random Mastercards” - this is also feasible in a couple of clicks.
Under the hood of the wizard is the profiling. Profiling is a search for data in a database according to predefined rules (attributes, domains). In fact, we read each cell of the customer’s database, apply a set of regular expressions to each cell, compare the values in this cell with dictionaries, etc. As a result, we have a set of triggered rules on the columns of the database tables. We can configure profiling, we can not read all the tables in the database, we can take only a certain number of rows from the table or a certain percentage of rows.

What is going on inside?

For each entry in the database, the depersonalization rules that we have chosen apply. In this case, temporary tables are created for the duration of the process, where replacements are written. Each subsequent record in the database is run according to these replacement correspondence tables, and if there is a correspondence there, it is replaced in the same way as before. Everything is actually a little more complicated depending on your scripts and pattern matching rules (there may be an inaccurate replacement, for example, for childbirth or for replacing dates stored in a different format), but the general idea is that.

If there are marked-up correspondences “the name is Cyrillic - the name is Latin”, then they should be clearly indicated at the development stage, and then in the substitution table they will correspond to each other. That is, the name will be anonymized in Cyrillic, and then this anonymized entry will be converted to the Latin alphabet, for example. At this point, we are moving away from the “do not improve the quality of data in the system” approach, but this is one of the compromises that you have to make for the sake of some kind of system performance. Practice shows that if stressful, functional testing does not notice a compromise in its work, then there was nothing. And here comes the important point that depersonalization as a whole is not encryption. If you have a couple of yards of entries in the table, and in ten of them the TIN has not changed, then what? Nothing, these ten records can not be found.

After the end of the process, the conversion tables remain in the protected database of the depersonalization server. The base is cut (truncated) and passed to testing without conversion tables, thus, for the tester, depersonalization becomes irreversible.

The full anonymous database is passed to testers for stress testing.

This means that while working with the database, the conversion table “swells” (the exact amount depends on the choice of replacements and their type), but the working base remains the original size.

What does the process look like in the operator interface?

General view of the IDE using one of the vendors as an example:

Debugger:

Starting a transformation from the IDE:

Configuring an expression to search for sensitive data in the profiler:

Page with a set of rules for the profiler:

The result of the profiler, web page with data search:

Is all data in the database masked?

No. Typically, the list of data for depersonalization is regulated by laws and standards of the sphere, plus the customer has suggestions for specific fields that no one should know about.

The logic is that if we mask the name of the patient in the hospital, you can mask or not mask the diagnosis - still no one will know who he is from. We had a case when notes on a transaction in a bank were simply masked in random letters. There were notes of the level: "The loan was refused, because the client came drunk, he vomited to the bar." From a debugging point of view, it's just a string of characters. Well, let her stay.

Example strategies:

A dynamic seed table is a transcoding table into which we add the recoding that has already happened. The hash can be very different and in the case of the same TIN, more often a new random TIN is generated with the first characters stored, with check digits.

Is it possible to change data using the DBMS itself?

Yes. When depersonalizing data, there are two main approaches - changing the data in the database using the database itself or organizing an ETL process and changing data using third-party software.

The key plus of the first approach is that you do not need to take data out of the database anywhere, there are no network costs, and fast and optimized database tools are used. The key minus is a separate development for each system, the lack of common conversion tables for different systems. Conversion tables are needed for reproducibility of depersonalization, further data integration between systems.

The key advantage of the second approach is that it doesn’t matter which database, system, file it is or some kind of web interface - once you implement a rule, you can use it everywhere. The key minus is that you need to read data from the database, process it with a separate application, write it back to the database.

Practice shows that if the customer has a set of several systems that require further integration, then only the second approach can be implemented for the final cost in money, as well as for acceptable development times.

That is, we can do anything we want, but the ETL approach has proven itself very well in the banking sector.

And why the data simply does not spoil manually?

This can be done once. Someone will sit for three days, depersonalize a bunch of data and prepare a database of 500-1000 records. The difficulty is that the process must be repeated regularly (with each change in the database structure and the appearance of new fields and tables) and in large volumes (for different types of testing). A common request is to depersonalize the first 10-50 GB of the database so that this amount falls on each table evenly.

What to do if document scans are stored in the database?

If a document can be reduced to XML and converted back (for example, office documents), you can also depersonalize them. But sometimes there are binaries like passport scans in PDF / JPG / TIFF / BMP. In this case, the generally accepted practice is to supply similar documents with a third-party script and replace real ones with samples from the base of randomly generated ones. The most difficult thing is with photographs, but there are services like this that solve the issue in approximately the same way.

Who is responsible for what?

When updating after changing software or “catching up”, the processes are simpler.

But what if something goes wrong in the tests?

This usually happens. First, testers, after the first depersonalization run, more precisely formulate the requirements for the database. We can change the rules of depersonalization or reject records like “here the actions should go in chronological order, and not in a chaotic”. Secondly, depending on the implementation, we either support depersonalization as the database changes, or leave all the documentation, descriptions of the database structure and processing types, transfer the entire processing code (rules in xml / sql) and train specialists from the customer.

How to watch a demo?

The easiest way is to email me at PSemenov@technoserv.com.

Defecting *** s is not just randomization