Five stages of accepting the inevitable, or How we developed a program for automated profiling

Hi, I’m in touch with Alexey Filatov (aka afilatov123) In 2017, I was invited to the SearchInform team to launch a new software solution. More precisely, to increase the capabilities of the flagship product - the DLP system . Not only does the market know how to software (prevent information leakage and corporate fraud). Customers want the program to be able to predict user behavior: "this employee is getting ready for dismissal, which means he can ..." or "a person is stressed and will probably make a mistake." And these predictions must be made with high accuracy and in an automated format.

To solve this problem, vendors usually follow the path of UEBA (or UBA). But we went our own way and started creating automated profiling.

image

Under the cut - the story of what path we went to make the product take place.

I’ll clarify right away that automated profiling with big reservations can also be called analysis of user behavior. But the difference in methods is significant, we would like to sort out the confusion in terms in one of the next posts (or even a lengthy story will turn into an endless chronicle).

So, profiling is a long-standing technique, but only in an offline format. In this offline world, there are specialist profilers who, based on the analysis of speech, intonations, facial expressions, make conclusions about the emotional state, personal qualities of a person, his criminal inclinations, etc. To keep a profiler (and preferably a dozen) in the staff of even a wealthy company is utopia. Hence the idea of ​​a program that replaces bright heads.

We started working on ProfileCenterwith the choice of what will become the "raw material" for analysis. There are not many options:

  • spoken language - to evaluate linguistics and voice characteristics;
  • keyboard handwriting;
  • Internet traffic and other patterns of user interaction with a computer;
  • facial expressions;
  • user texts.

Spoiler - we took the texts into development, but first I will briefly explain why the other options were eliminated.

Speaking is an accessible source of information, because vendors want to work with it. Yes, and there are good scientific developments in the assessment of speech. In particular, the most notable are the works of Tim Polzehl, for example, Personality in Speech . And also Swati Johar, Koteswara Rao Anne, K. Srinivasa Rao, Ute Jekosch. But while the technique is considered crude: voice analyzers are able to well identify the level of stress, but their ability to reliably determine personal characteristics has been called into question by many experts.

Another option for working with oral speech is to translate it into written text in order to further analyze it as a text. And of course we also tested the tools for translating speech into letters. But so far, most offline tools for the quality of recognition have not suited us.

Behavior patterns- statistical indicators of computer use. For example, the time that a person spends in a particular application, program, how many letters it sends, and so on. Well-known UEBA (UBA) -projects mainly work with this information, revealing that, for example, a person suddenly started sending not 10, but 100 letters a day (which means you need to look at him). But this technology has not yet yielded objectively good results in terms of predicting user behavior and - again - assessing his personal characteristics.

A relatively interesting parameter here is the analysis of traffic and search queries, but it rather speaks of the actual interests of the user, rather than his character and personality.

Facial analysis- This is one of the most well-developed methods. But in the scientific community, more and more people began to doubt the correctness of this approach, because a lot of information has appeared that facial expressions do not always reflect the emotional state of a person and are very “noisy”.

image

With this, I, as a person directly familiar with the FACS (Facial Action Coding System), also agree. An assessment of emotions can mainly be useful given the context and the exact relationship of stimulus and reaction. In our conditions, unfortunately, it is impossible to track. In addition, if you develop the idea further, you will have to face a physiognomic analysis, and this is already fraught with research in the field of unscientific knowledge.

Keyboard handwritinguntil it encounters great skepticism in the scientific community, there are dozens of works that have studied the question of determining personality traits by how a person “knocks on the keys”, but these works have not yet been implemented in practical models.

Now this technology is narrowly specialized in analyzing how a person types a username and password and can be used to identify a person. Analysis of arbitrary texts is not developed. But even taking into account these limitations, the keyboard handwriting from all the above sources of information is the most interesting for us, which is called “to grow”.

And finally, text analysis. The most studied and proven, since written language is a direct product of thinking. It reflects the patterns of thinking, the internal structure of the personality, preferences, values ​​and other characteristics. The connection between thinking and speech is studied by two sciences: psycholinguistics to a greater extent, psychosemantics to a lesser extent. It was not only we who took the written language into the development, ABBYY and Google use it as a source of information for their products - and many others.

There is one more purely technical plus of choosing written language as the basis for analysis - there are many, it is successfully assembled by the DLP system with which ProfileCenter integrates. So, the choice was predetermined.

What is noise and how to clean text


So, we recorded that written speech has become for us the main source of information for the program. The next stage of work is the creation of an algorithm for cleaning speech from "noise", normalizing text. To clear from “noise” means to remove elements from the text that do not carry a semantic load and have no value for analysis. It was easy to start: abstract numbers, Latin words, typos, some pictures - all attributed to noise.

image

With punctuation, everything turned out to be more complicated. Far from all put a point at the end of the sentence in household correspondence and it was necessary to learn how to begin to determine where it should stand. The presence and number of commas is also an important parameter. At the same time, in Skype-correspondence or social networks, punctuation marks are practically ignored.

Another difficulty was to isolate informal communication from correspondence and analyze texts in which the employee goes beyond the scope of professional and official duties. The first source that we connected to the module is mail. Introductory standard phrases were excluded from this text (hello, with respect, signature, etc.) and only the substantive part of the correspondence was taken to analytics. However, people write mostly dry business letters to email and, if you connect other sources of information (corporate messengers, social networks, etc.), we will get a more accurate result.

The next step for analysis also included correspondence from corporate messengers, Skype, Viber, WhatsApp, Lync, Telegram and social networks.

Work with cleared text


Got a clean text. The next stage, it is also the most difficult, is the construction of user psychotypes based on this text. In our conceptual apparatus, “psychotype” is a system of behavioral stereotypes, individual and value attitudes, motivational, emotional and communicative personality traits necessary to describe the difference between people.

There are many psychotypologies in the works of scientists, but in the main they duplicate each other. We relied more on the works of Lichko, Leongard, Sobchik, Glukhov, Kosinski, Saligman, Belyanin and the model of structurally dynamic profiling Psychea .

As a result of the synthesis of these typologies, we now rely on eight psychotypes with conventional names: hysteroid, epileptoid, paranoid, emotive, anxious, hyperthymic, schizoid and critical.

But how to analyze the text in an automated format so as to attribute its author to one of eight types?


The first hypothesis was this: for each psychotype, you need to create a lexical dictionary, find matches in the person’s vocabulary and assign it to one of eight types. For example, it is known that people of the schizoid type use low-frequency words more often (“muzle” instead of “wire” or “octotorp” instead of #) and long, and the epileptoid type love verbs more than others.

But these are conclusions at the level of empirical observations. If you try to translate them into algorithms, the idea becomes unrealizable: the dictionaries are too large, each word needs to be assigned a weight (its significance in the general formula of type). Who can assign this weight? Expert profiler. Suppose that there is even such an abstract "Alexey Filatov" who will take the trouble to shovel all the words of the Russian language to see how each corresponds to the lexicon of a schizoid or epileptoid. But even in such a utopian version, this will be a subjective assessment of a particular expert.

But the dictionaries of the frequency of which words a person uses depending on the severity of individual personality qualities is a completely different matter. Psycholinguistic researchers have them. But even then, by its significance for analysis, this variable in the formula is not in the first place. Because much more important is not what the person says, but how: what parts of speech he uses, how he composes phrases, which one uses morphology, etc. Many of these parameters are described in the corpus of the Russian language, and this is already the starting point for the preparation of formulas.

Another important point. In order to say about the severity of certain personal qualities in a person, you need a starting point. A person cannot be simply motivated for money or simply conflict, he is motivated or conflicted only in comparison with someone else. Therefore, the conditional "norm" for the program is the median value of personal qualities in the team. Its minimum number for the correct calculation of the median value should be 20 people.

As a result, the calculation algorithm - from the moment the user’s text was collected to the final classification of one or another psychotype — was chosen as follows:

  • extract unstructured user text from messages;
  • we define words in an unstructured text that coincide with dictionaries of personal qualities;
  • determine the value of the word weight based on the frequency of words in an unstructured text;
  • determine the characteristics of personal qualities;
  • we determine the indicators of the quantitative expression of the user's personal qualities, comparing his characteristics with median indicators for all users of the team;
  • determine the user's psychotype.

It was decided that in the program interface the user in the person of a security specialist or HR sees not the result of calculations in the form of a psychotype, but an intermediate stage of calculations. That is, the layout for personal qualities. This is more informative. And we display the psychotype itself in the so-called extended report.

Hypothesis testing and refinement of formulas


We have decided on the calculation algorithm. How to check the formula and how to adjust on whom to check? For these purposes, the employees of SearchInform themselves became the test subjects - they selected 102 people. I, with the help of fellow profilers, profiled them manually. The subjects underwent three standardized questionnaires: the 5PFQ questionnaire (the so-called “Big Five”), the Schwartz questionnaire, the LN Sobchik SMIL and ITO questionnaires. Then we compared the results with the data that the program produced.

On the scales, the results were different - from 57% to 94%. The scales of extraversion / introversion, anxiety, conflict, activity, etc. were perfectly determined. The results turned out to be worse, for example, in terms of “ambitiousness”.

According to the statistics obtained, the formula was adjusted, as a result, we “sewed” into it more than 70 variables (for example, the passive voice index, the word length index, sentences, proper names, etc.) and the weight of each.

It took a long time to work on determining the minimum sufficient amount of written material for analysis. Now we have settled on 20 thousand lemmas (a lemma is an invariable form of a word). But they started the analysis with 50 thousand, reducing this volume in increments of 5 thousand.

One of the most common questions is why have we still not realized the possibility of evaluating third-party user text taken from open sources? Like, why wait for the accumulation of 20 thousand lemmas, if you can take the text of a specific user on the network and analyze it according to the same criteria? Technically, this is possible, but then the information needs to be loaded into the program not by one person, but by the collective of employees or people of similar professions (described above why).

Combat Check and Limit


When the working model was ready - about two years ago - they started testing (MVP) the program not only on their own employees, but also on the employees of several dozen clients who agreed to participate in the experiment. By October-November 2018, they received a well-functioning product. We were sure that it gives out qualitative data on the so-called primary personal qualities (which we can double-check using the questionnaire).

The accuracy of the results of the finished module was evaluated by expert profilers and clients at 75–80%. For a task whose solution no one has previously proposed, these are good indicators. The main thing is that this is enough to solve business problems.

image

There are lines that we still cannot go beyond. To create a psychological portrait of the highest quality possible, you need two or four modalities: text, intonation, traffic, etc. When we add the analysis of voice, social networks, keyboard writing to the module, the quality of implementation will be even better. But these tasks are solved quite difficult (described above). Each subsequent percentage of accuracy in calculating our module is given with increasing difficulty.
We face roughly the same limitations when trying to build profiles for those people who write a little and whose vocabulary, frankly, is poor. We are talking about those users whose communication is reduced to “hello”, “ok” and “come on”. It is difficult to build a correct profile only on the basis of written speech on them.

And what happened? Short profile - what's in it


The product of all the research described above is a brief personality profile. As I said, this is primary information, “raw materials”, in order to draw more detailed conclusions on it both about one person and the team.

In the short profile, we needed to create a portrait of the user that would reflect the fundamentally important characteristics from the point of view of a security specialist and an information security service: strengths / weaknesses, fundamental differences between the employee and other users, general type, criminal trends, values ​​and recommendations.

As a result, in the short profile we single out the three strongest and three weakest personality traits.
It looks, for example, like this:

image
(This, by the way, is a screenshot of the profile of one strong leader).

Next, we compose an index of personal qualities. Why do we need him? Not all personality traits are the same ... stable. The manifestation of some strongly depends on the context, and without some starting point it is impossible to conclude that the quality is expressed.

For example, when can one say about a person that he is in conflict? When does he start swearing? Beat others? Shoot? But if we conclude that there is a conflict in comparison with the opposite quality (in a dichotomy), we can understand how pronounced both are. That is, a person is more responsive, polite than conflict.

image

We also identify criminal trends in a short profile (do not forget that our ProfileCenter is a product primarily for security services).

In order to identify risksfor each profile, they again turned to psychology, highlighted in the language of economic and information security risks that are inherent in personal qualities. For example, conflict, talkativeness, a dark triad of personality (manipulativeness), leadership qualities, emotionality. There are studies that have allowed these data to compare and derive recommendations. Here we focused on a large number of works not only in the field of criminology, criminal psychology and criminal profiling, but also on personnel safety and personnel risk management.
To calculate ambitiousness, we compiled our own linguistic formulas. To select variable formulas for calculating basic values, we took the scientific developments of Belyanin and Schwartz.

That's how it all looks completely. Short Profile Report:

image

Ratings, Advanced Reports, and Profile Dynamics


What's next? Having information about personal qualities, we set about creating ratings, as this is a useful function for our target audience - security service specialists and information security specialists in particular. They told us: we have 5,000 users, you can’t follow everyone. If you could narrow our focus of attention (identify risk groups), we would know who to watch more closely.

The complexity at this stage was not technological, but methodological. Since it’s not enough just to take and rate all users for each quality. For security services, the “synthetic" personality traits are informative, that is, not conflict, but scandalousness, not a desire for interaction, but leadership. Scandalousness and leadership include several indicators from a short profile. To compile a formula for each rating, to determine the weight of each quality in it, we again turned to psychosemantics and psycholinguistics. We processed at least 35 works in Russian and English. As a result, now the program gives 12 ratings , on the basis of which you can create your own.

imageRatings can determine the risk groups of those employees who are preparing for dismissal, demotivated, aggressive, scandalous, etc. And vice versa, using the ratings you can create personnel reserve groups. By the way, we are very good at predicting the dismissal of an employee, his burnout, and high leadership potential.

In principle, the same technical and methodological tasks from psycholinguistics were also present when creating an extended profile and profile dynamics - choosing variables for formulas and determining the weight of each value.

In extended profilemade additional reports that greatly expand the scope of the program, because in essence, they provide information on the user's core competencies. They are usually evaluated by personnel managers and SHL competency managers (the need for power and control, for consent, extraversion, general intellect, openness to the new, commitment, emotional stability, motivation for achievements).

Dynamics of profile changes - according to the report, you can receive warnings if something happens to a person, if he breaks out into the leaders of ratings that are significant for information security specialists.

image

I attach great importance to the fact that we were able to create a report on the dynamics. Why was this important to do? If after 2–4 months the profile and ratings of the user after several recalculations are kept generally stable, then this is an indicator that the so-called typical user behavior has been found.

This means that the key task of behavioral analysis in information security has been solved.

Interface


But strangely enough, it was necessary to tinker not only with technical, methodological problems. The question of the graphical presentation of the results caused no less discussion. In my head, the interface looked completely different than it is now. But it was important to think about how it would be more convenient for customers to work with the product.

image

The designer worked in emergency mode, reviewed dozens of options. Each element was criticized: visualization of the index of personal qualities, known in the project team as a “battery”, pictograms to indicate basic values ​​and level of ambition, a block with recommendations.

image
Interface “CIB Searchinform ProfileCenter”, which was released in 2018


“Translation Difficulties”


Another point is terminology. How to choose such names of personal qualities, ratings, which are correct from the point of view of science, but informative for our users? For example, in the first version, we introduced the “gambling” parameter. In psychology, this means involvement in the process, and for most people, “commitment to gambling.”

Due to differences in terminology, the alpha version caused an ambiguous assessment, so definitions and brief explanations of terms appeared in the final version of the report.

Discussions continue now, every time we introduce a new rating and you need to decide on a capacious, but understandable non-psychologists name. It should be noted that we follow the same path in foreign vocabulary - last year the release took place in English.

What else are you working on? While work is underway on improving reports. Now the module can generate about 78,000 options for advanced employee profiles; it can determine the user's risk rating. ProfileCenter integrates with the SearchInform CIB DLP system and needs to learn how to find correlations with incidents and human behavior.

We are working on the integration of the keyboard handwriting detection module into the ProfileCenter, preparing an extended report and additional risks in the field of personnel and information security - in general, there are many more options for how to increase the capabilities of the software.

In general, the market is actively developing in this direction and there are already followers who are trying to automatically assess the risks of employees in the field of information security. But I emphasize that such work can be promising at the junction of several “modalities” - when at the same time analysis takes into account at least not only “technical”, but also psycholinguistic information: better, even more.

P.S


If my long story about profiling didn’t scare you away but more interested in the topic, I invite you from Monday to take a course in “Profiling for the IS Service” - 5 classes that we will conduct at Center Search in-person and will be available online and for free (all because quarantine, what else).

The list of topics:

  • 20 , 11.00 : . . , .
  • 21 , 11.00 .

    « ProfileCenter» .
  • 22 11.00 .

    . ? .
  • 23 , 11.00 . . .
  • 24 , 11.00 .

    . ? .

You can register here .

All Articles