Neural networks and Process Mining: trying to make friends

Process Mining is a data analysis field that allows you to analyze processes based on the logs of information systems. Since there are very few publications on the topic of the use of machine learning in this field on Habré, we decided to share our experience in developing predictive models for solving process-oriented problems. As part of the VTB program, an IT junior for beginning IT professionals, interns from the Process mining team tested the machine learning methods in the context of the tasks of researching bank processes. Under the cut, we will talk about when and how we had the idea of ​​solving such problems, what we did and what results we got.



IT Junior Program is an annual internship program for beginner IT professionals at VTB Bank, which first started in September 2019. The internship lasts six months. According to the results of the 2019 program, more than half of the interns joined the staff and became employees of the company. More information about the program, the start of selection and requirements for candidates can be found here . This is how the trainees of this program approached the bank’s tasks.

In the classical coordinate system, in order to understand and formalize the process, it is necessary:

  • conduct an interview with employees;
  • analyze available reports and documentation.

In the Process Mining approach, a digital process model is formed on the basis of not only the expert opinion of the process participants, but also relevant data from information systems.

As a result, we get an objective digital model of the process, which is a reflection of the movement of real data in the IT systems of the process. The resulting model works in real time and allows you to display the current state of the process with the necessary degree of detail.

In our previous articleWe talked about our Process Mining platform and the real tasks of the Bank, which are solved with its help. The implemented solution allowed us to significantly reduce the time required to prepare mandatory reports to government agencies, and also helped to identify and optimize process imperfections, to establish a daily presentation of the current status of purchases in work.

Subsequently, our customers had a need not only to qualitatively determine the current state of the process, but also to predict its future conditions.

Next, we will describe step by step how we solved the problem of predicting the duration of the procurement process (using the BPI Challenge 2019 dataset as an example) using a set of well-known events using the high-performance DGX Station, kindly provided to us by NVIDIA for research.

Machine Learning Application for Process Mining


To solve the problem, we built a baseline using CatBoostRegressor, and then developed a solution with a neural network and embedding categorical variables.
Due to the presence of categorical and material features in the source data, it was decided to use boosting, which could process categorical features without coding, and also solve the problem on a discrete and material input.

The nets were used to build completely material attributes and solve the problem on the whole material input, and then compare these two approaches and decide whether to bother with the nets.

Data Description


It was decided to use external data that would suit us in the business area and possess a similar set of characteristics. The used BPI Challenge 2019 dataset includes 250 thousand cases - this is 1.5 million events. The initial data are described by a set of 21 signs: 18 categorical (there are index signs), two Boolean and one real. The execution time of the procurement process was selected as the target variable, which corresponded to the real needs of the business. For a detailed description of the characteristics, you can refer to the description of the dataset .

Baseline


Prior to model training, the data were divided into training (train) and test (test) samples in the ratio of 0.8 / 0.2. Moreover, the division did not occur according to events, but according to cases.

To determine how appropriate it is to use a complex native solution in the form of a neural network, baseline was built using CatBoost, an advanced library of gradient boosting on decision trees. To build a baseline, minimal data preprocessing was carried out (coding of categorical features with the corresponding frequency in the data), a target variable (case duration) and a number of new features were developed (in addition to those already in the original dataset):

  • . , : , , , , , .
  • Exponential Moving Average . EMA , .
  • (, , ).

After training the CatBoostRegressor in the training set, we got the following result: MAE (Mean Absolute Error) = 17.5 days (that is, the value of the predicted target variable is on average 17.5 days different from the true value). This result was used to test the effectiveness of a neural network. 

One of the important details here is the development of the target variable for baseline. 

Let us have a case. We denote it by c_i from the set C (the set of all cases in our dataset). Each case is an ordered sequence of events, i.e. c_i = (e_0, ​​..., e_ni), where ni is the length of the ith case. For each event, we have a time stamp - the exact time it started. Using these temporary stamps, you can calculate the duration of the case without the last event. However, assigning such a target to each event, that is, making a correspondence ek ∈ ​​ci, ek → ti (ti is the duration of the ith case), is not very good. Firstly, similar events (typical) can occur in cases of different durations. Secondly, we want to predict the duration of the case from a certain subsequence (ordered in time) of events (this is motivated by the fact that we don’t know the whole sequence of events, that is, we don’t know the case beforehow it happened, but we want to make an assessment of the duration of the whole case according to some known (occurred) events from this case).

Therefore, we need to break each case into subsequences of length from one to the length of the case of time-ordered events and assign a target variable equal to the duration of the case from which these subsequences are obtained, that is, the correspondences ci ∈ C, ci → {sub_cj} ni (ni as before, the length of the ith case), j = 1 and len (sub_cj) = j.

Thus, we break each case into subsequences and assign the duration of the entire case to each such subsequence.

More about subsequences

As mentioned earlier, we break the case into subsequences and assign the duration of the case to each of them. We are going to use boosting which is exacting to the size of the input data. So now we have X = {{sub_c ki } ni k = 1 } t = 1 N , sub_c ik is the k-th subsequence of the i-th case, t i is the length of the i-th case, N is the number of cases. That is, the dimension [∑ N t = 1 n i , sc, 17], sc is a variable equal to the length of the subsequence of the corresponding case.

After coding categorical variables by their frequency, we have real and Boolean variables, as well as coded categorical ones (index variables will not be used in the learning process). We can also average values ​​over a subsequence, while in categorical features we get the average frequency of occurring categorical values, which can also be considered as a characteristic describing the aggregation of a subset of events in a case, that is, as a characteristic describing a subsequence. Leave it and see what happens.

After averaging sc over the dimension, we obtain the following dimension: [∑ N t = 1 n i , 17].

Model building

Based on the cases, we divide train into another train and a validation sample, take a CatBoostRegressor with default parameters, pass it a training sample, validate on a validation sample, take the best iteration, use MAE as a validation metric. We get the following (in the figure below) on the test (we’ll separately prepare the test for the same pipeline that the train was built on. All the signs are based on the data that is in the test, that is, we don’t have any signs that are focused on the target variable. The only caveat: if the categorical features in the test do not meet the value that we saw in train, then we consider the frequency of this value on the test and update the dictionary for encoding).

Baseline results



• Iterations: 500.
• Learning rate: 0.1.

Training Parameters:

• Training time: less than 2 minutes.
• Iron: Tesla k80 (from colab).

Results:

• Test MAE: 17.5 days.
• The average duration of the case in the test: 66.3 days.

Neural network


Setup


To train the neural network, the data were improved: embeddings for categorical variables were built, and the distribution of the target variable was adjusted. Next, the neural network was trained on the NVIDIA Tesla K80 (Google Colab) and on the NVIDIA DGX Station.

The following results were obtained:

  • Training time on NVIDIA K80 (Google Colab): 20 minutes.
  • Training time at NVIDIA DGX Station: 8 minutes.

The training time of the neural network is due to the difference in the technical characteristics of the GPUs used:
NVIDIA Tesla K80 (Google Colab)
NVIDIA DGX Station
1X NVIDIA Tesla K80 12GB
4X NVIDIA Tesla V100 32GB

Preprocessing


New signs

  • EMA on the value of the event: we want to catch the trend in the cost of activities for each case.
  • Type of flaw: in the dataset description you can find information about four types of some descriptive statistics of the purchase (event) - these types are divided into the values ​​of two variables in the original dataset. We simply aggregate it back (if you look at the description of the dataset, it will be clear what we are talking about).

Categorical signs We

simply encode the unique values ​​of categorical signs with natural numbers in order, so that later we can teach embeddings.

Embeddings for categorical variables

We determine the dimension of embeddings for each categorical variable:

  • , ̆ ̆ ̆. ̆ , ̆ ̆ , ̆, . : MUi  = min(CAT_EMBEDDING_DIM; (len(uniquei) + 1) // 2), CAT_EMBEDDING_DIM — , uniquei — i- .
  • , 3, i-̆ ̆ max(3;MUi)+1, 1, , train, unk-.

We adjust the distribution of the target on the train sample. The 

initial distribution turned out to be very much shifted to the left due to outliers (cases that lasted 250 thousand days) and a large number of short cases, so we count the 0.05 and 0.95 percentiles and leave the data from train with the target between these rapids.

After that, we still have cases that last about one and about 100 days, that is, the target variable goes through several orders of magnitude. Therefore, the assumption that the variance is constant in the distribution of the target variable around the decision algorithm is hardly fulfilled, that is, the distribution of the target variable is close to normal, but the variance is not constant due to the fact that the target variable can be either less than 1 or more than 100. Therefore, to at least somehow level this effect, we normalize the data.

The result is shown in the graph below (black line is the normal distribution).



Then we divide by case our data into train and validation. There is also an obvious nuance here: we normalize the target with the average and deviation, calculated according to all the data, and then divide by train and validation, that is, it turns out like a face in train, but since we solve an auxiliary problem here, this face does not seem critical.

Building signs


Idea

  • We take only categorical signs from our train, encoded by natural numbers.
  • We take not substrings from cases, but simply events, that is, one line in our data for embeddings - this is one event characterized by coded categorical features.
  • : , ̆, , , , ̆ ̆ ̆. - , ̆, , , ̆ ( ), ( , - ̆ ).
  • ̆ .
  • , 8-̆̆ elu ̆, ( , , , L2-) .
  • , , — , , .
  • Summary: — ̆ ̆ ̆ ̆ — ̆.


  • Batch size = 1000
  • Learning rate = 3e-04.
  • = 15.
  • : Tesla k80 (colab) + Nvidia DGX Station .
  • (colab) – 50 .
  • (Nvidia DGX Station) — 18 .




Data preparation

Now we have embeddings for categorical variables (there is a nuance here: we honestly took unique values ​​of categorical variables on our train (not on the one we allocated for training embeddings, but on the one we allocated at the very beginning for training), therefore there is a chance that on the test data there will be a value of categorical variables that we did not see on the train, that is, we do not have trained embedding for this value.

For such values, a separate line is created in the embedding matrices, but in our case the problem is that during the training it is not involved, and therefore does not study. Based on this, if we meet a value of a categorical variable that was not seen before, then we take this vector, but in fact it is simply taken from the initializing distribution.

In how to train this vector, there is a direction for improving the model. The idea is that very rare values ​​in train can be encoded with this vector, because if we see a new value only in the test, which conditionally makes up 20% of the entire initial sample, then this value is rare and probably behaves same as rare values ​​in train.

In each event, we replace the categorical variables with the corresponding embedding, connect with the real and Boolean attributes, we obtain a matrix of size [N, F], where F is the sum of the dimensions of the embeds for categorical variables, the number of real and Boolean attributes.

We carry out a grouping of events in a subsequence (as previously done). The target variable for the subsequence is the duration of the case from which the subsequence was obtained. Add the number of events and the sum of the costs of events in this subsequence to the vector of the subsequence.

Now we have a matrix of a fixed size - you can feed into the model (before that we normalize the matrix).

Parallelization method

  • We make a tower for each gpu.
  • At each step, we divide the parameters between the towers.
  • .
  • ̆ , .
  • ( , ̆ ̆ ).
  • .
  • , - ( , word2vec-style, ).



  • () () ().
  • ̆ : — , gpu , , gpu .



  • : 7-̆̆ elu.
  • ̆ , .
  • Batch size = 1000.
  • Learning rate = 3e-04.
  • = 15.
  • : Tesla k80 (colab) + Nvidia DGX Station.
  • (colab) = 20 .
  • (Nvidia DGX Station) = 8 .

A piece of the model graph.



Consumption of resources and parallelization.

Neural network training on a CPU takes about four times as much time as on an NVIDIA DGX Station. In this case, the difference seems insignificant - eight minutes on the NVIDIA DGX Station and 32 minutes on the CPU. However, this is a small model with a small amount of data. When implementing real projects, where there will be several times more cases and events, training on the CPU will take at least a week. In this case, the use of NVIDIA DGX Station will reduce the training time to two days, which will greatly increase work efficiency. 

It was also revealed that the speed of the learning process greatly depends on the number of GPUs used, which shows the advantage of NVIDIA DGX Station.

This is confirmed by previous experiments on the NVIDIA DGX Station CPU and GPU using the original data set without any preliminary processing:

  • Learning time on the CPU: 6 minutes 18 seconds.
  • GPU training time: 34 seconds.

GPU



load visualization CPU load visualization



Neural network results




  • Test MAE = 10 days.
  • The average duration of the case on the test = 67 days.
  • Inference time = 20 seconds.

findings


  • We implemented a pilot to evaluate machine learning methods in the context of Process Mining tasks.
  • We tested and expanded the list of our tools with which we will solve problems that are important for business.
  • One of the interesting results was the writing of our own implementation of parallel computing on 4 Tesla v100 cards with which the DGX station is equipped: the use of several gpu speeds up learning almost in line from the number of gpu (the code is parallelized).
  • The transition to a fully continuous entry and the use of a neural network made it possible to take a week off baseline.
  • Time increases from a few minutes to an hour and a half (training in final architecture and embeddings, but embeddings can be used pre-trained, so the time is reduced to 20 minutes).

The described experiments show that in the field of Process mining, machine and deep learning algorithms can be successfully applied. In addition, it was revealed that the speed of the learning process greatly depends on the number of GPUs used, which shows the advantage of NVIDIA DGX Station.

What and how can be improved


Word2vec-style embeddings for events

When we made our model, including embeddings for categorical variables, we did not take into account the sequence of events relative to each other, that is, the peculiar semantics of events inside cases. To learn something useful from the order of events inside the cases, you need to train embeddings for these events.

Idea

  • We take one categorical feature and one real, divide the real by bucket, then each transaction will be characterized by the value of the categorical variable and the bucket in which the value of the real variable falls. Combine these two values, we get, as it were, an analogue of the word for the event.
  • We consider the case as a sentence (the set of words in the sentence corresponds to the set of events in the case).
  • ̆ , ̆ ̆, , .
  • ̆, Skipgram CBOW .
  • , ̆, .



  • Skipgram.
  • — 5.

  • Batch size = 1000.
  • Learning rate = 3e-04.
  • = 10.
  • : Tesla k80 (colab) + Nvidia DGX Station.
  • (colab) — 20 .
  • (Nvidia DGX Station) — 8 .
  • Test MAE : 10 ̆. 

Count



Using features from embeddings gives an increase of a couple of tenths of a day.

Eventually

  • Embeddings turned out, of course, uneducated, because they trained a little.
  • There are about 290 features from categorical embeddings, and 20 features from semantic embeddings (it makes no more sense to do because the size of the dictionary is small), so the influence of these semantic features can be leveled due to an imbalance in the proportion of features.
  • The semantics between events need to be somehow added to the training set, because due to the fact that the sequences of events (cases) are ordered, the order matters, and information can be extracted from this.
  • You can use more sophisticated architectures for embeddings.
  • , , — .

All Articles