🔀 🤸 🖋️ Configuring the loss function for a neural network based on seismic data 👨🏽 👩🏿‍🤝‍👨🏻 🧔🏿

In a previous article, we described an experiment to determine the minimum amount of manually labeled sections for training a neural network using seismic data. Today we continue this topic by choosing the most appropriate loss function.

Two basic classes of functions are considered - Binary cross entropy and Intersection over Union - in 6 variants with selection of parameters, as well as combinations of functions of different classes. Additionally, the regularization of the loss function is considered.

Spoiler: managed to significantly improve the quality of the network forecast.

Business Research Goals

We will not repeat the description of the specifics of the seismic survey, the data obtained and the tasks of their interpretation. All this is described in our previous article .

The idea of this study was prompted by the results of the competition for the search for salt deposits on 2D slices . According to the participants of the competition , in solving this problem, a whole zoo of various loss functions was used, moreover, with different successes.

Therefore, we asked ourselves: is it really possible for such problems on such data to select the loss function can give a significant gain in quality? Or is this characteristic only for the conditions of the competition, when there is a struggle for the fourth or fifth decimal place for the metrics predefined by the organizers?

Typically, in tasks solved with the help of neural networks, the tuning of the learning process is based mainly on the experience of the researcher and some heuristics. For example, for the problems of image segmentation, loss functions are most often used, based on assessing the coincidence of the shapes of recognized zones, the so-called Intersection over Union.

Intuitively, based on an understanding of behavior and research results, these types of functions will give better results than those that are not sharpened for images, such as cross-entropy ones. Nevertheless, experiments in search of the best option for this type of task as a whole and each task individually continue.

The seismic data prepared for interpretation have a number of features that can have a significant impact on the behavior of the loss function. For example, the horizons separating the geological layers are smooth, more sharply changing only in the places of faults. In addition, the distinguished zones have a sufficiently large area relative to the image, i.e. small spots on interpretation results are most often considered a recognition error.

As part of this experiment, we tried to find answers to the following local questions:

Is the loss function of the Intersection over Union class really the best result for the problem considered below? It seems the answer is obvious, but which one? And how much is the best from a business point of view?
Is it possible to improve the results by combining functions of different classes? For example, Intersection over Union and cross-entropy with different weights.
Is it possible to improve the results by adding to the loss function various additions designed specifically for seismic data?

And to a more global question:

Is it worth bothering with the selection of the loss function for the tasks of interpreting seismic data, or is the gain in quality not comparable with the time loss for conducting such studies? Maybe it’s worth intuitively choosing any function and spending energy on the selection of more significant training parameters?

General description of the experiment and the data used

For the experiment, we took the same task of isolating geological layers on 2D slices of a seismic cube (see Figure 1).

Figure 1. Example of a 2D slice (left) and the result of the marking of the corresponding geological layers (right) ( source )

And the same set of completely labeled data from the Dutch sector of the North Sea. Source seismic data are available on the Open Seismic Repository: Project Netherlands Offshore F3 Block website . A brief description can be found in Silva et al. "Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation . "

Since in our case we are talking about 2D slices, we did not use the original 3D cube, but the already made “slicing”, available here:Netherlands F3 Interpretation Dataset .

During the experiment, we solved the following problems:

We looked at the source data and selected the slices, which are closest in quality to manual marking (similar to the previous experiment).
We recorded the architecture of the neural network, the methodology and parameters of training, and the principle of selecting slices for training and validation (similar to the previous experiment).
We chose the studied loss functions.
We selected the best parameters for the parameterized loss functions.
We trained neural networks with different functions on the same data volume and chose the best function.
We trained neural networks with different combinations of the selected function with functions of another class on the same amount of data.
We trained neural networks with regularization of the selected function on the same amount of data.

For comparison, we used the results of a previous experiment in which the loss function was chosen exclusively intuitively and was a combination of functions of different classes with coefficients also chosen “by eye”.

The results of this experiment in the form of estimated metrics and predicted by the networks of slice masks are presented below.

Task 1. Data selection

As initial data, we used ready-made inlines and crosslines of a seismic cube from the Dutch sector of the North Sea. As in the previous experiment, simulating the work of the interpreter, for training the network, we chose only clean masks, having looked at all the slices. As a result, 700 crosslines and 400 inlines from ~ 1600 source images were selected.

Task 2. Fixing the parameters of the experiment

This and the following sections are of interest, first of all, for specialists in Data Science, therefore, appropriate terminology will be used.

For training, we chose 5% of the total number of slices, moreover, inlines and crosslines in equal shares, i.e. 40 + 40. Slices were selected evenly throughout the cube. For validation, 1 slice was used between adjacent images of the training sample. Thus, the validation sample consisted of 39 inlines and 39 crosslines.

321 inline and 621 crossline fell into the delayed sample, on which the results were compared.

Similar to the previous experiment, image preprocessing was not performed, and the same UNet architecture with the same training parameters was used.

The target slice masks were represented as binary cubes of dimension HxWx10, where the last dimension corresponds to the number of classes, and each value of the cube is 0 or 1, depending on whether this pixel in the image belongs to the class of the corresponding layer or not.

Each network forecast was a similar cube, each value of which relates to the probability that a given image pixel belongs to the class of the corresponding layer. In most cases, this value was converted into the probability itself by using a sigmoid. However, this should not be done for all loss functions, therefore activation was not used for the last layer of the network. Instead, the corresponding conversions were performed in the functions themselves.

To reduce the influence of the randomness of the choice of initial weights on the results, the network was trained for 1 era with binary cross-entropy as a function of losses. All other training started with these weights received.

Task 3. The choice of loss functions

For the experiment, 2 base classes of functions were selected in 6 variants:

Binary cross entropy :

binary cross entropy;
weighted binary cross entropy;
balanced binary cross entropy.

Intersection over Union :

Jaccard loss;
Tversky loss;
Lovász loss.

A brief description of the listed functions with code for Keras is given in the article . Here we present the most important with links (where possible) to a detailed description of each function.

For our experiment, the consistency of the function used during training is important with the metric by which we evaluate the result of the network forecast on the delayed sample. Therefore, we used our code implemented on TensorFlow and Numpy, written directly using the formulas below.

The following notation is used in the formulas:

pt - for the binary target mask (Ground Truth);
pp - for network prediction mask.

For all functions, unless otherwise specified, it is assumed that the network prediction mask contains probabilities for each pixel in the image, i.e. values in the interval (0, 1).

Binary cross entropy

Description: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a .

This function seeks to bring the network forecast distribution closer to the target, penalizing not only erroneous predictions, but also uncertain ones.

Weighted binary cross entropy

This function coincides with binary cross-entropy with a beta value of 1. It is recommended for strong class imbalances. For beta> 1, the number of false negative forecasts (False Negative) decreases and the completeness (Recall) increases, for beta <1 the number of false positive forecasts (False Positive) decreases and the accuracy increases (Precision).

Balanced binary cross entropy

This function is similar to weighted cross-entropy, but it corrects the contribution of not only single, but also zero values of the target mask. Coincides (up to a constant) with binary cross-entropy at a value of coefficient beta = 0.5.

Jaccard loss

The Jacquard coefficient (aka Intersection over Union, IoU) determines the measure of the “similarity” of the two areas. The Dice index does the same thing:

It makes no sense to consider both of these functions. We chose Jacquard.

For the case when both areas are specified using binary masks, the above formula can be easily rewritten in terms of the values of the masks:

For non-binary forecasts, optimization of the Jacquard coefficient is a non-trivial task. We will use the same formula for probabilities in the forecast mask as a certain imitation of the initial coefficient and, accordingly, the following loss function:

Tversky loss

Description: https://arxiv.org/pdf/1706.05721.pdf

This function is a parameterized version of the optimization of the Jacquard coefficient that coincides with it at alpha = beta = 1 and with the Dice index at alpha = beta = 0.5. For other non-zero and non-coincident values, we can shift the emphasis towards accuracy or completeness in the same way as in the functions of weighted and balanced cross-entropy.

The emphasis shift problem can be rewritten using a single coefficient lying in the interval (0, 1). The resulting loss function will look like this:

Lovász loss

It is difficult to give a formula for this function, since it is an option for optimizing the Jacquard coefficient by an algorithm based on sorted errors.

You can see the description of the function here , one of the code options is here .

Important explanation!

To simplify the comparison of values and graphs hereinafter, under the term "Jacquard coefficient" we will further understand the unit minus the coefficient itself. Jaccard loss is one way to optimize this ratio, along with Tversky loss and Lovász loss.

Task 4. Choosing the best parameters for parameterized loss functions

To select the best loss function on the same data set, an evaluation criterion is needed. In his quality, we chose the average / median number of connected components on the resulting masks. In addition, we used the Jacquard coefficient for predictive masks converted to single-layer argmax and again divided into binarized layers.

The number of connected components (i.e., solid spots of the same color) on each forecast obtained is an indirect criterion for assessing the volume of its subsequent refinement by the interpreter. If this value is 10, then the layers are selected correctly and we are talking about a maximum of minor horizon correction. If there are not many more, then you only need to "clean" small areas of the image. If there are substantially more of them, then everything is bad and may even need a complete re-layout.

The Jacquard coefficient, in turn, characterizes the coincidence of image zones assigned to one class and their boundaries.