Configuring the loss function for a neural network based on seismic data

In a previous article, we described an experiment to determine the minimum amount of manually labeled sections for training a neural network using seismic data. Today we continue this topic by choosing the most appropriate loss function.

Two basic classes of functions are considered - Binary cross entropy and Intersection over Union - in 6 variants with selection of parameters, as well as combinations of functions of different classes. Additionally, the regularization of the loss function is considered.

Spoiler: managed to significantly improve the quality of the network forecast.



Business Research Goals


We will not repeat the description of the specifics of the seismic survey, the data obtained and the tasks of their interpretation. All this is described in our previous article .

The idea of ​​this study was prompted by the results of the competition for the search for salt deposits on 2D slices . According to the participants of the competition , in solving this problem, a whole zoo of various loss functions was used, moreover, with different successes.

Therefore, we asked ourselves: is it really possible for such problems on such data to select the loss function can give a significant gain in quality? Or is this characteristic only for the conditions of the competition, when there is a struggle for the fourth or fifth decimal place for the metrics predefined by the organizers?

Typically, in tasks solved with the help of neural networks, the tuning of the learning process is based mainly on the experience of the researcher and some heuristics. For example, for the problems of image segmentation, loss functions are most often used, based on assessing the coincidence of the shapes of recognized zones, the so-called Intersection over Union.

Intuitively, based on an understanding of behavior and research results, these types of functions will give better results than those that are not sharpened for images, such as cross-entropy ones. Nevertheless, experiments in search of the best option for this type of task as a whole and each task individually continue.

The seismic data prepared for interpretation have a number of features that can have a significant impact on the behavior of the loss function. For example, the horizons separating the geological layers are smooth, more sharply changing only in the places of faults. In addition, the distinguished zones have a sufficiently large area relative to the image, i.e. small spots on interpretation results are most often considered a recognition error.

As part of this experiment, we tried to find answers to the following local questions:

  1. Is the loss function of the Intersection over Union class really the best result for the problem considered below? It seems the answer is obvious, but which one? And how much is the best from a business point of view?
  2. Is it possible to improve the results by combining functions of different classes? For example, Intersection over Union and cross-entropy with different weights.
  3. Is it possible to improve the results by adding to the loss function various additions designed specifically for seismic data?

And to a more global question:

Is it worth bothering with the selection of the loss function for the tasks of interpreting seismic data, or is the gain in quality not comparable with the time loss for conducting such studies? Maybe it’s worth intuitively choosing any function and spending energy on the selection of more significant training parameters?

General description of the experiment and the data used


For the experiment, we took the same task of isolating geological layers on 2D slices of a seismic cube (see Figure 1).


Figure 1. Example of a 2D slice (left) and the result of the marking of the corresponding geological layers (right) ( source )

And the same set of completely labeled data from the Dutch sector of the North Sea. Source seismic data are available on the Open Seismic Repository: Project Netherlands Offshore F3 Block website . A brief description can be found in Silva et al. "Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation . "

Since in our case we are talking about 2D slices, we did not use the original 3D cube, but the already made “slicing”, available here:Netherlands F3 Interpretation Dataset .

During the experiment, we solved the following problems:

  1. We looked at the source data and selected the slices, which are closest in quality to manual marking (similar to the previous experiment).
  2. We recorded the architecture of the neural network, the methodology and parameters of training, and the principle of selecting slices for training and validation (similar to the previous experiment).
  3. We chose the studied loss functions.
  4. We selected the best parameters for the parameterized loss functions.
  5. We trained neural networks with different functions on the same data volume and chose the best function.
  6. We trained neural networks with different combinations of the selected function with functions of another class on the same amount of data.
  7. We trained neural networks with regularization of the selected function on the same amount of data.

For comparison, we used the results of a previous experiment in which the loss function was chosen exclusively intuitively and was a combination of functions of different classes with coefficients also chosen “by eye”.

The results of this experiment in the form of estimated metrics and predicted by the networks of slice masks are presented below.

Task 1. Data selection


As initial data, we used ready-made inlines and crosslines of a seismic cube from the Dutch sector of the North Sea. As in the previous experiment, simulating the work of the interpreter, for training the network, we chose only clean masks, having looked at all the slices. As a result, 700 crosslines and 400 inlines from ~ 1600 source images were selected.

Task 2. Fixing the parameters of the experiment


This and the following sections are of interest, first of all, for specialists in Data Science, therefore, appropriate terminology will be used.

For training, we chose 5% of the total number of slices, moreover, inlines and crosslines in equal shares, i.e. 40 + 40. Slices were selected evenly throughout the cube. For validation, 1 slice was used between adjacent images of the training sample. Thus, the validation sample consisted of 39 inlines and 39 crosslines.

321 inline and 621 crossline fell into the delayed sample, on which the results were compared.

Similar to the previous experiment, image preprocessing was not performed, and the same UNet architecture with the same training parameters was used.

The target slice masks were represented as binary cubes of dimension HxWx10, where the last dimension corresponds to the number of classes, and each value of the cube is 0 or 1, depending on whether this pixel in the image belongs to the class of the corresponding layer or not.

Each network forecast was a similar cube, each value of which relates to the probability that a given image pixel belongs to the class of the corresponding layer. In most cases, this value was converted into the probability itself by using a sigmoid. However, this should not be done for all loss functions, therefore activation was not used for the last layer of the network. Instead, the corresponding conversions were performed in the functions themselves.

To reduce the influence of the randomness of the choice of initial weights on the results, the network was trained for 1 era with binary cross-entropy as a function of losses. All other training started with these weights received.

Task 3. The choice of loss functions


For the experiment, 2 base classes of functions were selected in 6 variants:

Binary cross entropy :

  • binary cross entropy;
  • weighted binary cross entropy;
  • balanced binary cross entropy.

Intersection over Union :

  • Jaccard loss;
  • Tversky loss;
  • Lovász loss.

A brief description of the listed functions with code for Keras is given in the article . Here we present the most important with links (where possible) to a detailed description of each function.

For our experiment, the consistency of the function used during training is important with the metric by which we evaluate the result of the network forecast on the delayed sample. Therefore, we used our code implemented on TensorFlow and Numpy, written directly using the formulas below.

The following notation is used in the formulas:

  • pt - for the binary target mask (Ground Truth);
  • pp - for network prediction mask.

For all functions, unless otherwise specified, it is assumed that the network prediction mask contains probabilities for each pixel in the image, i.e. values ​​in the interval (0, 1).

Binary cross entropy


Description: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a .



This function seeks to bring the network forecast distribution closer to the target, penalizing not only erroneous predictions, but also uncertain ones.

Weighted binary cross entropy




This function coincides with binary cross-entropy with a beta value of 1. It is recommended for strong class imbalances. For beta> 1, the number of false negative forecasts (False Negative) decreases and the completeness (Recall) increases, for beta <1 the number of false positive forecasts (False Positive) decreases and the accuracy increases (Precision).

Balanced binary cross entropy




This function is similar to weighted cross-entropy, but it corrects the contribution of not only single, but also zero values ​​of the target mask. Coincides (up to a constant) with binary cross-entropy at a value of coefficient beta = 0.5.

Jaccard loss


The Jacquard coefficient (aka Intersection over Union, IoU) determines the measure of the “similarity” of the two areas. The Dice index does the same thing:



It makes no sense to consider both of these functions. We chose Jacquard.

For the case when both areas are specified using binary masks, the above formula can be easily rewritten in terms of the values ​​of the masks:



For non-binary forecasts, optimization of the Jacquard coefficient is a non-trivial task. We will use the same formula for probabilities in the forecast mask as a certain imitation of the initial coefficient and, accordingly, the following loss function:



Tversky loss


Description: https://arxiv.org/pdf/1706.05721.pdf



This function is a parameterized version of the optimization of the Jacquard coefficient that coincides with it at alpha = beta = 1 and with the Dice index at alpha = beta = 0.5. For other non-zero and non-coincident values, we can shift the emphasis towards accuracy or completeness in the same way as in the functions of weighted and balanced cross-entropy.

The emphasis shift problem can be rewritten using a single coefficient lying in the interval (0, 1). The resulting loss function will look like this:



Lovász loss


It is difficult to give a formula for this function, since it is an option for optimizing the Jacquard coefficient by an algorithm based on sorted errors.

You can see the description of the function here , one of the code options is here .

Important explanation!


To simplify the comparison of values ​​and graphs hereinafter, under the term "Jacquard coefficient" we will further understand the unit minus the coefficient itself. Jaccard loss is one way to optimize this ratio, along with Tversky loss and Lovász loss.

Task 4. Choosing the best parameters for parameterized loss functions


To select the best loss function on the same data set, an evaluation criterion is needed. In his quality, we chose the average / median number of connected components on the resulting masks. In addition, we used the Jacquard coefficient for predictive masks converted to single-layer argmax and again divided into binarized layers.

The number of connected components (i.e., solid spots of the same color) on each forecast obtained is an indirect criterion for assessing the volume of its subsequent refinement by the interpreter. If this value is 10, then the layers are selected correctly and we are talking about a maximum of minor horizon correction. If there are not many more, then you only need to "clean" small areas of the image. If there are substantially more of them, then everything is bad and may even need a complete re-layout.

The Jacquard coefficient, in turn, characterizes the coincidence of image zones assigned to one class and their boundaries.

Weighted binary cross entropy


According to the experimental results was selected parameter beta = 2:


Figure 2. Comparison of the quality of the network prediction and core loss function chosen criteria


Figure 3. Statistics for the number of connected components on the part of the values of the parameter beta

Balanced binary cross entropy


According to the results of the experiments, the value of the parameter beta = 0.7 was chosen:


Figure 4. Comparison of the quality of the network forecast by the main loss function and the selected criteria


Figure 5. Statistics for the number of connected components

Tversky loss


According to the results of the experiments, the value of the parameter beta = 0.7 was chosen:


Figure 6. Comparison of the quality of the network forecast by the main loss function and the selected criteria


Figure 7. Comparison of the quality of the network forecast by the main loss function and the selected criteria

Two conclusions can be drawn from the above figures.

First, the selected criteria correlate fairly well with each other, i.e. the Jacquard coefficient is consistent with an estimate of the volume of necessary refinement. Secondly, the behavior of the cross-entropy loss functions is quite logically different from the behavior of the criteria, i.e. using training only this category of functions without additional evaluation of the results is still not worth it.

Task 5. Choosing the best loss function.


Now compare the results that showed the selected 6 loss functions on the same data set. For completeness, we added the predictions of the network obtained in the previous experiment.


Figure 8. Comparison projections networks trained with different loss functions for the selected criteria

Table 1. The mean values of criteria




Figure 9. Comparison networks projections on the number of predictions of the indicated number of connected components

from the presented diagrams and tables, the following conclusions regarding the use of "solo" loss functions:

  • In our case, the "Jacquard" functions of the Intersection over Union class really show better values ​​than cross-entropy ones. Moreover, significantly better.
  • Lovazh loss.

Let us visually compare the forecasts for the slices with one of the best and one of the worst Lovazh loss values ​​and the number of connected components. The target mask is displayed in the upper right corner, the forecast obtained in the previous experiment in the lower right:


Figure 10. Network forecasts for one of the best slices


Figure 11. Network forecasts for one of the worst slices

It can be seen that all networks work equally well on easily recognizable slices. But even on a poorly recognizable slice where all networks are wrong, the forecast for Lovazh loss is visually better than the forecasts of other networks. Although it is one of the worst losses for this function.

So, at this step, we have decided on a clear leader - Lovazh loss, the results of which can be described as follows:

  • about 60% of forecasts are close to ideal, i.e. require no more than adjustments to individual sections of the horizons;
  • approximately 30% of forecasts contain no more than 2 extra spots, i.e. require minor improvements;
  • approximately 1% of forecasts contain from 10 to 25 extra spots, i.e. requires substantial improvement.

At this step, only replacing the loss function, we achieved a significant improvement in the results compared to the previous experiment.

Can it still be improved by a combination of different functions? Check it out.

Task 6. Choosing the best combination of loss function


The combination of loss functions of various nature is used quite often. However, finding the best combination is not easy. A good example is the result of a previous experiment, which turned out to be even worse than the "solo" function. The purpose of such combinations is to improve the result by optimizing according to different principles.

Let's try to sort through different options of the function selected in the previous step with others, but not with all in a row. We confine ourselves to combinations of functions of different types, in this case, with cross-entropy ones. It makes no sense to consider combinations of functions of the same type.

Total, we checked 3 pairs with 9 possible coefficients each (from 0.1 \ 0.9 to 0.9 \ 0.1). In the figures below, the X axis shows the coefficient before Lovazh loss. The coefficient before the second function is equal to one minus the coefficient before the first. The left value is only a cross-entropy function, the right value is only Lovazh loss.


Figure 12. Evaluation of the forecast results of networks trained on BCE + Lovazh


Figure 13. Evaluation of the forecast


results of networks trained on BCE + Lovazh

It can be seen that the selected “solo” function was not improved by adding cross-entropy. Reducing some values ​​of the Jacquard coefficient by 1-2 thousandths may be important in a competitive environment, but does not compensate for a business deterioration in the criterion for the number of connected components.

To verify the typical behavior of a combination of functions of different types, we conducted a similar series of training for Jaccard loss. For only one pair, it was possible to slightly improve the values ​​of both criteria simultaneously:

0.8 * JaccardLoss + 0.2 * BBCE
Average of connected components: 11.5695 -> 11.2895
Average of Jaccard: 0.0307 -> 0.0283

But even these values ​​are worse than the “solo” Lovazh loss.

Thus, it makes sense to investigate combinations of functions of different nature on our data only in competition conditions or in the presence of free time and resources. To achieve a significant increase in quality is unlikely to succeed.

Task 7. Regularization of the best loss function.


At this step, we tried to improve the previously selected loss function with an addition designed specifically for seismic data. This is the regularization described in the article: "Neural-networks for geophysicists and their application to seismic data interpretation . "

The article mentions that standard regularizations like weights decay on seismic data do not work well. Instead, an approach based on the norm of the gradient matrix is ​​proposed, which is aimed at smoothing the boundaries of classes. The approach is logical if we recall that the boundaries of the geological layers should be smooth.

However, when using such regularization, one should expect some deterioration in the results by the Jacquard criterion, since smoothed class boundaries will less likely coincide with possible abrupt transitions obtained with manual markup. But we have one more criterion for verification - by the number of connected components.

We trained 13 networks with the regularization described in the article and the coefficient in front of it, taking values ​​from 0.1 to 0.0001. The figures below show some of the ratings for both criteria.


Figure 15. Comparison of the quality of the network forecast by the selected criteria


Figure 16. Statistics for the number of connected components in terms of the coefficient values ​​before regularization

It is seen that regularization with a coefficient of 0.025 significantly reduced the statistics for the criterion for the number of connected components. However, the Jacquard criterion in this case expectedly increased to 0.0357. However, this is a slight increase compared to a reduction in manual refinement.


Figure 17. Comparison of network forecasts by the number of predictions with the specified number of connected components.

Finally, we compare the class boundaries on the target and predicted masks for the previously selected worst cut.


Figure 18. The network forecast for one of the worst slices.


Figure 19. Overlaying part of the horizon of the target mask and forecast

As can be seen from the figures, the forecast mask, of course, is mistaken in some places, but at the same time it smooths the oscillations of the target horizons, i.e. corrects minor errors in the initial markup.

Summary characteristics of the selected loss function with regularization:

  • about 87% of forecasts are close to ideal, i.e. require no more than adjustments to individual sections of the horizons;
  • approximately 10% of forecasts contain 1 extra spot, i.e. require minor improvements;
  • about 3% of forecasts contain from 2 to 5 extra spots, i.e. require a little more substantial refinement.

findings


  • Only by adjusting one learning parameter - the loss function - we were able to significantly improve the quality of the network forecast and reduce the amount of necessary refinement by about three times.
  • Intersection over Union ( Lovazh loss) . -, .
  • -, . , .. .

:


  1. Reinaldo Mozart Silva, Lais Baroni, Rodrigo S. Ferreira, Daniel Civitarese, Daniela Szwarcman, Emilio Vital Brazil. Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation
  2. Lars Nieradzik. Losses for Image Segmentation
  3. Daniel Godoy. Understanding binary cross-entropy / log loss: a visual explanation
  4. Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3D fully convolutional deep networks
  5. Maxim Berman, Amal Rannen Triki, Matthew B. Blaschko. The Lovasz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks
  6. Bas Peters, Eldad Haber, and Justin Granek. Neural-networks for geophysicists and their application to seismic data interpretation

All Articles