Statistical regularization of incorrect inverse problems to them. Turchin (part 1)

Hello, Habr! Today we want to tell you what the laboratory of nuclear physics experiment methods , part of JetBrains Research , does .

Where are JetBrains and where is nuclear physics, you ask. We agreed on the basis of love for Kotlin, although in this post we will not talk about him. Our group focuses on the development of data analysis, modeling and writing software for scientists, and therefore focuses on cooperation and knowledge sharing with IT companies.

In this article we want to talk about the method of statistical regularization that we popularize , proposed by V.F. Turchin in the 70s of the XX century, and its implementation in the form of code in Python and Julia.

The presentation will be quite detailed, so those who are clear about the inverse problems can go straight to the examples and read the theory in this article .


Occurrence of the problem: why should anyone regularize at all?


If it is enough to ignore, any measurement in the experiment can be described as follows: there is a certain device that captures the spectrum or signal of a process and shows some numbers according to the measurement results. Our task as researchers, looking at these numbers and knowing the device structure, is to understand what the measured spectrum or signal was. That is, on the face of what is called the inverse problem . If you imagine this mathematically, we get this equation (which, incidentally, is called the Fredholm equation of the first kind ):

f(y)=∫abdxK(x,y)Ο†(x)

In fact, this equation describes the following: our measuring device is represented here by its hardware function K(x,y)which acts on the studied spectrum or other input signal φcausing the researcher to observe the output signal f(y). Researcher's goal is to restore the signalφ by famous f(y)and K(x,y). You can also formulate this expression in matrix form, replacing functions with vectors and matrices:

fm=Kmnφn

It would seem that signal reconstruction is not a difficult task, since both the Fredholm equation and the system of linear equations (even overdetermined) have an exact solution. So let's try it. Let the measured signal be described as the sum of two gausses:

Ο†(x)=2βˆ—N(2,0.16)+N(4,0.04)

As an instrument, we take the simplest integrator - a matrix that translates our signal into a cumulative sum using the Heaviside function:

Kmn=ΞΈ(xmβˆ’yn)

The type of the measured signal and our device, as well as the measurement result are shown on the graph.

It is important that any real measurement has an error, so we will spoil our result a little by adding normal noise, giving a five percent measurement error.

We will restore the signal by the least squares method:

Ο†=(KTK)βˆ’1KTf

And as a result we get:

Actually, on this we could finish the article, once again having convinced ourselves of the helplessness of idealistic methods of mathematics in the face of harsh and ruthless physical reality, and go to smoke soldering irons.

But first, let’s figure out what caused this failure to happen to us? Obviously, the point is measurement errors, but what do they affect? The fact is that Jacques Hadamard (the same one who added a dash to the Cauchy – Hadamard formula) divided all the tasks into correctly posed and incorrect ones.

Remembering the classics: β€œIt makes no sense to look for a solution, if any. It is about how to deal with a task that has no solution. This is a deeply fundamental question ... ”- we will not talk about the correct tasks and immediately take up the incorrect ones. Fortunately, we have already met this: the Fredholm equation written above is an incorrect inverse problem - even with infinitely small fluctuations in the input data (and even our measurement errors are far from infinitesimal), the solution of the equation obtained in an exact analytical way can differ arbitrarily from the true one .

You can read the proof of this statement in the first chapter of the classical work of academician A.N. Tikhonova "Methods for solving ill-posed problems." This book has tips on what to do with incorrect tasks, but the technique outlined there has a number of drawbacks that are fixed in the Turchin method. But first, we describe the general principles of working with incorrect tasks: what to do if you come across such a task?

Since the task itself cannot offer us anything, it will be necessary to commit a small crime: supplement the task with data so that it becomes correct, in other words, enter some * additional a priori information about the task * (this process is called regularization of the task). In contrast to the classical Tikhonov method, based on the introduction of parametrized regularizing functionals, the Turchin method calls on the other hand using Bayesian methods.

Theoretical Description of Statistical Regularization


Strategy


We formulate our problem in terms of mathematical statistics: according to a well-known implementation f(which we measure in the experiment) we need to evaluate the value of the parameter Ο†. FunctionalS^ calculating Ο†based fwe will call strategy . In order to determine which strategies are more optimal, we introduce a quadratic loss function . The real loss function can be any, why do we choose the quadratic one? Because any loss function near its minimum can be approximated by a quadratic function:

L(Ο†,S^[f])=||Ο†^βˆ’S^[f])||L2,

Where Ο†^- the best solution. Then the losses for our chosen strategy are determined by the risk function :

RS^[f](Ο†)≑E[L(Ο†,S^[f])]=∫L(Ο†,S^[f])P(f|Ο†)df.

Here P(f|Ο†)determines the probability density of our ensemble, over which loss averaging is performed. This ensemble is formed by a hypothetical multiple repetition of measurements.f for a given Ο†. In this way,P(f|Ο†) - this is the probability density known to us fobtained in the experiment.

According to the Bayesian approach, it is proposed to considerφas a random variable with a priori probability density P(φ)expressing the reliability of various solutions to our problem.P(φ)determined on the basis of information existing prior to the experiment. Then the choice of the optimal strategy is based on minimizing a posteriori risk :

rS^(Ο†)≑EΟ†Ef[L(Ο†,S^[f])|Ο†]

In this case, the optimal strategy is well known:

S^[f]=E[Ο†|f]=βˆ«Ο†P(Ο†|f)dΟ†,

where is the posterior density P(Ο†|f)is determined by the Bayes theorem:

P(Ο†|f)=P(Ο†)P(f|Ο†)∫dΟ†P(Ο†)P(f|Ο†)

This approach will allow us to determine the variance (correlation function) of the resulting solution:

D(x1,x2)=E[Ο†(x1)βˆ’S^[f](x1)][Ο†(x2)βˆ’S^[f](x2)]

So, we have obtained the optimal solution to our problem by introducing an a priori density P(φ). Can we say anything about that world of functionsφ(x)which is given by a priori density?

If the answer to this question is no, then we will have to accept all possibleφ(x)equally probable and return to an irregular solution. Therefore, we must answer this question in the affirmative.

This is precisely what the method of statistical regularization consists in — regularization of a solution by introducing additional a priori information onφ(x). If the researcher already has any a priori information (a priori densityP(φ→)), he can just calculate the integral and get the answer.

If there is no such information, the next paragraph describes what minimal information a researcher may have and how to use it to obtain a regularized solution.

A priori information


As British scientists have shown, in the rest of the world they like to differentiate. Moreover, if the mathematician will ask questions about the legality of this operation, the physicist optimistically believes that the laws of nature are described by "good" functions, that is, smooth.

In other words, it makes it smootherφ(x)higher a priori probability density. So let's try to introduce an a priori probability based on smoothness. To do this, we recall that the introduction of a priori information is some violence against the world, forcing the laws of nature to look in a way convenient for us.

This violence should be minimized, and by introducing an a priori probability density, it is necessary that Shannon’s information onΟ†(x)contained in P(Ο†β†’)was minimal. Formalizing the above, we derive the form of a priori density based on the smoothness of the function. To do this, we will look for a conditional extremum of information:

I[P(Ο†β†’)]=∫ln⁑P(Ο†β†’)P(Ο†β†’)dΟ†β†’β†’min

Under the following conditions:

  1. Smoothness condition Ο†(x). Let beΞ©Is a certain matrix characterizing the smoothness of the function. Then we require that a certain value of the smoothness functional be achieved:

    ∫(Ο†β†’,Ωφ→)P(Ο†β†’)dΟ†β†’=Ο‰

    An attentive reader should ask a question about determining the value of a parameter.
    Ο‰. The answer will be given further in the text.
  2. Probability normalization per unit: ∫P(Ο†β†’)dΟ†β†’=1
    Under these conditions, the following function will deliver a minimum of functionality:

    PΞ±(Ο†β†’)=Ξ±Rg(Ξ©)/2detΞ©1/2(2Ο€)N/2exp⁑(βˆ’12(Ο†β†’,αΩφ→))

    Parameter Ξ±Connected with Ο‰, but since we do not have information about the specific values ​​of the smoothness functional, figuring out exactly how it is connected is pointless. What then to do withΞ±, you ask. Here three ways are revealed to you:

    1. Match parameter value Ξ±manually and thereby actually proceed to Tikhonov's regularization
    2. Averaging (integrating) over all possible Ξ±assuming all possible Ξ±equally probable
    3. Choose the most likely αby its posterior probability density P(α|f→). This approach is correct because it gives a good approximation of the integral if the experimental data contain enough information aboutα.

The first case is of little interest to us. In the second case, we must calculate such a ugly integral here:

βŸ¨Ο†i⟩=∫dφφiP(f|Ο†)∫dΞ±P(Ξ±)Ξ±Rg(Ξ©)2exp⁑(βˆ’Ξ±2(Ο†β†’,Ωφ→))∫dΟ†P(f|Ο†)∫dΞ±P(Ξ±)Ξ±Rg(Ξ©)2exp⁑(βˆ’Ξ±2(Ο†β†’,Ωφ→))

For the third case, we can obtain the value of the integral analytically for Gaussian noises in the experiment (this will be considered through the section).

It should also be noted that we have not used anywhere, thatΞ©Is a smoothness operator. In fact, we can use any other operator (or their linear combination) here, just the smoothness of the function is the most obvious form of a priori information that we can use.

Sampling


We talked about functions, but any real device cannot measure not just a continuum, but even a countable set of points. We always take measurements in a finite set of points, therefore we are forced to carry out discretization and transition procedures from the integral equation to the matrix one. In the method of statistical regularization, we proceed as follows: we will decomposeφ(x) over some system of functions {Tn}:

Ο†(x)=βˆ‘nΟ†nTn(x).

Thus, the coefficients of this expansion form some vector Ο†β†’which is a vector in function space.

As a functional space, we can take the Hilbert space, or, for example, the space of polynomials. Moreover, the choice of basis in these spaces is limited only by your imagination (we tried to work with the trigonometric Fourier series, poligandra and cubic splines).

Then the elements of the matrixK calculated as:

Kmn=(K^Tn(x))(ym),

Where ym- points at which measurements were made. Matrix ElementsΞ© we will calculate by the formula:

Ωij=∫ab(dpTi(x)dx)(dpTj(x)dx)dx,

Where aand b- the boundaries of the interval on which the function is defined Ο†(x).

To recalculate errors, use the dispersion formula of a linear combination of random variables:

D[Ο†(x)]=D[βˆ‘nΟ†nTn(x)]=βˆ‘i,jΟ†iΟ†jcov(Ti(x),Tj(x)).

It should be borne in mind that in some cases, the representation of a function using a vector of finite dimension leads to a partial loss or change of information. In fact, we can consider algebraization a kind of regularization, however weak and insufficient to turn an incorrect task into a correct one. But, anyway, we have now moved from searchingφ(x) to vector search φ→and in the next section we find it.

Gaussian Noise Case


The case when the errors in the experiment are distributed according to Gauss is remarkable
in that an analytical solution to our problem can be obtained. Since a priori information and errors have a Gaussian form, their product also has a Gaussian form, and then the ugly integral that we wrote above is easily taken. The solution and its error will be as follows:

Ο†β†’=(KTΞ£βˆ’1K+Ξ±βˆ—Ξ©)βˆ’1KTΞ£βˆ’1Tfβ†’

Σφ→=(KTΞ£βˆ’1K+Ξ±βˆ—Ξ©)βˆ’1,

Where Ξ£- covariance matrix of the multidimensional Gaussian distribution, Ξ±βˆ—- the most probable value of the parameter Ξ±, which is determined from the condition of maximum a posteriori probability density:

P(Ξ±|fβ†’)=Cβ€²Ξ±Rg(Ξ©)2|(KTΞ£βˆ’1K+Ξ±Ξ©)βˆ’1|exp⁑(12fβ†’TΞ£βˆ’1KT(KTΞ£βˆ’1K+Ξ±Ξ©)βˆ’1KTΞ£βˆ’1Tfβ†’)


And if I do not have Gaussian errors?


The second part of the article will be devoted to this, but for now let us outline the essence of the problem.

βŸ¨Ο†i⟩=∫dφφiP(f|Ο†)∫dΞ±P(Ξ±)Ξ±Rg(Ξ©)2exp⁑(βˆ’Ξ±2(Ο†β†’,Ωφ→))∫dΟ†P(f|Ο†)∫dΞ±P(Ξ±)Ξ±Rg(Ξ©)2exp⁑(βˆ’Ξ±2(Ο†β†’,Ωφ→))


The main problem is that this terrible integral, firstly multidimensional, and secondly, in infinite limits. Moreover, it is highly multidimensional, vectorΟ†β†’ can easily have dimension m=30βˆ’50, and grid methods for calculating integrals have complexity of type O(nm), therefore, inapplicable in this case. When taking multi-dimensional integrals, Monte Carlo integration works well.

Moreover, since our limits are infinite, we must use the important sampling method, but then we need to select a function for sampling. To make everything more automated, you should use the Markov Chain Monte Carlo (MCMC) , which can independently adapt the sampling function to the calculated integral. We will talk about the application of MCMC in the next article.

Practical part


The first implementation of the statistical regularization method was written back in the 70s on Algol and was successfully used for calculations in atmospheric physics. Despite the fact that we still have the handwritten sources of the algorithm, we decided to add a bit of modernism and make an implementation in Python, and then on Julia.

Python


Installation

Install through pip:

pip install statreg

or download the source code .

Examples

As an example, consider how to use a module staregto recover data for a matrix and integral equation.

We import the necessary scientific packages.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.integrate import quad
%matplotlib inline

We determine the true signal, which we will restore.

a = 0
b = 5
#  
phi = lambda x: 4*norm.pdf(x-2, scale=0.4) + 2*norm.pdf(x-4, scale = 0.5)
x = np.linspace(a, b,100)
plt.plot(x, phi(x));


Define the kernel and the operation of convolution of functions (Note: np.convolutionspecifically for arrays):

kernel = lambda x,y : np.heaviside(x-y, 1) #  
convolution =  np.vectorize(lambda y: quad(lambda x: kernel(x,y)*phi(x), a,b)[0])

We generate the measured data and noise them using the normal distribution:

y = np.linspace(a, b, 50)
ftrue = convolution(y)
sig = 0.05*ftrue +0.01 #  
f = norm.rvs(loc = ftrue, scale=sig)
plt.errorbar(y, f, yerr=sig);



We solve the integral equation

We import the solver and auxiliary class for discretization:

from statreg.model import GaussErrorUnfolder
from statreg.basis import CubicSplines

As a functional basis for discretization, we use cubic splines, and as an additional condition, we indicate that the function takes zero values ​​on the edges.

basis = CubicSplines(y, boundary='dirichlet')
model = GaussErrorUnfolder(basis, basis.omega(2))

We solve the equation:

phi_reconstruct = model.solve(kernel, f, sig, y)

We are building a solution schedule:

plt.plot(x,phi(x))
phir = phi_reconstruct(x)
phiEr = phi_reconstruct.error(x)
plt.plot(x, phir, 'g')
plt.fill_between(x, phir-phiEr, phir + phiEr, color='g', alpha=0.3);



We solve the matrix equation

We import the solver and auxiliary class for discretization:

from statreg.model import GaussErrorMatrixUnfolder
from statreg.basis import CubicSplines

To obtain matrices, we use our functional basis, but it is clear that matrices can be obtained in any way.

cubicSplines = CubicSplines(y, boundary='dirichlet')
omega = cubicSplines.omega(2)
Kmn = cubicSplines.discretizeKernel(kernel,y)

We solve the matrix equation:

model = GaussErrorMatrixUnfolder(omega)
result = model.solve(Kmn, f, sig)

Build the chart:

phir = lambda x: sum([p*bf(x) for p, bf in zip(result.phi,cubicSplines.basisFun)])
plt.plot(x,phir(x))
plt.plot(x,phi(x));



Julia


As we mentioned, further development of the technique requires advanced Monte Carlo integration. We could use some kind of module in Python (for example, we worked with PyMC3), but we, among other things, are participating in a joint project with the Max Planck Institute in Munich.

This project is called the Bayesian Analysis Toolkit . Its goal is to create a framework with tools for Bayesian analysis methods, primarily including tools for MCMC. Now the team is working on the second version of the framework, which is written in Julia (the first is written in bad C ++). One of the tasks of our group is to demonstrate the capabilities of this framework using the example of statistical regularization, so we wrote an implementation in Julia .

using PyCall
include("../src/gauss_error.jl")
include("../src/kernels.jl")

a = 0.
b = 6.

function phi(x::Float64)
    mu1 = 1.
    mu2 = 4.
    n1 = 4.
    n2 = 2.
    sig1 = 0.3
    sig2 = 0.5

    norm(n, mu, sig, x) = n / sqrt(2 * pi*sig^2) * exp(-(x - mu)^2 / (2 * sig^2))
    return norm(n1, mu1, sig1, x) + norm(n2, mu2, sig2, x)
end
x = collect(range(a, stop=b, length=300))

import PyPlot.plot

myplot = plot(x, phi.(x))
savefig("function.png", dpi=1000)


This time we use a different core, we will take not an integrating step, but a convolution with a Gaussian, which actually leads a certain β€œblur” to our data:

function kernel(x::Float64, y::Float64)
    return getOpticsKernels("gaussian")(x, y)
end

convolution = y -> quadgk(x -> kernel(x,y) * phi(x), a, b, maxevals=10^7)[1]
y = collect(range(a, stop = b, length=50))
ftrue = convolution.(y)
sig = 0.05*abs.(ftrue) +[0.01 for i = 1:Base.length(ftrue)]
using Compat, Random, Distributions
noise = []
for sigma in sig
    n = rand(Normal(0., sigma), 1)[1]
    push!(noise, n)
end
f = ftrue + noise
plot(y, f)


Similarly, we take the basis of splines with fixed ends:

basis = CubicSplineBasis(y, "dirichlet")
Kmn = discretize_kernel(basis, kernel, y)
model = GaussErrorMatrixUnfolder([omega(basis, 2)], "EmpiricalBayes", nothing, [1e-5], [1.], [0.5])
result = solve(model, Kmn, f, sig)
phivec = PhiVec(result, basis)

x = collect(range(a, stop=b, length=5000))
plot(x, phi.(x))

phi_reconstructed = phivec.phi_function.(x)
phi_reconstructed_errors = phivec.error_function.(x)

plot(x, phi_reconstructed)
fill_between(x, phi_reconstructed - phi_reconstructed_errors, phi_reconstructed + phi_reconstructed_errors, alpha=0.3)



Real-world example

As an example of analysis of real data, we will restore the electron scattering spectrum of a hydrogen-deuterium mixture. In the experiment, the integral spectrum was measured (that is, the number of electrons above a certain energy), and we need to restore the differential spectrum. For this data, the spectrum was initially reconstructed using fitting, so that we have a basis for checking the correctness of our algorithm.

This is how the initial integrated spectrum looks:

And so - the result of the restoration:

Analysis with fit has three main disadvantages:

  • , .
  • , .
  • .

Statistical regularization avoids all these problems and provides a model-independent result with measurement errors. The solution obtained by regularization is in good agreement with the fitting curve. Note the two small peaks at 25 and 30 eV. It is known that a peak at 25 eV is formed during double scattering, and it was restored by a fit, since it was clearly specified in the fitting function. A peak of 30 eV may be a statistical anomaly (the errors are quite large at this point), or, possibly, indicates the presence of additional dissociative scattering.

Conclusions and announcement of the next part


We told you about a useful technique that can be adapted to many tasks of data analysis (including machine learning), and get an honest "fit" of the answer - the most rational solution to the equation in the face of uncertainty caused by measurement errors. As a nice bonus, we get values ​​for the decision error. Those who want to participate in the development or apply the method of statistical regularization can contribute in the form of code in Python, Julia or on something else.

In the next part we will talk about:

  • Using MCMC
  • Cholesky decomposition
  • As a practical example, we consider the use of regularization for processing a signal from a model of an orbital detector of protons and electrons

References



Posted by Mikhail Zeleny , Researcher at the Laboratory of Nuclear Physics Experiment Methods at JetBrains Research .

All Articles