Through thorns to the stars, or data analysis in the affairs of heaven


While we are sitting in the office drinking hot tea, something really important is happening somewhere in the expanses of space. What catches the eye of astronomers and researchers from all over the world is fascinating, intriguing, and maybe frightening those who know a little more about this than we do. New Galaxies are born, Tesla Ilona Mask flies to the songs of the immortal David Bowie, in general, beauty.

But let's return to Earth for a while. It just so happened that data analysis is a cross-scientific need. And so attractive. Since everything can be investigated, from the bowels of the Earth to the vast space.

I want to talk about such an experience, namely about participation in the IDAO 2019 international data analysis olympiad, which has been held for the third year in a row by my native university - the Higher School of Economics.

The task was reduced to the prevention and detection of “space accidents”, when satellites in orbit with not optimal motion paths crash into each other, turning into space debris, which at cosmic speeds could well cause several more accidents, loss of several million dollars and several calls to a carpet somewhere in NASA or Roscosmos. Why did it happen so? Obviously, the stars are to blame. Or not, let's figure it out.
By the way, statistics on the number of space objects of terrestrial origin flying in Earth orbit are given below.



It can be seen that the amount of space debris is increasing year by year.

So, here I will try to tell how our team was able to take 22nd place out of 302.

To begin, consider the source data, which are as follows.



Where x, y, z are the coordinates of the object in three-dimensional space, and Vx, Vy, Vz are the velocities. There are also simulation data obtained by GPT-4 with the _sim prefix that will not be used.

First, let's build a simple visualization, this will help to understand how the data is arranged. I used plotly. If we consider the data in a two-dimensional coordinate system, then they look as follows. The y axis of the seventh satellite is displayed below. There are more graphs that you can rotate with the mouse and grin while still having a good time in .ipynb on Github.



During the EDA (Explorative Data Analyze), it was noticed that the data contains observations that differ in time by one second. They must be removed to maintain seasonality. Most likely the same object was detected at the same point twice.

In short, this time series clearly has a linear trend and seasonality equal to 24, i.e., the satellite makes a revolution around the Earth in 24 observations. This will help in the future to choose the optimal algorithm.

Now we will write a function that will predict the values ​​of the time series using the SARIMA algorithm (the implementation from the statsmodels package was used), while optimizing the model parameters and choosing the best one with the minimum Akaike criterion. It shows how complicated the model is and retrained. The formula is given below.



The final conclusion was as follows:



Of course, our team came to this after several dozen iterations and repeated rewrites of the code. Something came in, greatly improving our speed, something ultimately fell, devouring our time, like Langoliers. But one way or another, predictions were made of the position of the satellite and its velocity for the next month.

The quality metric was SMAPE, the symmetric mean percentage error.



where F_t are the predicted values, F_t are the true values.

The final formula looked like this:



In the end, our team received heaps of not-so-good .ipynb code for notebooks, csv files with absolutely illogical names, sleepless nights, thousands of leaderboard updates, dozens of fallen submissions and other delights of ML hackathons, well, 22nd place from 302 teams on a private leaderboard, i.e. hit the TOP 7%.



As ideas for optimizing the solution, it is proposed to try to delve deeper into EDA to understand the data at a lower level, to try to use other predictive algorithms. More detailed analysis in the repository. Love ML and stay tuned.

Code Link

All Articles