🎏 🤞🏻 🏠 A look at the promising directions of development of geological, geophysical and field data management systems 👋🏼 👩🏻‍💻 👩🏿‍🏭

At the request of Gazpromneft, in 2019, wrote a short review of what geologists and geophysicists have to deal with in data management in general. Perhaps it will be interesting to a wider circle of readers - I publish here accordingly.

In the series “The Big Bang Theory,” the sweetest waitress Penny asks theoretical physicist Leonard (to support the conversation) about what's new in the world of particle physics? And he replies that since the discovery of the Higgs boson i.e. since about 1964, basically nothing. The conversation is not glued :)

And in the field of databases and banks of geological-geophysical and production (GGiP) data, everything is about the same boring. As you can see from Deloitte Digital's research, oil workers, miners and big truck drivers are far from big explosions. New databases and storages are created, but almost always to ensure the functionality of working with data for a particular application. The systems for organizing and storing data created in a targeted manner for organizing and storing await harsh reality. They drop out of a business process that revolves around application processing, interpretation, and modeling systems. In this regard, data continues to accumulate in application databases, flows around specialized systems and is conveniently arranged in Excel tables and file folders, at best corporate and centralized (if such a resource is provided to users).

The main guideline of the new data storage and organization systems is, in our opinion, their different from the relational structure. Historically, all the data organization tasks were solved using relational modeling due to the dominant position of relational tools in the market. The situation has changed, and the inefficiency of using relational technologies in many archive (write once read many) systems has become apparent.

If we finally discard the illusions of one “omnipotent ring”, which will unite them forever, it remains to accept that this set of geographically distributed application databases of various kinds, file systems and tables is our database. A moderately distributed, extremely heterogeneous, loosely coupled database that we have to learn to work with because the above characteristics are unlikely to change in the next decade due to the nature of the business. As Neil McNotan, one of the few systematic researchers in the field of data management, said: “Working in projects is limited. We are forced to trim the data into the project only for the reason that we cannot yet work with all the data at once ”(free quote).

So, abandoning attempts to systematize the process of data accumulation globally (upload everything!) And abandoning the care of their remastering, ensuring integrity, backup and protection from natural disasters to DIT specialists - we only have one concern: access! But this is a significant concern, because access is not only physical transfer and speed, but also delimitation and rights, reliability and a certain level of quality, structuring of information, both preliminary and in accordance with the request, etc.

We are engaged in access to data based on search technologies, but the search has long gone beyond what we used to call them. All our interaction with information outside the workplace actually occurs through built-in search technologies. Let us dwell briefly on the difficulties of organizing transparent access to data and ways to overcome them:

Interfaces with a significant number of systems.Even if the interface is read-only, the task is not trivial. Systems are changing both in terms of IT infrastructure and in terms of data models (for example, with new versions). Access control in many systems is purely individual. Accordingly, a system that supports persistent, even the simplest interfaces with a large number of other systems, needs continuous support for these interfaces. Costs and continuous development cycle.

Push or pull interfaces?From the point of view of the nature of the interfaces, it is difficult to rely on the fact that third-party systems will prepare and transmit data to the outside (even if the developers declare it) - this is not part of their functional responsibilities. Accordingly, mainly the interfaces / agents of the access tool platform will be responsible for retrieving data from third-party systems.

Multi-model.Each of the systems has its own data model, different from the others. You will have to support the data of different models, as well as their semantic generalizations or meta-models for different business requirements in order to work with all the data as a whole. GPN is actively conducting research in the field of semantic modeling. It is also appropriate here to talk about data enrichment, extraction of named entities, and the active application of the actually industrial approaches of machine learning. I would also like to mention the problem of the structured presentation of “multi-model” data to the user. Just a list of documents neither people nor even algorithms will work. The importance of thoughtful facetization, meta-models, and end-to-end queries is difficult to overestimate.

Caching and reverse index.Practice shows that in a system that has become completely dependent on other systems for requests (and classic portal access systems are often organized), there are two significant problems:

a. Performance - third-party systems process requests with unpredictable performance, sometimes extremely slowly.

b. Reliability - with a large number of connected systems at any time, one or more of them will necessarily fail and cannot answer the request.

The only way to defeat these ailments is deep indexing of information and effective work with the index.

Access control.Access to data must be distinguished. Not all IT systems organize access control in the same way. In this regard, for some systems, it is necessary to store or restore and constantly update the communication of internal users with users and groups of the corporate domain.

Communication with the NSI. Quality.When connecting a large amount of data from a large number of systems, we inevitably encounter problems of reliable determination of the relevance and quality of data. Any pre and post-processing of data, including cognitive tools that will help us reliably determine the relevance and quality of data received in response to a request, is in demand! Interestingly, quality is a function of use, not the other way around. High-quality data where someone constantly uses it. The best database with the highest quality data that has been unclaimed for 10 years is fraught with many surprises.

In the process of solving all of the above issues, we find that although we have relatively little data, the volume of indexing, processing and analysis is so significant that we confidently fall into the field of highly loaded systems and in industrial implementations are forced to rely on serious industrial solutions.

I want to complete this small essay with a paragraph filled with keywords. DevOps specialization - its development in the company requires special attention. These people, along with people in the DataScience specialization , are responsible for the future of working with data in the company. JSON is still an Arabic script for lovers of SQL syntax, but it seems to be necessary to master. Python- the main problem is not so much in owning the syntax as in the availability of time and motivation for specialists in geological and geophysical specialties to use it. Of the best, COSS (open source commercial software) is of course Elasticsearch . And as with any OSS - leave all sorts of illusions regarding the out of the box solution from the OSS manufacturer itself.

Finish a joke from the same series:
- Hello! I see you are busy?
- Oh yeah! Dark matter! The balance of gravity and quantum mechanics. Fermi-Bose symmetry. I comprehended the greatest challenges of today's physics!
“And so, you decided all of them in one fell swoop?”
- No, I only realized that they are the greatest tasks.

A look at the promising directions of development of geological, geophysical and field data management systems

More articles: