Features of data-driven in petrochemicals

When creating any business, each of its divisions automates itself. As a rule, end-to-end data flows between them are single. This leads to the fact that the data cannot be compared with each other, because each department considers them in its own way. No problem if you collect some metrics for the whole company, but when it comes to calculating end-to-end indicators, forecasts or solving modeling and optimization problems, chaos begins.

Data Warehousing (DWH) is not a new story. Traditionally, they have been used for reporting. But the full-fledged modeling and forecasting of end-to-end business processes on DWH data began relatively recently. Using the collected data, modern analysis tools make it possible not only to make dashboards with drop-down windows, but also to set up forecasting and optimization algorithms for each attribute, to scale game theory algorithms for the entire enterprise as a whole. And also build and immediately test hypotheses about the further development of the business on real data.



And it seems like everything sounds great. But not all companies are in a hurry to take an example from the leading experts (Booking.com, Amazon.com) and continue to work as usual. So what is stopping them? At a minimum, understanding the feasibility of large-scale investments in data processing tools, the laboriousness of implementing data description processes, the emergence of new roles (data curators responsible for data quality, data engineers and architects, etc.), learning how to consider the economic effect of implementing data management , clearly identify cost drivers, how to make an office self-sustaining, reconcile with a company’s strategy, and select those that will move the company forward, and much more.

My name is Victoria Krasnova, I am the head of SIBUR's Corporate Data Management. Together with my colleague, the leader of the Data Governance team, Rinat Abdurakhmanov, we will tell you how we do it.

When the large retailers (Wallmart) began to digitize, they had to figure out which digital footprints and artifacts one business process left behind and what the next one took as input. That is, describe the end-to-end business process. The same is required for digitalization of any other company. One way to answer this request is the concept of data management and data architecture.

In an applied sense, this means: gathering the most important and not very important company data in one place, describing it in a clear language, linking it to business processes and creating user-friendly methods of access to this data.

The data architecture, among its other functions, provides clear answers to the questions “where is it considered?”, “What is considered?”, “Why is it considered?”, “Who is responsible for quality?” and “where is it located, in which system is it?”.

It is important that the answers to these questions are separate from the business itself. It often happens this way: the analyst wants to test a hypothesis. To do this, he needs to go and ask for the necessary data from their owner, prove why and why this is necessary and important, spend half a day on this. Best case scenario. And eventually get a refusal. Why? Because the data owner is responsible for providing access to the data and their subsequent dissemination, because it is not known how the data will be interpreted by the analyst and it may not suit him, etc.

Therefore, it is necessary to build such a structure and logic that will be intuitive, work according to uniform rules and will not distract either the analyst himself or the data owner from immediate tasks.

For these purposes, a logical data model is excellent - a description of the data in the business language of detailing to technical details, in combination with a flexible role model. In this case, the analyst gains access to the repository and data set based on his role in the company. And he collects the required data set based on common sense, and not from the knowledge that in 2005 a certain comrade worked in the company, the file of which contains the required data.

Such an approach to structuring allows people to quickly analyze, makes the data comparable, and as a result allows to achieve secondary benefits - to digitize the entire business in stages.

What are the challenges facing us


In SIBUR, some processes are digitized quite well, for example, preparation of data for marketing, finance, supply chain management, production data and production plant bypasses. Everything else is more difficult, because SIBUR is a production with a cycle in which, from the point of view of business, there is no need to collect information at the same speed as it is needed in retail, telecoms or banks. Accordingly, the issue of speed in the analysis of data was also not raised. But difficult - does not mean impossible. This year we plan to optimize processes, make data calculation more transparent, increase the data transfer rate for decision making, and also start collecting digital tracks at all stages of the processes, where possible.

Why are digital companies highly accurate and fast for decision making? Because they have practically no margin to make a mistake if the data suddenly turns out to be wrong. In production, everything is different - it will not stop, plants will not stand up if there is any inaccuracy in the data for analytics. Therefore, the data architecture is the very force that, contrary to everything, leads production in the digital direction. And data management is a library that allows you to streamline the flow of data throughout the company.

Recently, we launched a line that deals with data description. While we are in search of a tool in order to describe the data, store and comfortably access the descriptions. If the tool for description is inconvenient, and because of this we will not be able to keep cataloging up to date, then using it will no longer make any sense. As a result, the data itself in the repository may not be relevant. Why do we need to build something on the basis of data whose expiration date has already expired?

Here we have yet another task: how to motivate architects accompanying existing information systems, describe the data and keep it up to date. The principle of “You build it, you run it” is popular in digital companies. Our historically implements it so that some people introduce it, but others support it. Often the documentation is not up to date, and some algorithms live only in the minds of old-timers. Therefore, the description of systems is a very time-consuming work, especially when it is carried out from scratch (as in our case). Indeed, in reality, the effect of this work will come for them much later, only after describing a critical mass of data. But in the end, when introducing another new system, they will not have to look for data to power it. Now it takes an average of two weeks or more to search for these data.

Data is needed not only to introduce new systems, but also to test hypotheses. They usually arise a lot, and they are tested in batches. And in fact, it turns out that there is data, there is a lot of them, they are diverse, but only a lot of time and money is spent on their search.

Another point when changing data “without warning” in one place causes the data in the other place to become incorrect. For example, the indicator “Production Volume” used to take into account losses at the stages of redistribution, and then stopped. They changed the system, but the rest are not up to date, and continue to use the indicator as before. As a result, the data for making a management decision are incorrect. Or at some point it turns out that the data is not comparable, people begin to look for errors. And this is also labor, only invisible and incalculable.

In general, as you understand, we have thoroughly faced the issue of choosing a tool for working with data. And before you go and choose an instrument, you need to write adequate criteria for such a choice.

Instrument Selection Criteria


We are in search of a tool that would support the description of metadata in the form of an object model with the ability to independently add new types of objects. Not all products offer this feature. For example, some tools allow you to display only physical tables as objects, but there is no class of objects for conceptual or logical entities.
Flexible configuration of connections between objects is very important. For example, today we have three logical levels of abstraction, but we should be limited in our ability to drop or add any number of levels.

Another important criterion is the presence of built-in connectors to source systems, especially to SAP. We have a lot of SAPa (I think that any large company, in principle, has a lot of SAPa) - a huge installation, and shoveling it with your hands is a completely thankless task. Ideal if there is one. If there is no such connector, then you can write it yourself.

Let's not forget about full-text search, semantic search with the ability to add your own dictionaries of synonyms (for example, the integrated Elasticsearch).

An important role is played by the possibility of feedback. Moreover, ideally there should be the possibility of commenting, evaluating on the principle of 1-5 stars, direct communication with the person responsible for the entity or attribute of a particular entity, as well as setting flags and tags to draw attention, as well as adding objects to Favorites.

In addition, it’s good would have a native connector to SAS DQ or any other tool that can be used to assess the quality of data and display the health index of a particular entity, so that the user can immediately see that the data can be trusted, because they were run through with verification. And give your feedback on this.

In general, you need something like this:



Here's an example of a typical case for you: a person saw that you can trust the data, looked more closely and found an error, and wrote directly to the owner asking him to fix it. It turns out such a data health showcase. Such openness and widespread availability of data gradually reduces the degree of mistrust of both users and owners. An analyst with even the most basic access to data can quickly get the necessary information that has been verified, and at the same time, it does not depend on the owner of the data who contributes this information. Win-win.

But usually everyone has everything in Excel, and this is a big problem (not Excel itself, of course, but such a situation). People counted the indicators, then they corrected them in their own tablet, but nothing has changed in the general system. And the analyst is scared to take some figure from publicly available corporate sources, it’s easier to go to a colleague and ask for a file. This is quite difficult to deal with. In fact, the criterion for the success of implementing a data office can be considered the creation of environments in which employees, as a rule, rely on the results of analysis when making decisions, and prefer SQL and Python from tools.

Separately, it is worth mentioning the maintenance of the current statuses of the data “Commercial Secret”, “Public Data”, “Personal Data”, “Corporate Data of Limited Distribution”. That is, for a data analyst, it is important to know what exactly he is currently browsing and unloading, or is letting his colleagues see.

After all, when the average person turns to legislation related to trade secrets and confidential information, he sees generalized information about what can harm companies. In frequent cases, they begin to consider as critical data in general everything that contains numbers (suddenly something). Accordingly, when asked to provide data for analysis, the owner begins to ask himself: “is this a commercial secret?”, “Will the actions of the applicant ask for harm?”, And, being reinsured, it often refuses. After all, he is responsible for this information, and does not know how the analyst will use it.

There was another case: when we were working on a list of confidential information for a data democratization project, it turned out that this list contains data that the owners call confidential, and we are required by law to provide them on the official website. And, of course, they are published there. That is, in conditions where there is no single tool in which everyone can see clearly verified information at once, a lot of people work in a “no matter what happens” mode and are very reinsured.

So, this is all about the criteria. But from what exactly we chose.

Search for a solution


We say “choose” because we have not yet chosen, we are still in search of the perfect tool. Initially, we chose from Collibra, SAS DG, Alation, Alteryx Connect and Informatica. We also looked through foreign open-source projects, but they swept them almost immediately, because no one knows how to work with the Cyrillic alphabet.

Then there was an unsuccessful experience with Collibra. We almost completed the deal, but it fell through - we did not agree on the conditions. In the short term, they will completely move to the cloud, and for any Russian company this is not an option. In fact, they will not provide a product, but a service: Collibra provides a subscription, and we provide our data. Formally, this is not a trade secret, but metadata, but in fact, if something goes wrong, the business will be completely paralyzed.

After this story, we realized that we would choose the box for a long time: we have long processes, we carefully approach transactions, conditions, and contractors, we check everything many times to make sure that the risk is minimal. Knowing all these features, we went into our own development to make at least a temporary solution for users. After all, the data is pouring, and it is impossible to use it without a description! In parallel, we take a closer look at Alation and Alteryx Connect and compare their functionality and cost with our solution.

We invented the logical model of storage ourselves, it’s a bit more complicated for us here than for other industries. For example, for banks and telecoms, there are reference data architectures — generally accepted structures and recommendations on what and how to decompose data. For petrochemicals, there is no full cycle of sources for creative borrowing in the public domain. It took only a year to understand how the business as a whole works. SIBUR has a complex production, a huge number of nomenclature, processes, businesses, which is reflected in the systems, which means that it requires analysis.

Here it helped us that there is the so-called knowledge-intensive leadership. For example, in other industries quite often, managers and managers are not very well versed in the industry itself. This happens, in principle, this is not something directly terrible, in the end, their business is to manage projects, and it turns out that each new manager’s link usually knows a little less than the previous one. But it turned out so that the managers are people who are able to explain to you on the fingers all the processes that can occur with butadiene throughout its life path, for example.

So, about the decision. A creative search is such a thing that it can take a year, two, or a couple of lives. Therefore, the search is good, but you need to work on something right now.

So, we went to our own development, called it dg_light. The development of the front took 4 weeks (to tell the truth, not from scratch, we reused the achievements of the timeshare analysis tool that had recently come off the assembly line).

Project Composition:

  • Backend: Docker, Node.js, NGINX, Crossbar.io
  • Frontend: React, Redux, Material UI, Highcharts, Autobahn.js
  • Data Storage: PostgreSQL
  • Protocols: WAMP, WebSocket, HTTP (S)
  • OS: CentOS 7

The structure of the storage facilities and the methodology were input from data architects. To study the front design, they planted a number of analysts of various maturity levels and asked: “How would you like it to be?” In fact, they painted for themselves.

All development took 6 weeks.

A reasonable question, what will we do with the decision when we buy the industrial one? It was originally planned that both solutions will live in parallel: in the "big" DG there will be data models, a glossary, a role model, and dg_light will leave complex chips that are not easy to implement in a boxed solution like data lineage. What will happen in the future will show the experience of use.

Data model


Physics . It all started with building a warehouse data model. We argued for a long time about how to build a detailed storage layer, and decided that we would not take one ready-made concept into work, but combine Data Vault 2.0 and Anchor (6NF). Again, because the data sources we have are very different. On the one hand, this is SAP, which in the depths is somewhere OLAP, and somewhere OLTP, and business logic that lives by its own laws and requires maximum detail. On the other hand, they live an eventful life of the manufacturing process control system (MES), in which data flows and key-value histories flow all the time.

The combination of hubs, satellites, links from DV2.0 and the maximum granularity of Ankor allowed to marry all this in one place. Yes, manually writing queries in such a system will be difficult, but all of its contents are correct. And we can guarantee the integrity of the system, even if everything around us suddenly starts to change or collapse.

Logics. Having solved the issue of the organization of architecture at the physical level, we moved on to its logical description. And our discussion with colleagues moved to a philosophical plane in an attempt to determine for ourselves what essence is and how they relate to each other. After arguing for a while, they turned to DAMA-DMBOK, pulled out the definitions from it and applied it to their context. As a result, it turned out that entities are nouns with which we operate within the framework of SIBUR, having complete business value and answering a number of questions. There is still debate about whether or not to include aggregates and reports in entities. Each architect has his own opinion, his own calculations, and now we are trying to bring our thoughts to a common denominator. This is one of the tasks that we have to solve, including the people we are looking for as a team.

But the logical model is not all. Additionally, on its basis, we also built the conceptual model that management needs to understand what is happening at all. The logical model for them is too detailed, so we grouped everything into data domains that fit well into the described business processes in the company. Now we are trying to negotiate with the process office in order to bind each such grouping of logical entities to processes in ARIS.

Further we went in breadth and even higher: we created a single logical data model, where we enter the logical entities of each system, while indicating source systems with indicating the relationship of the systems to each other.

We export this knowledge to corporate architects in Sparx Enterprise Architect, they need it to bind entities to integration flows and interfaces.
Such an organization of the data architecture will help people who plan to engage in impact analysis in the future, from it it will be possible to build a full-fledged Data Lineage. In general, we expect that the solution will be used not only by architects of all stripes, but people from analytics in business units, data scientists and everyone who is somehow connected with analysis.

Now we face a more laborious task - how to teach employees to use all this.

People and data


Our global plans are to make SIBUR a data-driven company, when absolutely any employee can analyze something. We divided the general strategy into three parts - about people, about processes and about tools. With the tools, one might say, they decided the issue, made their platform with data. Now we need people to start using it.

The main feature is the mentality of employees. They work in a dangerous petrochemical industry, where safety is written for everyone whose blood is written. And people are trained to strictly follow instructions, it is literally imprinted on the subcortex. Such a state is strongly contrary to the free thinking of the analyst.

We started small: gradually wean employees to make presentations on any more or less significant occasion and transfer them to dashboards. Since the people in the company are responsible and executive, they try to take a ready-made presentation and draw all the same in an interactive version. But dashboards live according to different laws, and for a person this is a completely different level of labor costs, it is necessary to load the entire history of the data and verify it. Data becomes automatically calculated and non-manipulated - you won’t change it with your hands unless you initially set it up properly.

In fact, all automation of internal processes ends with a bunch of Excel + mail. And transplanting people with Excel is an almost impossible task. Well, right, why do we need Python and SQL, because in Excel you can do everything! And it’s pretty hard to deal with.


In previous versions of the data management system in SIBUR there was such a thing as the owner of the information archive - an employee who gives access to data and knows where which number is correct. This approach created the barriers that I wrote about above. To break them, we took advantage of the “best practices” from Gartner and separately identified a data curator and a data quality officer.

A data curator is a manager at the level of the division’s director who determines the rules by which he is ready to give access to data. The data quality officer works directly with the information itself and knows which figure is correct. Now we are working to ensure that for each figure there is a person responsible for quality who will respond to requests from colleagues in case of error or inaccuracy. Already, we are dividing the data into the information available to everyone within the company, the information available within a particular unit, and information representing commercial secrets.

And if any manager wants to close specific data, we carry out shuttle negotiations and explain how the closure of information will affect other units directly or indirectly working with it. Thus, the percentage of data open within the company has been radically revised. By the standards of SIBUR, this is a real revolution.

Conclusion


We have a ready-made tool, but so far there are very few people who can use it. And we educate them. The process accelerated noticeably after we built the process of fan training, when each employee we trained takes on the responsibility of training the following. We took the path of training our own employees instead of hiring analysts, because in our case it’s easier to teach the gods of the macros SQL and Python than to awesome analysts to explain about pyrolysis and its features. And look at their faces at the same time.

How we attract people and motivate to study is a story worthy of a separate post.

In addition to educating internal analysts, we are also looking for architects, people who are knowledgeable about data management. This is a new direction not only for Russia, but also for the world as a whole, and people continue to interpret the concept of data architecture as to which much. There are understandable stories with business analytics, systems analytics, corporate architecture.

We at SIBUR define data architecture as a discipline at the junction of system pieces with databases and processes related to business. A person must understand how the organization in which he works is arranged, and how processes leave data about themselves in different systems. And how to connect the first with the second.

All Articles