Structuring risks and decisions when using BigData for official statistics

Translator's preface The

material interested me, primarily because of the table below:



Given that statistics (and Russian, at the genetic level), to put it mildly, do not like everything that differs from linear dependence, these guys managed to drag through the use of the activation function in a parabolic form to determine the degree of risk of using BigData in official statistics. Well done. Naturally, statisticians added their note to this work - “1 Any errors and omissions are the sole responsibility of the authors. The opinions expressed in this document are personal and do not necessarily reflect the official position of the European Commission. ” But the work was published. I think for today, this is enough, and they (the authors) did not forbid anyone to find their scales in these aspects.

The work can be fairly structured to separate where and how statistical methods differ from research methods for BigData. In my opinion, the greatest benefit from this work will be when talking with the customer and to refute his statements such as:

- And we collect the statistics ourselves, what do you still want to research?
- And you present your results to us so that we coordinate them with our statistics. In this question, the authors say that it would be nice to read this work (3 How big is Big Data? Exploring the role of Big Data in Official Statistics )

In this paper, the authors put down their vision of the level of risk. This parameter is in brackets, not to be confused with reference to sources.

The second observation. The authors use the term BDS - this is an analogue of the concept of BigData. (apparently curtsy to official statistics).

Preface by

A growing number of statistical offices are exploring the possibility of using large data sources to produce official statistics. Currently, there are only a few examples where these sources were fully integrated into actual statistical production. Consequently, the full extent of the consequences caused by their integration is not yet known. Meanwhile, the first attempts were made to analyze the conditions and impact of big data on various aspects of statistical production, such as quality or methodology. Recently, the task force has developed a quality framework for producing big data statistics in the context of the United Nations Economic Commission for Europe (UNECE) big data project.According to the European Statistical Code of Practice, the provision of high-quality statistical information is the main task of statistical offices. Since risk is defined as the effect of uncertainty on objectives (for example, the international standardization organization ISO 31000), we found it appropriate to categorize risks according to the quality measurements that they affect.
The proposed quality structure of statistical data obtained from large data sources provides a structured idea of ​​the quality associated with all stages of the statistical business process, and thus can serve as the basis for a comprehensive assessment and management of risks associated with these new data sources. It introduces new qualitative dimensions that are specific to K or (of high importance when) the use of big data for official statistics, such as institutional / business environment or complexity. Using these new qualitative measurements, it is possible to more systematically identify the risks associated with the use of large data sources in official statistics.

In this paper, we seek to identify the risks caused by the use of big data in the context of official statistics. We take a systematic approach to identifying risks in the context of the proposed quality structure. By focusing on the newly proposed quality measurements, we can describe the risks that are currently absent or do not affect the production of official statistics. At the same time, we can determine the current risks, which will be assessed in completely different ways when using big data to obtain statistics. Then we move on to the risk management cycle and provide an assessment of the likelihood and impact of these risks. Since risk assessment involves subjectivity in their attribution, the probability and impact on various risks, we measure the agreement between dozens of different stakeholders,provided independently. We then offer options for mitigating these risks in four main categories: avoidance, reduction, sharing and retention. According to ISO, one of the principles of risk management should be the creation of value, that is, resources to reduce risks should be lower than for inaction. In accordance with this principle, we will finally evaluate the possible impact of some risk mitigation measures on the quality of the final results in order to arrive at a more comprehensive assessment of the use of Big Data for official statistics.one of the principles of risk management should be the creation of value, that is, resources for reducing risks should be lower than for inaction. In accordance with this principle, we will finally evaluate the possible impact of some risk mitigation measures on the quality of the final results in order to arrive at a more comprehensive assessment of the use of Big Data for official statistics.one of the principles of risk management should be the creation of value, that is, resources for reducing risks should be lower than for inaction. In accordance with this principle, we will finally evaluate the possible impact of some risk mitigation measures on the quality of the final results in order to arrive at a more comprehensive assessment of the use of Big Data for official statistics.

1. Introduction


1.1. Background


The development of “big data” was characterized by Kenneth Neil Kukier and Victor Mayer-Schoenberger in their article “Growing Big Data” (2. www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/ therise-of-big-data ) with the term data transfer. Datafication is described as the process of "taking all aspects of life and turning them into data." For instance. Facebook provides personal networks, sensors for all types of environmental conditions, smartphones for personal communication and movements, wearable data for personal conditions. This leads to nearly universal data collection and availability.

As in many other sectors, official statistics have only recently begun to discuss the big data problem at a strategic level. There is still no common and widespread understanding of the way forward, whether it is a challenge or an opportunity, whether it is small or large, etc. As part of the High-level Group on the Modernization of Statistical Production and Services (3 How big is Big Data? Exploring the role of Big Data in Official Statistics: www1.unece.org/stat/platform/download/attachments/99484307/Virtual%20Sprint%20Big%20Data%20paper.docx?version=1&modificationDate=1395217470975&api=v2), a First SWOT analysis followed by a crude risk / benefit analysis was carried out. It was noted that “a comprehensive risk analysis will also include aspects such as likelihood and impact, and may also be expanded to identify strategies to mitigate and manage risks.”

Although this document is still far from a complete risk analysis, it aims to improve the situation precisely by creating the first structured review. We would like to emphasize that this review should be seen as a starting point for stimulating general discussion within the Official Statistical Community (OSC).

1.2. Sphere


This article is devoted exclusively to risks, excluding not only advantages, but also strengths and weaknesses, opportunities and threats. This means that “risks of inaction” (for example, the risk that the OSC will be out of competition with other participants if it is not modernized) are not covered by the scope; it is rather a threat. Instead, we are trying to highlight the risks that may arise (a) if OSC takes advantage of the opportunities provided by big data and begins to develop or improve a specific “big data-based official statistics product” (BOSP); (b) risks for the new “ordinary business”, that is, risks for official statistics based on “big data” production. (Since all production of official statistics is associated with risks, we restrict ourselves to (b) the risks specific to Big Data, i.e.risks that do not exist or are insignificant for the “traditional” process of collecting official statistics.)

1.3. Structure


In section 2, we present the basic principles associated with this task, starting with the clearly necessary framework for risk management and risk management (section 2.1). We also present a preliminary quality structure for statistical data obtained on the basis of big data (Section 2.2), since linking the quality structure with risks fulfills two goals:

  • It sets the context for identifying risks. Certain quality indicators, together with the characteristics considered, express the values ​​of the object, which are considered important and crucial for the provision of services to customers and users.
  • This allows you to assign specific risks to qualitative measurements that are embedded in common hyperspaces and are tied to certain stages in the production of statistical products.

In sections 3, 4, 5 and 6, we present the risks identified so far in various contexts (4 The business case documents of the ESS (https://www.europeansocialsurvey.org/about/structure_and_governance.html) Big Data project as well as on the Big Data ESSets contain a list of risks partially related to the project and partially to using big data sources for statistical purposes. The document "A suggested Framework for the Quality of Big Data" mentions some risks related to quality dimensions./ The ESS Big Data project business case documents, as well as the ESS Big Data networks, contain a list of risks partially related to the project and partially using big data sources for statistical purposes. Some proposed risks are mentioned in the “Proposed Structure for Big Data Quality” with quality indicators.).Here we use the classification of data access, the legal environment, data privacy and security, as well as skills; reorganization in accordance with the quality structure of statistics obtained from big data (Section 2.2) should be considered immediately as soon as this structure reaches a more complete status. For each of the identified risks, we (i) provide an assessment of the likelihood and impact (in accordance with Section 2.1.3) and (ii) propose strategies to mitigate and manage risks (see Section 2.1.4).For each of the identified risks, we (i) provide an assessment of the likelihood and impact (in accordance with Section 2.1.3) and (ii) propose strategies to mitigate and manage risks (see Section 2.1.4).For each of the identified risks, we (i) provide an assessment of the likelihood and impact (in accordance with Section 2.1.3) and (ii) propose strategies to mitigate and manage risks (see Section 2.1.4).

In the end, we discuss our findings and outline some next steps in Section 7.

2. The basics


2.1. Risks and Risk Management


According to ISO 31000: 20095, risk is defined as “the effect of uncertainty on goals”. This means that goals must be defined or known before risks can be identified. These goals are usually determined by the institutional context of the organization. Another important consideration is that risks carry a characterization of uncertainty, that is, it is not clear whether the described event will occur. Thus, risks are measured in terms of the likelihood of the occurrence of the event and its consequences, i.e., the impact that the event has on achieving its goals. Risk assessment should provide more objective information, which ultimately will allow you to find the right balance between the realization of profit opportunities and minimizing adverse effects.Risk management is an integral part of management practice and an important element of good corporate practice (6 Statistics Canada: 2014-2015 report on Plans and Priorities,www.statcan.gc.ca/aboutapercu/rpp/2014-2015/s01p06-eng.htm ). It is an iterative process that ideally allows continuous improvement of the decision-making process and contributes to continuous improvement in productivity.

Risks are also associated with quality. The use of a quality system should make it possible to use the opportunities provided by various sources and methodologies to achieve a result of a certain level of quality in the sense that this result satisfies the needs of users. Like risks, quality levels can be derived from the institutional environment and the goals of certain institutions. In this context, the institutional environment determines the overall level of risk that the organization is prepared to bear to achieve its goals.

The risk assessment and management process can be divided into various stages, which include setting the context, identifying risks, analyzing risks in terms of probability and impact, assessing risks and, finally, processing risks.

2.1.1. Institutional context


As a first step, it is necessary to establish a strategic, organizational and risk management context in which the rest of the process will take place. This includes establishing criteria by which risks will be assessed, and determining the structure of the analysis.

2.1.2. Risk identification


In the second stage, events that can affect the achievement of goals should be identified. Identification should include issues related to the type of risk, the timing of the event, the place, or how events can prevent, worsen, delay or improve the achievement of goals.

2.1.3. Risk assessment


The next step is to identify existing controls and risk analysis in terms of probability, as well as in terms of potential consequences. In the context of this article, the probability or probability of occurrence of risks uses a scale of 1 (unlikely) to 5 (frequent). The impact of events is measured on a scale of 1 (negligible) to 5 (extreme). As shown in Table 1, the product of probability and impact has a “risk level” ranging from 1 to 25. The



estimated risk levels can be compared with predefined criteria to strike a balance between potential benefits and adverse outcomes. This allows you to make judgments about management priorities.



Priority for action should be placed on critical risks (see Table 2), that is, those that can occur and have serious or extreme consequences for the organization’s goals.

2.1.4. Risk response


The final step consists of decisions on how to respond to risks. Some risks that are below a predetermined risk level can be ignored or tolerated. For others, the costs of risk mitigation can be so high that they outweigh the potential benefits. In this case, the organization may decide to abandon the relevant activities. Risks can also be transferred to third parties, such as insurance, which compensates for the costs incurred. The final option is to take risks into account when defining strategies and actions that balance costs with potential benefits. Thus, the organization will decide on the implementation of strategies to maximize benefits and minimize potential costs.



2.2. Quality systems


The task force, composed of representatives of national and international statistical organizations, developed in 2014 a preliminary quality framework for statistics derived from big data. The task force worked under the auspices of the UNECE / HLG project “The Role of Big Data in the Modernization of Statistical Production”. He expanded existing quality systems designed to evaluate statistics from administrative data sources, with quality indicators that were considered relevant for large data sources.

Within this system, a distinction is made between the three phases of a business process: input, productivity, and output. The input phase corresponds to the GSBP “design” and “collection” phases, performance for the “process” and “analysis” phases, and the output is equivalent to the “propagation” phase.

The structure uses a hierarchical structure that was taken from the administrative data structure developed by Statistics Netherlands (7 Daas, P., S. Ossen, R. Vis-Visschers, and J. Arends-Toth, (2009), Checklist for the Quality evaluation of Administrative Data Sources. Statistics Netherlands, The Hague / Heerlen). Quality dimensions are embedded in a hierarchical structure called hyperspaces. The three defined hyper dimensions are “source”, “metadata” and “data”. Quality measurements are embedded in these hyper dimensions and assigned to each of the stages of production. For the input phase, additional aspects were proposed “confidentiality and confidentiality”, “complexity” (in accordance with the data structure), “completeness” of metadata and “connectivity” (the ability to link data with other data),to add to the standard quality model. For each of the quality indicators, factors related to their description are proposed, as well as possible indicators.

In the context of this article, risks can be excluded from these factors. For example, factors that need to be considered to measure institutional / business environment quality are the sustainability of the data provider. A related risk may be that the data will not be available from the data provider in the future. Another example relates to the recently proposed aspect of quality, privacy and security. One important factor is “perception,” meaning possible negative perceptions of the intended use of specific data sources by various stakeholders.

3. Risks associated with data access


3.1. Lack of access to data
3.1.1. Description


This risk consists of a project related to the development of BOSP that does not gain access to the required Big Data Source (BDS).

To date, the OSC has learned the hard way that even getting out of the starting blocks and gaining this access is sometimes an insurmountable obstacle. Sometimes it’s easy to access a specific source, such as call data records (CDR), for testing / research purposes, but it’s much harder (for legal or commercial reasons) to access it for production purposes.

3.1.2. Probability


The probability is largely dependent on the characteristics of the BDS. When it comes to large administrative data, it can be as little as 1, in particular if (as is the case with traffic loop data studied by Daas et al. 8 Daas, P., M. Puts, B. Buelens and P. van den Hurk. 2015. “Big Data as a Source for Official Statistics.” Journal of Official Statistics 31 (2). (Forthcoming; publication foreseen for June 2015.)) there are no problems protecting personal data. If the BDS case belongs to a private individual, in particular if it is sensitive (for example, from the point of view of data protection) or valuable (from a commercial point of view), the probability can be very high (5).

3.1.3. Influence


The impact depends on the BOSP and the way you use BDS. If the BDS is in the very center, the impact can be very high (4 = it is not possible to produce BOSP at all), while it can be lower if it is still possible to produce BOSP (albeit with lower quality), relying on other DRM, which leads to to exposure in the range of 2-3.

3.1.4. Prevention


To reduce the risk of lack of access, you should establish preliminary contacts with the data provider and enter into a long-term data access agreement. In addition, a comprehensive legal review should be undertaken regarding the specific combination of BDS and BOSP. The possibilities of accessing data should be assessed using current or future legislation.

3.1.5. Softening


If there are alternative BDS that can be used for BOSP, they could be explored instead. If there is no way to produce BOSP without BDS, and if it is impossible to overcome the lack of access, efforts must be stopped and the new BOSP will not be released.

3.2. Loss of access to data
3.2.1. Description


This risk is that the statistical office is losing the BDS underlying BOSP.

3.2.2. Probability


If BOSP is already being produced, there is usually some stability, and in some cases the risk can be very low (1). However, in particular, in the case of private entities with which insufficiently firm agreements were concluded, nothing interferes, for example. new guidance from changing data reporting policies, which leads to a moderate risk of a gap (3). Moreover, if BDS is associated with unstable activities, there is always a risk that the provider will simply go bankrupt, and the risk may be even higher (4).

3.2.3. Influence


Since the existing BOSP may not be possible to manufacture, a very strong impact often occurs (5). In other cases, when BDS is auxiliary, the impact may be rather a loss of quality with an impact in the range of 2-3.

3.2.4. Prevention


The prevention strategy is similar to the strategy of lack of access to data, but with an increased emphasis on constant vigilance also in the production environment.

Not putting all your eggs in one basket (i.e. having multiple BDS underlying each BSOP) can also be a strategy, but it can be either impractical or too expensive.

3.2.5. Softening


If BDS is the result of unsustainable activities, it is possible that a new BDS reflecting the same social phenomenon may gradually become available. However, it would be too late to initiate a “market scan” as soon as the BSOP crashes; constant vigilance will be required - and this can be difficult to achieve.

4. Legal risk


4.1. Failure to comply with relevant legislation
4.1.1. Description


This risk consists of a project related to the development of BOSP, which does not take into account the relevant legislation, which makes BOSP inconsistent with the specified legislation. This may apply to data protection legislation, regulatory burden of response, etc.

4.1.2. Probability


Given OSC's ignorance of big data, it is possible that accidental (3) non-compliance may occur. Probability is typically associated with BDS, since the less “sensitive” the source, the less likely it is to create a mismatch.

4.1.3. Influence


The impact is usually critical (4) in the sense that for inappropriate production it will be necessary to stop BOSP (or, if it has not yet reached the implementation stage, its development should be stopped). It can even be extreme (5), since reputational risks arising from inappropriate (“illegal”) official statistics can have consequences

4.1.4. Prevention


For any BOSP, a thorough legal analysis is necessary - and this happens at several stages (what is acceptable at the development / exploration stage may not be right at the implementation / production stage). This, in turn, can lead to BOSP reengineering to make it compatible.

4.1.5. Softening


Depending on the severity of the discrepancy, the first step may be to take BOSP offline.

BOSP reengineering to make it compatible may be an option, but whether the BOSP is “saved” in this way depends heavily on the nature of the mismatch.

4.2. Adverse changes in the legal environment
4.2.1. Description


New legislation may be introduced relating to the development of the BOSP, which effectively makes the BOSP incompatible.

4.2.2. Probability


It is possible that proponents of enhanced data protection will be able to introduce new requirements that directly or indirectly affect the ability to create specific BOSPs. Probability in the range of 2-3 seems a realistic estimate.

4.2.3. Influence


Exposure is usually critical (4), in the sense that inappropriate production will require a BOSP shutdown.

4.2.4. Prevention


Certain business information should be conducted regularly to monitor the development of legislation - possibly also in order to influence it, making arguments in favor of official statistics in relevant (for example, advisory) forums.

4.2.5. Softening


Provided that proactive monitoring has been carried out, there may be time for BOSP reengineering to bring it into line with the new legislation from the first day of its entry into force.

If, on the other hand, the monitoring was not carried out, so that the new legislation “came as a surprise,” or if the legislation is so radical that there is no way to make BOSP incompatible, the only option would be to disable BOSP.

5. Risks associated with data privacy and security


5.1. Violations of data security
5.1.1. Description


This risk relates to unauthorized access to data stored in statistical offices. Third parties may receive data that is under the embargo, for example, due to the release of the schedule (9 For any BOSP that is entirely based on a single BDS, it is inevitable that the data will be implicitly known to the original data owner, and if the methodology is transparent, derived statistics also This situation is not addressed here, but rather at the risk of abuse of official position by the owners.) (10 In addition, this data may carry the risk of violation of confidentiality. This risk will be considered separately.). This may be, for example, the data that investors expect in the stock market.

5.1.2. Probability


Regarding the technical aspects of protecting the IT environment in the statistical office, the risk is as likely for BDSs as for traditional sources. However, there are two additional aspects that must be considered.

Firstly, with some BDSs, the overall risk is slightly increased due to the fact that the data security of the original owner may be compromised. This may be due, for example, to industrial espionage or hacking.

Secondly, as soon as potentially valuable data is stored in the office, the risk of attracting malicious intent will increase. If the stored data is of very high value to the business, you should be prepared for a very high probability of attacks aimed at the IT infrastructure, so the probability of a hack may be potentially higher (4).

If the stored data is not perceived as having value, the overall probability does not appear to be very high - from (1) to (3) depending on the data source.

5.1.3. Influence


The potential damage to your reputation can be great (5). What is important in the case of BDS is that if the breach of security occurs with the original owner, the impact on the reputation of the statistical office is expected to be lower than if the breach occurred with the data stored in it.

On the other hand, it is possible that a violation in the statistical office may have negative consequences for the original owner. In this case, a strong negative impact is again possible due to damage in terms of trust between the supplier and the statistical office (5).

5.1.4. Prevention


What is characteristic of the BDS case is that the security procedures of the original owner may be appropriate. It is unlikely that statistical offices will have audit credentials to control this. Owners whose data are used to make records with confidential publication schedules should be informed of the implications for official statistics of potential security breaches in their premises and should receive an official guarantee that proper security procedures are being applied.

A direct way to prevent a serious impact of a security breach in the owner’s premises on the statistical office is to use multiple sources for the same product so that one compromised source is not enough to get the final figure. The advantage of this approach is that more control is in the hands of the statistical office.

The way to prevent the negative consequences of a security breach in the statistical office for the original data owner is to find a way of working that does not involve transferring data that is potentially sensitive from the owner’s point of view to the statistical office. In raw form. A possible preventive approach is to use aggregated data. It should be remembered, however, that some forms of aggregation, for example, those designed to prevent the identification of individual members of the population, may not be appropriate in this case. One reason for this may be the fact that the risk to the owner is associated with the commercial value of the data, which can be significant even after anonymity is achieved.

5.1.5. Softening


In case of violation of the data managed by the statistical office, mitigation measures will be the same as in the case of traditional sources, if there was no negative impact on the original owner.

In the event of negative consequences for the original owner, the statistical office should review and strengthen its security procedures and clearly communicate and demonstrate its commitment to this.

If the violation occurred in the premises of the original owner, then the relevant statistical office should clearly report on the situation and insist on improving the owner’s safety procedures. If necessary, you can look for an alternative supplier.

5.2. Data privacy breaches


5.2.1. Description


This is a risk that the confidentiality of one or more persons from the statistical population will be violated. This may be due to an attack on the IT infrastructure due to pressure from other government agencies or due to inadequate controls over the disclosure of statistics.

5.2.2. Probability


As with the risk of data security breaches, the microdata storage specifications do not change much with the addition of BDS. However, there are warnings here.

Microdata from certain data sources can be of high business value, so storing them will increase the likelihood of attacks.

In addition, some microdata can be potentially very useful for other government agencies, such as law enforcement, taxation, or healthcare. In certain circumstances, adherence to the principle of statistical confidentiality may come under great pressure.

As for failures in the control of the disclosure of statistical information, there is already an established practice. The BDS may allow statistics to be produced for small subgroups of the population or provide the ability to link aggregated data from different BDSs, which may increase the likelihood of risk. In addition, new sources, however, will require new methodological developments, so the real danger is that the methodology for controlling disclosure is not updated properly.

In general, with reasonable preventive measures, the probability can be maintained at reasonable levels, but since there are many different and diverse factors, the corresponding assessment here seems to be that the probability is high (4).

5.2.3. Influence


The potential damage to your reputation can be great (5). As with the risk of data breach, a breach in the statistical office can have negative consequences for the original owner. Here the influence of such an event can be potentially even greater, especially provided that current trends in public opinion continue. The damage between the data provider and the statistical office is also expected to be very large.

5.2.4. Prevention


An unmistakable way to prevent this risk is to not have microdata from BDS at all (although storing other microdata still carries a corresponding risk, albeit with a different probability and impact). This way, as in the case of a risk of data security breach, will entail the need to develop other ways to use the data for statistical purposes. In addition, the different nature of the sources here will mean that it will be necessary to develop new methodologies with competing goals to extract as much useful information as possible and protect privacy from danger.

In the case of microdata storage, IT security and access control mechanisms must be at the required level and constantly monitored. Particular attention must be paid to ensuring the security of new methods of obtaining data. Ironically, this new way could be the physical transportation of storage devices (such as hard drives). If this method is used, then the delivery must be physically secure and encryption must be used.

5.2.5. Softening


The mitigating measures here are basically the same as in the case of data security breaches. If the cause of the violation is pressure from another government agency, then you should take the opportunity to strengthen the independence of governance so that such violations become even more difficult in the future.

5.3. Manipulations with a data source
5.3.1. Description


Third-party data providers, such as social media data or voluntarily provided data, are at risk of manipulation. This can be done either by the data provider itself or by third parties. For example, many false messages on social networks can be generated in order to push the statistical index obtained on the basis of these data in one way or another, if it is known that the index is calculated on the basis of such data.

For voluntarily provided data, there may be times when volunteers represent a specific interest group with a specific agenda.

5.3.2. Probability


For data the manipulation of which can be of great benefit, the probability is higher. This can be data for which statistics are interesting, for example, the stock market. In light of recent scandals related to LIBOR and Forex, it can be assumed that as long as there is incentive, attempts to manipulate data will be likely.

For statistics based on voluntarily provided data, you only need to look at the recent PR practice of hiring people who pretend to have a certain opinion and who are paid for public expression (for example, on Internet forums) to conclude that the probability is not small . In general, a figure of 3 to 4 seems adequate.

5.3.3. Influence


The big problem with the manipulations is that they can last a long time without detection. If the manipulation continues for a long time, the impact on quality can become significant. In addition, the damage to public confidence in official statistics can also be large, especially if the role of statistical offices as providers of quality data is publicly emphasized. On the other hand, if manipulations are detected on time and then published, this can actually improve public perception. Except in extremely bad cases, one can imagine the maximum effect (3).

5.3.4. Prevention


Performing regular control exercises with alternative sources is one of the possible preventive approaches. These alternative sources may be traditional or different. Using statistics based on a combination of sources can interfere with the significant effects of manipulation. In cases where they are afraid of provider-initiated manipulations, legal agreements can also be one way to prevent such practices.

5.3.5. Softening


In terms of damage to public relations, the mitigating measures that should be taken here are not much different from measures to combat any crisis.

In terms of data quality, it would be helpful if past data could be corrected so that even with a large delay, the correct series could be
produced. Regular benchmarking may be helpful for this. Please note that the goal of benchmarking in this case is slightly different from the goal of prevention. To prevent this, it is important to quickly spot and investigate a suspicious mismatch between the benchmark and the BDS. To mitigate the effects of old useful data is always useful.

In addition, care should be taken to prevent similar manipulations in the future - in particularly delicate cases, this may mean receiving potentially redundant data from several suppliers for comparative analysis.

5.4. Adverse public perception of the use of big data by official statistics
5.4.1. Description


The media and the general public are very sensitive to issues of confidentiality and the use of personal data from large data sources, especially in the context of the secondary use of data by government agencies that take administrative or legal measures against citizens. Negatively perceived use may be the positioning of speed control based on the analysis of navigation data (11 See www.theguardian.com/technology/2011/apr/28/tomtom-satnav-data-police-speed-traps ).
A specific case of TomTom Netherlands caused a significant drop in demand for TomTom devices and led to the company's decision to restrict access to data. In this particular case, the data related to individuals, but to levels of speed along sections of the road.

However, there may be applications with big data that are well received by the public. One example is applications that prevent crimes such as burglary based on big data methods.

Positive as well as negative public opinion can have a strong impact on the use of BDS in the context of the production of official statistics.

The consequence of negative public perception may be that:

  • BDS will no longer be available to statistical offices, either due to data provider decisions or government decisions not to use data, or
  • data usage will be limited, which may interfere with production if certain BOSP.

5.4.2. Probability


Factors that may affect the likelihood of such an event or its impact on the production of statistics:

  • data confidentiality, i.e. how easily people can be identified;
  • the amount of information that data is disclosed about individuals, for example, is increased by linking data from different sources;
  • data type, for example, financial transactions are perceived as more confidential than other data;
  • the type of potential action that can be taken on citizens, for example, fine people for speeding;
  • fuzzy legal environment in which data providers and users work or when legal conditions conflict with public ethical opinions / standards;
  • ; . , , . , , .

An estimate of the time of adverse events is not possible, since public mobilization is often triggered by coverage of events that negatively affect citizens. However, with increasing use of big data by governments and private enterprises, and especially with active marketing of data for other purposes than the one that led to their initial collection, it is more likely that such events will occur.

Events that strongly influence public perception are not frequent, but rather random (3) and distant (2). With increasing use of large data sources, likelihood will also increase.

5.4.3. Influence


The impact of the event is very dependent on the factors discussed above. In general, the impact is more serious for the already established production of statistical data, since, perhaps, the action should be terminated. The impact also depends on the availability of alternative data sources, although it may happen that public perception does not distinguish between different data sources in case of materialization of the event. In the current state of use of big data, it seems that these sources cannot completely replace traditional data sources, but rather supplement existing statistics. This will reduce the impact of events. Therefore, the impact of the event is considered in the range from 2 (insignificant) to 3 (main). At the production stage, the influence may increase to 4 (critical value).

5.4.4. Prevention


Preventive measures can be the definition of ethical principles for big data in official statistics. Ethical guidelines should be based on principles such as a code of practice for European statistics or the fundamental principles of official statistics (12 unstats.un.org/unsd/dnss/gp/fundprinciples.aspx ). The next step will be to define a communication strategy that will publish the results of ethical guidelines for the public and can be used to inform stakeholders about the ethical use of BDS for BOSP.

A separate risk assessment for a specific BDS can be conducted to identify risks and suggest preventive or mitigating actions based on ethical principles. A separate risk assessment may also include stakeholders, such as data protection agencies, to ensure that all risks are identified and actions are justified.

5.4.5. Softening


A communication strategy should also include measures in the event of growing negative public attitudes. A separate risk assessment should collect positive examples of the use of data and measures to prevent data misuse, which may necessarily be taken at the political level, and the statistical community may not be able to effectively influence them.

5.5. Loss of trust - not obtained as a result of observation
5.5.1. Description


Users of official statistics usually have high confidence in the accuracy and reliability of statistics. This is based on the fact that the production of statistical data is embedded in a reliable and accessible methodological base, as well as documentation on the quality of the statistical product. In addition, most statistics are based on observations, i.e. obtained from surveys or censuses that establish an easily understandable relationship between observation and statistics. The use of BDS, which are not collected for the main purpose of statistics, carries the risk that these relationships will be lost and users will lose confidence in the official statistics. An example related to the last round (2010) of the census is related tothat in some countries, statistics were obtained using several sources and statistical models. In a number of cases, stakeholders have disputed statistics.

5.5.2. Probability


The likelihood of a risk depends on factors such as the complexity of the statistical / methodological model, the reliability of the relationship between the BSD and BOSP, or whether other statistics are consistent. The probability should be in the range of 3 (random) to 4 (probable), which means that this can happen several times or often.

5.5.3. Influence


The impact of the occurrence of risk will largely depend on whether NSOs can successfully prove the accuracy and reliability of the statistics. If this cannot be achieved, the impact from the point of view of loss of trust and confidence may also affect other statistical areas, that is, the reliability of not only some statistical data, but also cast doubt on the organization itself. NSOs would lose competitive advantage over other private organizations active in this area.

5.5.4. Prevention


Preventive actions will consist in the development and publication of a scientifically based methodology that is recognized by the scientific community, enrich the data with metadata in quality, ensure the consistency of BOSP with non-BOSP, and carry out strict quality control.

Before embarking on statistical production, the BOSP could be published as an experimental one, and interested parties would be encouraged to challenge the BOSP in order to validate or improve the BOSP.

5.5.5. Softening


There are two cases to distinguish. If the statistics are disputed but of high / sufficient quality (correct / accurate), it would be enough to explain and bring the statistics to the public, providing simple examples to understand.

6. Skills Risks


6.1.
6.1.1.


The analysis of digital traces left by people during the course of their activities requires certain data analysis tools, which are currently not the most common in official statistics. First, the use of indirect data on people's activities instead of direct surveys in surveys requires the use of statistical models and, therefore, inference skills and machine learning. Secondly, these digital records consist of data that often does not have the usual table format usual for survey results, with rows corresponding to a statistical unit and columns with specific characteristics of these statistical units. Digital tracks are also presented in the form of text, sound, image and video.Extracting relevant statistical information from these data types requires skills in natural language processing, audio processing, and image processing. Third, these data sources tend to provide massive data sets, the processing of which requires a good understanding of distributed computing methodologies.

The risk of a shortage of experts lies in obtaining data from one of these new large data sources, since the statistical office is not able to process and analyze it properly due to the fact that its staff does not have the necessary skills.

6.1.2. Probability


The probability of this risk will depend on three factors: 1) the specific types of skills needed for each type of big data source, and the likelihood that the statistical office will find the opportunity to study such a source; 2) the current availability of necessary skills in statistical management; and 3) the organizational culture of the statistical office.

Regarding the types of skills that may be required, it should be noted that not all sources require all the skills listed above. Some (for example, data like Google Trends) do not require distributed computing, as they are already preprocessed from the data holder or have signal processing skills, and they will mainly require statistical modeling skills. However, there is a wide variety of big data sources, most of which require skills in distributed computing, signal processing, and machine learning. At the same time, the proper investigation of these digital paths will require the processing of several sources. Thus, there is a high probability that large data sources becoming available to the statistical office will require these unusual skills,and the likelihood of this risk is very high (5).

Regarding the current availability of the necessary skills, this will depend on the particular statistical office. Even if the survey methodology is less common than the survey methodology, it is also used in official statistics in individual areas. Therefore, even if this may require some redistribution of human resources, statistical offices can find a solution on their own. As for distributed computing skills, mainly related to IT, they will depend on how the IT infrastructure is managed in the organization. Depending on how outside the IT department is, solutions can be found in the context of existing arrangements. However, signal processing and machine learning skills generally do not exist in most official statistical offices,and the application of these skills cannot be outsourced, as they should be applied by experts in the field of statistics. Therefore, from this point of view, the likelihood of this risk also seems very high (5).

Organizational culture will also influence the likelihood of this risk. Having staff ready to acquire the necessary skills through self-learning can enable an organization to respond to a situation with a new data source that requires skills other than normal. This will depend on the organizational culture of the statistical office, namely on whether it will encourage employees to learn new skills and whether this allows employees time for self-study.

Thus, the likelihood that the statistical office will not be able to process and analyze new data sources due to lack of skills among its employees will be between probable (4) and frequent (5) depending on the organization’s self-learning culture.

6.1.3. Influence


A statistical office that is unable to process and analyze large data sources due to lack of skills among its employees can have two possible negative consequences: 1) the data source will not be studied, at least not fully; 2) the source will be misused.

The lack of the ability to fully explore the potential of a valuable source of big data will have little impact (2) in the short term, as statistical offices really have statistical tools to meet current needs. However, in the long run (and possibly even in the medium term), the consequences of losing this opportunity will be crucial (4), as statistical offices are increasingly faced with competition from private providers that do not have the same institutional structure that will allow them to guarantee society statistical independence.

However, improper use of the source will have extremely negative consequences for statistical offices, since official statistics depend heavily on their reputation in carrying out their mission. However, we can argue that the most important skill that, if missed, can lead to incorrect results is the statistical conclusion, in particular the conclusion based on the model, which is also less likely to be absent. Therefore, the expected impact will be more critical (4) than extreme.

6.1.4. Prevention


Statistical services can actively prevent this risk in two ways: 1) training; and 2) a set.

Statistical offices can provide staff with the necessary skills by identifying in detail the skills needed to use large data sources in statistical production, compiling a list of existing staff skills, identifying training needs, and then organizing training courses.

Statistical offices can also recruit new employees with the necessary skills. This seems to have serious limitations, since statistical offices will not be able to recruit a critical mass of personnel for a situation where the use of large data sources will be widespread in the office and new employees will still need several years to reach the level of experience of existing employees. However, at least some of the new staff recruited as part of a regular staff upgrade may have big data skills.

6.1.5. Softening


Faced with a situation where new sources of big data are available without employees with the necessary skills, statistical offices can mitigate the negative effects in two ways: 1) subcontracting; and 2) cooperation.

Statistical offices can enter into agreements for data processing and analysis of new sources of big data with other organizations that provide these types of services. This seems to be a viable solution, as a new sector of enterprises specializing in processing this type of data appears. However, this is a decision which in itself carries certain risks, since the statistical office will have less control over the production of potentially sensitive statistical products. This solution also has the disadvantage that it does not allow employees of the statistical office to learn and acquire the necessary skills.

Collaboration with other organizations that have employees with the necessary skills and who are also interested in exploring the source of big data seems to be a more promising solution. This cooperation can take the form of joint projects with employees of the statistical office and employees of other organizations on an equal footing, who share their knowledge. This would not only reduce the risk of lack of skills, but also allow the statistical office to acquire these skills.

6.2. Leak of experts to other organizations
6.2.1. Description


This risk is that statistical agencies lose their staff to other organizations after they have acquired skills related to big data.

6.2.2. Probability


The probability of this risk will depend on two factors: 1) the existing attractive opportunities in organizations outside of official statistics; 2) working conditions in statistical offices.

As for opportunities in organizations outside of official statistics, the likelihood of this risk seems probable (4). There is a high demand for people with big data skills in the private sector, as well as in other public sector organizations. After acquiring skills in working with big data, official statisticians will gain a comparative advantage, being experienced experts in the field of statistics. In addition to the specific skills of working with big data, other organizations require data specialists with more traditional skills, such as assessing user needs and developing key performance indicators (KPIs) that are common to official statisticians. In addition, employees who are more likely to learn new skills are also expected to be thosewho will also be more open to career changes and leave the statistical office.

As for the working conditions in statistical offices, this will obviously depend mainly on the particular office. However, statistical offices in general still offer attractive professional opportunities for people from a quantitative point of view. Statistical offices offer the largest range of possible domains for work and the largest selection of data for work. This will somehow reduce the likelihood that statistical offices will lose their staff due to unforeseen circumstances (3).

6.2.3. Influence


The impact of this risk will be the same as the risk of a lack of personnel with relevant skills in the first place. Therefore, the impact will be critical (4), as indicated above.

6.2.4. Prevention


Apparently, the only way for statistical offices to prevent this risk is to provide attractive working conditions for their employees. This is generally true for all staff. However, in the specific case, when employees are open to mastering new skills, namely the skills of working with big data, working conditions can be improved by providing them with training opportunities where they can develop their professional interests. Statistical offices can also pay particular attention to being open to new innovative projects and ideas related to new big data sources coming from statisticians working in several areas of statistics. Finally,the prevention of personnel loss for other organizations in the sequence of their skills in working with big data will depend on the good identification of personnel who are able and willing to work with such data, and on the provision of good opportunities for their professional development.

6.2.5. Softening


A reduction in this risk will be made in relation to the risk of a staff member having the appropriate skills: 1) subcontracting; and 2) cooperation.

7. Discussion


From this first review, it is obvious that it is impossible to establish a single probability or impact for a given “big data risk” - as a rule, both indicators are largely dependent on the source of big data, as well as on “official statistics based on big data”.
product. "

Thus, we conclude that the logical next step in this direction is the adoption of a number of possible pilot projects (each of which includes a combination of one or more BDSs and one or more BDOSs) as a starting point and - for each such pilot - The desire to assess the likelihood and impact of each risk.

To this end, we are on the verge of launching a stakeholder survey, trying to assess the OSC assessment of the likelihood, impact (and possible mitigation / mitigation actions) of a number of possible pilot projects - and to seek OSC proposals for risks that we did not include in this document .

8. REFERENCES
UNECE (2014), «A suggested Framework for the Quality of Big Data», Deliverables of the UNECE Big Data Quality Task Team, www1.unece.org/stat/platform/download/attachments/108102944/Big%20Dat
a%20Quality%20Framework%20-%20final-%20Jan08-2015.pdf?version=1&modificationDate=1420725063663&api=v2

UNECE (2014), «How big is Big Data? Exploring the role of Big Data in Official Statistics», www1.unece.org/stat/platform/download/attachments/99484307/Virtual%20Sprint%20Big%20Data%20paper.docx?version=1&modificationDate=1395217470975&api=v2

Daas, P., S. Ossen, R. Vis-Visschers, and J. Arends-Toth, (2009), Checklist for the Quality evaluation of Administrative Data Sources, Statistics Netherlands, The Hague/Heerlen

Dorfman, Mark S. (2007), Introduction to Risk Management (e ed.), Cambridge, UK, Woodhead-Faulkner, p. 18, ISBN 0-85941-332-22)

Eurostat (2014), «Accreditation procedure for statistical data from non-official sources» in Analysis of Methodologies for using the Internet for the collection of information society and other statistics, www.cros-portal.eu/content/analysismethodologies-using-internet-collection-information-society-and-other-statistics-1

Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. dx.doi.org/10.1787/5js7t9wqzvg8-en

Reis, F., Ferreira, P., Perduca, V. (2014) «The use of web activity evidence to increase the timeliness of official statistics indicators», paper presented at IAOS 2014 conference, iaos2014.gso.gov.vn/document/reis1.p1.v1.docx

Even if not explicitly mentioning risks, this paper in fact approaches the many risks associated to the use of web activity data for official statistics. Eurostat (2007), Handbook on Data Quality Assessment Methods and Tools, ec.europa.eu/eurostat/documents/64157/4373903/05-Handbook-ondata-quality-assessment-methods-and-tools.pdf/c8bbb146-4d59-4a69-b7c4-218c43952214


All Articles