Choosing a database for analytics



When analytics issues go beyond pre-built tools, it's probably time for you to choose a database for analytics. You should not write scripts of queries to the working database, because you can change the order of the data and, most likely, slow down the application.
You can also accidentally delete important information if analysts or engineers work there.

For analysis, you need a separate type of database. But which one is true?

In this post we will consider offers and best practices for an average company that is just starting to work. Whichever setting you choose, you can find a compromise in the future to improve performance over what we are discussing here.

Working with a large number of clients, we found that the most important criteria that must be considered are:

  • Type of data analyzed
  • How much data do you have?
  • The focus of your engineering team
  • How fast do you need information

What types of data do you analyze?


Think about the data you want to analyze. Do they fit well in rows and columns like a huge Excel spreadsheet? Or would it make more sense if you put them in a Word document?

If you answered Excel, a relational database like Postgres, MySQL, Amazon Redshift or BigQuery will suit your needs. These structured relational databases are great when you know exactly what data you are going to receive and how they relate to each other - basically, how rows and columns are related. For most types of user analysis, relational databases work well. User attributes such as names, emails, and billing plans fit perfectly into the table, like user events and their properties .

On the other hand, if your data fits better on a piece of paper, you should refer to a non-relational (NoSQL) database such as Hadoop or Mongo.

Non-relational databases are characterized by an extremely large number of private values ​​(millions) of semi-structured data. Classical examples of semi-structured data are texts such as e-mail, books and social networks, audiovisual data, and geographic data. If you are doing a lot of text mining, language processing, or image processing, you will most likely need to use non-relational data stores.



How much data are you dealing with?


The next question to ask yourself is how much data are you dealing with. The more data you have, the more useful the non-relational database will be, because it will not impose restrictions on incoming data, which will allow you to write to the database faster.



These are not strict restrictions, and each one can process more or less data depending on various factors, but we found that each of the databases works perfectly within these limits.

If you have less than 1 TB of data, then with Postgres you will get good performance. But it slows down at about 6 TB. If you like MySQL but need a slightly larger scale, Aurora (Amazon's own version) can reach 64 TB. For a petabyte size, Amazon Redshift is usually a good choice as it is optimized for analytics up to 2PB. For parallel processing or even MOAR data, it's probably time to take a look at Hadoop.

However, AWS told us that they are running Amazon.com on Redshift, so if you have a first-class DBA team, you may be able to scale beyond the 2PB “limit”.

What is your engineering team focused on?


This is another important question to ask yourself when discussing the database. The smaller your overall team, the greater the likelihood that your engineers will focus primarily on product creation, rather than data processing and management. The number of people you can devote to these projects will greatly affect your options.

With some engineering resources, you have more choices - you can go to a relational or non-relational database. Relational databases take less time than NoSQL.

If you have several engineers who are working on the installation, but cannot bring anyone to the service, choose something like Postgres , Google SQL (optional MySQL hosting) or Segment Warehouses(Redshift hosting) is probably a better option than Redshift, Aurora or BigQuery, as they require periodic correction of data processing. If you have more time to service, choosing Redshift or BigQuery will provide faster, larger-scale queries.

Relational databases have another advantage: you can use SQL to query them. SQL is well known to both analysts and engineers, and is easier to learn than most programming languages.

On the other hand, analytics of semi-structured data usually require, at a minimum, experience in object-oriented programming, or, better, experience in writing code to work with big data. Even with the advent of analytics tools like Hunkfor Hadoop or Slamdata for MongoDB, you will need an experienced analyst or data specialist to analyze these types of databases.

How fast do you need this data?


While “real-time analytics” is very popular for cases such as fraud detection and system monitoring, most analyzes do not require real-time data or or immediate analysis.

When you answer questions, for example, what causes the outflow of users or how people switch from your application to your website, access to your data with a slight delay (hourly or daily intervals) is quite acceptable. Your data does not change minute by minute.

Therefore, if you are mainly working on the actual analysis, you should refer to a database optimized for analytics, such as Redshift or BigQuery. Such databases are designed to accommodate a large amount of data and quickly read and combine data, making queries fast. They can also download data fast enough (hourly) while someone is doing the cleaning process, resizing and monitoring the cluster.

If you absolutely need real-time data, you should turn to an unstructured database such as Hadoop. You can design your Hadoop database so that data loads into it very quickly, although querying it may take longer depending on RAM usage, available disk space, and data structure.

Postgres vs. Amazon Redshift vs. Google bigquery


You probably already realized that a relational database would be the best choice for analyzing most types of user behavior. Information about how your users interact with your site and applications can easily fit into a structured format.

analytics.track('Completed Order') — select * from ios.completed_order



So the question is which SQL database to use? Four criteria must be considered.

Size vs. speed


When you need speed, it's worth considering Postgres: for a database less than 1TB, Postgres is pretty fast for loading data and queries. Plus, it is available. As you approach 6TB (inherited from Amazon RDS), your queries will run slower.

Therefore, when you need a larger size, we usually recommend Redshift. Our experience shows that Redshift has the best value for money.

SQL highlight


Redshift is built on a variation of Postgres, and both support the good old SQL. Redshift does not support all the data types and functions that postgres supports, but it is much closer to the industry standard than BigQuery, which has its own SQL.

Unlike many other SQL-based systems, BigQuery uses comma-separated syntax to indicate table joins, and not according to SQL documentation . This means that without caution, SQL queries can lead to errors or unexpected results. Therefore, many of the teams we met cannot convince their analysts to learn BigQuery SQL.

Third-party ecosystem


Rarely does your data warehouse live on its own. You need to put the data in a database, and in addition, you need to use some kind of software to analyze it. (Unless you run the SQL query from the command line).

Therefore, people often like that Redshift has a very large ecosystem of third-party tools. AWS has capabilities such as the Segment Data Warehouse for loading data into Redshift from the analytics API, and they also work with almost all data visualization tools on the market. Fewer third-party services connect to Google, so moving the same data to BigQuery may take more time to develop, and you will not have so many options for BI software.

You can see Amazon partnershere and google here .

However, if you already use Google Cloud Storage instead of Amazon S3, it may be beneficial for you to stay in the Google ecosystem. Both services simplify data loading if they already exist in the corresponding cloud storage repository, so that, although they will not violate the terms of use, it will be much easier if you stop using one of these providers.

Training


Now that you have a clearer idea of ​​which database to use, the next step is to figure out how you will collect the data into the database.

Many new database developers underestimate how difficult it is to build a scalable data pipeline. You must write your own extraction layer, data collection API, query and conversion layer. And everyone has to scale. In addition, you need to determine the correct layout based on the size and type of each column. MVP replicates your production database to a new instance, but this usually means using a database that is not optimized for analytics.

Fortunately, there are several options on the market that can help you get around some of these obstacles and automatically do an ETL for you.

But whether it's your own development or purchase, getting data in SQL is worth it.

Based on the initial user data, only with the help of a flexible SQL format will you be able to answer in detail questions about what your customers are doing, accurately assess the distribution, understand the cross-platform behavior, create dashboards for a particular company, and much more.

Source: https://habr.com/ru/post/undefined/


All Articles