The tale of how we tamed BigQuery

Task


In fact, the task that I want to talk about is simple to gloom in its wording: it was necessary to visualize the sales data of the e-commerce department with little blood, i.e., read, almost for nothing.


What is meant by this? The baskets of our stores generate a constant stream of data on online sales in different regions of the world with all the consequences: different currencies, time zones, taxes, types of customers, types of items, orders, etc. In fact, the same is generated by any online store, only, perhaps, the options for the parameters of orders are slightly different.


It is required, sitting on this stream, to issue a certain aggregated dashboard, on which the business could watch online key indicators of the current moment and for long periods online (or rather, regularly). Moreover, it is desirable that these dashboards were somewhat different in varying degrees of detail and specialization.


Search for a solution


Naturally, we started by looking for a ready-made solution. We looked at such data visualizers as, for example, Owox, BI solutions from Microsoft, some enterprise solutions, etc.
All this is great served by advertising. But:


  • not cheap
  • a thing in itself (it’s not so easy to control the mechanics of data processing, and sometimes you just have to rely on the competencies of data scientists of the development company)

By that time (2019), I already had a small pocket data visualization projection in Google Datastudio, which had already existed for at least a year and reduced to the fact that I manually uploaded the report from 1C, uploaded it to the Google Storage Cloud (not to be confused with GoogleDrive), from there - to the BigQuery table, and then, in DataStudio, I did data slices and everything else.


— , , , , .


, , , , . Data Studio , BigQuery , , , , .


, , SQL BigQuery, , .


BigQuery (https://console.cloud.google.com/bigquery), , .


, : 300 . , 2 BigQuery, , .



. , , BigQuery , , . , !


, SELECT, , BQ.


, BQ , . , , , .


: SQL ==> Python ==> BigQuery.


, — .


, SELECT csv- GoogleStorage BQ , Google Functions, Python ( ) . , 6 ( ) .


, ?



, , , SQL . , Orders, , , , , , , id, id .. , , .


, , SELECT pandas dataframe. :


import pandas as pd
import MySQLdb
import MySQLdb.cursors

def sql_query(date0, date1):
    connection = MySQLdb.connect(host=”server”, user=”user”, passwd=”pass”, db=”orders”, charset=”cp1251”, cursorclass = MySQLdb.cursors.SSCursor)
    query = “SELECT * FROM orders WHERE date>=’”+date0+”’ and date<’”+date1+”’ ”
    data =  pd.read_sql_query(query, connection=connection)
#  query -   SELECT   ,     date0  date1,   , ,   5 .          -       .          ,      BQ
# , ,  connection      ,        ..       .

#     -  . ,        :

    data.payment_date = data.payment_date.apply(lambda x: x.tz_localize(‘Europe/Moscow’)).astype(str)

#     str -  .  ,      ,       .     BQ         .

# ,     pandas     BQ,           ,      .

    return data

, 5- , (90 , .. ) , sql_query(date) , BQ.


, , pandas dataframe .



: , BQ MERGE . BQ 1-2 . 20 70 , STRING FLOAT. 70000 .


, . BQ, .


from google.cloud import bigquery
from google.oauth2 import service_account

project_id = “MyBQproject”  #    BQ
dataset_id = “MyDataSet”  #     

#       BQ.       - ( ,    ).       BQ.  ,   ,    ‘Service accounts’  Google Cloud Platform.

credentials = service_account.Credentials.from_service_account_file(“acc.json”)
client = bigquery.Client(project_id, credentials=credentials)

def upload_temp_table(df, table_id, schema):
    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
   # job_config.autodetect = True  #     
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  #  
    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()   #  

def merge_tables(temp_table, base_table, schema, head, merge_param):
    merge_query = “MERGE into “+base_table+” as T USING “+temp_table+” as S \
         ON T.”+merge_param+” = S.”+merge_param+ \
         “ WHEN MATCHED then UPDATE SET “+ \
         ‘,’.join([‘T.’+item+’=S.’+item for item in head])+ \
         “ WHEN NOT MATCHED then INSERT \
         (“+’,’.join(head)+”) VALUES (“+’,’.join([‘S.’+item for item in head])+”);”

#  ,         merge_param (,  id ,        )

#  -  

    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
    job = client.query(merge_query, job_config=job_config)
    job.result()

, — .


import datetime

for period in range(18):
    date0 = (datetime.datetime.now() - datetime.timedelta(days = period*5 + 5)).strftime(‘%Y%m%d’)
    date1 = (datetime.datetime.now() - datetime.timedelta(days = period*5)).strftime(‘%Y%m%d’)
    df = sql_query(date0,date1)   #    sql   >=date0, <date1
    upload_temp_table(df=df, table_id=”temp_table”, schema)  #    
    merge_tables(temp_table=’temp_table’, base_table = ‘orders’, schema=schema, head=[,,,], merge_param=’id’)

head — . . SQL , . MERGE- . BQ . - , Python , BQ.



.


import sys

log_file = open(“log.txt”, ‘a’)
old_stdout = sys.stdout
sys.stdout = log_file

, print
:


sys.stdout = old_stdout
log_file.close()

print -. , .



, BigQuery , DataStudio (https://datastudio.google.com/) .


, DataStudio , BigQuery Python. , , , apply.


, DataStudio (, ), , , , Datastudio .


, Datastudio Blended Tables, LEFT JOIN - . , , , BigQuery. , , , SQL-, BigQuery . , , 6 , , 6 . 20 70 3-4 BigQuery. .


, . , . - BigQuery, - Python. , merge_tables — BQ . , .. — . , .


, BigQuery , Google Spreadsheets. - id . :


  • Python pandas dataframe
  • BQ JOIN BQ (-, ).

, . - , , , .


. - -, . , , ( , F5). : Data Studio Auto-Refresh. : , View, Data Studio Auto-Refresh, , SQL=>BQ.


, , , :-)


All Articles