🉑 👩🏽‍🔧 📹 我们如何驯服BigQuery的故事 🐢 🐞 ☂️

任务

实际上，我要谈论的任务的措词很简单：必须以很少的精力可视化电子商务部门的销售数据，也就是说，几乎一无所获。

这是什么意思？我们商店的购物篮不断产生有关世界不同地区在线销售的数据，其后果包括：不同的币种，时区，税金，客户类型，物品类型，订单等。实际上，任何在线商店都会生成相同的内容，仅可能是订单参数的选项稍有不同。

需要坐在此流上，发布某个汇总的仪表板，企业可以在该仪表板上监视当前时刻的在线关键指标，并且可以长期在线（或定期）观看。此外，希望这些仪表板在细节和专业化程度方面有所不同。

寻找解决方案

自然，我们从寻找现成的解决方案开始。我们研究了诸如Owox，Microsoft的BI解决方案，某些企业解决方案等数据可视化工具。
所有这些都可以通过广告很好地实现。但：

不便宜
本身就是一件事情（控制数据处理的机制并不容易，有时您只需要依靠开发公司的数据科学家的能力即可）

到那时（2019年），我已经在Google Datastudio中有了一个小型的袖珍数据可视化投影，该投影已经存在了至少一年，归结为我从1C手动下载报告并将其上传到Google Storage Cloud的事实（不要与GoogleDrive），从那里到BigQuery表，然后在DataStudio中，我进行了数据切片和其他所有操作。

— , , , , .

, , , , . Data Studio , BigQuery , , , , .

, , SQL BigQuery, , .

BigQuery (https://console.cloud.google.com/bigquery), , .

, : 300 . , 2 BigQuery, , .

. , , BigQuery , , . , !

, SELECT, , BQ.

, BQ , . , , , .

: SQL ==> Python ==> BigQuery.

, — .

, SELECT csv- GoogleStorage BQ , Google Functions, Python ( ) . , 6 ( ) .

, ?

, , , SQL . , Orders, , , , , , , id, id .. , , .

, , SELECT pandas dataframe. :

import pandas as pd
import MySQLdb
import MySQLdb.cursors

def sql_query(date0, date1):
    connection = MySQLdb.connect(host=”server”, user=”user”, passwd=”pass”, db=”orders”, charset=”cp1251”, cursorclass = MySQLdb.cursors.SSCursor)
    query = “SELECT * FROM orders WHERE date>=’”+date0+”’ and date<’”+date1+”’ ”
    data =  pd.read_sql_query(query, connection=connection)
#  query -   SELECT   ,     date0  date1,   , ,   5 .          -       .          ,      BQ
# , ,  connection      ,        ..       .

#     -  . ,        :

    data.payment_date = data.payment_date.apply(lambda x: x.tz_localize(‘Europe/Moscow’)).astype(str)

#     str -  .  ,      ,       .     BQ         .

# ,     pandas     BQ,           ,      .

    return data

, 5- , (90 , .. ) , sql_query(date) , BQ.

, , pandas dataframe .

: , BQ MERGE . BQ 1-2 . 20 70 , STRING FLOAT. 70000 .

, . BQ, .

from google.cloud import bigquery
from google.oauth2 import service_account

project_id = “MyBQproject”  #    BQ
dataset_id = “MyDataSet”  #     

#       BQ.       - ( ,    ).       BQ.  ,   ,    ‘Service accounts’  Google Cloud Platform.

credentials = service_account.Credentials.from_service_account_file(“acc.json”)
client = bigquery.Client(project_id, credentials=credentials)

def upload_temp_table(df, table_id, schema):
    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
   # job_config.autodetect = True  #     
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  #  
    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()   #  

def merge_tables(temp_table, base_table, schema, head, merge_param):
    merge_query = “MERGE into “+base_table+” as T USING “+temp_table+” as S \
         ON T.”+merge_param+” = S.”+merge_param+ \
         “ WHEN MATCHED then UPDATE SET “+ \
         ‘,’.join([‘T.’+item+’=S.’+item for item in head])+ \
         “ WHEN NOT MATCHED then INSERT \
         (“+’,’.join(head)+”) VALUES (“+’,’.join([‘S.’+item for item in head])+”);”

#  ,         merge_param (,  id ,        )

#  -  

    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
    job = client.query(merge_query, job_config=job_config)
    job.result()

, — .

import datetime

for period in range(18):
    date0 = (datetime.datetime.now() - datetime.timedelta(days = period*5 + 5)).strftime(‘%Y%m%d’)
    date1 = (datetime.datetime.now() - datetime.timedelta(days = period*5)).strftime(‘%Y%m%d’)
    df = sql_query(date0,date1)   #    sql   >=date0, <date1
    upload_temp_table(df=df, table_id=”temp_table”, schema)  #    
    merge_tables(temp_table=’temp_table’, base_table = ‘orders’, schema=schema, head=[,,,], merge_param=’id’)

head — . . SQL , . MERGE- . BQ . - , Python , BQ.

import sys

log_file = open(“log.txt”, ‘a’)
old_stdout = sys.stdout
sys.stdout = log_file

, print
:

sys.stdout = old_stdout
log_file.close()

print -. , .

, BigQuery , DataStudio (https://datastudio.google.com/) .

, DataStudio , BigQuery Python. , , , apply.

, DataStudio (, ), , , , Datastudio .

, Datastudio Blended Tables, LEFT JOIN - . , , , BigQuery. , , , SQL-, BigQuery . , , 6 , , 6 . 20 70 3-4 BigQuery. .

, . , . - BigQuery, - Python. , merge_tables — BQ . , .. — . , .

, BigQuery , Google Spreadsheets. - id . :

Python pandas dataframe
BQ JOIN BQ (-, ).

, . - , , , .

. - -, . , , ( , F5). : Data Studio Auto-Refresh. : , View, Data Studio Auto-Refresh, , SQL=>BQ.

, , , :-)

我们如何驯服BigQuery的故事

任务

寻找解决方案

More articles: