我们如何驯服BigQuery的故事

任务


实际上,我要谈论的任务的措词很简单:必须以很少的精力可视化电子商务部门的销售数据,也就是说,几乎一无所获。


这是什么意思?我们商店的购物篮不断产生有关世界不同地区在线销售的数据,其后果包括:不同的币种,时区,税金,客户类型,物品类型,订单等。实际上,任何在线商店都会生成相同的内容,仅可能是订单参数的选项稍有不同。


需要坐在此流上,发布某个汇总的仪表板,企业可以在该仪表板上监视当前时刻的在线关键指标,并且可以长期在线(或定期)观看。此外,希望这些仪表板在细节和专业化程度方面有所不同。


寻找解决方案


自然,我们从寻找现成的解决方案开始。我们研究了诸如Owox,Microsoft的BI解决方案,某些企业解决方案等数据可视化工具。
所有这些都可以通过广告很好地实现。但:


  • 不便宜
  • 本身就是一件事情(控制数据处理的机制并不容易,有时您只需要依靠开发公司的数据科学家的能力即可)

到那时(2019年),我已经在Google Datastudio中有了一个小型的袖珍数据可视化投影,该投影已经存在了至少一年,归结为我从1C手动下载报告并将其上传到Google Storage Cloud的事实(不要与GoogleDrive),从那里到BigQuery表,然后在DataStudio中,我进行了数据切片和其他所有操作。


— , , , , .


, , , , . Data Studio , BigQuery , , , , .


, , SQL BigQuery, , .


BigQuery (https://console.cloud.google.com/bigquery), , .


, : 300 . , 2 BigQuery, , .



. , , BigQuery , , . , !


, SELECT, , BQ.


, BQ , . , , , .


: SQL ==> Python ==> BigQuery.


, — .


, SELECT csv- GoogleStorage BQ , Google Functions, Python ( ) . , 6 ( ) .


, ?



, , , SQL . , Orders, , , , , , , id, id .. , , .


, , SELECT pandas dataframe. :


import pandas as pd
import MySQLdb
import MySQLdb.cursors

def sql_query(date0, date1):
    connection = MySQLdb.connect(host=”server”, user=”user”, passwd=”pass”, db=”orders”, charset=”cp1251”, cursorclass = MySQLdb.cursors.SSCursor)
    query = “SELECT * FROM orders WHERE date>=’”+date0+”’ and date<’”+date1+”’ ”
    data =  pd.read_sql_query(query, connection=connection)
#  query -   SELECT   ,     date0  date1,   , ,   5 .          -       .          ,      BQ
# , ,  connection      ,        ..       .

#     -  . ,        :

    data.payment_date = data.payment_date.apply(lambda x: x.tz_localize(‘Europe/Moscow’)).astype(str)

#     str -  .  ,      ,       .     BQ         .

# ,     pandas     BQ,           ,      .

    return data

, 5- , (90 , .. ) , sql_query(date) , BQ.


, , pandas dataframe .



: , BQ MERGE . BQ 1-2 . 20 70 , STRING FLOAT. 70000 .


, . BQ, .


from google.cloud import bigquery
from google.oauth2 import service_account

project_id = “MyBQproject”  #    BQ
dataset_id = “MyDataSet”  #     

#       BQ.       - ( ,    ).       BQ.  ,   ,    ‘Service accounts’  Google Cloud Platform.

credentials = service_account.Credentials.from_service_account_file(“acc.json”)
client = bigquery.Client(project_id, credentials=credentials)

def upload_temp_table(df, table_id, schema):
    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
   # job_config.autodetect = True  #     
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  #  
    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()   #  

def merge_tables(temp_table, base_table, schema, head, merge_param):
    merge_query = “MERGE into “+base_table+” as T USING “+temp_table+” as S \
         ON T.”+merge_param+” = S.”+merge_param+ \
         “ WHEN MATCHED then UPDATE SET “+ \
         ‘,’.join([‘T.’+item+’=S.’+item for item in head])+ \
         “ WHEN NOT MATCHED then INSERT \
         (“+’,’.join(head)+”) VALUES (“+’,’.join([‘S.’+item for item in head])+”);”

#  ,         merge_param (,  id ,        )

#  -  

    job_config = bigquery.LoadJobConfig()
    job_config.schema = schema
    job = client.query(merge_query, job_config=job_config)
    job.result()

, — .


import datetime

for period in range(18):
    date0 = (datetime.datetime.now() - datetime.timedelta(days = period*5 + 5)).strftime(‘%Y%m%d’)
    date1 = (datetime.datetime.now() - datetime.timedelta(days = period*5)).strftime(‘%Y%m%d’)
    df = sql_query(date0,date1)   #    sql   >=date0, <date1
    upload_temp_table(df=df, table_id=”temp_table”, schema)  #    
    merge_tables(temp_table=’temp_table’, base_table = ‘orders’, schema=schema, head=[,,,], merge_param=’id’)

head — . . SQL , . MERGE- . BQ . - , Python , BQ.



.


import sys

log_file = open(“log.txt”, ‘a’)
old_stdout = sys.stdout
sys.stdout = log_file

, print
:


sys.stdout = old_stdout
log_file.close()

print -. , .



, BigQuery , DataStudio (https://datastudio.google.com/) .


, DataStudio , BigQuery Python. , , , apply.


, DataStudio (, ), , , , Datastudio .


, Datastudio Blended Tables, LEFT JOIN - . , , , BigQuery. , , , SQL-, BigQuery . , , 6 , , 6 . 20 70 3-4 BigQuery. .


, . , . - BigQuery, - Python. , merge_tables — BQ . , .. — . , .


, BigQuery , Google Spreadsheets. - id . :


  • Python pandas dataframe
  • BQ JOIN BQ (-, ).

, . - , , , .


. - -, . , , ( , F5). : Data Studio Auto-Refresh. : , View, Data Studio Auto-Refresh, , SQL=>BQ.


, , , :-)


All Articles