任务
实际上,我要谈论的任务的措词很简单:必须以很少的精力可视化电子商务部门的销售数据,也就是说,几乎一无所获。
这是什么意思?我们商店的购物篮不断产生有关世界不同地区在线销售的数据,其后果包括:不同的币种,时区,税金,客户类型,物品类型,订单等。实际上,任何在线商店都会生成相同的内容,仅可能是订单参数的选项稍有不同。
需要坐在此流上,发布某个汇总的仪表板,企业可以在该仪表板上监视当前时刻的在线关键指标,并且可以长期在线(或定期)观看。此外,希望这些仪表板在细节和专业化程度方面有所不同。
寻找解决方案
自然,我们从寻找现成的解决方案开始。我们研究了诸如Owox,Microsoft的BI解决方案,某些企业解决方案等数据可视化工具。
所有这些都可以通过广告很好地实现。但:
- 不便宜
- 本身就是一件事情(控制数据处理的机制并不容易,有时您只需要依靠开发公司的数据科学家的能力即可)
到那时(2019年),我已经在Google Datastudio中有了一个小型的袖珍数据可视化投影,该投影已经存在了至少一年,归结为我从1C手动下载报告并将其上传到Google Storage Cloud的事实(不要与GoogleDrive),从那里到BigQuery表,然后在DataStudio中,我进行了数据切片和其他所有操作。
— , , , , .
, , , , . Data Studio , BigQuery , , , , .
, , SQL BigQuery, , .
BigQuery (https://console.cloud.google.com/bigquery), , .
, : 300 . , 2 BigQuery, , .
. , , BigQuery , , . , !
, SELECT, , BQ.
, BQ , . , , , .
: SQL ==> Python ==> BigQuery.
, — .
, SELECT csv- GoogleStorage BQ , Google Functions, Python ( ) . , 6 ( ) .
, ?
, , , SQL . , Orders, , , , , , , id, id .. , , .
, , SELECT pandas dataframe. :
import pandas as pd
import MySQLdb
import MySQLdb.cursors
def sql_query(date0, date1):
connection = MySQLdb.connect(host=”server”, user=”user”, passwd=”pass”, db=”orders”, charset=”cp1251”, cursorclass = MySQLdb.cursors.SSCursor)
query = “SELECT * FROM orders WHERE date>=’”+date0+”’ and date<’”+date1+”’ ”
data = pd.read_sql_query(query, connection=connection)
data.payment_date = data.payment_date.apply(lambda x: x.tz_localize(‘Europe/Moscow’)).astype(str)
return data
, 5- , (90 , .. ) , sql_query(date) , BQ.
, , pandas dataframe .
: , BQ MERGE . BQ 1-2 . 20 70 , STRING FLOAT. 70000 .
, . BQ, .
from google.cloud import bigquery
from google.oauth2 import service_account
project_id = “MyBQproject”
dataset_id = “MyDataSet”
credentials = service_account.Credentials.from_service_account_file(“acc.json”)
client = bigquery.Client(project_id, credentials=credentials)
def upload_temp_table(df, table_id, schema):
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
def merge_tables(temp_table, base_table, schema, head, merge_param):
merge_query = “MERGE into “+base_table+” as T USING “+temp_table+” as S \
ON T.”+merge_param+” = S.”+merge_param+ \
“ WHEN MATCHED then UPDATE SET “+ \
‘,’.join([‘T.’+item+’=S.’+item for item in head])+ \
“ WHEN NOT MATCHED then INSERT \
(“+’,’.join(head)+”) VALUES (“+’,’.join([‘S.’+item for item in head])+”);”
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job = client.query(merge_query, job_config=job_config)
job.result()
, — .
import datetime
for period in range(18):
date0 = (datetime.datetime.now() - datetime.timedelta(days = period*5 + 5)).strftime(‘%Y%m%d’)
date1 = (datetime.datetime.now() - datetime.timedelta(days = period*5)).strftime(‘%Y%m%d’)
df = sql_query(date0,date1)
upload_temp_table(df=df, table_id=”temp_table”, schema)
merge_tables(temp_table=’temp_table’, base_table = ‘orders’, schema=schema, head=[,,,], merge_param=’id’)
head — . . SQL , . MERGE- . BQ . - , Python , BQ.
.
import sys
log_file = open(“log.txt”, ‘a’)
old_stdout = sys.stdout
sys.stdout = log_file
, print
:
sys.stdout = old_stdout
log_file.close()
print -. , .
, BigQuery , DataStudio (https://datastudio.google.com/) .
, DataStudio , BigQuery Python. , , , apply.
, DataStudio (, ), , , , Datastudio .
, Datastudio Blended Tables, LEFT JOIN - . , , , BigQuery. , , , SQL-, BigQuery . , , 6 , , 6 . 20 70 3-4 BigQuery. .
, . , . - BigQuery, - Python. , merge_tables — BQ . , .. — . , .
, BigQuery , Google Spreadsheets. - id . :
- Python pandas dataframe
- BQ JOIN BQ (-, ).
, . - , , , .
. - -, . , , ( , F5). : Data Studio Auto-Refresh. : , View, Data Studio Auto-Refresh, , SQL=>BQ.
, , , :-)