Repeat Power BI cohort analysis with Python

Good afternoon, dear readers! The reason for writing this publication was a webinar, which I looked at Youtube. He was dedicated to a cohort analysis of sales. The author used the Power BI Desktop platform to work with data. I will not provide a link to the specified video so that this article is not regarded as an advertisement, but in the course of the narration I will try to make spoilers to the original source in order to better explain the logic of my own decision. This webinar gave me the idea that it would be interesting to repeat the possibilities of DAX formulas with the functions of the Pandas library.

Two points that I want to focus on. Firstly, this material is intended for beginner analysts who are just taking their first steps in using the Python programming language. Ideal if readers are familiar with the Power BI BI analytics platform. Secondly, since DAX calculations served as a source of inspiration, I will “copy” the author’s algorithms as far as possible, and there will inevitably be a departure from the main programming paradigms.

With the introduction, that's all. Let's hit the road!

We will carry out all calculations in the JupyterLab environment. The laptop solutions can be found at ( link ).

Data is loaded into Power BI using the Power Query tool (in fact, it is a visual editor that generates queries in the M language). When developing, you should adhere to the following rule: all data preprocessing should be done using Power Query, and metrics should be calculated using Power Pivot. Since our main library is Pandas, we immediately use its capabilities.

%%time
#   
path_to_data = "C:/Users/Pavel/Documents/Demo/"
# 
df = pd.read_csv(os.path.join(path_to_data, "ohortAnalysis_2016_2018.csv"), sep=";", parse_dates=["date"], dayfirst=True)

We will test the lines of code at runtime in order to establish the most time-consuming sections in the future. To set the full path to the read file, use the os library. To simplify the process of developing a laptop, you can do without it. The dataset itself is randomly composed. There are 1,048,575 lines in the CSV file. Reading data with the read_csv () function is usually straightforward. It is enough to specify the column separator and the column with dates, if any, in the array. If the information was uploaded with some "features", then you may need to configure additional parameters, for example, specifying the encoding for each column.

The head () function will often be used in the case to visually monitor the progress of data transformation. All errors cannot be cut off, but obvious flaws can be fixed on the spot.

After loading the data into the model, the author of the webinar sorts the data array. This is done in order to add an auxiliary column with indexes. In our case, this column will not be used, but the data will also be sorted in order to more conveniently control the correct calculation of the fields in the tables.

%%time
#  ,      
df.sort_values(["user_id","date"], inplace = True)

image

At the next step, the solution on the Power BI platform proposes to create an auxiliary table, the data from which will be pulled into the main array. Creating a table is done using the SUMMARIZE () function. It creates a pivot table with aggregated totals for selected groups: df_groupby_user = SUMMARIZE(df;df[user_id];"first_date_transaction";MIN(df[date]);"total_amount_user";SUM(df[amount]);"count_transaction_user";COUNT(df[amount]))

The Pandas library has its counterpart - the groupby () function. To apply groupby () it is enough to specify the necessary data frame, groupable columns, at the end list the columns for which aggregation functions will be applied. The result obtained is reduced to the format of the usual data frame by the reset_index () function. In conclusion, rename the fields.

%%time
#       user_id. 
df_groupby_user = df.groupby(by = ["user_id"]).agg({"date": "min", "amount": ["sum","count"]})
df_groupby_user.reset_index(inplace = True)
# 
new_columns = ["user_id","first_date_transaction", "total_amount_user","count_transaction_user"]
df_groupby_user.columns = new_columns

In addition to the “first purchase date” metric, the number of transactions per customer and the total amount of customer purchases for the entire period are calculated. The measurements did not find their application in the laptop, but we will not remove them.

We return to the webinar. The new metric "year-month of first purchase" is calculated. DAX formula: first_transaction = FORMAT(df_groupby_user[first_date_transaction];"YYYY-MM")

image

Python uses the syntax dt.strftime ('% Y-% m'). You will find a detailed explanation of how it works in online publications regarding working with date and time in Python. At this step, something else is important. Pay attention to the time of the operation.

Not at all pandas-like performance (24.8 sec.). The line of code is slower than all the previous ones.
This piece of listing becomes the first candidate for possible refactoring.

%%time
#  -
df_groupby_user["first_transaction"] = df_groupby_user["first_date_transaction"].dt.strftime('%Y-%m')

It is time to return to the webinar again. There is a union of tables by a key field. Then the necessary fields are pulled into the main table using the RELATED () function. Pandas does not have this feature. But there is merge (), join (), concat (). In this case, it is best to apply the first option.

%%time
# 
df_final = pd.merge(df, df_groupby_user, how = "left", on = "user_id")

After the data with the date of the first transaction fell into the main table, you can calculate the delta. We use the apply construct (lambda x: ...) to clearly demonstrate how resource-intensive this process is (39.7 sec.). Here is another candidate for code rewriting.

%%time
#   "    "
df_final["delta_days"] = df_final["date"] - df_final["first_date_transaction"]
df_final["delta_days"] = df_final["delta_days"].apply(lambda x: x.days)

The main table already has a delta by day, so you can divide the column data into cohorts. Principle: 0 (that is, the first sale to the customer) - cohort 0; values ​​greater than 0, but less than or equal to 30 are 30; values ​​greater than 30, but less than or equal to 90 are 90, etc. For these purposes, in DAX, you can use the CEILING () function. It rounds the number up to the nearest integer, a multiple of the value from the second parameter.

image

In Python, I did not find a similar mathematical function, although I planned to find it in the math module (maybe I searched poorly). Therefore, I had to go around and apply the cut () function. After spreading the data into cohorts, NaN was mapped to numerical values ​​of 0. We cannot solve this problem by using the fillna () function, since we are dealing with categorical data. First you need to add a new value to the category. At the end of this code listing, change the data type to int. This is done so that in the future, when constructing a pivot table using a data frame, the new cohort does not appear at the end of a series of values.

%%time
#  . 
cut_labels_days = [x for x in range (30, 1230, 30)]
cut_bins_days = [x for x in range (0, 1230, 30)]
df_final["cohort_days"] = pd.cut(df_final["delta_days"], bins = cut_bins_days, labels=cut_labels_days, right = True)
%%time
#     .   fillna   !
df_final["cohort_days"] = df_final["cohort_days"].cat.add_categories([0])
df_final["cohort_days"].fillna(0, inplace = True)
%%time
#     .   fillna   !
df_final["cohort_days"] = df_final["cohort_days"].cat.add_categories([0])
df_final["cohort_days"].fillna(0, inplace = True)
#    .    ,  "0"        ,     
#    .
df_final["cohort_days"] = df_final["cohort_days"].astype(int)

Using the pivot_table () function, we get the desired pivot table. We get quite a lot of cohorts, so the result cannot be completely displayed on the screen. To avoid this, when solving a real case, you can take a shorter time interval for analysis or enlarge the ranges of values ​​of the cohorts themselves.

%%time
#  
df_pivot_table = pd.pivot_table(df_final, values=["amount"], index=["first_transaction"], columns=["cohort_days"], aggfunc=np.sum, fill_value = 0)

image

In Power BI, you need to use the Matrix tool to build such a visualization.

image

The next stage is plotting. The nuance of the situation is that we need the amount on an accrual basis. In Power BI, just select the required 'Quick action' menu item and the necessary DAX formula will be automatically generated. With the Pandas library, the situation is a little more complicated. We double-sequentially group the existing data frame and apply the cumsum () function. Since the result will still be used, we will make a copy of the data frame to build the graph. The accumulated sales values ​​were quite large, so we divide the values ​​by 1,000,000 and round the result to two digits after the decimal point.

%%time
#     amount
df_pivot_table_cumsum = df_final.groupby(by = ["first_transaction","cohort_days"]).agg({"amount": ["sum"]}).groupby(level=0).cumsum().reset_index()
df_pivot_table_cumsum.columns = ["first_transaction","cohort_days","cumsum_amount"]
%%time
#     
df_pivot_table_cumsum_chart = copy.deepcopy(df_pivot_table_cumsum)
#     ,       Y.
df_pivot_table_cumsum_chart["cumsum_amount"]=round(df_pivot_table_cumsum_chart["cumsum_amount"]/1000000, 2)

We use the capabilities of the library to build the graph. A diagram is built in just one line of code, but the result is not impressive. This graph clearly loses to visualizations on any BI platform. You can connect the Plotly library and conjure with add-ons, but this is a completely different labor costs compared to the approach shown in the video.

%%time
df_pivot_table_cumsum_chart.pivot(index="cohort_days", columns="first_transaction", values="cumsum_amount").plot(figsize = (15,11))

image

Let us make brief conclusions.

In terms of calculations, the Pandas library may well replace Power Pivot (DAX).

The feasibility of such a replacement remains outside the conversation.

DAX, like the Python library functions, does a good job of performing operations on entire fields of a table.

In terms of speed, simplicity and ease of visualization design, Power BI is superior to Pandas. In my opinion, the built-in graphs (as well as those created using the matplotlib, seaborn libraries) are appropriate to apply in two cases: express analysis of a number of data for outliers, local minima / maxima, or preparing slides for presentation. The development of graphical control panels is best left to BI solutions.

That's all. All health, good luck and professional success!

All Articles