Python, pandas and solving three problems from the world of Excel

Excel is an extremely common data analysis tool. It’s easy to learn how to work with it, it is on almost every computer, and the one who has mastered it can solve quite complex problems with it. Python is often considered a tool whose capabilities are almost limitless, but which is harder to master than Excel. The author of the material, the translation of which we publish today, wants to talk about the solution using Python of the three tasks that are usually solved in Excel. This article is a bit of an introduction to Python for those who know Excel well.



Data loading


Let's start by importing the Python library pandasand loading data into data frames that are stored on Excel sheets salesand statesworkbooks. We will give the same names to the corresponding data frames.

import pandas as pd
sales = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'sales')
states = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'states')

Now we will use the .head()data frame method salesto display the elements at the beginning of the data frame:

print(sales.head())

Compare what will be displayed with what you can see in Excel.


Comparison of the appearance of data output in Excel with the appearance of data output from the pandas data frame

Here you can see that the results of visualizing data from the data frame are very similar to what you can see in Excel. But there are some very important differences:

  • The row numbering in Excel starts at 1, and in pandas the number (index) of the first row is 0.
  • In Excel, columns have letters starting with a letter A, and in pandas, the names of the columns correspond to the names of the corresponding variables.

We will continue to study the possibilities of pandas that allow us to solve problems that are usually solved in Excel.

Implementing the Excel Excel IF Functionality in Python


Excel has a very convenient function IFthat allows, for example, writing something to a cell, based on checking what is in another cell. Suppose you want to create a new column in Excel, the cells of which will tell us if the values ​​in the corresponding cells in the column exceed 500 B. In Excel, Eyou can assign a heading to such a column (in our case, a column ) by MoreThan500writing the corresponding text to the cell E1. After that, in the cell E2, you can enter the following:

=IF([@Sales]>500, "Yes", "No")


Using the IF function in Excel

In order to do the same with pandas, you can use list comprehension:

sales['MoreThan500'] = ['Yes' if x > 500 else 'No' for x in sales['Sales']]


List inclusions in Python: if the current value is greater than 500, Yes is included in the list; otherwise, No

List inclusions are an excellent tool for solving such problems, which simplifies the code by reducing the need for complex constructions like if / else. The same problem can be solved with if / else, but the proposed approach saves time and makes the code a little cleaner. Details on list inclusion can be found here .

Implementing the VLOOKUP Excel Functionality in Python


In our dataset, on one of the Excel sheets, there are city names, and on the other, the names of states and provinces. How to find out exactly where each city is located? An Excel function is suitable for this VLOOKUP, with which you can link the data of two tables. This function works on the principle of the left join, when each record from the data set located on the left side of the expression is saved. Using the function VLOOKUP, we suggest that the system search for a specific value in a given column of a specified sheet, and then return a value that is a specified number of columns to the right of the found value. Here's what it looks like:

=VLOOKUP([@City],states,2,false)

We set the salescolumn heading on the sheet Fas we Stateuse the function VLOOKUPto fill in the cells of this column with the names of the states and provinces in which the cities are located.


Using the VLOOKUP function in Excel

In Python, you can do the same using the methodmergefrom pandas. It takes two data frames and combines them. To solve this problem, we need the following code:

sales = pd.merge(sales, states, how='left', on='City')

Let's analyze it:

  1. The first argument to the method mergeis the original data frame.
  2. The second argument is the data frame in which we are looking for values.
  3. The argument howindicates exactly how we want to join the data.
  4. The argument onpoints to the variable by which the connection should be performed (here you can still use the arguments left_onand right_on, necessary if the data of interest to us in different data frames is named differently).

Summary tables


Pivot Tables is one of Excel's most powerful features. These tables allow you to quickly extract valuable information from large data sets. Create an Excel PivotTable that displays information about total sales for each city.


Creating a PivotTable in Excel

As you can see, to create such a table, just drag the fieldCityinto the sectionRowsand the fieldSalesinto the sectionValues. After that, Excel will automatically display the total sales for each city.

In order to create the same pivot table in pandas, you will need to write the following code:

sales.pivot_table(index = 'City', values = 'Sales', aggfunc = 'sum')

Let's analyze it:

  1. Here we use a method sales.pivot_table, telling pandas that we want to create a pivot table based on a data frame sales.
  2. The argument indexpoints to the column by which we want to aggregate the data.
  3. The argument valuesindicates which values ​​we are going to aggregate.
  4. The argument aggfuncspecifies the function that we want to use in the processing of values (here you can still use the functions mean, max, minand so on).

Summary


From this material you learned about how to import Excel data into pandas, how to implement the capabilities of Excel functions IFand Python tools pandasVLOOKUP, as well as how to reproduce the functionality of Excel pivot tables using pandas. Perhaps now you are wondering why you should use pandas if you can do the same in Excel. There is no single answer to this question. Python allows you to create code that can be fine-tuned and deeply researched. Such code can be reused. Python can describe very complex data analysis schemes. And the Excel features are probably only enough for smaller-scale data research. If you have only used Excel until now, I recommend that you try Python and pandas and find out what you can do.

What tools do you use to analyze data?

, , iPhone. , .


All Articles