Excel is an extremely common data analysis tool. Itβs easy to learn how to work with it, it is on almost every computer, and the one who has mastered it can solve quite complex problems with it. Python is often considered a tool whose capabilities are almost limitless, but which is harder to master than Excel. The author of the material, the translation of which we publish today, wants to talk about the solution using Python of the three tasks that are usually solved in Excel. This article is a bit of an introduction to Python for those who know Excel well.
Data loading
Let's start by importing the Python library pandas
and loading data into data frames that are stored on Excel sheets sales
and states
workbooks. We will give the same names to the corresponding data frames.import pandas as pd
sales = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'sales')
states = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/pythonexcel.xlsx', sheet_name = 'states')
Now we will use the .head()
data frame method sales
to display the elements at the beginning of the data frame:print(sales.head())
Compare what will be displayed with what you can see in Excel.Comparison of the appearance of data output in Excel with the appearance of data output from the pandas data frameHere you can see that the results of visualizing data from the data frame are very similar to what you can see in Excel. But there are some very important differences:- The row numbering in Excel starts at 1, and in pandas the number (index) of the first row is 0.
- In Excel, columns have letters starting with a letter
A
, and in pandas, the names of the columns correspond to the names of the corresponding variables.
We will continue to study the possibilities of pandas that allow us to solve problems that are usually solved in Excel.Implementing the Excel Excel IF Functionality in Python
Excel has a very convenient function IF
that allows, for example, writing something to a cell, based on checking what is in another cell. Suppose you want to create a new column in Excel, the cells of which will tell us if the values ββin the corresponding cells in the column exceed 500 B
. In Excel, E
you can assign a heading to such a column (in our case, a column ) by MoreThan500
writing the corresponding text to the cell E1
. After that, in the cell E2
, you can enter the following:=IF([@Sales]>500, "Yes", "No")
Using the IF function in ExcelIn order to do the same with pandas, you can use list comprehension:sales['MoreThan500'] = ['Yes' if x > 500 else 'No' for x in sales['Sales']]
List inclusions in Python: if the current value is greater than 500, Yes is included in the list; otherwise, NoList inclusions are an excellent tool for solving such problems, which simplifies the code by reducing the need for complex constructions like if / else. The same problem can be solved with if / else, but the proposed approach saves time and makes the code a little cleaner. Details on list inclusion can be found here .Implementing the VLOOKUP Excel Functionality in Python
In our dataset, on one of the Excel sheets, there are city names, and on the other, the names of states and provinces. How to find out exactly where each city is located? An Excel function is suitable for this VLOOKUP
, with which you can link the data of two tables. This function works on the principle of the left join, when each record from the data set located on the left side of the expression is saved. Using the function VLOOKUP
, we suggest that the system search for a specific value in a given column of a specified sheet, and then return a value that is a specified number of columns to the right of the found value. Here's what it looks like:=VLOOKUP([@City],states,2,false)
We set the sales
column heading on the sheet F
as we State
use the function VLOOKUP
to fill in the cells of this column with the names of the states and provinces in which the cities are located.Using the VLOOKUP function in ExcelIn Python, you can do the same using the methodmerge
from pandas. It takes two data frames and combines them. To solve this problem, we need the following code:sales = pd.merge(sales, states, how='left', on='City')
Let's analyze it:- The first argument to the method
merge
is the original data frame. - The second argument is the data frame in which we are looking for values.
- The argument
how
indicates exactly how we want to join the data. - The argument
on
points to the variable by which the connection should be performed (here you can still use the arguments left_on
and right_on
, necessary if the data of interest to us in different data frames is named differently).
Summary tables
Pivot Tables is one of Excel's most powerful features. These tables allow you to quickly extract valuable information from large data sets. Create an Excel PivotTable that displays information about total sales for each city.Creating a PivotTable in ExcelAs you can see, to create such a table, just drag the fieldCity
into the sectionRows
and the fieldSales
into the sectionValues
. After that, Excel will automatically display the total sales for each city.In order to create the same pivot table in pandas, you will need to write the following code:sales.pivot_table(index = 'City', values = 'Sales', aggfunc = 'sum')
Let's analyze it:- Here we use a method
sales.pivot_table
, telling pandas that we want to create a pivot table based on a data frame sales
. - The argument
index
points to the column by which we want to aggregate the data. - The argument
values
indicates which values ββwe are going to aggregate. - The argument
aggfunc
specifies the function that we want to use in the processing of values (here you can still use the functions mean
, max
, min
and so on).
Summary
From this material you learned about how to import Excel data into pandas, how to implement the capabilities of Excel functions IF
and Python tools pandasVLOOKUP
, as well as how to reproduce the functionality of Excel pivot tables using pandas. Perhaps now you are wondering why you should use pandas if you can do the same in Excel. There is no single answer to this question. Python allows you to create code that can be fine-tuned and deeply researched. Such code can be reused. Python can describe very complex data analysis schemes. And the Excel features are probably only enough for smaller-scale data research. If you have only used Excel until now, I recommend that you try Python and pandas and find out what you can do.What tools do you use to analyze data?, , iPhone. , .
