🚴🏾 🤜 🤸🏾 Python for the tester: how small c pandas scripts help test large datasets 👐🏻 📖 🚪

I work as a tester on a project, the essence of which is the collection and storage of various data and the formation of various reports and file uploads on their basis. When generating such reports, a large number of conditions for data selection are taken into account, and therefore, when testing, you have to work a lot with SQL queries in the database. But to verify the correct selection of data and search for excess / missing data, this often does not suffice, so I had to look for additional tools for this.

Since I already had some basic python knowledge, I decided to try to write small scripts that would allow me to do something with the existing data and thereby facilitate and speed up the testing process. In this article I will tell you what came of it.

Designing a script script

To write a script, you need to figure out what exactly the script should do, what data needs to be input and what data is expected to be output.

Sample steps for the scenario:

We get a file with data in a certain format (or several files)
Get data from file / files
We select the necessary data
We perform some operations on data
We upload data to an Excel file, if necessary (usually this format is the most convenient for further analysis and storage)

Then you need to think about how and where you can get input. It can be a manually created file with data, uploading data to a UI into a file with filters, a file with parsing data using another script, a file to upload the results of an SQL query to the database (you can quickly upload data to a csv file from the database), json -file or xml-file with data from the response to a request to API, etc.

We write scripts in python using pandas and other libraries

To write scripts in python, you need to install an interpreter and a suitable IDE. It is also best to create a separate virtual environment for this project.

I use different libraries for scripts, some of them are built-in python libraries , some need to be installed additionally:

pandas is a library for data analysis. It allows you to work with data from files of various formats, as well as receive data directly from the database using an SQL query. Data from files are loaded into data frames (visually the same tables as in Excel), with data in which you can already perform different operations: combine data from different data frames by analogy with join / union in SQL, select the necessary data under certain conditions, compare data in different columns of the data frame, etc.
openpyxl, xlrd - libraries for working with Excel.

The simplest script framework for working with data from csv, json, Excel files is as follows:

#   pandas
import pandas as pd

#    csv-  -
# (        )
#       csv-     ";"
df = pd.read_csv('./csv_file.csv', sep=';', encoding='utf-8')

# 
#    json-  -
# (        )
# df = pd.read_json('./json_file.json', encoding='utf-8')

# 
#    Excel-  -,     
# (        )
# file_excel = 'Excel_file.xlsx'
# df = pd.ExcelFile(file_excel).parse('1')


#  -     -  
#    - final_df


#    Excel-,  
#          
# (      )
writer = pd.ExcelWriter('.xlsx')
final_df.to_excel(writer, '1')
writer.save()

In this script, data from a file of the required format is loaded into a data frame, the necessary data is selected and some operations are performed on them, then the data is written to a new Excel file.

If you need to work with data obtained as a result of an SQL query to the database, you can not export them to a csv file, but immediately get them into a data frame by executing an SQL query in the database in the script itself:

#   pandas
import pandas as pd
#      ,     PostgreSQL
# (   -    )
import psycopg2

#    
conn = psycopg2.connect(dbname='_', host='', port='',
                        user='', password='')

#   SQL-
q = """select ... 
    from ... 
    where ..."""

#    -,  SQL-
df = pd.read_sql_query(q, conn)


#  -     -  
#    - final_df


#    Excel-,  
#          
# (      )
writer = pd.ExcelWriter('.xlsx')
final_df.to_excel(writer, '1')
writer.save()

If you need to get data from an xml file, then you can use the libraries designed for this. I use the built-in library ElementTree .

When the data is received in a data frame, you can immediately combine it with data from another data frame (analogues of join or union in SQL) or perform some operations on them, for example, remove duplicates, remove rows with empty values in certain cells , compare data in multiple columns, select the desired rows / columns, etc. Read more in the documentation for pandas.

Script Usage Options

And now we turn on the main tool of the tester and select the data / features on our project, for verification of which such scripts would be useful.

Files with a small amount of test data generated using generatedata were created for the scripts . In reality, data files contain tens of thousands of rows and a large number of columns.

Scenario # 1

There are three files in csv format with data. For each data line, there is a field with a unique identifier id. Data from these files are selected taking into account certain conditions and entered into a table in the database, then these data are displayed in a report in the form of a table on the UI. It is possible to upload data on a UI to an Excel file.

Suppose that the conditions for selecting data for a report from source files are as follows:

Files can have duplicates by id, in a report, a record with the same identifier should be taken into account only once (in this case, we simply select any one of the rows with this identifier from the data).
Rows with missing data in the reg_date column cell should not be counted.
In fact, there can be more selection conditions, also the data can be compared with the data already in the system and only intersecting data by id will be displayed in the report, but for example we will take into account only the two conditions indicated above.

The task of the tester: Check that the lines with the necessary objects are correctly selected from the source files and that all these objects are displayed in the report on the UI.

We compose a script for the script:

- csv-, - ( union SQL), id, reg_date.
UI Excel-, , -.
(merge) - ( outer join SQL) Excel- .
, , - , , UI.

In the final file, the data will contain only one column with id, if the names of the columns in different data frames coincided, and it may not be clear which columns / rows from which file were. Therefore, I either name the columns with a unique identifier by different names in the files, or add a separate column “Rows from a file of such and such” to each file and put “Yes” in it - then when analyzing the resulting Excel file it’s convenient to filter by this column because they always contain a value and, filtering by them, you can already understand what data diverges in the corresponding columns.

Example data from file example1_csv_1.csv :

Example data from report_UI.xlsx file : A

python script looks like this:

#   pandas
import pandas as pd

#     csv-    -
# (        )
df_from_file1 = pd.read_csv('./example1_csv_1.csv', sep=';', encoding='utf-8',
                            usecols=['id', 'name', 'email', 'reg_date'])
df_from_file2 = pd.read_csv('./example1_csv_2.csv', sep=';', encoding='utf-8',
                            usecols=['id', 'name', 'email','reg_date'])
df_from_file3 = pd.read_csv('./example1_csv_3.csv', sep=';', encoding='utf-8',
                            usecols=['id', 'name', 'email', 'reg_date'])

#    -    - 
# (   union  SQL)
df_from_csv = pd.concat([df_from_file1, df_from_file2, df_from_file3]).\
    reset_index(drop=True)
print(df_from_csv)

#       
df_from_csv.drop_duplicates(subset='id', keep='first', inplace=True)
print(df_from_csv)

#     NaN ( )   reg_date
df_from_csv = df_from_csv.dropna()
print(df_from_csv)

#    Excel-   UI  -,
#       
# (        )
file_excel = 'report_UI.xlsx'
df_from_excel = pd.ExcelFile(file_excel).parse('1')
print(df_from_excel)

#  -     - 
# -       UI
# (   outer join  SQL)
df = df_from_csv.merge(df_from_excel, left_on='id', right_on="", how='outer')
print(df)

#     Excel-
writer = pd.ExcelWriter('.xlsx')
df.to_excel(writer, '1')
writer.save()

Limitations:

( , 30 000 ).
( Excel) / , .

Scenario No. 2
The section contains data in the form of a table for certain objects from a single source. The system will receive data from a second source (integration) and update the existing table data with this data. Each record in the table is data for one object that has a unique identifier. If from a new source the data of an object by identifier coincides with the data of an existing object, then all fields of an existing record are updated with data from a new source (confirmed). If the table does not yet have an object with an identifier from the second source, then a new record is created in the table with data from the new source. Data from the second system can be uploaded to a json file in advance.

The task of the tester:Prepare a file with data for the test in advance, in order to verify that the existing records are updated correctly and they are affixed with a confirmation sign in the database if there is a match by identifier, and new records are correctly created and they are marked with the sign of adding to the database if records with such the identifier was not yet.

We compose a script for the script:

We upload the data from the partition table to the Excel file on the UI (if this is not possible, you can always export the data from the result of the SQL query used in the code to output data to this table on the UI) and fill the data from it into the first data frame .
We get the json file with data from the second source and load it into the second data frame.
(merge — outer join SQL) - - Excel-, . , , .

( , 30 000 ).
json- / – /, - json- pandas /.

Scenario 3

A request is made to the system API, in response to which data on objects in json format is received.

Tester's task: Compare the data from the response to the request to the API with the data from the result of the SQL query in the database.

We compose a script for the script:

We execute the SQL query in the database, export the data from the query result to a csv file, load this data into the first data frame.
We save the data from the response to the request to the API in the json file, load the data from the file into the second data frame.
We combine the data (merge - by analogy with outer join in SQL) from two received data frames into one new data frame by a unique identifier and unload the data from it into an Excel file, in which we will already compare data by columns using the functions of Excel
Or the data on the columns in the general data frame can be compared using pandas, while unloading the rows with the same / different data in the columns into a new data frame / Excel file for analysis.

Example data from example3_csv.csv file :

Example data from example3_json.json file :

[
    {
        "id": "16421118-4116",
        "name_json": "Tempor Consulting",
        "email_json": "Nullam.lobortis.quam@***",
        "tel_json": "1-821-805-****",
        "reg_date_json": "12-11-16",
        "city_json": "Natales"
    },
    {
        "id": "16040210-2206",
        "name_json": "Odio Etiam Incorporated",
        "email_json": "arcu@***",
        "tel_json": "1-730-291-****",
        "reg_date_json": "26-06-05",
        "city_json": "Viddalba"
    },
...
]

The python script looks like this:

#   pandas
import pandas as pd

#    csv-  -
# (        )
#       csv-     ";"
df_from_csv = pd.read_csv('./example3_csv.csv', sep=';', encoding='utf-8')
print(df_from_csv)

#    json-  -
# (        )
df_from_json = pd.read_json('./example3_json.json', encoding='utf-8')
print(df_from_json)

#  -    -
# (   outer join  SQL)
df_csv_json = df_from_csv.merge(df_from_json, left_on='id', 
                                right_on="id", how='outer')
print(df_csv_json)

#    Excel-,   ,
#      -   ,
#          
# (      )
# writer = pd.ExcelWriter('.xlsx')
# df_csv_json.to_excel(writer, '1')
# writer.save()

#       
# (, name_csv  name_json) 
#       Excel-  
# (        )
unequal_data_df = df_csv_json.loc[df_csv_json['name_csv'] != 
                                  df_csv_json['name_json']]
unequal_data_df = unequal_data_df[['id', 'name_csv', 'name_json']]
print(unequal_data_df)

writer = pd.ExcelWriter('_name.xlsx')
unequal_data_df.to_excel(writer, '1')
writer.save()

Limitations:

When working with files with a very large number of lines, you will have to break them into separate files (here you need to try, I rarely have files over 30,000 lines).
If there are several levels of nesting of objects / data arrays in the json file, then from the internal levels they will be loaded into the cell as an object / array, therefore working with json files with pandas is convenient only for data without excessive nesting of objects / arrays.
API SQL- , SQL- .

If the response to the request to the API comes in xml format, then you will first need to parse the necessary data from the xml file using ElementTree or another library and then load it into the data frame.

Scenario No. 4

On the UI an xml-file with data on objects is downloaded, which is generated on the fly from data in the database subject to certain conditions (for example, statuses, dates, years or other parameter values for objects are taken into account).

Tester's task: Compare the unique identifiers id of the objects from the xml file that are in the attribute of the company tag with the identifiers of the objects from the result of the SQL query in the database.

We compose a script for the script:

We save the data from the response to the request to the API in the xml-file, get the necessary data from this file using the ElementTree library, load the data into the first data frame.
We execute the SQL query in the database, export the data from the query result to the csv file, load this data into the second data frame.
We merge the data (merge - similar to outer join in SQL) from two received data frames into one new data frame by a unique identifier and unload the data from it into an Excel file.
Next, open the resulting file and analyze the data lines.

Example data from example4_csv.csv file :

Example data from example4_xml.xml file : A

python script looks like this:

#   ElementTree
from xml.etree import ElementTree
#   pandas
import pandas as pd

#    xml-   
# (        )
tree = ElementTree.parse("example4_xml.xml")

#   
root = tree.getroot()

#  ,     
data_list = []
i = 1

#    -   id_type1  id_type2
for child in root.iter("companies"):
    for child_1 in child.iter("company"):
            data_list.append({"": i, "id": child_1.get("id_type1")
                                                or child_1.get("id_type2"), 
                              "  xml": ""})
            i += 1

#     data_list  -
df_from_xml = pd.DataFrame.from_dict(data_list, orient='columns')
print(df_from_xml)

#    csv-  -
df_from_csv = pd.read_csv('./example4_csv.csv', sep=';', encoding='utf-8')
print(df_from_csv)

#  -   -
# (   outer join  SQL)
df = df_from_csv.merge(df_from_xml, left_on='id', right_on="id", how='outer')
print(df)

#    Excel-
#          
# (      )
writer = pd.ExcelWriter('.xlsx')
df.to_excel(writer, '1')
writer.save()

Scenario 5

At the UI, the section displays data on objects in the form of a table. It is possible to upload data to an Excel file.

The task of the tester: Compare the data from the partition table with the data that is downloaded to the Excel file.

We compose a script for the script:

We ask developers for an SQL query in the database from the code that is responsible for outputting data to the partition table on the UI.
We execute this SQL-query in the database, upload the data to a csv file, load the data from it into the first data frame.
We upload the data from the partition table to the Excel file on the UI and load the data from it into the second data frame.
(merge — outer join SQL) - - Excel-, Excel.
- pandas, / -/Excel- .

, Excel- UI, , , .

Also, similar scripts can be used simply to transfer data from json-files or csv-files to Excel-files. Or, you can combine data from several Excel files in certain columns and upload them into one new Excel file.

Conclusion

These are just a few examples of how you can use python + pandas to speed up the testing process and find bugs. In fact, pandas has much more opportunities for working with data, you can read more about this in the documentation for this library.

Perhaps your project has other options for using such scripts and this article will help you begin to use them in the work of testers.

Python for the tester: how small c pandas scripts help test large datasets

Designing a script script

We write scripts in python using pandas and other libraries

Script Usage Options

Conclusion

More articles: