Web Scraping Part 1

Introduction


Hello everyone. Recently, I had an idea to share with an interested circle of people about how scrapers are written. Since most audiences are familiar with Python, all further examples will be written on it.


This part is designed to introduce those who have not tried themselves in this area. If you are already an advanced reader, then you can safely scroll further, but to preserve the pattern, I would advise you to pay a little attention to this article.


print('Part 1. Get started')

Tools


  • Programming language and corresponding libraries
    Of course, without it, nowhere. In our case, Python will be used. This language is a pretty powerful tool for writing scrapers, if you can use it and its libraries correctly: requests, bs4, json, lxml, re.
  • Developer Tools
    Every modern browser has this utility. Personally, Iā€™m comfortable using Google Chrome or Firefox. If you use other browsers, I recommend trying one of the above. Here we will need almost all the tools: elements, console, network, application, debuger.
  • Modern IDE
    Here the choice is yours, the only thing I would like to advise is the presence of a compiler, debuger, and a static analyzer in your development environment. I give my preference to PyCharm from JetBrains.

A bit about the format


For myself, I distinguish two principles of data extraction and analysis: frontend, backend.


A Frontend . In this case, we directly obtain information from the final HTML file collected on the web application server. This method has its pros and cons: we always get information that is already accurately uploaded to the site, but lose performance, because sometimes we need to find out about site updates as quickly as possible.


Backend. backend api - json xml. , api, , . api -, .


Frontend


, . requests . headers, .. , , . : Kith


import requests

headers = {
    'authority': 'www.kith.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
    'sec-fetch-dest': 'document',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
}

session = requests.session()

response = session.get("https://kith.com/collections/mens-footwear", headers=headers)

if response.status_code == 200:
    print("Success")
else:
    print("Bad result")

. (Ctrl+Shift+I) , .
: , , .


soup = BeautifulSoup(response.text, 'html.parser')

for element in soup.find_all('li', class_='collection-product'):
    name = element.find('h1', class_="product-card__title").text.strip()
    price = element.find('span', class_="product-card__price").text.strip()
    link = "https://kith.com/" + element.find('a').get('href')


class Prodcut:
    name = str
    price = str
    link = str

    def __init__(self, name, price, link):
        self.name = name
        self.price = price
        self.link = link

    def __repr__(self):
        return str(self.__dict__)

,


import requests
from bs4 import BeautifulSoup

class Prodcut:
    name = str
    price = str
    link = str

    def __init__(self, name, price, link):
        self.name = name
        self.price = price
        self.link = link

    def __repr__(self):
        return str(self.__dict__)

headers = {
    'authority': 'www.yeezysupply.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
    'sec-fetch-dest': 'document',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
}

session = requests.session()

response = session.get('https://kith.com/collections/mens-footwear', headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')

for element in soup.find_all('li', class_='collection-product'):
    name = element.find('h1', class_="product-card__title").text.strip()
    price = element.find('span', class_="product-card__price").text.strip()
    link = "https://kith.com/" + element.find('a').get('href')

    prodcut = Prodcut(name, price, link)

    print(prodcut.__repr__())


We examined the basic technologies and principles that we will meet in the following parts. I recommend that you try to complete the tasks with the listed libraries on your own, and I will also be glad to listen to your preferences regarding the site, which will be chosen as an example. Next, we will look at how to search for backend api points and interact with them.


All Articles