Introduction
Hello everyone. Recently, I had an idea to share with an interested circle of people about how scrapers are written. Since most audiences are familiar with Python, all further examples will be written on it.
This part is designed to introduce those who have not tried themselves in this area. If you are already an advanced reader, then you can safely scroll further, but to preserve the pattern, I would advise you to pay a little attention to this article.
print('Part 1. Get started')
Tools
- Programming language and corresponding libraries
Of course, without it, nowhere. In our case, Python will be used. This language is a pretty powerful tool for writing scrapers, if you can use it and its libraries correctly: requests, bs4, json, lxml, re. - Developer Tools
Every modern browser has this utility. Personally, Iām comfortable using Google Chrome or Firefox. If you use other browsers, I recommend trying one of the above. Here we will need almost all the tools: elements, console, network, application, debuger. - Modern IDE
Here the choice is yours, the only thing I would like to advise is the presence of a compiler, debuger, and a static analyzer in your development environment. I give my preference to PyCharm from JetBrains.
For myself, I distinguish two principles of data extraction and analysis: frontend, backend.
A Frontend . In this case, we directly obtain information from the final HTML file collected on the web application server. This method has its pros and cons: we always get information that is already accurately uploaded to the site, but lose performance, because sometimes we need to find out about site updates as quickly as possible.
Backend. backend api - json xml. , api, , . api -, .
Frontend
, . requests . headers, .. , , . : Kith
import requests
headers = {
'authority': 'www.kith.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
'sec-fetch-dest': 'document',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'accept-language': 'en-US,en;q=0.9',
}
session = requests.session()
response = session.get("https://kith.com/collections/mens-footwear", headers=headers)
if response.status_code == 200:
print("Success")
else:
print("Bad result")
. (Ctrl+Shift+I) , .
: , , .
soup = BeautifulSoup(response.text, 'html.parser')
for element in soup.find_all('li', class_='collection-product'):
name = element.find('h1', class_="product-card__title").text.strip()
price = element.find('span', class_="product-card__price").text.strip()
link = "https://kith.com/" + element.find('a').get('href')
class Prodcut:
name = str
price = str
link = str
def __init__(self, name, price, link):
self.name = name
self.price = price
self.link = link
def __repr__(self):
return str(self.__dict__)
,
import requests
from bs4 import BeautifulSoup
class Prodcut:
name = str
price = str
link = str
def __init__(self, name, price, link):
self.name = name
self.price = price
self.link = link
def __repr__(self):
return str(self.__dict__)
headers = {
'authority': 'www.yeezysupply.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
'sec-fetch-dest': 'document',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'accept-language': 'en-US,en;q=0.9',
}
session = requests.session()
response = session.get('https://kith.com/collections/mens-footwear', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for element in soup.find_all('li', class_='collection-product'):
name = element.find('h1', class_="product-card__title").text.strip()
price = element.find('span', class_="product-card__price").text.strip()
link = "https://kith.com/" + element.find('a').get('href')
prodcut = Prodcut(name, price, link)
print(prodcut.__repr__())
We examined the basic technologies and principles that we will meet in the following parts. I recommend that you try to complete the tasks with the listed libraries on your own, and I will also be glad to listen to your preferences regarding the site, which will be chosen as an example. Next, we will look at how to search for backend api points and interact with them.