👨🏾‍🔧 🏴󠁧󠁢󠁥󠁮󠁧󠁿 💏 网页抓取第1部分 🎦 🧔🏾 ⤵️

介绍

大家好。最近，我有一个想法与感兴趣的人分享刮板的书写方式。由于大多数读者都熟悉Python，因此将在其上编写所有更多示例。

本部分旨在介绍那些尚未在此领域尝试过的人。如果您已经是高级阅读器，则可以安全地进一步滚动，但是为了保留模式，我建议您对本文稍加注意。

print('Part 1. Get started')

工具类

编程语言和相应的库
当然，没有它，无处不在。在我们的例子中，将使用Python。如果您可以正确使用刮板及其库，则该语言是编写刮板的强大工具：request，bs4，json，lxml，re。
开发人员工具
每个现代的浏览器都有此实用程序。就个人而言，我很喜欢使用Google Chrome或Firefox。如果您使用其他浏览器，建议您尝试上述一种。在这里，我们几乎需要所有工具：元素，控制台，网络，应用程序，调试器。
现代IDE
这是您的选择，我唯一要建议的是开发环境中是否存在编译器，调试器和静态分析器。我更喜欢JetBrains的PyCharm。

关于格式的一点

对于我自己，我区分了数据提取和分析的两个原则：前端和后端。

前端。在这种情况下，我们直接从Web应用程序服务器上收集的最终HTML文件中获取信息。这种方法有其优点和缺点：我们总是会获取已经准确上传到站点的信息，但是会失去性能，因为有时我们需要尽快找出有关站点更新的信息。

Backend. backend api - json xml. , api, , . api -, .

Frontend

, . requests . headers, .. , , . : Kith

import requests

headers = {
    'authority': 'www.kith.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
    'sec-fetch-dest': 'document',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
}

session = requests.session()

response = session.get("https://kith.com/collections/mens-footwear", headers=headers)

if response.status_code == 200:
    print("Success")
else:
    print("Bad result")

. (Ctrl+Shift+I) , .
: , , .

soup = BeautifulSoup(response.text, 'html.parser')

for element in soup.find_all('li', class_='collection-product'):
    name = element.find('h1', class_="product-card__title").text.strip()
    price = element.find('span', class_="product-card__price").text.strip()
    link = "https://kith.com/" + element.find('a').get('href')

class Prodcut:
    name = str
    price = str
    link = str

    def __init__(self, name, price, link):
        self.name = name
        self.price = price
        self.link = link

    def __repr__(self):
        return str(self.__dict__)

import requests
from bs4 import BeautifulSoup

class Prodcut:
    name = str
    price = str
    link = str

    def __init__(self, name, price, link):
        self.name = name
        self.price = price
        self.link = link

    def __repr__(self):
        return str(self.__dict__)

headers = {
    'authority': 'www.yeezysupply.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
    'sec-fetch-dest': 'document',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
}

session = requests.session()

response = session.get('https://kith.com/collections/mens-footwear', headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')

for element in soup.find_all('li', class_='collection-product'):
    name = element.find('h1', class_="product-card__title").text.strip()
    price = element.find('span', class_="product-card__price").text.strip()
    link = "https://kith.com/" + element.find('a').get('href')

    prodcut = Prodcut(name, price, link)

    print(prodcut.__repr__())

我们研究了以下部分将要遇到的基本技术和原理。我建议您尝试自己使用这些库来完成任务，并且我也很高兴听到您对网站的偏好，它将作为示例。接下来，我们将研究如何搜索后端api点并与之交互。

网页抓取 第1部分

介绍

工具类

关于格式的一点

Frontend

More articles:

网页抓取第1部分