aiohttp + FastAPI + BeautifulSoup 測試

10 min readFeb 28, 2022

background: 平常網路上搜尋到的爬蟲（加上自己常常寫的）都是在 local 直接跑，本來如果使用 requests + BeautifulSoup 放上去 FastAPI 也能跑的很正常，但就是……慢、慢、還是慢……

有找到一些資料，像是 FastAPI-aiohttp-example，但看了很久還是不太懂，而且還有使用 aioresponses 來 mock/fake web requests in python aiohttp package ，以及 collections.abc 的 Coroutine 抽象class，完全看不懂啊啊啊啊啊啊啊… 所以不才小弟我想寫個簡單例子方便理解😂 順便做做實驗XD

這裡以104人力銀行搜尋 ”django” 的工作為例，先上扣：

import time
import aiohttp
from pydantic import BaseModel, Field
from fastapi import FastAPI
from fastapi.encoders import jsonable_encoder
from fastapi.responses import JSONResponse
import requests
from bs4 import BeautifulSoup

app = FastAPI()


# parse jobs title and content
def parse_html(html):
    soup = BeautifulSoup(html, "html.parser")
    title = soup.find("title").text
    content = soup.find("meta", {"property": "og:description"}).get("content")
    return {"title": title, "content": content}

# get html content of links
async def aiohttp_fetch(client, link):
    print("exec_aiohttp_fetch")
    async with client.get(link) as r:
        assert r.status == 200
        return await r.text()


# browse all links
async def aiohttp_104(links):
    print("exec_aiohttp_104")
    res = []
    async with aiohttp.ClientSession() as client:
        for link in links:
            print(link)
            html = await aiohttp_fetch(client, link)
            res.append(parse_html(html))
    return res


# return objects model
class Job(BaseModel):
    title: str = Field(title="職稱")
    content: str = Field(title="工作內容")


url = "https://www.104.com.tw/jobs/search/?ro=0&keyword=django&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=1&mode=s&jobsource=2018indexpoc&langFlag=0&langStatus=0&recommendJob=1&hotJob=1"
prefix = "https:"


# FastAPI route
@app.get("/api/104/django/1/")
async def get_django_job():
    start_time = time.time()
    print(f"---------- start_time: {start_time}s ----------")
    # get page1 all_url
    r = requests.get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.text, "html.parser")
        links_ele = soup.find_all("a", {"class": "js-job-link"}, href=True)
        links = [prefix + ele["href"] for i, ele in enumerate(links_ele)]

        # start parse_and_fetch
        jobs = await aiohttp_104(links)

        # make FastAPI return objects
        res = [
            Job(title=job.get("title"), content=job.get("content"))
            for job in jobs
        ]
        json_compatible_item_data = jsonable_encoder(res)
        print(f"---- execute_time: {time.time() - start_time}s ---")
        return JSONResponse(content=json_compatible_item_data)

url 是 104人力銀行以 django 為關鍵字搜尋的結果，看參數也知道page=1 XD 我們這裡就先不爬多頁了
步驟：當 FastAPI 啟動爬蟲之後，會先用 requests 拿到 url 的 response，然後爬到所有的 links_ele (jobs 的 href) ，然後再把 links 都丟到 async def aiohttp_104() 就大功告成了 (?)
因為 BeautifulSoup 不是今天的主角，所以這部分就大省略了XDDD
— 有 soup 都不要看就對了XD — -123- ~wedrftg~*345t*

依照 aiohttp 的飯粒（？），除了 FastAPI 之外的另外兩個 async def 其實就依樣畫狐狸的方式照刻，核心概念是，透過 aiohttp.ClientSession() 把 context manager 製作的 client 以及 job 的 link 丟到 async def fetch() 裡面，然後程式就會乖乖的把所有 link 給爬完了👍
最後的 jobs = await aiohttp_104(links) 取得的工作們的 title 和 content 再建立 model 、並交給 FastAPI 回傳，就大功告成拉～( 撒花🎉
來看一下 log：

---------- start_time: 1646043493.143475s ----------
exec_aiohttp_104
https://www.104.com.tw/job/7g9os?jobsource=hotjob_chr
exec_aiohttp_fetch
https://www.104.com.tw/job/7icb0?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/75ai8?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7c19s?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/793ve?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7duf1?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/76vbt?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/72iao?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7ih51?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7j7ub?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7i7yz?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7hing?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7jo75?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7etdk?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7hf9h?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/73trn?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/73s56?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/6blep?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/76p4d?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/6x67u?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7h65u?jobsource=jolist_d_relevance
exec_aiohttp_fetch
---------- execute_time: 3.1635680198669434s ----------
INFO:     127.0.0.1:49552 - "GET /api/104/django/1/ HTTP/1.1" 200 OK

總共執行了 3.16s ，應該算快吧（？
然後是 FastAPI 的 response：

雖然還是沒用到 create_task 或 gather ，但至少已經把 aiohttp 裝進 FastAPI 裡了QQQQQ 接下來還有好多好多困難的問題要克服 Orz..

尤其是那個價格日曆XDDDDD

好吧就先這樣了連假愉快～～～！

Reference:

aiohttp + FastAPI + BeautifulSoup 測試

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Pingshian Yu

No responses yet