aiohttp + FastAPI + BeautifulSoup 測試

Pingshian Yu
10 min readFeb 28, 2022

--

Photo by Hitesh Choudhary on Unsplash

background: 平常網路上搜尋到的爬蟲(加上自己常常寫的)都是在 local 直接跑,本來如果使用 requests + BeautifulSoup 放上去 FastAPI 也能跑的很正常,但就是……慢、慢、還是慢……

有找到一些資料,像是 FastAPI-aiohttp-example,但看了很久還是不太懂,而且還有使用 aioresponsesmock/fake web requests in python aiohttp package ,以及 collections.abcCoroutine 抽象class,完全看不懂啊啊啊啊啊啊啊… 所以不才小弟我想寫個簡單例子方便理解😂 順便做做實驗XD

  • 這裡以104人力銀行搜尋 ”django” 的工作為例,先上扣:
import time
import aiohttp
from pydantic import BaseModel, Field
from fastapi import FastAPI
from fastapi.encoders import jsonable_encoder
from fastapi.responses import JSONResponse
import requests
from bs4 import BeautifulSoup

app = FastAPI()


# parse jobs title and content
def parse_html(html):
soup = BeautifulSoup(html, "html.parser")
title = soup.find("title").text
content = soup.find("meta", {"property": "og:description"}).get("content")
return {"title": title, "content": content}

# get html content of links
async def aiohttp_fetch(client, link):
print("exec_aiohttp_fetch")
async with client.get(link) as r:
assert r.status == 200
return await r.text()


# browse all links
async def aiohttp_104(links):
print("exec_aiohttp_104")
res = []
async with aiohttp.ClientSession() as client:
for link in links:
print(link)
html = await aiohttp_fetch(client, link)
res.append(parse_html(html))
return res


# return objects model
class Job(BaseModel):
title: str = Field(title="職稱")
content: str = Field(title="工作內容")


url = "https://www.104.com.tw/jobs/search/?ro=0&keyword=django&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=14&asc=0&page=1&mode=s&jobsource=2018indexpoc&langFlag=0&langStatus=0&recommendJob=1&hotJob=1"
prefix = "https:"


# FastAPI route
@app.get("/api/104/django/1/")
async def get_django_job():
start_time = time.time()
print(f"---------- start_time: {start_time}s ----------")
# get page1 all_url
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.text, "html.parser")
links_ele = soup.find_all("a", {"class": "js-job-link"}, href=True)
links = [prefix + ele["href"] for i, ele in enumerate(links_ele)]

# start parse_and_fetch
jobs = await aiohttp_104(links)

# make FastAPI return objects
res = [
Job(title=job.get("title"), content=job.get("content"))
for job in jobs
]
json_compatible_item_data = jsonable_encoder(res)
print(f"---- execute_time: {time.time() - start_time}s ---")
return JSONResponse(content=json_compatible_item_data)
  • url104人力銀行django 為關鍵字搜尋的結果,看參數也知道page=1 XD 我們這裡就先不爬多頁了
  • 步驟: 當 FastAPI 啟動爬蟲之後,會先用 requests 拿到 url 的 response,然後爬到所有的 links_ele (jobs 的 href) ,然後再把 links 都丟到 async def aiohttp_104() 就大功告成了 (?)
  • 因為 BeautifulSoup 不是今天的主角,所以這部分就大省略了XDDD
  • — 有 soup 都不要看就對了XD — -123- ~wedrftg~*345t*
  • 依照 aiohttp 的飯粒(?),除了 FastAPI 之外的另外兩個 async def 其實就依樣畫狐狸的方式照刻,核心概念是,透過 aiohttp.ClientSession() 把 context manager 製作的 client 以及 job 的 link 丟到 async def fetch() 裡面,然後程式就會乖乖的把所有 link 給爬完了👍
  • 最後的 jobs = await aiohttp_104(links) 取得的工作們的 titlecontent建立 model 、 並交給 FastAPI 回傳,就大功告成拉~( 撒花🎉
  • 來看一下 log:
---------- start_time: 1646043493.143475s ----------
exec_aiohttp_104
https://www.104.com.tw/job/7g9os?jobsource=hotjob_chr
exec_aiohttp_fetch
https://www.104.com.tw/job/7icb0?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/75ai8?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7c19s?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/793ve?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7duf1?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/76vbt?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/72iao?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7ih51?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7j7ub?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7i7yz?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7hing?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7jo75?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7etdk?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7hf9h?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/73trn?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/73s56?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/6blep?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/76p4d?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/6x67u?jobsource=jolist_d_relevance
exec_aiohttp_fetch
https://www.104.com.tw/job/7h65u?jobsource=jolist_d_relevance
exec_aiohttp_fetch
---------- execute_time: 3.1635680198669434s ----------
INFO: 127.0.0.1:49552 - "GET /api/104/django/1/ HTTP/1.1" 200 OK
  • 總共執行了 3.16s ,應該算快吧(?
  • 然後是 FastAPI 的 response:

雖然還是沒用到 create_taskgather ,但至少已經把 aiohttp 裝進 FastAPI 裡了QQQQQ 接下來還有好多好多困難的問題要克服 Orz..

尤其是那個價格日曆XDDDDD

好吧就先這樣了 連假愉快 ~~~!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response