Additional requests
Some websites require page interactions to load some information, such as clicking a button, scrolling down or hovering on some element. These interactions usually trigger background requests that are then loaded using JavaScript.
To extract such data, reproduce those requests using HttpClient.
Include HttpClient among the inputs of your
page object, and use an asynchronous field or method to call one of its methods.
For example, simulating a click on a button that loads product images could look like:
import attrs
from web_poet import HttpClient, HttpError, field
from zyte_common_items import Image, ProductPage
@attrs.define
class MyProductPage(ProductPage):
http: HttpClient
@field
def productId(self):
return self.css("::attr(product-id)").get()
@field
async def images(self):
url = f"https://api.example.com/v2/images?id={self.productId}"
try:
response = await self.http.get(url)
except HttpError:
return []
else:
urls = response.css(".product-images img::attr(src)").getall()
return [Image(url=url) for url in urls]
Warning
HttpClient should only be used to handle the type of scenarios
mentioned above. Using HttpClient for crawling logic would
defeat the purpose of web-poet.
Making a request
HttpClient provides multiple asynchronous request methods, such as:
http = HtpClient()
response = await http.get(url)
response = await http.post(url, body=b"...")
response = await http.request(url, method="...")
response = await http.execute(HttpRequest(url, method="..."))
Request methods also accept custom headers and body, for example:
http.post(
url,
headers={"Content-Type": "application/json;charset=UTF-8"},
body=json.dumps({"foo": "bar"}).encode("utf-8"),
)
Request methods may either raise an HttpError or return an
HttpResponse. See Working with HttpResponse.
Note
HttpClient methods are expected to follow any redirection except
when the request method is HEAD. This means that the
HttpResponse that you get is already the end of any redirection
trail.
Concurrent requests
To send multiple requests concurrently, use HttpClient.batch_execute, which accepts any number of
HttpRequest instances as input, and returns HttpResponse
instances (and HttpError instances when using
return_exceptions=True) in the input order. For example:
import attrs
from web_poet import HttpClient, HttpError, HttpRequest, field
from zyte_common_items import Image, ProductPage, ProductVariant
@attrs.define
class MyProductPage(ProductPage):
http: HttpClient
max_variants = 10
@field
def productId(self):
return self.css("::attr(product-id)").get()
@field
async def variants(self):
requests = [
HttpRequest(f"https://example.com/api/variant/{self.productId}/{index}")
for index in range(self.max_variants)
]
responses = await self.http.batch_execute(*requests, return_exceptions=True)
return [
ProductVariant(color=response.css("::attr(color)").get())
for response in responses
if not isinstance(response, HttpError)
]
You can alternatively use asyncio together with HttpClient to
handle multiple requests. For example, you can use asyncio.as_completed()
to process the first response from a group of requests as early as possible.
Error handling
HttpClient methods may raise an exception of type
HttpError or a subclass.
If the response HTTP status code (response.status) is 400 or higher, HttpResponseError is
raised. In case of connection errors, TLS errors and similar,
HttpRequestError is raised.
HttpError provides access to the offending
request, and HttpResponseError also provides
access to the offending response.
Retrying additional requests
Input validation allows retrying all inputs from a page object. To retry only additional requests, you must handle retries on your own.
Your code is responsible for retrying additional requests until good response data is received, or until some maximum number of retries is exceeded.
It is up to you to decide what the maximum number of retries should be for a given additional request, based on your experience with the target website.
It is also up to you to decide how to implement retries of additional requests.
One option would be tenacity. For example, to try an additional request 3 times before giving up:
import attrs
from tenacity import retry, stop_after_attempt
from web_poet import HttpClient, HttpError, field
from zyte_common_items import ProductPage
@attrs.define
class MyProductPage(ProductPage):
http: HttpClient
@field
def productId(self):
return self.css("::attr(product-id)").get()
@retry(stop=stop_after_attempt(3))
async def get_images(self):
return self.http.get(f"https://api.example.com/v2/images?id={self.productId}")
@field
async def images(self):
try:
response = await self.get_images()
except HttpError:
return []
else:
urls = response.css(".product-images img::attr(src)").getall()
return [Image(url=url) for url in urls]
If the reason your additional request fails is outdated or missing data from page object input, do not try to reproduce the request for that input as an additional request. Request fresh input instead.