Built-in framework

web_poet.framework is a built-in web-poet framework for simple use cases.

It is designed to be easy to use for quick proof-of-concepts, simple scripts, and for generating test fixtures. It can also serve as a reference implementation for framework authors.

Limitations

The main limitation of the built-in framework is that it is not a complete scraping framework like Scrapy, which can support web-poet thanks to scrapy-poet.

As a web-poet framework, the built-in framework also lacks support for custom input classes, Retry and UseFallback.

Also, browser inputs only support plain GET requests. Requests with a non-GET method, headers or a body raise HttpRequestError.

Installation

To use web_poet.framework, install the framework extra:

pip install web-poet[framework]

For browser support, you also need to install at least 1 browser with Playwright. For example, to install the main browsers:

playwright install

Basic use

from dataclasses import dataclass
from web_poet import WebPage
from web_poet.framework import Framework
from web_poet.utils import ensure_awaitable


@dataclass
class Book:
    title: str


class BookPage(WebPage[Book]):
    @field
    def title(self) -> str:
        return self.response.css("h1::text").get()


framework = Framework()
item = await framework.get_item("https://books.example.com/book/1", BookPage)

# Or, if you prefer, get a page object instance first.
page = await framework.get_page("https://books.example.com/book/1", BookPage)
item = await ensure_awaitable(page.to_item())

Choosing a page object class automatically

If you decorate your page object classes with handle_urls() and make sure they are imported, e.g. with consume_modules(), you can pass get_item() an item class, and let it determine which page object class to use:

from dataclasses import dataclass
from web_poet import WebPage, handle_urls
from web_poet.framework import Framework


@dataclass
class Book:
    title: str


@handle_urls("books.example.com")
class BookPage(WebPage[Book]):
    @field
    def title(self) -> str:
        return self.response.css("h1::text").get()


framework = Framework()
item = await framework.get_item("https://books.example.com/book/1", Book)

Browser

The built-in framework can use Playwright to resolve browser dependencies like BrowserHtml or BrowserResponse.

Chromium is used by default. You can override that by passing default_playwright_engine to Framework. Page objects can also annotate their Playwright engine dependencies with playwright_engine() to specify which engine they require. For example:

from typing import Annotated

from web_poet import WebPage, Item
from web_poet.page_inputs.browser import BrowserResponse
from web_poet.framework import playwright_engine


class MyPageObject(WebPage[Item]):
    response = Annotated[BrowserResponse, playwright_engine("firefox")]

Stats

The built-in framework supports Stats.

By default, Framework creates a DictStatCollector object, exposes it to any page object that requests Stats, and exposes that object as the stats attribute of the framework:

from web_poet.framework import Framework

framework = Framework()
item1 = await framework.get_item("http://example.com/book/1", BookPage)
item2 = await framework.get_item("http://example.com/book/2", BookPage)
all_stats = framework.stats

Framework also supports passing a custom stats collector:

from web_poet.page_inputs.stats import StatCollector


class MyStatCollector(StatCollector): ...


framework = Framework(stats=MyStatCollector())