API reference
Page Inputs
- class web_poet.page_inputs.browser.BrowserHtml[source]
Bases:
SelectableMixin,strHTML returned by a web browser, i.e. snapshot of the DOM tree in HTML format.
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.page_inputs.browser.BrowserResponse(url: str | _Url, html, *, status: int | None = None)[source]
Bases:
SelectableMixin,UrlShortcutsMixinBrowser response: url, HTML and status code.
urlshould be browser’s window.location, not a URL of the request, if possible.htmlcontains the HTML returned by the browser, i.e. a snapshot of DOM tree in HTML format.The following are optional since it would depend on the source of the
BrowserResponseif these are available or not:statusshould represent the int status code of the HTTP response.- url: ResponseUrl
- html: BrowserHtml
- property text: str
HTML returned by the browser, identical to
self.html.Provided for compatibility with
HttpResponse.
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- urljoin(url: str | RequestUrl | ResponseUrl) RequestUrl
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.page_inputs.client.HttpClient(request_downloader: RequestDownloaderT | None = None, *, save_responses: bool = False, return_only_saved_responses: bool = False, responses: Iterable[_SavedResponseData] | None = None)[source]
Async HTTP client to be used in Page Objects.
See Additional requests for the usage information.
HttpClient doesn’t make HTTP requests on itself. It uses either the request function assigned to the
web_poet.request_downloader_varcontextvar, or a function passed viarequest_downloaderargument of the__init__()method.Either way, this function should be an
async deffunction which receives anHttpRequestinstance, and either returns aHttpResponseinstance, or raises a subclass ofHttpError. You can read more in the Providing the Downloader documentation.- async request(url: str | _Url, *, method: str = 'GET', headers: dict[str, str] | HttpRequestHeaders | None = None, body: bytes | HttpRequestBody | None = None, allow_status: str | int | list[str | int] | None = None) HttpResponse[source]
This is a shortcut for creating an
HttpRequestinstance and executing that request.HttpRequestErroris raised for connection errors, connection and read timeouts, etc.An
HttpResponseinstance is returned for successful responses in the100-3xxstatus code range.Otherwise, an exception of type
HttpResponseErroris raised.Rasing
HttpResponseErrorcan be suppressed for certain status codes using theallow_statusparam - it is a list of status code values for whichHttpResponseshould be returned instead of raisingHttpResponseError.There is a special “*”
allow_statusvalue which allows any status code.There is no need to include
100-3xxstatus codes inallow_status, becauseHttpResponseErroris not raised for them.
- async get(url: str | _Url, *, headers: dict[str, str] | HttpRequestHeaders | None = None, allow_status: str | int | list[str | int] | None = None) HttpResponse[source]
Similar to
request()but peforming aGETrequest.
- async post(url: str | _Url, *, headers: dict[str, str] | HttpRequestHeaders | None = None, body: bytes | HttpRequestBody | None = None, allow_status: str | int | list[str | int] | None = None) HttpResponse[source]
Similar to
request()but performing aPOSTrequest.
- async execute(request: HttpRequest, *, allow_status: str | int | list[str | int] | None = None) HttpResponse[source]
Execute the specified
HttpRequestinstance using the request implementation configured in theHttpClientinstance.HttpRequestErroris raised for connection errors, connection and read timeouts, etc.HttpResponseinstance is returned for successful responses in the100-3xxstatus code range.Otherwise, an exception of type
HttpResponseErroris raised.Rasing
HttpResponseErrorcan be suppressed for certain status codes using theallow_statusparam - it is a list of status code values for whichHttpResponseshould be returned instead of raisingHttpResponseError.There is a special “*”
allow_statusvalue which allows any status code.There is no need to include
100-3xxstatus codes inallow_status, becauseHttpResponseErroris not raised for them.
- async batch_execute(*requests: HttpRequest, return_exceptions: bool = False, allow_status: str | int | list[str | int] | None = None) list[HttpResponse | HttpResponseError][source]
Similar to
execute()but accepts a collection ofHttpRequestinstances that would be batch executed.The order of the
HttpResponseswould correspond to the order ofHttpRequestpassed.If any of the
HttpRequestraises an exception upon execution, the exception is raised.To prevent this, the actual exception can be returned alongside any successful
HttpResponse. This enables salvaging any usable responses despite any possible failures. This can be done by settingTrueto thereturn_exceptionsparameter.Like
execute(),HttpResponseErrorwill be raised for responses with status codes in the400-5xxrange. Theallow_statusparameter could be used the same way here to prevent these exceptions from being raised.You can omit
allow_status="*"if you’re passingreturn_exceptions=True. However, it would be returningHttpResponseErrorinstead ofHttpResponse.Lastly, a
HttpRequestErrormay be raised on cases like connection errors, connection and read timeouts, etc.
- class web_poet.page_inputs.http.HttpRequestBody[source]
Bases:
bytesA container for holding the raw HTTP request body in bytes format.
- class web_poet.page_inputs.http.HttpResponseBody[source]
Bases:
bytesA container for holding the raw HTTP response body in bytes format.
- class web_poet.page_inputs.http.HttpRequestHeaders[source]
Bases:
_HttpHeadersA container for holding the HTTP request headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpRequestHeaders(pairs) <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpRequestHeaders(pairs) >>> headers <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict. For more info on its other features, read the API spec ofmultidict.CIMultiDict.- classmethod from_bytes_dict(arg: _AnyStrDict, encoding: str = 'utf-8') Self
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]andTuple[bytes]]alongside a plainbytesvalue. A value instralso works and wouldn’t break the decoding process at all.By default, it converts the
bytesvalue using “utf-8”. However, this can easily be overridden using theencodingparameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: list[dict]) Self
An alternative constructor for instantiation using a
List[Dict]where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpResponseHeaders[source]
Bases:
_HttpHeadersA container for holding the HTTP response headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpResponseHeaders(pairs) <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpResponseHeaders(pairs) >>> headers <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict. For more info on its other features, read the API spec ofmultidict.CIMultiDict.- declared_encoding() str | None[source]
Return encoding detected from the Content-Type header, or None if encoding is not found
- classmethod from_bytes_dict(arg: _AnyStrDict, encoding: str = 'utf-8') Self
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]andTuple[bytes]]alongside a plainbytesvalue. A value instralso works and wouldn’t break the decoding process at all.By default, it converts the
bytesvalue using “utf-8”. However, this can easily be overridden using theencodingparameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: list[dict]) Self
An alternative constructor for instantiation using a
List[Dict]where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpRequest(url: str | _Url, *, method: str = 'GET', headers=NOTHING, body=NOTHING)[source]
Bases:
objectRepresents a generic HTTP request used by other functionalities in web-poet like
HttpClient.Tip
To build a request to submit an HTML form, use the form2request library, which provides integration with web-poet.
- url: RequestUrl
- headers: HttpRequestHeaders
- body: HttpRequestBody
- class web_poet.page_inputs.http.HttpResponse(url: str | _Url, body, *, status: int | None = None, headers=NOTHING, encoding: str | None = None)[source]
Bases:
SelectableMixin,UrlShortcutsMixinA container for the contents of a response, downloaded directly using an HTTP client.
urlshould be a URL of the response (after all redirects), not a URL of the request, if possible.bodycontains the raw HTTP response body.The following are optional since it would depend on the source of the
HttpResponseif these are available or not. For example, the responses could simply come off from a local HTML file which doesn’t containheadersandstatus.statusshould represent the int status code of the HTTP response.headersshould contain the HTTP response headers.encodingencoding of the response. If None (default), encoding is auto-detected from headers and body content.- url: ResponseUrl
- body: HttpResponseBody
- headers: HttpResponseHeaders
- property text: str
Content of the HTTP body, converted to unicode using the detected encoding of the response, according to the web browser rules (respecting Content-Type header, etc.)
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- urljoin(url: str | RequestUrl | ResponseUrl) RequestUrl
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- web_poet.page_inputs.http.request_fingerprint(req: HttpRequest) str[source]
Return the fingerprint of the request.
- class web_poet.page_inputs.response.AnyResponse(response: BrowserResponse | HttpResponse)[source]
Bases:
SelectableMixin,UrlShortcutsMixinA container that holds either
BrowserResponseorHttpResponse.- response: BrowserResponse | HttpResponse
- property url: ResponseUrl
URL of the response.
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- urljoin(url: str | RequestUrl | ResponseUrl) RequestUrl
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.page_inputs.page_params.PageParams[source]
Bases:
dict[_KT,_VT]Container class that could contain any arbitrary data to be passed into a Page Object.
Note that this is simply a subclass of Python’s
dict.
- class web_poet.page_inputs.stats.StatCollector[source]
Bases:
ABCBase class for web-poet to implement the storing of data written through
Stats.
- class web_poet.page_inputs.stats.DummyStatCollector[source]
Bases:
StatCollectorStatCollectorimplementation that does not persist stats. It is used when running automatic tests, where stat storage is not necessary.
- class web_poet.page_inputs.stats.DictStatCollector[source]
Bases:
DummyStatCollectorSimple
StatCollectorimplementation that stores stats in adictaccessible through thedataproperty.
- class web_poet.page_inputs.stats.Stats(stat_collector: StatCollector | None = None)[source]
Bases:
objectPage input class to write key-value data pairs during parsing that you can inspect later. See Stats.
Stats can be set to a fixed value or, if numeric, incremented.
Stats are write-only.
Storage and read access of stats depends on the web-poet framework that you are using. Check the documentation of your web-poet framework to find out if it supports stats, and if so, how to read stored stats.
Pages
- class web_poet.pages.Injectable[source]
Bases:
ABC,FieldsMixinBase Page Object class, which all Page Objects should inherit from (probably through Injectable subclasses).
Frameworks which are using
web-poetPage Objects should useis_injectable()function to detect if an object is an Injectable, and if an object is injectable, allow building it automatically through dependency injection, using https://github.com/scrapinghub/andi library.Instead of inheriting you can also use
Injectable.register(MyWebPage).Injectable.registercan also be used as a decorator.
- web_poet.pages.is_injectable(cls: Any) bool[source]
Return True if
clsis a class which inherits fromInjectable.
- class web_poet.pages.ItemPage[source]
Bases:
Extractor[ItemT],InjectableBase class for page objects.
- class web_poet.pages.WebPage(response: HttpResponse)[source]
Bases:
ItemPage[ItemT],ResponseShortcutsMixinBase Page Object which requires
HttpResponseand provides XPath / CSS shortcuts.- response: HttpResponse
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- async to_item() ItemT
Extract an item from a web page
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.pages.BrowserPage(response: BrowserResponse)[source]
Bases:
ItemPage[ItemT],ResponseShortcutsMixinBase Page Object which requires
BrowserResponseand provides XPath / CSS shortcuts.- response: BrowserResponse
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- async to_item() ItemT
Extract an item from a web page
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.pages.Returns[source]
Bases:
Generic[ItemT]Inherit from this generic mixin to change the item class used by
ItemPage
- class web_poet.pages.Extractor[source]
Bases:
Returns[ItemT],FieldsMixinBase class for field support.
Mixins
- class web_poet.mixins.ResponseShortcutsMixin[source]
Common shortcut methods for working with HTML responses. This mixin could be used with Page Object base classes.
It requires “response” attribute to be present.
- urljoin(url: str) str[source]
Convert url to absolute, taking in account url and baseurl of the response
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- property selector: Selector
Cached instance of
parsel.selector.Selector.
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
- class web_poet.mixins.SelectableMixin[source]
Inherit from this mixin, implement
._selector_inputmethod, get.selectorproperty and.xpath/.css/.jmespathmethods.- property selector: Selector
Cached instance of
parsel.selector.Selector.
- css(query) SelectorList
A shortcut to
.selector.css().
- jmespath(query: str, **kwargs) SelectorList
A shortcut to
.selector.jmespath().
- xpath(query, **kwargs) SelectorList
A shortcut to
.selector.xpath().
Requests
- web_poet.requests.RequestDownloaderT
Frameworks that wants to support additional requests in
web-poetshould set the appropriate implementation ofrequest_downloader_varfor requesting data.alias of
Callable[[HttpRequest],Awaitable[HttpResponse]]
Exceptions
Core Exceptions
These exceptions are tied to how web-poet operates.
- exception web_poet.exceptions.core.NoSavedHttpResponse(msg: str | None = None, request: HttpRequest | None = None)[source]
Indicates that there is no saved response for this request.
Can only be raised when a
HttpClientinstance is used to get saved responses.- Parameters:
request (HttpRequest) – The
HttpRequestinstance that was used.
- exception web_poet.exceptions.core.PageObjectAction[source]
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.
- exception web_poet.exceptions.core.RequestDownloaderVarError[source]
The
web_poet.request_downloader_varhad its contents accessed but there wasn’t any value set during the time requests are executed.See the documentation section about setting up the contextvars to learn more about this.
- exception web_poet.exceptions.core.Retry(message: str | None = None, max_retries: int | None = None)[source]
The page object found that the input data is partial or empty, and a request retry may provide better input.
message is the reason for the retry.
max_retries is the desired maximum retries. If not specified, the framework defaults are used instead.
- exception web_poet.exceptions.core.UseFallback[source]
The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.
HTTP Exceptions
These are exceptions pertaining to common issues faced when executing HTTP operations.
- exception web_poet.exceptions.http.HttpError(msg: str | None = None, request: HttpRequest | None = None)[source]
Bases:
OSErrorIndicates that an exception has occurred when handling an HTTP operation.
This is used as a base class for more specific errors and could be vague since it could denote problems either in the HTTP Request or Response.
For more specific errors, it would be better to use
HttpRequestErrorandHttpResponseError.- Parameters:
request (HttpRequest) – Request that triggered the exception.
- request: HttpRequest | None
Request that triggered the exception.
- exception web_poet.exceptions.http.HttpRequestError(msg: str | None = None, request: HttpRequest | None = None)[source]
Bases:
HttpErrorIndicates that an exception has occurred when the HTTP Request was being handled.
- Parameters:
request (HttpRequest) – The
HttpRequestinstance that was used.
- exception web_poet.exceptions.http.HttpResponseError(msg: str | None = None, response: HttpResponse | None = None, request: HttpRequest | None = None)[source]
Bases:
HttpErrorIndicates that an exception has occurred when the HTTP Response was received.
For responses that are in the status code
100-3xx range, this exception shouldn’t be raised at all. However, for responses in the400-5xx, this will be raised by web-poet.Note
Frameworks implementing web-poet should NOT raise this exception.
This exception is raised by web-poet itself, based on
allow_statusparameter found in the methods ofHttpClient.- Parameters:
request (HttpRequest) – Request that got the response that triggered the exception.
response (HttpResponse) – Response that triggered the exception.
- response: HttpResponse | None
Response that triggered the exception.
Apply Rules
See Rules for more context about its use cases and some examples.
- web_poet.default_registry
Default
RulesRegistry.
- web_poet.handle_urls()
handle_urls()of thedefault_registry.
- class web_poet.rules.ApplyRule(for_patterns: str | Patterns, *, use: type[ItemPage], instead_of: type[ItemPage] | None = None, to_return: type[Any] | None = None, meta: dict[str, Any] = NOTHING)[source]
A rule that primarily applies Page Object and Item overrides for a given URL pattern.
This is instantiated when using the
web_poet.handle_urls()decorator. It’s also being returned as aList[ApplyRule]when calling theweb_poet.default_registry’sget_rules()method.You can access any of its attributes:
for_patterns- contains the list of URL patterns associated with this rule. You can read the API documentation of the url-matcher package for more information about the patterns.use- The Page Object that will be used in cases where the URL pattern represented by thefor_patternsattribute is matched.instead_of- (optional) The Page Object that will be replaced with the Page Object specified via theuseparameter.to_return- (optional) The item class that the Page Object specified inuseis capable of returning.meta- (optional) Any other information you may want to store. This doesn’t do anything for now but may be useful for future API updates.
The main functionality of this class lies in the
instead_ofandto_returnparameters. Should both of these be omitted, thenApplyRulesimply tags which URL patterns the given Page Object defined inuseis expected to be used on.When
to_returnis not None (e.g.to_return=MyItem), the Page Object inuseis declared as capable of returning a certain item class (i.e.MyItem).When
instead_ofis not None (e.g.instead_of=ReplacedPageObject), the rule adds an expectation that theReplacedPageObjectwouldn’t be used for the URLs matchingfor_patterns, since the Page Object inusewill replace it.If there are multiple rules which match a certain URL, the rule to apply is picked based on the priorities set in
for_patterns.More information regarding its usage in Rules.
Tip
The
ApplyRuleis also hashable. This makes it easy to store unique rules and identify any duplicates.
- class web_poet.rules.RulesRegistry(*, rules: Iterable[ApplyRule] | None = None)[source]
RulesRegistry provides features for storing, retrieving, and searching for the
ApplyRuleinstances.web-poetprovides a default Registry nameddefault_registryfor convenience. It can be accessed this way:from web_poet import handle_urls, default_registry, WebPage from my_items import Product @handle_urls("example.com") class ExampleComProductPage(WebPage[Product]): ... rules = default_registry.get_rules()
The
@handle_urlsdecorator exposed asweb_poet.handle_urlsis a shortcut fordefault_registry.handle_urls.Note
It is encouraged to use the
web_poet.default_registryinstead of creating your ownRulesRegistryinstance. Using multiple registries would be unwieldy in most cases.However, it might be applicable in certain scenarios like storing custom rules to separate it from the
default_registry.- add_rule(rule: ApplyRule) None[source]
Registers an
web_poet.rules.ApplyRuleinstance.
- handle_urls(include: str | Iterable[str], *, instead_of: type[ItemPage] | None = None, to_return: type | None = None, exclude: str | Iterable[str] | None = None, priority: int = 500, **kwargs)[source]
Class decorator that indicates that the decorated Page Object should work for the given URL patterns.
The URL patterns are matched using the
includeandexcludeparameters whileprioritybreaks any ties. See the documentation of the url-matcher package for more information about them.This decorator is able to derive the item class returned by the Page Object. This is important since it marks what type of item the Page Object is capable of returning for the given URL patterns. For certain advanced cases, you can pass a
to_returnparameter which replaces any derived values (though this isn’t generally recommended).Passing another Page Object into the
instead_ofparameter indicates that the decorated Page Object will be used instead of that for the given set of URL patterns. See Rule precedence.Any extra parameters are stored as meta information that can be later used.
- Parameters:
include – The URLs that should be handled by the decorated Page Object.
instead_of – The Page Object that should be replaced.
to_return – The item class holding the data returned by the Page Object. This could be omitted as it could be derived from the
Returns[ItemClass]orItemPage[ItemClass]declaration of the Page Object. See Items section.exclude – The URLs for which the Page Object should not be applied.
priority – The resolution priority in case of conflicting rules. A conflict happens when the
include,override, andexcludeparameters are the same. If so, the highest priority will be chosen.
- get_rules() list[ApplyRule][source]
Return all the
ApplyRulethat were declared using the@handle_urlsdecorator.Note
Remember to consider calling
consume_modules()beforehand to recursively import all submodules which contains the@handle_urlsdecorators from external Page Objects.
- search(**kwargs: Any) list[ApplyRule][source]
Return any
ApplyRulefrom the registry that matches with all the provided attributes.Sample usage:
rules = registry.search(use=ProductPO, instead_of=GenericPO) print(len(rules)) # 1 print(rules[0].use) # ProductPO print(rules[0].instead_of) # GenericPO
- overrides_for(url: _Url | str) Mapping[type[ItemPage], type[ItemPage]][source]
Finds all of the page objects associated with the given URL and returns a Mapping where the ‘key’ represents the page object that is overridden by the page object in ‘value’.
- page_cls_for_item(url: _Url | str, item_cls: type) type | None[source]
Return the page object class associated with the given URL that’s able to produce the given
item_cls.
- top_rules_for_item(url: _Url | str, item_cls: type) Generator[ApplyRule][source]
Iterates the top rules that apply for url and item_cls.
If multiple rules score the same, multiple rules are iterated. This may be useful, for example, if you want to apply some custom logic to choose between rules that otherwise have the same score. For example:
from web_poet import default_registry def browser_page_cls_for_item(url, item_cls): fallback = None for rule in default_registry.top_rules_for_item(url, item_cls): if rule.meta.get("browser", False): return rule.use if not fallback: fallback = rule.use if not fallback: raise ValueError(f"No rule found for URL {url!r} and item class {item_cls}") return fallback
- web_poet.rules.consume_modules(*modules: str) None[source]
This recursively imports all packages/modules so that the
@handle_urlsdecorators are properly discovered and imported.Let’s take a look at an example:
# FILE: my_page_obj_project/load_rules.py from web_poet import default_registry, consume_modules consume_modules("other_external_pkg.po", "another_pkg.lib") rules = default_registry.get_rules()
For this case, the
ApplyRuleare coming from:my_page_obj_project(since it’s the same module as the file above)other_external_pkg.poanother_pkg.libany other modules that was imported in the same process inside the packages/modules above.
If the
default_registryhad other@handle_urlsdecorators outside of the packages/modules listed above, then the correspondingApplyRulewon’t be returned. Unless, they were recursively imported in some way similar toconsume_modules().
Fields
web_poet.fields is a module with helpers for putting extraction logic
into separate Page Object methods / properties.
- class web_poet.fields.FieldInfo(name: str, meta: dict | None = None, out: list[Callable] | None = None)[source]
Information about a field
- web_poet.fields.field(method=None, *, cached: bool = False, meta: dict | None = None, out: list[Callable] | None = None)[source]
Page Object method decorated with
@fielddecorator becomes a property, which is then used byItemPage’s to_item() method to populate a corresponding item attribute.By default, the value is computed on each property access. Use
@field(cached=True)to cache the property value.The
metaparameter allows to store arbitrary information for the field, e.g.@field(meta={"expensive": True}). This information can be later retrieved for all fields using theget_fields_dict()function.The
outparameter is an optional list of field processors, which are functions applied to the value of the field before returning it.
- web_poet.fields.get_fields_dict(cls_or_instance) dict[str, FieldInfo][source]
Return a dictionary with information about the fields defined for the class: keys are field names, and values are
web_poet.fields.FieldInfoinstances.
- async web_poet.fields.item_from_fields(obj, item_cls: type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T[source]
Return an item of
item_clstype, with its attributes populated from theobjmethods decorated withfielddecorator.If
skip_nonitem_fieldsis True,@fieldswhose names are not amongitem_clsfield names are not passed toitem_cls.__init__.When
skip_nonitem_fieldsis False (default), all@fieldsare passed toitem_cls.__init__, possibly causing exceptions ifitem_cls.__init__doesn’t support them.
- web_poet.fields.item_from_fields_sync(obj, item_cls: type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T[source]
Synchronous version of
item_from_fields().
Annotation support
- web_poet.annotation_encode(obj: Any) Any[source]
Encodes obj for
Annotated.Annotated params must be hashable. This function converts dicts and lists into hashable alternatives (tuples and frozensets).
For example:
foo = Annotated(Bar, annotation_encode({"a": [1, 2, 3]}))
obj must not contain tuples or frozensets, or unhashable data besides dicts and lists.
- web_poet.annotation_decode(obj: Any) Any[source]
Converts a result of
annotation_encode()back to original form.
- class web_poet.AnnotatedInstance(result: Any, metadata: tuple[Any, ...])[source]
Wrapper for instances of annotated dependencies.
It is used when both the dependency value and the dependency annotation are needed.
- Parameters:
result (Any) – The wrapped dependency instance.
metadata (Tuple[Any, ...]) – The copy of the annotation.
Utils
- web_poet.utils.get_fq_class_name(cls: type) str[source]
Return the fully qualified name for a type.
>>> from web_poet import Injectable >>> get_fq_class_name(Injectable) 'web_poet.pages.Injectable' >>> from decimal import Decimal >>> get_fq_class_name(Decimal) 'decimal.Decimal'
- web_poet.utils.memoizemethod_noargs(method: CallableT) CallableT[source]
Decorator to cache the result of a method (without arguments) using a weak reference to its object.
It is faster than
cached_method(), and doesn’t add new attributes to the instance, but it doesn’t work if objects are unhashable.
- web_poet.utils.cached_method(method: CallableT) CallableT[source]
A decorator to cache method or coroutine method results, so that if it’s called multiple times for the same instance, computation is only done once.
The cache is unbound, but it’s tied to the instance lifetime.
Note
cached_method()is needed becausefunctools.lru_cache()doesn’t work well on methods: self is used as a cache key, so a reference to an instance is kept in the cache, and this prevents deallocation of instances.This decorator adds a new private attribute to the instance named
_cached_method_{decorated_method_name}; make sure the class doesn’t define an attribute of the same name.
- web_poet.utils.as_list(value: Any) list[Any][source]
Normalizes the value input as a list.
>>> as_list(None) [] >>> as_list("foo") ['foo'] >>> as_list(123) [123] >>> as_list(["foo", "bar", 123]) ['foo', 'bar', 123] >>> as_list(("foo", "bar", 123)) ['foo', 'bar', 123] >>> as_list(range(5)) [0, 1, 2, 3, 4] >>> def gen(): ... yield 1 ... yield 2 >>> as_list(gen()) [1, 2]