Web

WebDocument

class datahtml.web.WebDocument(url: str, *, html_txt: str, is_root=False)

It’s the main object for the library. It represents a HTML Document. This page could be a root link or a subpage.

__init__(url: str, *, html_txt: str, is_root=False)

Parameters:

classmethod parse(url: str, *, crawler: CrawlerSpec, is_root=False) → WebDocument: It crawl and parse a url passed.

Deprecated since version 0.3.0: Use datahtml.web.download()

meta_og(keys=['og:url', 'og:image', 'og:description', 'og:type', 'og:locale', 'og:title']) → Dict[str, str]

datahtml.web.download(url: str, *, crawler: CrawlerSpec, is_root=True, raise_when_not_200=True) → WebDocument

It crawls the url passed.

Parameters:

Returns:

A web object.

Return type:

WebDocument

datahtml.web.build_sitemap(url: str, *, crawler: CrawlerSpec, filter_dt: int = 1) → List[SitemapLink]

It try to get the sitemap of the site based on the robots.txt protocol. After finding sitemaps links, it starts crawling each link.

Parameters:

url (str) – Base url of the site
crawler (CrawlerSpec) – A crawler based on the CrawlerSpec
filter_dt – some sites could have a lot of sitemaps and links, like media site, filter_dt helps to filter old content.

Returns:

A list of links extracted from the sitemaps.

Return type:

List[sitemap.SitemapLink]

datahtml.web.find_rss_links(url: str, *, crawler: CrawlerSpec, web: WebDocument | None = None) → List[RSSLink]

It will scrap the url, looking for links related to rss feeds. If it found rss links, then it will try to get the feed from those urls.

Parameters:

url (str) – base url to crawl, it should be the root url.
crawler (CrawlerSpec) – class:CrawlerSpec implementation to be used
web – Optional, if a class:WebDocument object is passed, then it wouldn’t crawl the site.

Returns:

A list of RSS link already parsed.

Return type:

List[rss.RSSLink]