Web

WebDocument

class datahtml.web.WebDocument(url: str, *, html_txt: str, is_root=False)

It’s the main object for the library. It represents a HTML Document. This page could be a root link or a subpage.

__init__(url: str, *, html_txt: str, is_root=False)
Parameters:
  • url (str) – url where the document belongs.

  • html_txt (str) – html text of the document.

  • is_root (bool) – if is the root site or a subpage.

property html: str
classmethod parse(url: str, *, crawler: CrawlerSpec, is_root=False) WebDocument

It crawl and parse a url passed.

Deprecated since version 0.3.0: Use datahtml.web.download()

social_urls() List[URL]
images() List[Image]
ld_json() Dict[str, Any]
meta_og(keys=['og:url', 'og:image', 'og:description', 'og:type', 'og:locale', 'og:title']) Dict[str, str]
article() ArticleData
keywords() str | None
metas() List[MetaTag]
get_locale() str | None

download

datahtml.web.download(url: str, *, crawler: CrawlerSpec, is_root=True, raise_when_not_200=True) WebDocument

It crawls the url passed.

Parameters:
  • url (str) – url to crawl

  • crawler (CrawlerSpec) – A class:CrawlerSpec implementation.

  • is_root (bool) – if it’s a root site or not.

Returns:

A web object.

Return type:

WebDocument

build_sitemap

datahtml.web.build_sitemap(url: str, *, crawler: CrawlerSpec, filter_dt: int = 1) List[SitemapLink]

It try to get the sitemap of the site based on the robots.txt protocol. After finding sitemaps links, it starts crawling each link.

Parameters:
  • url (str) – Base url of the site

  • crawler (CrawlerSpec) – A crawler based on the CrawlerSpec

  • filter_dt – some sites could have a lot of sitemaps and links, like media site, filter_dt helps to filter old content.

Returns:

A list of links extracted from the sitemaps.

Return type:

List[sitemap.SitemapLink]