Web
WebDocument
- class datahtml.web.WebDocument(url: str, *, html_txt: str, is_root=False)
It’s the main object for the library. It represents a HTML Document. This page could be a root link or a subpage.
- __init__(url: str, *, html_txt: str, is_root=False)
- Parameters:
url (str) – url where the document belongs.
html_txt (str) – html text of the document.
is_root (bool) – if is the root site or a subpage.
- property html: str
- classmethod parse(url: str, *, crawler: CrawlerSpec, is_root=False) WebDocument
It crawl and parse a url passed.
Deprecated since version 0.3.0: Use
datahtml.web.download()
- ld_json() Dict[str, Any]
- meta_og(keys=['og:url', 'og:image', 'og:description', 'og:type', 'og:locale', 'og:title']) Dict[str, str]
- article() ArticleData
- keywords() str | None
- metas() List[MetaTag]
- get_locale() str | None
download
- datahtml.web.download(url: str, *, crawler: CrawlerSpec, is_root=True, raise_when_not_200=True) WebDocument
It crawls the url passed.
- Parameters:
url (str) – url to crawl
crawler (CrawlerSpec) – A class:CrawlerSpec implementation.
is_root (bool) – if it’s a root site or not.
- Returns:
A web object.
- Return type:
build_sitemap
- datahtml.web.build_sitemap(url: str, *, crawler: CrawlerSpec, filter_dt: int = 1) List[SitemapLink]
It try to get the sitemap of the site based on the robots.txt protocol. After finding sitemaps links, it starts crawling each link.
- Parameters:
url (str) – Base url of the site
crawler (CrawlerSpec) – A crawler based on the
CrawlerSpecfilter_dt – some sites could have a lot of sitemaps and links, like media site, filter_dt helps to filter old content.
- Returns:
A list of links extracted from the sitemaps.
- Return type:
List[sitemap.SitemapLink]
find_rss_links
- datahtml.web.find_rss_links(url: str, *, crawler: CrawlerSpec, web: WebDocument | None = None) List[RSSLink]
It will scrap the url, looking for links related to rss feeds. If it found rss links, then it will try to get the feed from those urls.
- Parameters:
url (str) – base url to crawl, it should be the root url.
crawler (CrawlerSpec) – class:CrawlerSpec implementation to be used
web – Optional, if a class:WebDocument object is passed, then it wouldn’t crawl the site.
- Returns:
A list of RSS link already parsed.
- Return type:
List[rss.RSSLink]