Types
- class datahtml.types.Link(title: str, href: str, internal: bool, is_file: bool)
- title: str
text extract from the link
- href: str
the real link
- internal: bool
if is internal to site or external (another domain)
- is_file: bool
if is’t a file
- class datahtml.types.URL(fullurl: str, url_short: str, norm: str, www: bool, secure: bool, domain_base: str, netloc: str, path: str, tld: str, is_social: bool = False)
Represents a url. Usually parsed using
datahtml.parsers.parse_url()- Parameters:
fullurl – the origina given url
url_short – a normalized url. It’s mantained only for compatibility. It will be deprecated because also keeps www. attribute.
norm – a real normalize url, with slashes at the end, nor queryparams nor www, nor protocol, only domain and path.
www – a boolean value indicating if the original url has www
secure – If the protocol is http or https
domain_base – represents the domain, nor paths nor protocol
netloc – netloc value as provided by urllib’s parse function. The difference with the norm attribute is that netloc could include port and www.
path – path of the url, it keeps queryparams.
tld – tld part of the domain. It could be deprecated in a future release
is_social – check if the url belongs to know social network. It could be deprecated in future releases.
url = parse_url("https://www.algorinfo.com/testing?query=params") print(url.norm) algorinfo.com/testing
New in version 0.4.0rc14: norm attribute
- fullurl: str
- url_short: str
- norm: str
- www: bool
- secure: bool
- domain_base: str
- netloc: str
- path: str
- tld: str
- is_social: bool = False