Types

class datahtml.types.Link(title: str, href: str, internal: bool, is_file: bool)

title: str: text extract from the link

href: str: the real link

internal: bool: if is internal to site or external (another domain)

is_file: bool: if is’t a file

class datahtml.types.URL(fullurl: str, url_short: str, norm: str, www: bool, secure: bool, domain_base: str, netloc: str, path: str, tld: str, is_social: bool = False)

Represents a url. Usually parsed using datahtml.parsers.parse_url()

Parameters:

fullurl – the origina given url
url_short – a normalized url. It’s mantained only for compatibility. It will be deprecated because also keeps www. attribute.
norm – a real normalize url, with slashes at the end, nor queryparams nor www, nor protocol, only domain and path.
www – a boolean value indicating if the original url has www
secure – If the protocol is http or https
domain_base – represents the domain, nor paths nor protocol
netloc – netloc value as provided by urllib’s parse function. The difference with the norm attribute is that netloc could include port and www.
path – path of the url, it keeps queryparams.
tld – tld part of the domain. It could be deprecated in a future release
is_social – check if the url belongs to know social network. It could be deprecated in future releases.

url = parse_url("https://www.algorinfo.com/testing?query=params")
print(url.norm)
algorinfo.com/testing

New in version 0.4.0rc14: norm attribute

fullurl: str

url_short: str

norm: str

www: bool

secure: bool

domain_base: str

netloc: str

path: str

tld: str

is_social: bool = False

class datahtml.types.Image(alt: str, src: str)

alt: str

src: str