Types

title: str

text extract from the link

href: str

the real link

internal: bool

if is internal to site or external (another domain)

is_file: bool

if is’t a file

class datahtml.types.URL(fullurl: str, url_short: str, norm: str, www: bool, secure: bool, domain_base: str, netloc: str, path: str, tld: str, is_social: bool = False)

Represents a url. Usually parsed using datahtml.parsers.parse_url()

Parameters:
  • fullurl – the origina given url

  • url_short – a normalized url. It’s mantained only for compatibility. It will be deprecated because also keeps www. attribute.

  • norm – a real normalize url, with slashes at the end, nor queryparams nor www, nor protocol, only domain and path.

  • www – a boolean value indicating if the original url has www

  • secure – If the protocol is http or https

  • domain_base – represents the domain, nor paths nor protocol

  • netloc – netloc value as provided by urllib’s parse function. The difference with the norm attribute is that netloc could include port and www.

  • path – path of the url, it keeps queryparams.

  • tld – tld part of the domain. It could be deprecated in a future release

  • is_social – check if the url belongs to know social network. It could be deprecated in future releases.

url = parse_url("https://www.algorinfo.com/testing?query=params")
print(url.norm)
algorinfo.com/testing

New in version 0.4.0rc14: norm attribute

fullurl: str
url_short: str
norm: str
www: bool
secure: bool
domain_base: str
netloc: str
path: str
tld: str
is_social: bool = False
class datahtml.types.Image(alt: str, src: str)
alt: str
src: str