Welcome to datahtml

Datahtml is a library to process and extract data from html and xml content.

Datahtml lets you:

  • Extract ld+json data from html

  • Extract frequently used meta tags from html (those that are used for SEO and social media, between others)

  • Extract Article data from a html, usually from Newspaper sites

  • Parse RSS feeds from sites

  • Crawl some specific social media sites like google and youtube

Under the hood datahtml uses libraries like BeautifoulSoup, Newspaper2k, feedparser between others

Indices and tables