HTML stands for Hypertext Markup Language and is the standard markup language for creating web pages and web applications. HTML documents from a web server or local storage and render them into web pages.
- Using markup HTML describes the structure of web pages.
- The building blocks of HTML pages are HTML elements.
- HTML tags represent HTML elements and label the different types of content such as paragraph, table, and headings, etc.
- Web browsers use HTML tags to render page content.
In HTML, pictures and other items like forms might be embedded into the rendered page. It gives a way to make structured and organized documents by representing structural semantics for text such as paragraphs, lists, and headings, etc.
There are a lot of libraries available for parsing an HTML, but the problem with parsing HTML is that it is not an exact science. Sometimes HTML page seems ok when it is rendered in a browser, but there are a few things not quite right in that HTML page. To handle badly formed HTML pages, programmers need to use different techniques by using the right library for parsing.
In this article, we will discuss the top 10 HTML libraries, or you can say the most useful libraries which are used in 2017.
1. HtmlAgilityPack By ZZZ Projects & darthobiwan
HtmlAgilityPack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT.
- Don't need to understand XPATH nor XSLT to use it.
- It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.
- The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
For further details, you can visit http://html-agility-pack.net/
2. AngleSharp By FlorianRappl
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. AngleSharp follows the W3C specifications, and besides the AngleSharp official APIs, it also adds some useful extension methods which make working with the DOM convenient.
- An important aspect of AngleSharp is that CSS can also be parsed.
- AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.
- The performance of AngleSharp is quite close to the performance of browsers, even very large pages can be processed within milliseconds.
3. TinyMCE By tugberk
TinyMCE can convert HTML TEXTAREA fields or other HTML elements to editor instances. TinyMCE is very easy to integrate into other Content Management Systems.
4. GemBox.Document By GemBoxSoftware
GemBox.Document is a .NET component that enables developers to read, write, convert and print document files such as HTML, TXT, PDF, DOCX, DOC, RTF from .NET applications in a simple and efficient way.
- Requires only .NET Framework and much faster than Microsoft Word automation.
- No dependency on Microsoft Word.
- Simple and easy-to-use programming interface.
- High-quality rendering and printing.
5. FSharp.Data By tomasp
The F# Data library implements everything you need to access data in your F# applications and scripts. It contains F# type providers for working with structured file formats (CSV, HTML, JSON and XML) and for accessing the WorldBank data. It also includes helpers for parsing CSV, HTML and JSON files and for sending HTTP requests.
This library focuses on providing a simple, mostly read-only, access to the structured documents and other data sources.
For further details, you can visit http://fsharp.github.io/FSharp.Data/
6. Markdig By xoofx
Markdig is a fast, powerful, CommonMark compliant, extensible Markdown processor for .NET with 20+ built-in extensions.
- Very fast, light weightparser and html renderer
- Converter to HTML
- Includes all the core elements of CommonMark
- Compatible with .NET 3.5, 4.0+ and .NET Core (netstandard1.1+)
7. HtmlSanitizer By mganss
HtmlSanitizer is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks. It can also shield you from deliberate or accidental "tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style. It uses AngleSharp to parse, manipulate, and render HTML and CSS.
To facilitate different use cases, HtmlSanitizer can be customized at several levels:
- Configure allowed HTML tags and attributes through the AllowedTags and AllowedAttributes properties respectively.
- Configure allowed CSS property names and at-rules through the AllowedCssProperties and AllowedAtRules properties respectively.
- Configure allowed URI schemes through the property AllowedSchemes.
- Configure HTML attributes that contain URIs (such as "src", "href" etc.) through the property UriAttributes.
- Cancelable events are raised before a tag, attribute, or style is removed.
8. NReco.PdfGenerator By nreco
This package embeds wkhtmltopdf binaries (x86 windows), and they're extracted on first use - you don't need to deploy wkhtmltopdf separately.
- PdfGenerator can be used for free in a single-deployment projects.
- In 99% cases, PDF result is identical to the web browser view
- The simplest way to generate PDF from .NET applications such as ASP.NET, ASP.NET MVC, WebForms, .NET Core, VB.NET, etc.
For further details, you can visit https://www.nrecosite.com/pdf_generator_net.aspx
9. HtmlAgilityPack.CssSelectors By pbrooks
HtmlAgilityPack.CssSelectors provides an Extension Method for HtmlAgilityPack HtmlDocument and HtmlNode classes. It is a handy tool for Web Scrapers, and a good alternative to HAP XPath queries.
10. HtmlTags By joshuaflanagan
Simple .NET object model for generating HTML. In general, you should avoid building strings of HTML in your applications. There are plenty of template/view engines that are much more suitable for generating dynamic markup. However, there are some situations that require you to build snippets of HTML from code (e.g., view extensions in FubuMVC or HtmlHelper extensions in ASP.NET MVC).
HtmlTags is the best way to build those snippets. The advantages of HtmlTags are as follows:
- Automatic encoding of HTML entities in your attributes and inner text.
- Ensures proper structure such as closing tags, etc.
- Easy to use chaining API with method names modeled after jQuery
- Compatible with ASP.NET 4's encoded code expressions <%: %>
- Methods manipulate an internal model, not a string. The HTML string is not generated until ToString() is called.