Parsing HTML Documents with the Html Agility Pack
Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser – namely, by making an HTTP request from code and then parsing and analyzing the returned HTML. The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class . These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf , String.Substring , and the like, or through the use of regular expressions. Another option for parsing HTML documents is to use the Html Agility Pack , a free, open-source library designed to simplify reading from and writing to HTML documents














