Site Scraping with HtmlAgilityPack and ScrapySharp

  • Twitter
  • LinkedIn

Site Scraping with HtmlAgilityPack and ScrapySharp

Recently a customer asked me to scrape data from a deprecated site and apply that data to the new site. As the amount of data on the old site was large, it made sense to create a .Net Core command line program that would do the scraping and serialization to JSON for us.

The usage of the command line program is as such:

                LocationScraper [location_csv_input_file]

Where location_csv_input_file is a simple csv of all the pages we wish to have scraped.

The main(string[] args) method is as we would expect:

        private static void Main(string[] args)
            if (args.Length < 1)
                Console.Error.WriteLine("Outputs individual location files in json format to 'output' folder.");
                Console.Error.WriteLine("usage: LocationScraper location_csv_input_file");

                var config = new CsvConfiguration(CultureInfo.InvariantCulture)
                    HasHeaderRecord = false

                using (var reader = new StreamReader(args[0]))
                    using (var inputCSV = new CsvReader(reader, config))
                        // Read the urls to be scraped
                        List locationPaths = inputCSV.GetRecords().Select(csvRecord => csvRecord.LocationPath).ToList();
                        // Instantiate our page scraper, via IOC
                        IScraperBase pageScraper = _serviceProvider.GetServices().First(s => s.GetType() == typeof(PageScraper));
                        // Scrape the desired data from each of the urls
                        var locationData = ScrapeLocation(locationPaths, pageScraper);
                        // Write the location data to multiple JSON files
            catch (Exception ex)
                Console.Error.WriteLine("Could not parse location paths from specified input file");
            Console.WriteLine("Report generation complete");

An abridged version of ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper) looks like this, where BranchItem is our data model:

public static List ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper)
    List locationData = new List();
    foreach (var locationPath in locationPaths)
        Uri locationUri = new Uri(locationPath);
        if (locationUri.Segments.Length >= 1)
                var scrapedContent = scraper.ScrapePage(locationUri);
                if (scrapedContent != null)
                    Console.Error.WriteLine($"\t\tNo content found for {locationPath}");
            catch (Exception e)
                Console.Error.WriteLine("\t\tError scraping hospital. Exception: {0}", e);

    return locationData;

The real fun is in the IScraperBase and PageScraper classes.

IScraperBase defines a single function:

namespace LocationScraper.Repositories
    using System;
    using Models;

    public interface IScraperBase
        BranchItem ScrapePage(Uri locationUri);

Using IOC, I created a number of concrete instances of IScraperBase solely devoted to a single source page with its unique HTML and CSS. The following example is one of them:

public class PageScraper : IScraperBase
     // IScrapingToolsRepository is a custom class with methods to extract data
     // from HTML elements
     protected readonly IScrapingToolsRepository _scrapingToolsRepository;
     protected readonly ScrapySharp.Network.ScrapingBrowswer _scrapingBrowser;
     public PageScraper(IScrapingToolsRepository scrapingToolsRepository)
         _scrapingToolsRepository = scrapingToolsRepository;
	 _scrapingBrowser = new ScrapySharp.Network.ScrapingBrowser();

     public BranchItem ScrapePage(Uri branchUri)
  ScrapySharp.Network.WebPage webPage = _scrapingbrowser.NavigateToPage(new Uri(url));
         var branchPage = webPage?.Html;
         if (branchPage == null)
             return null;

	  // Instantiate our model, using the incoming Uri as the name of the future JSON file
         var branchItem = new BranchItem { Name = branchUri.Segments[^1] };

         // Get the branch SEO Description, and extract the Content value from it
         var branchDescription = branchPage.CssSelect("meta[name='description']").FirstOrDefault();
         branchItem.Description = _scrapingToolsRepository.GetAttributeValue(branchDescription, Constants.Selectors.Content);

         // Get the og:title, and extract the Content from it
         var branchSeoTitle = branchPage.CssSelect("meta[property='og:title']").FirstOrDefault();
         branchItem.SeoTitle = _scrapingToolsRepository.GetAttributeValue(branchSeoTitle, Constants.Selectors.Content);

         // Get the branch phone
         var branchPhoneNumber = branchPage.CssSelect($"div.location-main-block a[href*='tel:']").FirstOrDefault();
         branchItem.PhoneNumber = _scrapingToolsRepository.GetAttributeValue(branchPhoneNumber, Constants.Selectors.Href).Replace("tel:", "").Replace(".", "").Replace("-", "").Trim();
         branchItem.PhoneNumber = $"{Convert.ToInt64(branchItem.PhoneNumber):(###) ###-####}";

	// etc 
         return branchItem;

After the execution is completed, we'll have a lovely directory full of JSON files containing the data from the source pages as extracted by ScrapySharp s ScrapySharp.Extensions.CssSelect().

The library is a little old, but dead simple and easy to use.

Happy Scraping!