Site Scraping with HtmlAgilityPack and ScrapySharp

  • Twitter
  • LinkedIn

Site Scraping with HtmlAgilityPack and ScrapySharp

Recently a customer asked me to scrape data from a deprecated site and apply that data to the new site. As the amount of data on the old site was large, it made sense to create a .Net Core command line program that would do the scraping and serialization to JSON for us.

The usage of the command line program is as such:

                LocationScraper [location_csv_input_file]

Where location_csv_input_file is a simple csv of all the pages we wish to have scraped.

The main(string[] args) method is as we would expect:

        private static void Main(string[] args)
        {
            if (args.Length < 1)
            {
                Console.Error.WriteLine("Outputs individual location files in json format to 'output' folder.");
                Console.Error.WriteLine("usage: LocationScraper location_csv_input_file");
                return;
            }

            try
            {
                ConfigureServices();
                var config = new CsvConfiguration(CultureInfo.InvariantCulture)
                {
                    HasHeaderRecord = false
                };

                using (var reader = new StreamReader(args[0]))
                {
                    using (var inputCSV = new CsvReader(reader, config))
                    {
                        // Read the urls to be scraped
                        List locationPaths = inputCSV.GetRecords().Select(csvRecord => csvRecord.LocationPath).ToList();
                        // Instantiate our page scraper, via IOC
                        IScraperBase pageScraper = _serviceProvider.GetServices().First(s => s.GetType() == typeof(PageScraper));
                        // Scrape the desired data from each of the urls
                        var locationData = ScrapeLocation(locationPaths, pageScraper);
                        // Write the location data to multiple JSON files
                        RenderLocationData(locationData);
                    }
                }
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine("Could not parse location paths from specified input file");
                Console.Error.WriteLine(ex.Message);
                Console.Error.WriteLine(ex.StackTrace);
            }
            Console.WriteLine("Report generation complete");
        }

An abridged version of ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper) looks like this, where BranchItem is our data model:

public static List ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper)
{
    List locationData = new List();
    foreach (var locationPath in locationPaths)
    {
        Uri locationUri = new Uri(locationPath);
        if (locationUri.Segments.Length >= 1)
        {
            try
            {
                var scrapedContent = scraper.ScrapePage(locationUri);
                if (scrapedContent != null)
                {
                    locationData.Add(scrapedContent);
                }
                else
                {
                    Console.Error.WriteLine($"\t\tNo content found for {locationPath}");
                }
            }
            catch (Exception e)
            {
                Console.Error.WriteLine("\t\tError scraping hospital. Exception: {0}", e);
                continue;
            }
        }
    }

    return locationData;
}

The real fun is in the IScraperBase and PageScraper classes.

IScraperBase defines a single function:

namespace LocationScraper.Repositories
{
    using System;
    using Models;

    public interface IScraperBase
    {
        BranchItem ScrapePage(Uri locationUri);
    }
}

Using IOC, I created a number of concrete instances of IScraperBase solely devoted to a single source page with its unique HTML and CSS. The following example is one of them:

public class PageScraper : IScraperBase
 {
     // IScrapingToolsRepository is a custom class with methods to extract data
     // from HTML elements
     protected readonly IScrapingToolsRepository _scrapingToolsRepository;
     protected readonly ScrapySharp.Network.ScrapingBrowswer _scrapingBrowser;
     public PageScraper(IScrapingToolsRepository scrapingToolsRepository)
     {
         _scrapingToolsRepository = scrapingToolsRepository;
	 _scrapingBrowser = new ScrapySharp.Network.ScrapingBrowser();
     }

     public BranchItem ScrapePage(Uri branchUri)
     {
  ScrapySharp.Network.WebPage webPage = _scrapingbrowser.NavigateToPage(new Uri(url));
         var branchPage = webPage?.Html;
         if (branchPage == null)
         {
             return null;
         }

	  // Instantiate our model, using the incoming Uri as the name of the future JSON file
         var branchItem = new BranchItem { Name = branchUri.Segments[^1] };

         // Get the branch SEO Description, and extract the Content value from it
         var branchDescription = branchPage.CssSelect("meta[name='description']").FirstOrDefault();
         branchItem.Description = _scrapingToolsRepository.GetAttributeValue(branchDescription, Constants.Selectors.Content);

         // Get the og:title, and extract the Content from it
         var branchSeoTitle = branchPage.CssSelect("meta[property='og:title']").FirstOrDefault();
         branchItem.SeoTitle = _scrapingToolsRepository.GetAttributeValue(branchSeoTitle, Constants.Selectors.Content);

         // Get the branch phone
         var branchPhoneNumber = branchPage.CssSelect($"div.location-main-block a[href*='tel:']").FirstOrDefault();
         branchItem.PhoneNumber = _scrapingToolsRepository.GetAttributeValue(branchPhoneNumber, Constants.Selectors.Href).Replace("tel:", "").Replace(".", "").Replace("-", "").Trim();
         branchItem.PhoneNumber = $"{Convert.ToInt64(branchItem.PhoneNumber):(###) ###-####}";

	// etc 
         return branchItem;
     }
}

After the execution is completed, we'll have a lovely directory full of JSON files containing the data from the source pages as extracted by ScrapySharp s ScrapySharp.Extensions.CssSelect().

The library is a little old, but dead simple and easy to use.

Happy Scraping!

Related Blogs

Latest Blogs