Site Scraping with HtmlAgilityPack and ScrapySharp
Recently a customer asked me to scrape data from a deprecated site and apply that data to the new site. As the amount of data on the old site was large, it made sense to create a .Net Core command line program that would do the scraping and serialization to JSON for us.
The usage of the command line program is as such:
LocationScraper [location_csv_input_file]
Where location_csv_input_file is a simple csv of all the pages we wish to have scraped.
The main(string[] args) method is as we would expect:
private static void Main(string[] args)
{
if (args.Length < 1)
{
Console.Error.WriteLine("Outputs individual location files in json format to 'output' folder.");
Console.Error.WriteLine("usage: LocationScraper location_csv_input_file");
return;
}
try
{
ConfigureServices();
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false
};
using (var reader = new StreamReader(args[0]))
{
using (var inputCSV = new CsvReader(reader, config))
{
// Read the urls to be scraped
List locationPaths = inputCSV.GetRecords().Select(csvRecord => csvRecord.LocationPath).ToList();
// Instantiate our page scraper, via IOC
IScraperBase pageScraper = _serviceProvider.GetServices().First(s => s.GetType() == typeof(PageScraper));
// Scrape the desired data from each of the urls
var locationData = ScrapeLocation(locationPaths, pageScraper);
// Write the location data to multiple JSON files
RenderLocationData(locationData);
}
}
}
catch (Exception ex)
{
Console.Error.WriteLine("Could not parse location paths from specified input file");
Console.Error.WriteLine(ex.Message);
Console.Error.WriteLine(ex.StackTrace);
}
Console.WriteLine("Report generation complete");
}
An abridged version of ScrapeLocation(IEnumerable
public static List ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper)
{
List locationData = new List();
foreach (var locationPath in locationPaths)
{
Uri locationUri = new Uri(locationPath);
if (locationUri.Segments.Length >= 1)
{
try
{
var scrapedContent = scraper.ScrapePage(locationUri);
if (scrapedContent != null)
{
locationData.Add(scrapedContent);
}
else
{
Console.Error.WriteLine($"\t\tNo content found for {locationPath}");
}
}
catch (Exception e)
{
Console.Error.WriteLine("\t\tError scraping hospital. Exception: {0}", e);
continue;
}
}
}
return locationData;
}
The real fun is in the IScraperBase and PageScraper classes.
IScraperBase defines a single function:
namespace LocationScraper.Repositories
{
using System;
using Models;
public interface IScraperBase
{
BranchItem ScrapePage(Uri locationUri);
}
}
Using IOC, I created a number of concrete instances of IScraperBase solely devoted to a single source page with its unique HTML and CSS. The following example is one of them:
public class PageScraper : IScraperBase
{
// IScrapingToolsRepository is a custom class with methods to extract data
// from HTML elements
protected readonly IScrapingToolsRepository _scrapingToolsRepository;
protected readonly ScrapySharp.Network.ScrapingBrowswer _scrapingBrowser;
public PageScraper(IScrapingToolsRepository scrapingToolsRepository)
{
_scrapingToolsRepository = scrapingToolsRepository;
_scrapingBrowser = new ScrapySharp.Network.ScrapingBrowser();
}
public BranchItem ScrapePage(Uri branchUri)
{
ScrapySharp.Network.WebPage webPage = _scrapingbrowser.NavigateToPage(new Uri(url));
var branchPage = webPage?.Html;
if (branchPage == null)
{
return null;
}
// Instantiate our model, using the incoming Uri as the name of the future JSON file
var branchItem = new BranchItem { Name = branchUri.Segments[^1] };
// Get the branch SEO Description, and extract the Content value from it
var branchDescription = branchPage.CssSelect("meta[name='description']").FirstOrDefault();
branchItem.Description = _scrapingToolsRepository.GetAttributeValue(branchDescription, Constants.Selectors.Content);
// Get the og:title, and extract the Content from it
var branchSeoTitle = branchPage.CssSelect("meta[property='og:title']").FirstOrDefault();
branchItem.SeoTitle = _scrapingToolsRepository.GetAttributeValue(branchSeoTitle, Constants.Selectors.Content);
// Get the branch phone
var branchPhoneNumber = branchPage.CssSelect($"div.location-main-block a[href*='tel:']").FirstOrDefault();
branchItem.PhoneNumber = _scrapingToolsRepository.GetAttributeValue(branchPhoneNumber, Constants.Selectors.Href).Replace("tel:", "").Replace(".", "").Replace("-", "").Trim();
branchItem.PhoneNumber = $"{Convert.ToInt64(branchItem.PhoneNumber):(###) ###-####}";
// etc
return branchItem;
}
}
After the execution is completed, we'll have a lovely directory full of JSON files containing the data from the source pages as extracted by ScrapySharp s ScrapySharp.Extensions.CssSelect().
The library is a little old, but dead simple and easy to use.
Happy Scraping!