Site Scraping with HtmlAgilityPack and ScrapySharp
Recently a customer asked me to scrape data from a deprecated site and apply that data to the new site. As the amount of data on the old site was large, it made sense to create a .Net Core command line program that would do the scraping and serialization to JSON for us.
The usage of the command line program is as such:
LocationScraper [location_csv_input_file]
Where location_csv_input_file is a simple csv of all the pages we wish to have scraped.
The main(string[] args) method is as we would expect:
private static void Main(string[] args) { if (args.Length < 1) { Console.Error.WriteLine("Outputs individual location files in json format to 'output' folder."); Console.Error.WriteLine("usage: LocationScraper location_csv_input_file"); return; } try { ConfigureServices(); var config = new CsvConfiguration(CultureInfo.InvariantCulture) { HasHeaderRecord = false }; using (var reader = new StreamReader(args[0])) { using (var inputCSV = new CsvReader(reader, config)) { // Read the urls to be scraped List locationPaths = inputCSV.GetRecords().Select(csvRecord => csvRecord.LocationPath).ToList(); // Instantiate our page scraper, via IOC IScraperBase pageScraper = _serviceProvider.GetServices().First(s => s.GetType() == typeof(PageScraper)); // Scrape the desired data from each of the urls var locationData = ScrapeLocation(locationPaths, pageScraper); // Write the location data to multiple JSON files RenderLocationData(locationData); } } } catch (Exception ex) { Console.Error.WriteLine("Could not parse location paths from specified input file"); Console.Error.WriteLine(ex.Message); Console.Error.WriteLine(ex.StackTrace); } Console.WriteLine("Report generation complete"); }
An abridged version of ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper) looks like this, where BranchItem is our data model:
public static List ScrapeLocation(IEnumerable locationPaths, IScraperBase scraper) { List locationData = new List(); foreach (var locationPath in locationPaths) { Uri locationUri = new Uri(locationPath); if (locationUri.Segments.Length >= 1) { try { var scrapedContent = scraper.ScrapePage(locationUri); if (scrapedContent != null) { locationData.Add(scrapedContent); } else { Console.Error.WriteLine($"\t\tNo content found for {locationPath}"); } } catch (Exception e) { Console.Error.WriteLine("\t\tError scraping hospital. Exception: {0}", e); continue; } } } return locationData; }
The real fun is in the IScraperBase and PageScraper classes.
IScraperBase defines a single function:
namespace LocationScraper.Repositories { using System; using Models; public interface IScraperBase { BranchItem ScrapePage(Uri locationUri); } }
Using IOC, I created a number of concrete instances of IScraperBase solely devoted to a single source page with its unique HTML and CSS. The following example is one of them:
public class PageScraper : IScraperBase { // IScrapingToolsRepository is a custom class with methods to extract data // from HTML elements protected readonly IScrapingToolsRepository _scrapingToolsRepository; protected readonly ScrapySharp.Network.ScrapingBrowswer _scrapingBrowser; public PageScraper(IScrapingToolsRepository scrapingToolsRepository) { _scrapingToolsRepository = scrapingToolsRepository; _scrapingBrowser = new ScrapySharp.Network.ScrapingBrowser(); } public BranchItem ScrapePage(Uri branchUri) { ScrapySharp.Network.WebPage webPage = _scrapingbrowser.NavigateToPage(new Uri(url)); var branchPage = webPage?.Html; if (branchPage == null) { return null; } // Instantiate our model, using the incoming Uri as the name of the future JSON file var branchItem = new BranchItem { Name = branchUri.Segments[^1] }; // Get the branch SEO Description, and extract the Content value from it var branchDescription = branchPage.CssSelect("meta[name='description']").FirstOrDefault(); branchItem.Description = _scrapingToolsRepository.GetAttributeValue(branchDescription, Constants.Selectors.Content); // Get the og:title, and extract the Content from it var branchSeoTitle = branchPage.CssSelect("meta[property='og:title']").FirstOrDefault(); branchItem.SeoTitle = _scrapingToolsRepository.GetAttributeValue(branchSeoTitle, Constants.Selectors.Content); // Get the branch phone var branchPhoneNumber = branchPage.CssSelect($"div.location-main-block a[href*='tel:']").FirstOrDefault(); branchItem.PhoneNumber = _scrapingToolsRepository.GetAttributeValue(branchPhoneNumber, Constants.Selectors.Href).Replace("tel:", "").Replace(".", "").Replace("-", "").Trim(); branchItem.PhoneNumber = $"{Convert.ToInt64(branchItem.PhoneNumber):(###) ###-####}"; // etc return branchItem; } }
After the execution is completed, we'll have a lovely directory full of JSON files containing the data from the source pages as extracted by ScrapySharp's ScrapySharp.Extensions.CssSelect().
The library is a little old, but dead simple and easy to use.
Happy Scraping!