Migrating Content from Any Website to Sitecore Content Hub One

SergeyYatsenko
Sitecore Technology MVP & Sr. Director
  • Twitter
  • LinkedIn

I've seen my fair share of challenges when it comes to moving content between different platforms. Recently, I faced an interesting task of my company's technical blog with over 10 years of content (mostly around Sitecore). In this post, I'll share my experience of migrating a lot of inconsistently written and formatted content from a legacy website to Sitecore Content Hub One. I'll walk you through the process of extracting content to Sitecore Content Hub One, and provide you with a practical approach that you can use for your own content migration projects.


The Migration Approach

Before diving into the technical details, let me outline the approach I took to ensure a successful migration:

  1. 1. Content Extraction with javascript tools: I chose Puppeteer and Cheerio for content scraping. Puppeteer excels at handling dynamic JavaScript-heavy sites by providing a real browser environment, while Cheerio offers an efficient way to parse and extract content from HTML using familiar jQuery-like syntax. This combination proved to be powerful for handling various content structures and formats.
  2. 2. Content Refinement with AI: Over a decade of blogging naturally led to inconsistencies in formatting and writing style. I leveraged ChatGPT to standardize the content's tone and format while preserving the technical accuracy. This step significantly improved the overall quality and readability of the migrated content.
  3. 3. Uploading Content to Content Hub One: The final step involved pushing the refined content to Sitecore Content Hub One, taking advantage of its robust content modeling and API capabilities.

Here's a visual representation of the process:

Migrating Content1

A note on the Sitecore Content Hub One

Sitecore Content Hub One is a headless content management system (CMS) designed to provide seamless content delivery across multiple channels. It offers flexibility and scalability, making it an ideal choice for modern digital experiences. As part of the broader Sitecore ecosystem, it provides powerful tools for content management and delivery.


Prerequisites

Before starting the migration process, ensure you have the following:

  1. 1. Node.js installed on your system
  2. 2. Access to Sitecore Content Hub One instance
  3. 3. Required npm packages: puppeteer, cheerio, turndown, @sitecore/contenthub-one-sdk


Step 1: Scraping Content from the Source Website

The first step in the migration process is to scrape content from the source website. This involves extracting text, images, and other relevant data. Our implementation uses Puppeteer for headless browsing and Cheerio for HTML parsing.


Setting Up the Scraper

// Import the required packages
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const fs = require("fs");
const path = require("path");

// Configuration
// Set the root directory for storing the scraped data
const dataRoot = "./data";
// Set the root directory for storing the scraped pages
const pagesRoot = "./data/pages";
// Set whether to save images while scraping
const saveImages = true;


Implementing the Scraper

/**
 * Extracts content from a given page and saves it to the file system.
 *
 * @param {string} page - The page URL to extract content from.
 * @param {TurndownService} turndownService - The Turndown service instance for converting HTML to Markdown.
 */
const extractDetailsPages = async (page, turndownService) => {
  const browser = await puppeteer.launch(); // Launch a headless browser instance
  const page = await browser.newPage(); // Create a new page

  // Navigate to the page
  await page.goto(page); // Navigate to the page
  const content = await page.content(); // Get the page content

  // Parse content with Cheerio
  const $ = cheerio.load(content); // Load the HTML content into Cheerio

  // Extract content and save
  const title = $("h1").text(); // Extract the title of the page
  const bodyContent = $(".article-body").html(); // Extract the body content of the page
  const markdown = turndownService.turndown(bodyContent); // Convert the HTML content to Markdown

  // Save the content
  await saveBlogData({
    title,
    content: markdown,
    // Add other metadata
  }); // Save the extracted content to the file system
};


Handling Images

/**
 * Saves an image from a given URL to a given folder path.
 *
 * @param {string} url - The URL of the image to save.
 * @param {string} folderPath - The folder path where the image should be saved.
 * @returns {Promise} - A promise that resolves when the image is saved.
 */
const saveImage = async (url, folderPath) => {
  // Get the file name from the URL
  const fileName = path.basename(url);

  // Construct the full path where the image should be saved
  const filePath = path.join(folderPath, fileName);

  // Create a write stream for the file
  const file = fs.createWriteStream(filePath);

  // Fetch the image from the given URL
  const response = await fetch(url);

  // Pipe the response body to the write stream
  response.body.pipe(file);

  // Return a promise that resolves when the image is saved
  return new Promise((resolve, reject) => {
    file.on("finish", resolve);
    file.on("error", reject);
  });
};


Step 2: Converting Content to Markdown

Once the content is scraped, we convert it to Markdown format using the Turndown library. This makes it easier to maintain and import into Sitecore Content Hub One.

// Create a new instance of TurndownService
// The constructor takes an object with configuration options
const turndownService = new TurndownService({
  // Set the heading style to ATX (using # symbols)
  // This is the default style used by Markdown
  headingStyle: "atx",

  // Set the code block style to fenced
  // This means that code blocks will be fenced with triple backticks
  codeBlockStyle: "fenced",

  // Set preformattedCode to true
  // This ensures that any preformatted text, such as code blocks, is properly formatted
  // when converted to Markdown
  preformattedCode: true,
});

// Add custom rule to fence all preformatted text with triple backticks
// This ensures that any preformatted text, such as code blocks, is properly formatted
// when converted to Markdown
turndownService.addRule("fenceAllPreformattedText", {
  filter: ["pre"],
  replacement: function (content, node, options) {
    return (
      "\n\n" +
      options.fence +
      "\n" +
      node.firstChild.textContent +
      "\n" +
      options.fence +
      "\n\n"
    );
  },
});


Step 3: Content Cleanup with AI (Optional)

For enhanced content quality, you can use AI services like ChatGPT to clean and format the content. Here's an example of how to integrate with OpenAI's API:

import { Configuration, OpenAIApi } from "openai";

const cleanupContent = async (content) => {
  const configuration = new Configuration({
    apiKey: process.env.OPENAI_API_KEY,
  });
  const openai = new OpenAIApi(configuration);

  const response = await openai.createChatCompletion({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content:
          "Refine the following blog post content to fix formatting, syntax/grammar errors, etc. while preserving its original meaning and style:",
      },
      {
        role: "user",
        content: content,
      },
    ],
  });

  return response.data.choices[0].message.content;
};


Step 4: Uploading to Sitecore Content Hub One

The final step involves uploading the processed content to Sitecore Content Hub One using their SDK.


Setting Up the Client

import {
  ClientCredentialsScheme,
  ContentHubOneClientFactory,
  ContentHubOneClientOptions,
} from "@sitecore/contenthub-one-sdk";

const createClient = () => {
  const options = new ContentHubOneClientOptions(
    process.env.CONTENT_HUB_URL,
    process.env.CLIENT_ID,
    process.env.CLIENT_SECRET
  );

  return ContentHubOneClientFactory.create(options);
};


Uploading Content

const uploadBlogPost = async (client, blogData) => {
  try {
    // Create the blog post item in Content Hub One
    const contentItem = await createItem(client, {
      name: blogData.title,
      fields: {
        content: {
          value: blogData.content,
        },
        publishDate: {
          value: blogData.date,
        },
        author: {
          value: blogData.author,
        },
        // Add other fields as needed
      },
    });

    // Upload associated images
    for (const image of blogData.images) {
      await uploadMedia(client, image, contentItem.id);
    }

    return contentItem;
  } catch (error) {
    console.error(`Error uploading blog post: ${error.message}`);
    throw error;
  }
};


Best Practices and Considerations

  1. 1. Error Handling: Implement robust error handling for both scraping and uploading processes
  2. 2. Rate Limiting: Respect the source website's robots.txt and implement appropriate delays between requests
  3. 3. Content Validation: Verify the content structure before uploading to Sitecore
  4. 4. Media Handling: Ensure all media files are properly processed and linked
  5. 5. Metadata Preservation: Maintain important metadata during the migration process


Additional Resources


Conclusion

Migrating content to Sitecore Content Hub One is a streamlined process that can greatly enhance your content management capabilities. By following this guide and utilizing the provided code examples, you can efficiently transfer content from any website to Sitecore, ensuring a seamless digital experience.

The solution provided here is flexible and can be adapted to various content structures and requirements. Whether you're migrating a small blog or a large corporate website, the principles and approaches outlined in this guide will help you achieve a successful migration.

Remember to test thoroughly in a staging environment before proceeding with the actual migration, and always keep backups of your content throughout the process.