SOLR on Azure Kubernetes Service (AKS)
Background
SOLR is the backbone of getting your information to clients quickly and affordably when you use Sitecore. Over the years we have implemented many different versions of SOLR and SOLR replication types. Our newest and proudest is SOLR with ZooKeeper on AKS.
This deployment version of SOLR with ZooKeeper on AKS maximizes all the best parts of SOLR. It uses ZooKeeper to keep all the SOLR up to date. This is faster than SOLR s Master-Slave replication. It also allows any number of instances of SOLR at a fraction of the cost. Another benefit is the self-healing of AKS vs Virtual Machines.
Setup and Automation
The set up of the new SOLR with ZooKeeper is completely command-line based. Instead of creating a VM, RDP into the VM and installing all the requirements and setting everything up exactly right which we found on average takes about four to six hours per Virtual Machine. An entire cluster could be spun up in under two hours.
Once we created a solid process for deploying the SOLR on AKS we scripted it out and begun the automation process. We integrated our scripts into Azure DevOps so now in an automated pipeline we can deploy a full cluster of SOLR on AKS in under 25 minutes. We tokenized all the different variables to maximize reusability. That means, using the same pipeline we could deploy three different environments in under an hour and a half. All three clusters completely independent from each other and ready to be connected to their Sitecore Infrastructure counterparts.
Extending to Sitecore Commerce
There is very little difference using this setup for Sitecore and Sitecore Commerce. You just need to add additional commerce cores via an API request and connect the SOLR up to the additional commerce connection strings.
Performance Considerations & Auto-healing
For the AKS nodes, we chose Linux because SOLR and ZooKeeper are native to Linux. ZooKeeper specifically does not run as well on Windows nodes for AKS.
When we compared AKS SOLR with ZooKeeper vs Windows VM SOLR Master-Slave replication we found that Master VM gets the data of the change and takes about five minutes to send the updated data to the two slaves. With AKS, ZooKeeper is much faster at replicating data between the nodes. Also, you have three dedicated nodes running SOLR with AKS. With Master-Slave you only have two Slaves. This gives you more uptime in case something goes wrong. Additionally, if a slave goes down, you must log in to the VM and restart the SOLR Service. It takes about 5 to 10 minutes to fix this issue. If both Slaves did go down, you need to restart both SOLR Services looking at 15 minute response time. On the other hand, AKS SOLR with Zookeeper we are seeing average self-healing time between 45 seconds to 90 seconds for a single pod. When we tested killing all three pods, the self-healing only took about 5 minutes before all three pods were back fully operational again. AKS has unlimited scaling potential, you can scale out the AKS SOLR nodes to any number you want. You are only limited by budget. Another advantage is that the cost of this AKS SOLR vs SOLR on Windows VM s is between 50% to 66% cheaper to run each month.
Lessons Learned
There were a few lessons learned:
The deployment will fail if you do not contact Microsoft and up the limit of nodes available in each region. (Default is 10 cores Maximum)
The only thing we need to assist you in setting the environment up is an Azure Service Principal.
Demonstration video coming soon!
Happy coding!