Sitecore 8.2 Lucene Crawling Error
Missing PDF Indexing
Recently a client running Sitecore 8.2 had a problem with their Lucene index.
They started seeing a huge number of errors in the crawling log:
6932 11:05:46 WARN Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{0C39B29F-EED1-40A0-BB7B-E6D5BC5F6883}?lang=en&ver=1
Exception: System.Runtime.InteropServices.COMException
Message: Exception from HRESULT: 0x80048605
Source: Sitecore.ContentSearch
at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IPersistStream.Load(IStream stream)
at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.InitializeFilterAsPersistStream(IFilter filter, String fileName)
at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension)
at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName)
at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable)
at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable)
at Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilder.<>c__DisplayClass12_0.b__0(IComputedIndexField computedIndexField, ParallelLoopState parallelLoopState)
The underlying issue is that the Adobe IFilter Sitecore uses to index PDF content had gone missing. You may be wondering, "where can I get a copy of this now thoroughly dated software?". I found some older blog posts that pointed to a now-defunct Adobe download site, but eventually stumbled on an Adobe FTP server with a copy of the required software:
ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/PDFFilter64Setup.msi
After installing this, the errors went away, and the content was indexed properly.