Skip to main content

Custom Crawler for Parsing PDF files with Sitecore

I recently had to create a crawler in my Sitecore 6.5 project that looked at PDF files in the Media library and called an external API to get a list of PDF files to parse and index. You can do this with Sitecore but the examples for doing this are old and really don’t work any more. It was a bit painful to try and get it all working. It is actually not a hard process, it is just the lack of working examples that made it hard to put all the parts together. I will break this into two parts 1) Create a Customer Crawler 2) Setup PDF indexing. You need to do them both to make PDF indexing happen and both, at least for me had no working examples I could find.

Create a Custom Crawler

For the crawler I started with Sitecore’s documentation (section3.2). It got me started but did not work the way they have it set up, but it does get you introduced to what a crawler is and takes you most the way there.

1) Create the new class

public class FileCrawler : BaseCrawler, ICrawler
2) Implement the two interface methods.
void ICrawler.Add(IndexUpdateContext context)
void ICrawler.Initialize(Index index)

3) Add some properties

public string Root { get; set; }
public string Database { get; set; }

These properties are set via the configuration file you will setup in a moment. Root defines where in the Sitecore tree you want your crawler to start working. Database allows you via your config to change rather the crawler is looking at Web or Master database.

Add is the main entry point to your code. Initialize is called but you may or may not want to do anything in here. For me I did not want to do anything other then a little logging. The Add method is where all my code started. With this you have all the code you need for a custom crawler. You can put a breakpoint somewhere inside the Add method and once you do the next steps you should be able to hit that breakpoint (yes, that means you can attached like normal for Sitecore debugging and debug the crawler).

Here is what my class now looks like:

using Sitecore;
using Sitecore.Search;
using Sitecore.Search.Crawlers;
 
public class FileCrawler : BaseCrawler, ICrawler
{
    public string Root { get; set; }
    public string Database { get; set; }
    public float Boost { get; set; }
    long _totalProcessedSize;
    int _fileCount;
    int _successCount;
    int _failureCount;
 
   void ICrawler.Add(IndexUpdateContext context)
   {
       _fileCount = 0;
       _totalProcessedSize = 0;
       _successCount = 0;
       _failureCount = 0;
 
       Stopwatch watch = Stopwatch.StartNew();
 
       ParseMediaLibraryFiles(context);
 
       watch.Stop();
       Log.Info(string.Format("Finished parsing files -- Total files:{0}(Errors:{1}-Success:{2}) -- {3}m:{4}s:{5}ms -- Total bytes {6}",_fileCount, _failureCount,_successCount, watch.Elapsed.Minutes, watch.Elapsed.Seconds,
           watch.Elapsed.Milliseconds, _totalProcessedSize), this);
   }
   void ICrawler.Initialize(Index index)
   {
       Log.Info("File Crawler Init", this);
   }
}

Line 19 is where I will get into the PDF part of this post, but for now this is the code for my crawler.

3) Setup your config file

In Sitecore’s documentation they tell you to create a FileCrawler.config file. If you don’t already have a file in your project that holds information about custom indexing you will need to set this new config file up. If you already have one for this purpose you can just add a new index or area inside a location attribute in that file (these are located in <website>/app_config/include). Using the details of the config setup they provided I ran into all types of issues getting errors like “AddIndex method not found” or “Add method not found.” Here is what I set up to get it working.

 
<search>
 <configuration>
   <indexes>
        <index patch:after="index[@id='system']" id="MyIndexName" type="Sitecore.Search.Index, Sitecore.Kernel">
            <param desc="name">$(id)</param>
            <param desc="folder">MyIndexName</param>
            <Analyzer ref="search/analyzer" />
            <locations hint="list:AddCrawler">
                <tqsFiles type="MyNamespace.FileCrawler, MyNamespace">
                    <Database>web</Database>
                    <Root>/sitecore/media library/files/resources</Root>
                    <Tags>PDFFiles</Tags>
                    <Boost>1.0</Boost>
                </tqsFiles>
            </locations>
        </index>
  <configuration>
<search>

Once you have this you should now be able to go into Sitecore –> control panel –> databases –> Rebuild Search Indexes and see your new index (“MyIndexName”). If you see your new index you can attached to the w3w process and put a breakpoint in the Add method. When you have your breakpoint ready to go make sure your new index is checked and click “rebuild.” You should hit your breakpoint. That is it, you can now create whatever custom code you want in here using your database and root properties to know where to look for the data. The context item passed into the “Add” method is where you create or add new documents which are added to the index. Just make sure you do “context.AddDocument().” Without this the index will never get updated with your information.

Setup PDF Indexing

Now lets setup some code that will grab all the files in the media library and index any PDF file it finds. Again I was able to find an old Sitecore document (section 2.3 and chapter 5. Chapter 5 provides imagea link to some old open source libraries you will need, but they are old. Updated libraries can be found for PDFBox here.) on this subject that got me started but it did not work on its own. I will not bore you will all my code here, just the important methods for PDF parsing.

First, download the zip file from the link above for PDFBox. When you unzip the file make sure you unblock the files, if not you will get errors when trying to build. You will need to add these dlls as references to your project. The zip file comes with a lot of dlls and I am not sure when each one is needed. Some are called and loaded at runtime, though they are not needed at build time, but add them to your bin folder.

Once you have the dlls and reference set up you are ready for the main methods. I will touch on two methods here. The ParsePDF method does what you would think. This actually takes the string from a media item and parses it. The AddPDFContent takes Lucence.Net Document object and adds the index fields to the document.

protected virtual void AddPDFContent(Document document, MediaItem media)
{
   _totalProcessedSize += media.Size;
   _fileCount++;
   if (media.GetMediaStream().CanRead)
   {
       document.Add(this.CreateTextField(BuiltinFields.Content, this.ParsePDF(media.GetMediaStream(), media.Name)));
       document.Add(this.CreateTextField(BuiltinFields.Name, media.Name));
   }
}
private string ParsePDF(Stream mediaStream, string fileName)
{
   PDDocument doc = null;
   ikvm.io.InputStreamWrapper wrapper = null;
 
   try
   {
       Stream stream = mediaStream;
       wrapper = new ikvm.io.InputStreamWrapper(stream);
       doc = PDDocument.load(wrapper);
       PDFTextStripper stripper = new PDFTextStripper();
       var docText = stripper.getText(doc);
 
       _successCount++;
 
       return docText;
   }
   catch (Exception Ex)
   {
       Log.Error("Error parsing " + fileName + " for indexing", Ex, this.GetType());
       _failureCount++;
       return String.Empty;
   }
   finally
   {
       if ((doc != null) && (wrapper != null))
       {
           doc.close();
           wrapper.close();                                   
       }
   }
}

The ParsePDF method does the work of reading the stream from the PDF file and getting the text from it. Then it just returns that string to the AddPDFContent method which puts that string in the BuiltinFields.Content field (when looking at the index this is the “_content” field).

You can added a TextField (this.CreateTextField) or a DataField (this.CreateDataField) to the document. The text fields are used by the index to find hits and the data fields are used so you can programmatically access information about the document if a hit is found. So if you want a piece of data to be accessible to both the index and programmatically accessible you will want to add a textfield and a datafield for that value.

That is it. Now just pass in the root path to where your PDF files are, get the media stream from those files and call these methods.

After I had finished my coding I finally did find a good example on code project. So hopefully between this post and that post you can get what you need.

Share this post :

Comments

Popular posts from this blog

Excel XIRR and C#

I have spend that last couple days trying to figure out how to run and Excel XIRR function in a C# application. This process has been more painful that I thought it would have been when started. To save others (or myself the pain in the future if I have to do it again) I thought I would right a post about this (as post about XIRR in C# have been hard to come by). Lets start with the easy part first. In order to make this call you need to use the Microsoft.Office.Interop.Excel dll. When you use this dll take note of what version of the dll you are using. If you are using a version less then 12 (at the time of this writing 12 was the highest version) you will not have an XIRR function call. This does not mean you cannot still do XIRR though. As of version 12 (a.k.a Office 2007) the XIRR function is a built in function to Excel. Prior version need an add-in to use this function. Even if you have version 12 of the interop though it does not mean you will be able to use the function. The

Experience Profile Anonymous, Unknown and Known contacts

When you first get started with Sitecore's experience profile the reporting for contacts can cause a little confusion. There are 3 terms that are thrown around, 1) Anonymous 2) Unknown 3) Known. When you read the docs they can bleed into each other a little. First, have a read through the Sitecore tracking documentation to get a feel for what Sitecore is trying to do. There are a couple key things here to first understand: Unless you call " IdentifyAs() " for request the contact is always anonymous.  Tracking of anonymous contacts is off by default.  Even if you call "IdentifyAs()" if you don't set facet values for the contact (like first name and email) the contact will still show up in your experience profile as "unknown" (because it has no facet data to display).  Enabled Anonymous contacts Notice in the picture I have two contacts marked in a red box. Those are my "known" contacts that I called "IdentifyAs"

Uniting Testing Expression Predicate with Moq

I recently was setting up a repository in a project with an interface on all repositories that took a predicate. As part of this I needed to mock out this call so I could unit test my code. The vast majority of samples out there for mocking an expression predicate just is It.IsAny<> which is not very helpful as it does not test anything other then verify it got a predicate. What if you actually want to test that you got a certain predicate though? It is actually pretty easy to do but not very straight forward. Here is what you do for the It.IsAny<> approach in case someone is looking for that. this .bindingRepository.Setup(c => c.Get(It.IsAny<Expression<Func<UserBinding, bool >>>())) .Returns( new List<UserBinding>() { defaultBinding }.AsQueryable()); This example just says to always return a collection of UserBindings that contain “defaultBinding” (which is an object I setup previously). Here is what it looks like when you want to pass in an exp

WPF Localization - RESX Option

About a year ago I was building a WPF project in .Net 3.0 and Visual Studio 2005. I wanted to revisit this subject and see what has changed in .Net 3.5 and Visual Studio 2008. I will make a few of these posts to try and cover all the different options (RESX option, LocBaml option, Resource Dictionary Option). In this blog I will focus on using a resx file to localize an application. To show how the resx option is done I created a WPF form with three labels on it. The first label has is text set inline in XAML, the second has it text set via code behind from the resx file and the third has its text set via XAML accessing the resx file. The first thing that needs to happen to setup a project for localization is a small change to the project file. To make this change you will need to open the project file in notepad (or some other generic editor). In the first PropertyGroup section you need to add the follow XML node <UICulture>en-US</UICulture>. So the project file node w

Password Management

The need to create, store and manage passwords is a huge responsibility in modern day life. So why is it that so many people do it so poorly? This is a loaded questions with answers ranging from people being uneducated, to lazy, to educated but not affective in their methods and many more. This blog is to help those (in some way even myself) around me strengthen their online security. Why does it matter? To answer this let's look at a few numbers. According to the US Department of Justice (DOJ)’s most recent study , 17.6 million people in the US experience some form of identity theft each year. Ok fine but that is identity theft that has nothing to do with password management. What is one way someone can start getting information about who you are? How do they get access to steal your money? From Cyber Security Ventures 2019 report : "Cybersecurity Ventures predicts that healthcare will suffer 2-3X more cyberattacks in 2019 than the average amount for other industries. W