Skip to main content

Content Tagging with Open Calais on Sitecore

As content authors build more and more content managing and surfacing that content becomes a bigger challenge. A common request is how can content authors tag their content with relative information to help organize and surface it. With Sitecore, this can be done via Sitecore Cortex Content Tagging
As with almost everything in Sitecore it is an extensible framework. It uses the Open Calais AI to process content and return standard tagging information. To enable this there is really not much that is needed. The Sitecore documentation has most of the steps you need and there are a few other blog posts out there with details as well. I found what all these are lacking though is a working sample of the configuration you need. All you need is a patch file like this (just add your API token):

What can be a little tricky is where to find your API Key. Just go to login and then in navigation click on APIs and the big button that says "Display My API Token".

Once you have set this up and followed steps outlined in the documentation to tag some content how do you customize and manage this process. First, let's start with where do all these tags go? After the API call is made to Open Calais Sitecore pushes that tags back into Sitecore and stores them here /sitecore/system/Settings/Buckets/TagRepository. This is a Sitecore Item Bucket so you can search for certain tags and it can grow large over time and still be manageable. 

Now let's look at what happens when running this process and applying tags.


  • Providers: Provide the essential business logic for how content is tagged. 
    • IContentProvider: Extracts the content from Sitecore and preps it for analysis. 
    • IDiscoveryProvider: Processes the taggable content and extracts tag data
    • ITaxonomyProiver: Tags all discovered tag data and creates a list of tags that should be assigned.
    • ITagger: Assigns tags to the object being tagged. 
  • Configuration: Control how all the business logic is pulled together. 
  • Pipelines: Extension points for granular control over tagging
    • getTaggingConfiguration: This determines the configuration set to use. It holds two processors. 
      • GetDefaultConfigurationName: This pipeline is pretty simple and just looks for a setting with the key of  Sitecore.ContentTagging.DefaultConfigurationName and returns its value (surprise it is "Default")
      • BuildConfiguration: This will use the select configruation name to build and return an ItemContentTaggingProvidersSet.
    • tagContent: Run configured providers
      • RetriveContent: Uses the configured IContentProviders to select fields. The default provider will not process anything that starts with a double underscore "__" and only fields and types defined in the configuration
      • Normailize: This process is actually responsible for trigging another pipeline of "normalizeContent" which out of the box only has one process that stripes HTML from the field. 
      • GetTags: This is going to loop through all DiscoveryProviders to get tags for the content generated in steps above. The by default is the OpenCalaisProvider. If you want to get a feel for what content is returned from Open Calais you can use their test API. For Open Calais 4 groups are returned. This provider imports all of these groups regardless of their relevance score. 
        • SocialTags: Social tags are derived from the Wikipedia folksonomy. They are periodically updated to keep them current.
        • Industry: Industry tells you the industry it thinks the content aligns to.
        • Entities: This can have references to companies and products for those companies.
        • Topics: What is the general topic of the content. 
      • StoreTags: This will take all the tags that were returned and add those tags to the above-mentioned storage location. 
      • ApplyTags: This will save the needed tags to the "selected" area for the item that was being tagged. 
    • normalizeContent: Prepares TaggableContent

The Sitecore.ContentTagging.Core.config sets up default providers for all of these. Note that there is also a Sitecore.ContentTagging.OpenCalais.config that overrides the default provider with the Open Calais provider. 

Points to Note:

  • Open Calais is free but it does have limits. The main limit being it will only process 1,000 records a day. 
  • Using Open Calais means using a standard ontology. If you need a custom ontology or other customizations you will need a different provider or something in the pipeline that applies your customizations. 


Popular posts from this blog

Excel XIRR and C#

I have spend that last couple days trying to figure out how to run and Excel XIRR function in a C# application. This process has been more painful that I thought it would have been when started. To save others (or myself the pain in the future if I have to do it again) I thought I would right a post about this (as post about XIRR in C# have been hard to come by). Lets start with the easy part first. In order to make this call you need to use the Microsoft.Office.Interop.Excel dll. When you use this dll take note of what version of the dll you are using. If you are using a version less then 12 (at the time of this writing 12 was the highest version) you will not have an XIRR function call. This does not mean you cannot still do XIRR though. As of version 12 (a.k.a Office 2007) the XIRR function is a built in function to Excel. Prior version need an add-in to use this function. Even if you have version 12 of the interop though it does not mean you will be able to use the function. The

Experience Profile Anonymous, Unknown and Known contacts

When you first get started with Sitecore's experience profile the reporting for contacts can cause a little confusion. There are 3 terms that are thrown around, 1) Anonymous 2) Unknown 3) Known. When you read the docs they can bleed into each other a little. First, have a read through the Sitecore tracking documentation to get a feel for what Sitecore is trying to do. There are a couple key things here to first understand: Unless you call " IdentifyAs() " for request the contact is always anonymous.  Tracking of anonymous contacts is off by default.  Even if you call "IdentifyAs()" if you don't set facet values for the contact (like first name and email) the contact will still show up in your experience profile as "unknown" (because it has no facet data to display).  Enabled Anonymous contacts Notice in the picture I have two contacts marked in a red box. Those are my "known" contacts that I called "IdentifyAs"

Uniting Testing Expression Predicate with Moq

I recently was setting up a repository in a project with an interface on all repositories that took a predicate. As part of this I needed to mock out this call so I could unit test my code. The vast majority of samples out there for mocking an expression predicate just is It.IsAny<> which is not very helpful as it does not test anything other then verify it got a predicate. What if you actually want to test that you got a certain predicate though? It is actually pretty easy to do but not very straight forward. Here is what you do for the It.IsAny<> approach in case someone is looking for that. this .bindingRepository.Setup(c => c.Get(It.IsAny<Expression<Func<UserBinding, bool >>>())) .Returns( new List<UserBinding>() { defaultBinding }.AsQueryable()); This example just says to always return a collection of UserBindings that contain “defaultBinding” (which is an object I setup previously). Here is what it looks like when you want to pass in an exp

WPF Localization - RESX Option

About a year ago I was building a WPF project in .Net 3.0 and Visual Studio 2005. I wanted to revisit this subject and see what has changed in .Net 3.5 and Visual Studio 2008. I will make a few of these posts to try and cover all the different options (RESX option, LocBaml option, Resource Dictionary Option). In this blog I will focus on using a resx file to localize an application. To show how the resx option is done I created a WPF form with three labels on it. The first label has is text set inline in XAML, the second has it text set via code behind from the resx file and the third has its text set via XAML accessing the resx file. The first thing that needs to happen to setup a project for localization is a small change to the project file. To make this change you will need to open the project file in notepad (or some other generic editor). In the first PropertyGroup section you need to add the follow XML node <UICulture>en-US</UICulture>. So the project file node w

Advanced Item Cloning

Cloning in Sitecore can be extremely useful. It makes reusing of content items and updating of those items very easy. The default capabilities for item cloning can usually handle most needs. The default behavior does have one thing that can really trip you up. By default clone, child items stay linked to the source cloned item and are not reparented to their new cloned parent. The first thing to understand is there are configuration options for cloning that allow you to change how cloning works. The configuration files have them pretty well documented but if you don't know what you are looking for you may not know they are there. <setting name="ItemCloning.Enabled" value="true"/> Specifies whether the Item Cloning feature is enabled Default value on CM and Standalone servers: true. Default value on CD, Processing and Reporting servers: false. <setting name="ItemCloning.NonInheritedFields" value=""/> Specifies a pipe-separated lis