Friday, December 12, 2008


OpenCalais is a tool created by Thomson Reuters for extracting metadata from documents. It utilizes an engine called ClearForest (a recent aquisiton by Reuters) to run a series of NLP (Natural Language Processing) based text analytics on any content sent to it.

Why is OpenCalais interesting as compared to other tools of its kind?
1. It's free.
2. It can be called as a Web Service.
3. It's open to the public and out there on the interwebs.

Here's an example of how you'd use OpenCalais in semi-real life:

Pretend you're an insurance broker with 1,000 clients (you're doing quite
well). Each transaction you make with a client produces a document of some
sort, which you've been storing on your intranet. Someone asks you this
question: How many of your clients have insured properties in Paris, France over
the years? How would you get an accurate count? This information is certainly in
the documents you've created, but it would take hours of relentless searching to
get an answer. So we end up either not knowing statistics like this - or paying
for expensive software to make us enter this information during our workflow.

This is a trivial problem for OpenCalais. OpenCalais extracts entities like people, places, and companies right out of the box. You send it a document, it sends back all those entities. Simple! Our insurance broker could easily sort his documents by city, person, or company. It does more than just people and places, though, there are over 100 types of entities that OpenCalais looks for by default. (From movies, to stock splits, to medical conditions). Thomson Reuters has chosen entities that will be useful to professionals from various industries.

Now we get to the part about what makes OpenCalais special. There are other engines out there that extract metadata from documents, sure. But OpenCalais can do this on the fly as a web service. Imagine if every time you wrote a word document it extracted the relevant entities for you and presented you with a list. You could tell it to automatically create links to any company names. Perhaps you're a doctor and every time you write a client's name in a document you'd like it to automatically link to their medical records.

I like the idea of on the fly metadata extraction. Even now, as I write this blog, I'm wondering how I will tag it. Hmm.. if only I had some sort of way to automatically extract entities from my documents and create tags.

What would you use it for?

1 comment:

T Tague said...


Tom from Open Calais here.

First thanks for noticing Calais. Great that you took the time to think of a couple of use cases.

I also wanted to point out the power of event extraction in Calais. Not only will it locate entities (people, places, things) but events like natural disasters, management changes and many others. Each of those events is then tied back to the entities it has found.

Also - if you're a WordPress blogger Calais or Drupal user Calais is ready for full integration in your CMS today. Take a look at for an example.