Building a Semantic Recommendation Engine: the Sequel

by Michael Malgeri

Since we discussed the movie business in this post, a sequel seemed appropriate.

First some background. Our children are in their late teens, but over the years they’ve been taken to almost every animated film produced in the twenty first century and prior. While there have been some flops, as parents, we’ve marveled at how often the writers are able to inject entertaining dialogue for adults, yet keep the young ones glued to their seats with simpler chatter and of course, amazing visuals. They even manage to overlay multiple themes and “moral of the story” messaging appropriately targeted to kids and parents. Bravo to these super talented folks!

Given that, let’s say a fictitious content provider named Netflux knows a consumer has kids and takes them to the animated hits. The easy thing to do is to recommend similar age appropriate films. But let’s say they really want to hit the mark. Knowing the parent’s twitter handle, they decide to leverage social media. Note: in prep for this post, the following tweets were sent:

The #testrec hash tag allowed for convenient gathering of these tweets in an array, but Netflux can retrieve these via the consumer’s Twitter handle.

Since The Incredibles is a film about super heroes, a recommendation could be the Spider-Man series. However, using search techniques like synonym matching, co-occurrence and stemming, along with custom semantic inferencing rules, Netflux can REALLY impress the consumer by recommending NOT Spider-Man but Monsters vs. Aliens. How would that work?

First, the creators of Monsters vs. Aliens would have to tag the film with descriptive metadata and share it with Netflux. Detailed tagging is table stakes for accurate media recommendation engines. Content providers are tagging not just title metadata but annotating each scene with information pertaining to characters, talents, locations, storylines, costumes, product placements and a variety of other attributes, which provide valuable media insight.

Next, the assumption is that Netflux tracks information about its consumers, e.g. SS#, bank accounts…just kidding. Using a Twitter handle, Netflux can grab a consumer’s tweets (see code in Appendix1, which can be used in MarkLogic’s query console) and process them in MarkLogic’s operational data hub (ODH) as follows:

  1. Load all tweets in a staging database as-is. The tweets are loaded into a structure that preserves the original content, but allows for incremental enrichments to be collected in other areas of the structure. The “envelope” pattern is used for this purpose, allowing semantic facts and other types of metadata to be collected.
  2. Harmonize the tweets by leveraging an enrichment service, a process that could tag movie titles and sentiment words and also generate semantic facts such as:
    1. @mmalgeri tweeted #banter
    2. @mmalgeri tweeted #repartee
    3. @mmalgeri tweeted #witticisms
    4. @mmalgeri tweeted "The Incredibles"
    5. @mmalgeri tweeted "Shrek"
    6. @mmalgeri tweeted "Finding Nemo"
  3. Further harmonize these tweets by associating words like banter with its synonyms and stems such as repartee and witticism, and add these synonyms to the tweet document.
  4. Perform co-occurrence analysis on these documents to determine which sentiment words appear with movie titles.
  5. Create custom inferencing rules that conclude:
    1. If @mmalgeri has tweets about movies, and
    2. @mmalgeri tweets synonymous sentiment words about movies,
    3. then recommend a movie with the same or synonymous sentiment words

In other words, a smart recommendation engine would realize that @mmalgeri might like animations about heroes for his kids, but…he REALLY likes snappy dialogue. The Spider-man series would not likely be tagged with this kind of descriptive metadata because that’s not its main characteristic. However, Monsters vs. Aliens contains an abundance of clever dialogue and is hopefully properly tagged… ”Fresno? Fresno! In what universe is Fresno better than Paris, Derek?”

Content providers can leverage features such as multi-model NoSQL and semantic documents, sophisticated search and indexing, semantic facts contained in graphs, and semantic inferencing and combine them to create a smart recommendation engine. MarkLogic provides these features out of the box. Consider downloading the free developer’s version and while you’re at it, check out Monsters vs. Aliens…you’ll have fun.

Appendix 1 – Javascript code to gather tweets in QConsole

Comments