Customized Tokenization

Mary Hostege
Last updated January 3, 2014

MarkLogic provides the ability to modify how the server tokenizes text in certain parts of documents by changing how particular characters are classified. Characters classified as space or punctuation will break tokens and not be included in the index. Characters classified as symbol or word characters will be included in the index, and will or will not break tokens, respectively. Characters can even be removed entirely: neither appearing in the index nor breaking tokens. With this simple mechanism you achieve some powerful effects. Let's see how.

A Problem: Searching Tweets

Imagine a database loaded with tweets with some metadata providing information about where and when the tweet was sent and the person who sent them. Tweets use some punctuation in special ways: at signs (@) mark user names, and hash marks (#) mark topics. So "@NASA" is the official NASA user, and "#NASA" indicates a tweet about NASA. Here are some sample tweets encoded in XML:

Sample tweets

Suppose I want to explore this data and distinguish between tweets directed at NASA from those about NASA. Since this is a large database, my application will be using unfiltered search for better performance, so I want the matches found in the index to be as accurate as possible, with no need for filtering. For this application, when I search for the user "@NASA" I want the index to only return tweets containing "@NASA" and not those with just the word "NASA" or the topic "#NASA". And when I search for the word "NASA" I do not want the index to return matches for the user "@NASA".

Punctuation is not indexed in MarkLogic, however, so "@NASA", "#NASA", and "NASA" will all show up in the indexes and the lexicons exactly the same way: as "NASA". Estimates and unfiltered searches will not distinguish these in any way. So the following searches

Fetching results from a REST server configured on port 8003
GET /search?q=NASA
GET /search?q=@NASA
GET /search?q=%23NASA

Will all return the same set of tweets

<tweet>@NASA thanks for inviting me to the social! I'm learning so much!</tweet>
<tweet>Rumour has it NASA is announcing MSL finding organic carbon at press conference Tuesday.</tweet>
<tweet>Streambed on Mars! #NASA #MSL</tweet>

Why is this? Let's enable the return-plan search option so that we can see how the query is being resolved. Create the following tweet_options.xml file and POST it to the REST server:

When we search for "@NASA" we will see that the query sent to the index doesn't even include the punctuation mark:

Customized Tokenization to the Rescue

Customized tokenization allows you to control how specific Unicode codepoints will be classified by the tokenizer, which affects how the tokenizer breaks up the text, which ultimately affects how indexing and search will handle those tokens. Documentation is available in the Custom Tokenization section of the Search Developer's Guide.

Searching for Users

For this application, when I search for the user "@NASA" I want the index to only return tweets containing "@NASA" and not those with just the word "NASA" or the topic "#NASA". And when I search for the word "NASA" I do not want the index to return matches for the user "@NASA". To achieve this goal, I am going to instruct the tokenizer to treat the at sign as a normal word character when it appears in the body of a tweet.

The procedure is simple:

  1. Define a field that encompasses the body of the tweet.
  2. Define a tokenizer override for that field that reassigns the "@" character to the word tokenizer class.
  3. Reindex the data using the new tokenization rules.
  4. Define the field as a query constraint if it is to be used through the higher level APIs, such as the REST interface.

Since tokenization can only be customized on fields, the first step is to create a field. This can be done through the Admin interface as described in the Overview of Fields in the Administrator's Guide. It is also possible to script this using the Admin API, but here I will just use the Admin user interface to create a field over the tweet child of my documents and add the customized tokenization settings.

Tokenization overrides are set on the same page in the Admin interface.

Reindexing is necessary to apply the new tokenization rules to data already in the database. On a large database you want to be sure to set up the tokenization rules before loading data to avoid having to reindex a lot of data.

If we want to make use of the field in the higher level APIs, such as through REST or Java, we'll want to make a binding for it in the search options.

Search Options: Adding the field 'tweet'

That's it! Searches against the new field will produce different results from word searches within those elements. The search for "@NASA" only matches the tweet that has that user name in it, but the other searches are unaffected.

GET /search?options=tweet&q=tweet:NASA
<tweet>Rumour has it NASA is announcing MSL finding organic carbon at press conference Tuesday.</tweet>
<tweet>Streambed on Mars! #NASA #MSL</tweet>
GET /search?options=tweet&q=tweet:@NASA
<tweet>@NASA thanks for inviting me to the social! I'm learning so much!</tweet>
GET /search?options=tweet&q=tweet:%23NASA
<tweet>Rumour has it NASA is announcing MSL finding organic carbon at press conference Tuesday.</tweet>
<tweet>Streambed on Mars! #NASA #MSL</tweet>

A look at the query plan explains these results: now the at sign is regarded as part of the word and is sent intact to the index:

Searching for Topics

For this application, when I search for the topic "#NASA" I want the index to only return tweets containing "#NASA" and not those with just the word "NASA" or the user "@NASA". And when I search for the word "NASA" I do not want the index to return matches for the user "@NASA", but I do want to see matches for tweets containing the topic "#NASA". This time, instead of instructing the tokenizer to treat the hash mark as a normal word character when it appears in the body of a tweet, I will instruct it to treat the hash mark as a symbol. As a symbol it will still be included in the index, but as a separate word token.

Since I already have the field defined, all I need to do is add the tokenizer override in that field to reassign the "#" character to the symbol tokenizer class. This can be done in the field configuration page in the Admin interface.

Now we get different results for the searches for "NASA" and "#NASA" within the field:

GET /search?options=tweet&q=tweet:NASA
<tweet>Rumour has it NASA is announcing MSL finding organic carbon at press conference Tuesday.</tweet>
<tweet>Streambed on Mars! #NASA #MSL</tweet>
GET /search?options=tweet&q=tweet:@NASA
<tweet>@NASA thanks for inviting me to the social! I'm learning so much!</tweet>
GET /search?options=tweet&q=tweet:%23NASA
<tweet>Streambed on Mars! #NASA #MSL</tweet>

This works because now "#NASA" is seen as a phrase of the two words "#" and "NASA". We won't find a match to this phrase in the tweet that has the bare word "NASA". On the other hand, the bare word "NASA" will find a match against the phrase "#NASA" in a tweet, just as a word search for "Mars" will match the phrase "on Mars" in the document.

Searching for Phone Numbers

Phone numbers are stored in an inconsistent format: some have no punctuation at all, while others uses spaces or parentheses and hyphens. I want to be able to search for a phone number as the whole number, or with wildcarding and correctly match the actual phone number, ignoring the formatting. This could be solved at ingestion time by normalizing the phone numbers, but we can also use custom tokenization to achieve the same result.

First, consider the following queries and matching phone numbers, assuming the database has been configured for trailing wildcards:

GET /search?options=tweet&q=6507011212
GET /search?options=tweet&q=701-650-9921
GET /search?options=tweet&q=650*
<phone>(650)701-1212</phone>
<phone>701 6509921</phone>
GET /search?options=tweet&q=701-650*
<phone>(650)701-1212</phone>
<phone>701 6509921</phone>

Since the punctuation and spacing is inconsistent and creates inconsistent token boundaries, it is hard to find searches that give consistent results. The solution is to create another field that covers the mobile child of the document in which the relevant punctuation and space characters are redefined to the remove tokenizer class. When a character is in this class, it is as if it didn't appear in the text stream as far as the search and indexing is concerned.

Search options for field with phone numbers

Now we can search for phone numbers in this field and it doesn't matter if the phone number was formatted in either the query or the document: the proper matches are found:

GET /search?options=tweet&q=phone:6507011212
<phone>(650)701-1212</phone>
GET /search?options=tweet&q=phone:701-650-9921
<phone>701 6509921</phone>
GET /search?options=tweet&q=phone:650*
<phone>(650)701-1212</phone>
GET /search?options=tweet&q=phone:701-650*
<phone>701 6509921</phone>

Analytics

The new rules can be used for more than just search. Suppose I want to find out which users are mentioned in a tweet about a certain word. To do that, first I set up a field word lexicon on my "tweet" field. Again, this can be scripted or done on the field configuration page in the Admin interface.

Next the field word lexicon needs to be setup as the suggestion source in the query options.

Search Options: Adding a suggestion source

By constraining the suggestions to starting with "@" only mentioned users will be returned.

GET /suggest?options=tweet&partial-q=@&q=social
<search:suggestions xmlns:search="http://marklogic.com/appservices/search">
  <search:suggestion>@NASA</search:suggestion>
</search:suggestions>

It should be noted, however, that on a large database this technique may not perform well, and adding explicit markup is probably a better option.

Set-up Scripts

This section describes the scripts required to produce the final working setup.

  1. Execute the XQuery script setup1.xqy. You can copy this entire script into a QConsole buffer and execute it there.
  2. Execute the shell script setup2.sh. It assumes you have curl available as well as the tweet_options.xml file.
  3. See the code below for the scripts below:

Comments