MarkMail is a widely used app that allows technology professionals to easily find content across a huge variety of mailing lists. Its backend runs on MarkLogic Server, and contains a searchable collection of over 60 million archived email messages from public mailing lists around the world. MarkLogic comes with lots of cool geolocation features. Using MarkLogic, one can easily write up a query which searches over a series of geographic boundaries, including circles, rectangles, and even arbitrary polygons.
I'm currently a junior at Brown University where I am majoring in Economics and Computer Science. Over my winter break, I took on an internship at MarkLogic. My main task during the six weeks, was to expand MarkMail to include geolocation features and to design a prototype for a new homepage to expose the geographic data ingested into MarkMail's servers in real-time.
So how did I accomplish my task?
We first needed to extract geographic information from emails. The received headers of an email trace the route of an email message as it is sent from one server to another. One can read through these headers to follow the path from an email's origin at the sender's client to our MarkMail SMTP servers. Each part of the received headers contains IP addresses and DNS hostnames that identify the servers the message passed through. Using MaxMind's Geo IP database, we were able to map IP addresses and hostnames to geographic locations. However, not all IPs can be mapped to locations since they are private addresses. In this case, we simply set the geographic location to the next server that had a public IP address. Finally, to accomplish this task over a huge dataset we used Hadoop's MapReduce with the MarkLogic Connector for Hadoop to run a batch processing job over the existing emails in the database to enrich them.
Some of the toughest challenges came with this part of the project. Trying to display millions of emails on a map was obviously impossible. A heat map seemed like a good solution to our problem, but this would prevent users from being able to click on individual messages. We also considered a hybrid solution, where a heat map would turn into specific messages once map reached a certain zoom level, but we thought this might be too jarring a transition for the user.
We decided to take subset of the messages and show them. A tight cluster would create a heat map type effect implicitly. At first we tried to take a random sampling of a couple hundred emails to display on the map. But this led to more problems. It turns out, certain locations send a lot more emails than others (i.e. Google's servers), and random selection thus gives an unfair weight to these clusters of locations. Towards the end of my break, I realized there was an optimal solution – a strategy that involves drawing a weighted random sampling produces a decent distribution of messages over the map.
Yet the project was not complete without working on a prototype of an improved homepage to show off the new geo capabilities of MarkMail. I designed a homepage which exhibited on a map the most recent emails that came into our server, and animated the transitions. MarkMail's new homepage will send the user flying around the globe in real-time.