Anchor Dates for Finding Recent Documents

by Dave Cassel

I wrote up a recipe for finding documents with recent dates in them, and I used this as part of the query:

A reviewer asked, "is there a disadvantage to specifying pubdate>=xs:dateTime(xs:date('0001-01-01')) score-rating=linear?" It turns out that there is.

When using score-function=reciprocal or score-function=linear, values near the anchor value will be more differentiated (and thus more useful for scoring) than values that are far away.

To illustrate this, let's generate some sample data.

One hundred simple documents, each containing a date that is one to one hundred months behind the current date. We're going to use a range query, so add a date element range index.

My first query uses score-function=reciprocal to see how far the dates in the documents are from today.

When I run this, documents come back in the correct order. The search items with indexes 15 & 16 (zero-based index) show the first score collision, with clumps of gradually increasing size coming after. We're getting some reasonable differentiation based on how far back the documents dates go; when combined with other relevancy factors, this should produce a good ordering.

Now let's take a look at the opposite approach: how far away are the documents from an ancient time?

All my documents have dates later than Year 1, and the further they are from year 1, the higher the score should be. Sounds good, but the math behind the scenes emphasizes those dates that are close to the anchor. In this case, the dates are far enough away that all documents got the same score. As such, this score contribution is not useful for ordering recent results.

I also ran the experiment with dateTimes instead of dates and the results were even more dramatic. With the difference in granularity, the equations expect small differences to be significant; therefore big differences are poorly differentiated.

Conceptually, you might think you can approach distance scoring from either direction; in practice, if there's an endpoint you care more about, use that as your anchor.

Further Reading

Comments