[MarkLogic Dev General] What is the Scope of Search Relevance
dsokolsky at marklogic.com
Mon Jan 26 18:13:44 PST 2009
The calculations are based on the number of fragments in the database,
so yes, your search scores will be affected by duplicate content in
/create and /preview. Relevance is calculated (by default) using the
log-tf*idf formula (log of term frequency times the inverse document
frequency for the search matches).
Relevance will only affect the order in which things are returned from a
search; it will not affect what documents match a search. Whether these
extra copies of documents will really change the relevance order depends
on the number of documents you have in the database. If you have a
relatively small number of documents in the /create and /preview
directories compared to the number of documents in the database, then it
is not likely to change the relevance order. If the proportion of
duplicate documents is statistically significant, then it might have a
material impact. For most databases and most real-world searches, I
don't think it will affect your results very much.
If you find it is affecting the relevance order, then putting them in
separate databases is the right approach. My feeling is this will not
be necessary, but it will depend on your database size, your content,
and your searches.
Hope that helps,
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Mike
Sent: Monday, January 26, 2009 5:32 PM
To: general at developer.marklogic.com
Subject: [MarkLogic Dev General] What is the Scope of Search Relevance
Is search relevance based on all documents in a database or only the
documents included in the scope of a search?
For example, assume I have three folders in the same database: /create,
/preview and /publish. /create contains multiple versions of each
document. /preview contains copies of documents in /create that a user
is considering displaying on a website. /publish contains copies of
documents in /preview that a user wants to display on a website. Thus,
/publish contains a copy of some of the documents in /preview and
/preview contains a copy of some of the documents in /create and /create
contains multiple versions of each document.
If I use XPath to limit a search to include only those documents in
/publish, will search relevance be affected by duplicate documents in
/create and /preview?
Similarly, if a user does a search for documents in all three folders
but the user only has permissions to see documents in /publish, will
search relevance be affected by duplicate documents in /create and
Should we create separate databases for /create, /preview, and /publish
to ensure that duplicate documents do not affect search relevance?
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General