[MarkLogic Dev General] Very puzzling bug in wildcard search results

David Sewell dsewell at virginia.edu
Wed Mar 11 13:22:47 PDT 2015

[resubmitting; first time didn't go through, apparently]

I'm trying to figure out what could possibly account for buggy results for 
wildcard searches in certain fringe cases (running MarkLogic 7.0-4.3).

I have two servers running on the same data set of 166K documents, with 
identical host, database and app server settings so far as I can determine (for 
anything related to word query at least). Ordinarily, wildcard searches on 
words return the exact same number of matches on both hosts. For example:

 		H1	H2
democra*	 1579	 1579
demo*		 4354	 4354
dem*		16866	16866

But there are certain word stems that produce buggy results on H2, matching all 
documents when they shouldn't. Actually I should say "word stem" because the 
buggy results all involve words starting in "rel". For example:

 		H1	H2
religions*	   138	   138
religion*	  2448	166618
relig*		  3810	166618
reli*		 14608	166618
rel*		 39888   39888
re*		150890	166618
relia*		  1084	166618
relie*		  8306	166618
relo*		   156	166618
relm*		     3	     3

I have tried unsuccesfully to find other letter sequences that exhibit the bug 
in a wildcard search or that give different result counts for H2. So far it's 
only certain "rel-" examples.

My next step will be a forced reindex of the database on H2 to see if that 
helps, but before I do that I wonder if anyone has a clue what might account 
for this behavior.

Even odder, on two entirely different systems running an entirely different 
MarkLogic software instance, "rel-" searches are also showing discrepancies, 
though I haven't researched that one as thoroughly. Some deep-level indexing 
bug, possibly?


David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: dsewell at virginia.edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

