[MarkLogic Dev General] Uneven load distribution on 3-node cluster

Danny Sinang d.sinang at gmail.com
Tue Aug 21 07:11:55 PDT 2012


*Additional notes :*

text-content-1 : 51 GB
text-conetnt-2 : 74 GB
text-content-3 : 66 GB (hosted on node 1)

*List cache hits :*

text-content-1 43,752,916
text-content-2 49,538,336
text-content-3 31,605,528 (hosted on node 1)

*Compressed tree cache hits :*

text-content-1 17,401
text-content-2 17,443
text-content-3 11,471 (hosted on node 1)


Regards,
Danny

On Tue, Aug 21, 2012 at 7:36 AM, Danny Sinang <d.sinang at gmail.com> wrote:

> Hi,
>
> We have a 3 node ML 4.2-6 cluster.
>
> Since last week, we've seen CPU usage on nodes 2 and 3 skyrocket to around
> 90% each from 6 to 10 pm, while node 1 used would hit only 30% at peak
> period.
>
> Now we've seen an influx of new customers and this could explain why the
> sudden load during that period. Moreover, it looks like we need to rewrite
> some of our code to reduce CPU usage.
>
> However, what confounds me is why node 1 isn't taking on as much load as
> the other nodes. I'm thinking maybe the following events / situations
> caused it. Hope somebody here can confirm or point me in the right
> direction.
>
> 1. Expanded Tree Cache Increase / Restart / Seg Fault / Forest Failover
>
> The night before the CPU usage spike, I had to increase the Expanded Tree
> Cache for the cluster from 8 GB to 12 GB (i.e. 12288 MB). This of course
> caused the ML cluster to automatically restart. After the restart,
> everything looked ok from the application perspective. However, two hours
> later, node 1 suddenly encountered multiple "XDMP-OLDSTAMP : Timestamp too
> old for forest" and "Segmentation fault" errors and caused multiple
> restarts. Eventually, the forests on node 1 did a local-disk failover to
> node 2. The following day, we decided to "unfailover" the forest on node 2
> back to node 1. The database status shows everything's back to normal after
> that, except of course the uneven load between the nodes.
>
>
> Questions :
>
> a. Could the forest "unfailover" have missed to tell the cluster that node
> 1 is back in business, thus the uneven load ?
>
> b. Could it be that design the database status showing that the
> "unfailover" was successful, node 2 is still serving the content that
> failed over to it ?
>
> c. Could the 12 GB (12288 MB) expanded tree cache be an "uneven" number
> causing the multiple restarts and old timestamp and segmentation fault
> errors ?
>
> d. Could the 12 GB expanded tree cache be causing the uneven load across
> the 3 nodes ? The expanded tree cache partitions is set to 4.
>
>
> 2. Ingestion of content done only on nodes 2 and 3.
>
> By design, we validate incoming content on node 1 and ingest (i.e.
> document-insert) them on nodes 2 and 3. Could this be causing the content
> to be saved only on nodes 2 and 3 ? I was informed months ago that ML
> automatically saves documents evenly across all the forests making up a
> database. In our case, our production database is made up of 3 forests,
> each saved in each of the nodes.
>
> Looking at the database status, it looks like the forests containing
> binary files have almost the same size, but the text forests have varying
> sizes as follows :
>
> text-content-1 : 51 GB
> text-conetnt-2 : 74 GB
> text-content-3 : 66 GB
>
> Regards,
> Danny
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120821/8fc4a9b9/attachment.html 


More information about the General mailing list