RE-ENTER SAS
Voyez le cloud autrement

Endeca MDEX Load Balancing

by Vina Rakotondrainibe | Oracle Commerce Expert and Cloud Deployment Specialist
Paris area,

Introduction

In this article, we will talk about high availability tests on Endeca (now know as Oracle Guided Search) and how to load balance MDEXes. For these tests, I used the Commerce Reference Store (CRS) and configured dgraphs (the part of MDEX which responds to search requests) on two different Vms for the live environment. Search requests from the store are load balanced via HAProxy. My goal was to find how to set Endeca and HAProxy to minimize site navigation unavailability during baseline updates. I assume that you are familiar with Oracle Guided Search and already know about the components we are talking about. If not, you can read the Getting started guide.

As a reminder, when you do a baseline update on Endeca (i.e. reindex everything), you replace the entire dgraph index with a fresh one. During this operation, your dgraph does not respond to search calls anymore. As the paradigm for Oracle Commerce based site is to let Endeca and Experience Manager drive the entire navigation, the search engine being unavailable means "no site anymore" for your users.

Configuring HAProxy

HAProxy is a software load balancer and proxy solution which has been known as a very reliable solution for years. This is a sample configuration file for my tests:

#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 2
    timeout http-request    2s
    timeout queue           10s
    timeout connect         2s
    timeout client          2s
    timeout server          2s
    timeout http-keep-alive 5s
    timeout check           3s
    maxconn                 3000
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
frontend  main *:15000
    option httplog
    log-format %hr\ %r\ %st\ %B\ %Tr\ %cc\ %b/%s\ %cs
    default_backend             livemdexes
#---------------------------------------------------------------------
# static backend for serving up images, stylesheets and such
#---------------------------------------------------------------------
backend static
    balance     roundrobin
    server      static 127.0.0.1:80 check
#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
backend livemdexes
    balance     leastconn
    cookie JSESSIONID prefix indirect nocache
option httpchk GET /admin?op=ping HTTP/1.0\r\nUser-agent:\ LB-Check
server mdex1 <MDEX 1 IP>:15000 check inter 2000 fall 2 rise 1 cookie mdex1 server mdex2 <MDEX 2 IP>:15000 check inter 2000 fall 2 rise 1 cookie mdex2

You notice a few things in the configuration above:

  • I lowered all connection timeout settings to 2 seconds as Endeca is supposed to respond very quickly. Depending on our site and personnalization, you might want to increase this value.
  • The MDEXes expose an admin HTTP interface with a ping operation to check if the dgraph is alive and responding.
  • We have set pretty aggressive fall and rise parameters for the backend resources to make sure very few requests would go on a dgraph when it becomes unavailable.
  • We set the health check interval to 2 seconds. This is probably too aggressive for a production environment but my dgraph unavailabilty lasted only 2 seconds with the CRS dataset. The idea here is to find a polling interval which will enable HAProxy to detect the dgraph as soon as it gets back to service.

Configuring the MDEXes

To avoid a total blackout, the dgraphs can be configured to be part of a restart group. This is important as they are all in the same restart group by default if you do not pay attention to the configuration template which is assembled for CRS.

This is the declaration for my second dgraph in the LiveDgraphCluster.xml file (this file can be found in the config/script folder of your asssembled Endeca application for CRS):

  <dgraph id="DgraphB1" host-id="LiveMDEXHostB" port="15000" post-startup-script="LiveDgraphPostStartup">
  <properties>
    <property name="restartGroup" value="2" />
    <property name="DgraphContentGroup" value="Live" />
  </properties>
  <log-dir>./logs/dgraphs/DgraphB1</log-dir>
  <input-dir>./data/dgraphs/DgraphB1/dgraph_input</input-dir>
  <update-dir>./data/dgraphs/DgraphB1/dgraph_input/updates</update-dir>
 </dgraph>

Note that we set the restartGroup property to 2 for this second dgraph in order to sequentially replace the indexes when doing a baseline update.

Configuring ATG

It is possible to configure ATG to dicuss with different sets of MDEXes depending on the Endeca application the Assembler is invoking.

To do so, you should override the /atg/endeca/assembler/AssemblerApplicationConfiguration service configuration. For my tests, I had only one application (the CRS endeca application) so I could use the default setting properties directly:

defaultMdexPort=15000
defaultMdexHostName=my_haproxy_machine_fqdn

Note: We set the HAProxy listen port to 15000 so basically we just needed to update the MDEX host name to the HAProxy machine's FQDN. For multiple Endeca applications, you would need to define several frontend on several ports in HA.

Load testing the Mdexes (i.e. the dgraph part)

Load testing an MDEX is fairly easy while not very well documented. Each time one calls a dgraph, it logs the call (including parameters) in a file with a .reqlog extension. You can find this file easily by issuing a ps -ef to look at your running dgraph. The request log file is passed as a parameter to the running dgraph process.

You can extract requests from a file or multiple files and filter these requests by type. In my case, I took only navigation query requests:

/opt/endeca/MDEX/6.5.1/bin/reqlogparser --query-types g --silent DgraphA1.reqlog DgraphB1.reqlog  > /tmp/requests.log

Note: Use thereqlogparser command with the --help option to get all available options.

Now that you have a set of requests regrouped in a single log file, you can use this file to simulate load on your search engines:

/opt/endeca/MDEX/6.5.1/bin/eneperf  haproxy_server_fqdn 15000 /tmp/requests.log 100 10

Again, use the eneperf command with the --help option to get all available options.

The baseline update

You can launch a baseline update by going to your Endeca application control folder (i.e. /opt/endeca/Apps/CRS/control in my case) and issuing:

./baseline_update.sh

You should see the following logs which indicates when the dgraph is stopped for full index replacement:

[10.07.15 18:38:11] INFO: Stopping component 'DgraphB1'.
[10.07.15 18:38:12] INFO: [LiveMDEXHostB] Starting shell utility 'move_dgraph-input_to_dgraph-input-old-1444235892804'.
[10.07.15 18:38:14] INFO: [LiveMDEXHostB] Starting shell utility 'move_dgraph-input-new_to_dgraph-input-1444235894127'.
[10.07.15 18:38:15] INFO: [LiveMDEXHostB] Starting backup utility 'backup_log_dir_for_component_DgraphB1'.
[10.07.15 18:38:16] INFO: [LiveMDEXHostB] Starting component 'DgraphB1'.

During this phase, you will receive HTTP code 503 if you try to talk to the stopped MDEX. So the goal is to find the correct settings in order to minimize the number of requests your load balancer sends to the stopped search engine.

The best approach is to programmatically instruct the load balancer to blacklist the stopped resource before starting the idenx switch. And it is possible! You can define a post-startup-script and a pre-shutdown-script for each dgraph. You need to edit the LiveDgraphCluster.xml file which is located int he config/script folder of your Endeca application.

Top