21 sept 2010

Tomcat 6: Session replication for failover

Summary: there is a bug in tomcat 6.0.20 which inhibits tomcat to send multicasts between instances, failing to form the cluster and hence not replicate sessions.

For the current project, I have to cluster Alfresco 3.2r Enterprise. We have a mini 2-node cluster. Although hibernate L2 cache replication works correctly (you have to rename ehcache-custom.xml.sample.cluster, which was not totally clear after reading the documentation), I went for session replication, which the docs state as supported. Now, only the session replication was missing. (There seems to be a bug which makes session replication fail, but I had no time to verify it. Anyway, I wanted to go ahead and learn how to configure tomcat for session replication and fail-over.)

As I had not much idea of configuring tomcat, I picked up an existing tomcat 6.0.20 instance and a small session example. The I configured tomcat, following the session replication / cluster how-to. Finally, I copied the tomcat instance and changed any colliding ports.

But I was not able to make it work. I was looking into the log for any message about my instances following the cluster, but without luck. After trying other ports, reconfiguring the network to support multicast ping (icmp), googleing around, reading a lot of docs, etc. I found a email message (which I can't find anymore), suggesting that there is a bug in tomcat-6.0.20 not sending multicasts for cluster instance detection!

I downloaded immediately a new version (6.0.29) and configured the two instances. It worked at the first attempt.

I use Apache proxy_balancer to test the instances. Here goes my Apache config file:
<Location /balancer-manager>
SetHandler balancer-manager
</Location>

<Proxy balancer://ajpCluster>
BalancerMember ajp://localhost:8809 route=jvm1
BalancerMember ajp://localhost:8810 route=jvm2
</Proxy>

<Location /sessiontest>
ProxyPass balancer://ajpCluster/sessiontest stickysession=JSESSIONID nofailover=off
</Location>

<Location /favicon.ico>
ProxyPass balancer://ajpCluster/favicon.ico
</Location>

The "route" parameter of BalanceMember adds just its value to the session id. The /balancer-manager url helps you to debug the cluster, displaying if both instances accept requests, how may have been processed, and to enable or disable any instances. As we can see here, my tomcat instances are listening for AJP requests on ports 8809 and 8810.

Here goes the interesting part of my conf/server.xml of both (the have just different ports):
<Engine name="Catalina" defaultHost="localhost" jvmRoute="jvm1">

<Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
channelSendOptions="8">
<Manager className="org.apache.catalina.ha.session.DeltaManager"
expireSessionsOnShutdown="false"
notifyListenersOnReplication="true"/>

<Channel className="org.apache.catalina.tribes.group.GroupChannel">

<Membership className="org.apache.catalina.tribes.membership.McastService"
address="228.0.0.4"
ttl="15"
port="45564"
frequency="500"
dropTime="3000" />

<Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver"
address="auto"
port="4200"
autoBind="100"
selectorTimeout="5000"
maxThreads="6" />

<Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
<Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/>
</Sender>

<Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>

<Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/>

</Channel>
<Valve className="org.apache.catalina.ha.tcp.ReplicationValve"
filter=".*\.gif;.*\.js;.*\.jpg;.*\.htm;.*\.html;.*\.txt;" />

<Deployer className="org.apache.catalina.ha.deploy.FarmWarDeployer"
tempDir="/tmp/war-temp/"
deployDir="/tmp/war-deploy/"
watchDir="/tmp/war-listen/"
watchEnabled="false" />

<ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>

<ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener"/>
</Cluster>
...
</Engine>


Take into account that you just have to change all enabled connector (HTTP, HTTPS, AJP, SHUTDOWN) ports and the cluster message Receiver port (4200, in the example above) for the tomcat instances which work in the machine.