Setup TarMK Cold Standby in AEM 6

With the Up-gradation of Jackrabbit to Oak in AEM 6. AEM 6 comes with the most awaiting feature of TarMK Cold Standby architecture. To mitigate the risk during fail-over situations. TarMK Cold Standby Approach comes as a option to traditional Master Slave concept.

Note:- TarMk cold standby sync is linear from primary to standy node without any repository corruption check, means if primary is corrupted secondary will also gets corrupted. As per tarmk cold standby architecture standy instance is exact copy of primary and cannot help if primary instance gets corrupted.

After completing this tutorial you will have clear understanding about:-

  • How Tarmk Cold Standby works.
  • How to setup Tarmk Cold Standby in AEM.
  • How to debug First time Sync between Primary and Standby Instance.
  • How to switch from primary to standy during failover.
  • Advantages and Disadvantages of using Tarmk Cold Standby.

How Tarmk Cold Standby works:

The TarMK Cold Standby approach allows one or more standby AEM instances to connect to a primary instance. Standby instances is a working live copy of the master or primary repository and ensure a quick switch over without any data loss in case the master or primary is unavailable for any reason.

On the primary AEM instance, a TCP port is opened to listening incoming messages. There are two type of messages that our standby instance will send to the primary or master instance:

  • A message requesting the segment ID of the current head.
  • A message requesting segment data with a specified ID.

Note:- Standby instances do not receive any requests, because they run in sync only mode. Only  Felix Console is accessible on standby instance, for configuring OSGI services and components.

Below figure shows a typical TarMK Cold Standby deployment Architecture:-

tarmk cold standby deployment architecture

Note:- It is recommended to configure a load balancer between dispatcher and the primary instance and load balancer should direct all traffic to primary instance only.

In addition to above Architecture you can also map a network drive, which runs daily once or twice a week to take backup of latest crx-quickstart folder of primary. This will help in scenarios when primary instance is corrupted, you always have an addition back to to restore. Because due to linear sync as soon as primary gets corrupted standby is also corrupted.

primary cold standby tarmk architecture aem

Setup Tarmk Cold Standby in AEM

In this tutorial, I have used default Node store(segment store) for data storage .

Follow below steps to setup Primary instance:-

  • Install AEM 6.1 and let it create crx-quickstart folder.
  • Shutdown the instance and copy the crx-quickstart(installation) folder, from primary to standy instance.
    Note:- It is advisable to give descriptive name to older like aem-primary and aem-standby to differentiate.
  • Create a folder install at aem-primary/crx-quickstart/install, if this folder already exist check and delete content present inside it.
  • Create a folder install.primary(any descriptive name) at aem-primary/crx-quickstart/install/install.primary to store node or data store relate configuration files.
  • Create a config file with name org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config  and enter below values:-
    org.apache.sling.installer.configuration.persist=B"false"
    mode="primary"
    port=I"8023"
  • Start Primary instance with “primary” run mode. For example for Windows Set
     
    set
    CQ_RUNMODE=primary,crx3,crx3tar
  • Create a new log file to store logs related to TarMK sync from primary.
    • Go to  Felix console.
    • Create a new Apache Sling Logging Logger for the org.apache.jackrabbit.oak.plugins.segment package.
    • Set log level to DEBUG.

Note:- Change the log level from DEBUG to ERROR or INFO after first sync, otherwise your TarMK log file size will increase very rapidly.

Follow below steps to setup Standby instance:-

  • Now go to Standby instance and run jar file under aem-standby folder.
  • Create same logging configure as primary instance.
  • Once done stop the instance.
  • Check and delete the content available under folder at aem-standby/crx-quickstart/install.
  • Create a folder install.standby(any descriptive name) at aem-primary/crx-quickstart/install/install.standby to store node or data store relate configuration files.
  • Create a config file with name org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config  and enter below values:-
    org.apache.sling.installer.configuration.persist=B"false"
    mode="standby"
    primary.host="127.0.0.1"
    port=I"8023"
    secure=B"false"
    interval=I"5"

    Note:- Change primary.host to primary instance IP address.

  • Create a config file with name Sample of org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStoreService.config  and enter below values:-
    name="Oak-Tar"
    service.ranking=I"100"
    standby=B"true"
    
  • Start Standby instance with “standby” run mode. For example for Windows Set
    set
    CQ_RUNMODE=standby,crx3,crx3tar

The above configurations can also be done from Felix Console:-

  • Go to Felix Console –> Configuration Manager.
  • Search for “Apache Jackrabbit Oak TarMK Cold Standby service”.
  • Change the setting, save it and restart the instance to take effect new changes.

apache oak tarmk cold standby service config

Note:- Once saved AEM creates a config file same as we created above and store all the configuration values at \aem-primary\crx-quickstart\launchpad\config\org\apache\jackrabbit\oak\plugins\segment\standby\store.

Debug First time Sync between Primary and Standby Instance:-

Once the setup is completed and standby instance starts syncing up with primary instance , you can verify whether it is started properly or not by comparing with below debug logs.

StandBy instance Logs:-

  • Open tarmk-coldstandby.log of standby instance. You will see below logs for read segment,got segment and writing segment. Which means sync has started.

*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStore trying to read segment ec1f739c-0e3c-41b8-be2e-5417efc05266

*DEBUG* [nioEventLoopGroup-3-1] org.apache.jackrabbit.oak.plugins.segment.standby.codec.SegmentDecoder received type 1 with id ec1f739c-0e3c-41b8-be2e-5417efc05266 and size 262144

*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStore got segment ec1f739c-0e3c-41b8-be2e-5417efc05266 with size 262144

*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.plugins.segment.file.TarWriter Writing segment ec1f739c-0e3c-41b8-be2e-5417efc05266 to /mnt/crx/author/crx-quickstart/repository/segmentstore/data00016a.tar

  • Open error.log at standby Instance, you will see sync configuration and its status as started.
*INFO* [FelixStartLevel] org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService started standby sync with 10.20.30.40:8023 at 5 sec.

Primary Instance Logs:-

  • Open tarmk-coldstandby.log of primary instance. You will see below logs, here client is our standby instance.
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.store.CommunicationObserver got message ‘s.d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd’ from client c7a7ce9b-1e16-488a-976e-627100ddd8cd

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.server.StandbyServerHandler request segment id d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.server.StandbyServerHandler sending segment d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd to /10.20.30.40:34998

*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.plugins.segment.standby.store.CommunicationObserver did send segment with 262144 bytes to client c7a7ce9b-1e16-488a-976e-627100ddd8cd

Note:- Once these logs stop appearing, you can assume that your sync is completed.

  • You can monitor below log to verify currently which tar file is getting synchronized.
*DEBUG* [defaultEventExecutorGroup-156-1] org.apache.jackrabbit.oak.plugins.segment.file.TarWriter Writing segment 3a03fafc-d1f9-4a8f-a67a-d0849d5a36d5 to /<<CQROOTDIRECTORY>>/crx-quickstart/repository/segmentstore/data00014a.tar

Switch from primary to standy during failover:-

Steps to follow when primary instance went down or crash in production environment:-

  • Remove Primary instance from load balancer, if you are using it.
  • Stop standby instance, and bring it up as primary instance(by changing runmode to primary).
    Note:- Take backup of standby crx-quickstart folder, if you feel primary instance is corrupted and you want to keep current instance as primary and create a new standby instance.
  • Add new primary instance to load balancer.primary failover pocedure aem

Advantages and Disadvantages of using Tarmk Cold Standby

Advantages of using Tarmk cold standby architecture:-

  • It is robust, as it uses checksum will all packets to take care of damaged packets and handle all network related issues automatically.
  • As all instances run under same intranet, security breach becomes difficult. Furthermore we can restrict Ip range from accessing primary or standby instance from Felix console.
  • Failover process is simple and fast.

Dis- Advantages of using Tarmk cold standby architecture:-

  • It is not multi threaded. So multi core cannot speed up the sysnc process.
  • One server is idle most of the time.
  • The failover is not triggered automatic.
Spread the love

Leave a Reply to Debal Das Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.