Monday 30 May 2016

Distributed SolrCloud setup with external ZooKeeper ensemble

We all know that Solr search performs better than database queries because of "inverse index" rather than database queries with a full table scan. Databases and Solr have complementary strengths and weaknesses though. In this blogpost we will set up a SolrCloud just like a production system.

SolrCloud or Solr Master Slave
SolrCloud and master-slave both address four particular issues:
  • Sharding
  • Near Real Time (NRT) search and incremental indexing
  • Query distribution and load balancing
  • High Availability (HA) 
If our application just reads data from Solr and need high availability on reading data from Solr then a simple one master to many slave hierarchy is more than sufficient. But if you are looking out for high availability on writing to Solr too, then SolrCloud is a right option

Is SolrCloud is better? Maintaining SolrCloud needs a good infrastructure and have to look out the availability of ZooKeepers, and nodes health, high performance disk for better replication speed etc. But, other than this we don't need to worry about Data consistency among nodes as this will be taken care by SolrCloud.

In which cases is better to coose SolrCloud? 
When we need high availability on Solr Writes as well as reads, we have to go for SolrCloud. Also, if we cannot afford bigger machines to have one single node, then we can split index to shards and keep it under smaller config machines.

In which cases is better to choose Solr Replication? 
When our application does not write in real time to SOLR, Replication is enough and no need to get complicated with SolrCloud. Also, its comparatively easy to setup Master Slave than SolrCloud

Sharding and Data consistency, automatic rebalancing of shards are better in SolrCloud. Query distribution and load balancing is automatic for SolrCloud, in sharded environment for master slave we need to use distributed query.

To setup a distributed SolrCloud we must have the following - 
  1. At least 6 serves (3 for ZooKeeper cluster setup and 3 for SolrCloud setup)
  2. Zookeper
  3. Solr 6
For this blogpost we will be using a single system with three different ZooKeeper and three different Solr6 instances running on different port.

ZooKeeper Cloud Setup 

1. Download Apache ZooKeeper 3.4.6 is the version I am using
2. Create a directory, lets assume zk_cluster under $<home> .
3. Create three instances for ZooKeeper under zk_cluster, lets assume they are zookeeper-3.4.6_1,           zookeeper-3.4.6_2, zookeeper-3.4.6_3.
4. Create data and logs directory under zk_cluster directory.
5. Create ZooKeeper Server ID, basically this file reside in the ZooKeeper data directory.

At this point of time your ZooKeeper cluster will look like -

zk_cluster
|
    |-data
|---zookeeper-3.4.6_1
|--myid (content numeric 1)
|---zookeeper-3.4.6_2
|--myid (content numeric 2)
|---zookeeper-3.4.6_3
|--myid (content numeric 3)
|-log
|---zookeeper-3.4.6_1
|---zookeeper-3.4.6_2
|---zookeeper-3.4.6_3
|-zookeeper-3.4.6_1
|-zookeeper-3.4.6_2
|-zookeeper-3.4.6_3

6. Preparing ZooKeeper configuration called zoo.cfg at $<home>/zk_cluster/{zookeeper-3.4.6_1}/conf/zoo.cfg.  Here I will show you for Server 1. We have to perform same steps with appropriate values (clientPort, dataDir, dataLogDir) for respective ZooKeeper server.


# The number of milliseconds of each tick
tickTime=2000

# The number of ticks that the initial synchronization phase can take
initLimit=10

# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5

# the directory where the snapshot is stored.
# Choose appropriately for your environment
dataDir=$<home>/zk-cluster/data/zookeeper-3.4.6_1

# the port at which the clients will connect
clientPort=2181 <change for other instances>

# the directory where transaction log is stored.
# this parameter provides dedicated log device for ZooKeeper
dataLogDir=$<home>/zk-cluster/logs/zookeeper-3.4.6_1

# ZooKeeper server and its port no.
# ZooKeeper ensemble should know about every other machine in the ensemble
# specify server id by creating 'myid' file in the dataDir
# use hostname instead of IP address for convenient maintenance
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

7. Once zoo.cfg created for all the server then we can start the ZooKeeper Servers. ZooKeeper supports the following commands

  • start
  • start-foreground
  • stop
  • restart
  • status
  • upgrade
  • print-cmd

SolrCloud Setp

Before we create the Solr instances, we'll need to create a configset in order to create a collection to shard and replicate across multiple instances.  Creating a configset is very specific to our  collection. We can use the pre-built configsets that come with Solr 6, they are located in solr-6.0.0/server/solr/configsets and we don't have to do anything.

A custom configset requires taking care of path and third party libraries defined in solrconfig.xml file.We also have to create/update the schema.xml as necessary to map data from the source to a Solr document

Lets assume we are creating a configset named solr_cloud_example simply copying the content of  basic_configs. Additional libraries and schema can be updated before we actually start creating indexes.

Uploading a configset to Zookeeper

This is relevant if we want to upload  configuration ahead of time instead of specifying the configuration to use in the "create" command or if we are using the Collections API to issue a "create" command via the REST interface.


To upload the configset, we have to use zkcli.sh which is in <BASE_INSTALL_DIR>/solr-6.0.0/server/scripts/cloud-scripts.  Lets go to that directory and issue the following command:

./zkcli.sh -zkhost localhost:2181,localhost:2182,localhost:2183 -cmd upconfig -confname < solr_cloud_example > -confdir <base_installation_dir>/solr-6.0.0/sever/solr/configsets/< solr_cloud_example >/conf

This will upload the config directory in ZooKeeper cluster we have setup earlier.

Creating Solr Instances

Under the <base_directory> create a directory <solr_cluster> , download and copy three solr 6 installations. So the directory structure looks like  -

<base_dir>/<solr_cluster>
                            |
                            |-- solr-6.0.0_1
                                    | -- server
                                           |--solr
                                                 |--configsets
                                                       |-- <solr_cloud_example>
                                   
                            |-- solr-6.0.0_2
                            |-- solr-6.0.0_3


Now we are ready with the setup.

Start SolrCloud

At this point of time we have all the setup ready, before we start solr instances make sure the zookeeper cluster is up and running.

Goto
<base_dir>/solr-cluster/solr-6.0.1 and execute
bin/solr start -cloud  -p 8983 -z localhost:2181,localhost:2182,localhost:2183 -noprompt

Goto
<base_dir>/solr-cluster/solr-6.0.2 and execute
bin/solr start -cloud  -p 8984 -z localhost:2181,localhost:2182,localhost:2183 -noprompt

Goto
<base_dir>/solr-cluster/solr-6.0.3 and execute
bin/solr start -cloud  -p 8985 -z localhost:2181,localhost:2182,localhost:2183 -noprompt


once all the instances are running just type
http://localhost:8983/solr/admin/collections?action=CREATE&name=test_solr_cloud&numShards=2&replicationFactor=2&maxShardsPerNode=2
&collection.configName= solr_cloud_example to create a collection named solr_cloud_example

Now go to http://localhost:8983/solr/#/~cloud  you will see the collection along with shards and replications in different nodes