Thursday 31 August 2017

Solr In Production

This blog helps you in setting up SolrCloud with zookeeper ensemble. assuming you have fair knowledge about Apache Solr and Zookeeper.
Includes detailed steps for configuring SolrCloud with zookeeper ensemble in production environment using AWS EC2 machines.

Using SolrCloud instead of single standalone Solr server.

Why SolrCloud ?

Setting up Solr as a cluster of Solr servers(i.e. solrCloud) gives ability to handle fault tolerance and high availability. other features of SolrCloud

  • Central configuration for the entire cluster
  • Automatic load balancing and fail-over for queries
  • Zookeeper integration for cluster co-ordination and configuration

Zookeeper 

Setting Co-ordination and managing a service in a distributed environment is a complicated process. Zookeeper makes this process easy. Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Why External Zookeeper ?

By default Solr comes bundled with Apache Zookeeper, but it is discouraged from using this internal Zookeeper in production. Because shutting down Solr instance will also shut down its Zookeeper server.
To handle this we should have External Zookeeper Ensemble(cluster of zookeeper servers) with more than half its servers running at any given time.
Use formula 2n+1 = number of Zookeeper servers in cluster, where n is number of failure servers to handle. ex: say n=2, we should have total of 5 Zookeeper servers in quorum(zookeeper cluster). if any 2 out of 5 servers goes down, Zookeeper will be still up and running as majority of servers in quorum are active/running.

So, lets start with configuration steps

Am using 6 AWS EC2 machines for this setup, 3 for SolrCloud and remaining 3 for Zookeeper Ensemble. Not mandatory that it should be 6 machines, depends on requirements and availability.

3 nodes for zookeeper
private IP : 9.*.*.11  public IP : 5*.**2.1**.*1*
private IP : 9.*.*.12  public IP : 5*.**2.2**.*2*
private IP : 9.*.*.13  public IP : 5*.**2.3**.*3*

3 nodes for solrcloud
private IP : 9.*.*.14  public IP : 5*.**2.4**.*4*
private IP : 9.*.*.15  public IP : 5*.**2.5**.*5*
private IP : 9.*.*.16  public IP : 5*.**2.6**.*6*

Update the Java to latest version in all 6 machines. because recent version of Solr needs Java-1.8
  • sudo yum install java-1.8.0
remove other older version(change the command accordingly)
  • sudo yum remove java-1.7.0-openjdk
 Zookeeper setup :
 Just to make sure all zookeeper related files goes into same folder, create one folder. 
  • mkdir zookeeper
  • cd zookeeper
Download Zookeeper from web (change the version accordingly)
  • wget "http://www-eu.apache.org/dist/zookeeper/zookeeper-3.4.8/zookeeper-3.4.8.tar.gz"
OR you can copy from local machine(change user and IP accordingly)
  • sudo scp /home/vinod/zk/zookeeper-3.4.8.tar.gz user@AWS-Public-IP:/home/user/zookeeper
Extract file
  • tar -xvf zookeeper-3.4.8.tar.gz
Create a directory where you would like Zookeeper to store its data
  • mkdir data
Copy sample zookeeper configuration file zoo_sample.cfg to zoo.cfg. because by default zk(zookeeper) refer zoo.cfg file for configuration settings when started.
  • cp zookeeper-3.4.8/conf/zoo_sample.cfg zookeeper-3.4.8/conf/zoo.cfg
Edit default zoo.cfg file and update file
  • vim zookeeper-3.4.8/conf/zoo.cfg
Update zoo.cfg file like below and save the file. (change username and IP accordingly)

tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/home/username/zookeeper/data
# the port at which the clients will connect
clientPort=2181

server.1=9.*.*.11:2888:3888
server.2=9.*.*.12:2888:3888
server.3=9.*.*.13:2888:3888


Inside data folder create file with name myid for server number mapping. server.1 i.e 9.*.*.11 should have myid file with value 1 inside it.
  • cd data 
  • cat > myid 
Enter 1 and save it.
Follow the same steps in other two machines. Note that server.2 machine should have myid file with value 2 and server.3 machine myid file with value 3.

Once all this configuration is done in all 3 machines, we are all set to start Zookeeper.
  • cd home/username/zookeeper/zookeeper-3.4.8
  • bin/zkServer.sh start
Once Zookeeper started in all 3 machines, check all 3 machines are connected.
Use zkCli.sh from any of 3 connected machines to perform simple, file-like operations. Enter below command, you should see something like [zk: ip:2181(connected) ] in the console.
  • bin/zkCli.sh -server 9.*.*.11:2181,9.*.*.12:2181,9.*.*.13:2181
Check here for available Zookeeper operations and more details.

SolrCloud setup :
Make sure Java version is updated to latest. if not, follow the steps mentioned before Zookeeper setup.
  • mkdir SolrCloud
  • cd SolrCloud
Download Solr from web (change the version accordingly)
  • wget "http://www.us.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz"
OR you can copy from local machine(change user and IP accordingly)
  • sudo scp /home/user/solr/solr-6.1.0.tgz user@AWS-Public-IP:/home/user/SolrCloud
Create solr home directory for cloud setup. can choose any location I am creating here solr-6.1.0/server/solr/node1/solr
  • cd  solr-6.1.0/server/solr
  • mkdir node1
  • cd node1
  • mkdir solr
 Copy zoo.cfg & solr.xml from solr-6.1.0/server/solr to solr-6.1.0/server/solr/node1/solr/
  • cp solr-6.1.0/server/solr/zoo.cfg solr-6.1.0/server/solr/node1/solr/
  • cp solr-6.1.0/server/solr/solr.xml solr-6.1.0/server/solr/node1/solr/
Now we can start solr in cloud mode. Follow the same steps in other two machines. start solr 3 machines in cloud mode with external zookeeper which is running in other 3 machines by pointing to its IP.
  • bin/solr start -cloud -s server/solr/node1/solr -p 8983 -z 9.*.*.11:2181,9.*.*.12:2181,9.*.*.13:2181
Now we have 3 node SolrCloud and 3 node Zookeeper running.
Let's create solr collection, before that we need to upload solr configuration files(schema.xml, solrconfig.xml etc) to Zookeeper.
By default solr comes with sample config files(found inside server/solr/configsets) and scripts(server/scripts) which helps to upload these to zookeeper node. from any solr machine
  •  server/cloud-scripts/zkcli.sh -zkhost 9.*.*.11:2181 -cmd upconfig -confname solr_configs -confdir /server/solr/configsets/
    sample_techproducts_configs/conf
Create collection by using the configs uploaded to Zookeeper.
Example:
  • http://5*.**2.4**.*4*:8983/solr/admin/collections?
    action=CREATE&name=collection1&numShards=2&replicationFactor=2&collection.configName=solr_configs 


Hope this helps,
Vinod

4 comments:

  1. hi, nice one.. I need your kind response, i want to store date-wise data in each shard e.g 1st oct shard1, 2nd oct shard2,3rd oct shard3 and so on. please help me how can i do explicitly, with minimum human intervention. Please need your kindness.
    kind regards
    Muhammad Rehman KAHLOON
    yoursrehman@gmail.com

    ReplyDelete
    Replies
    1. Hi, thanks.

      There is something called Document Routing. where you can indicate to which shard your document to go. There 2 types of document routing.
      1. compositeId(default)
      Use Id of the document, prefix it with some name say Windows/ or shard
      name(not necessary that it should be same as shard name). but maintain
      consistency.
      ex: "id" : "windows!568" here solr index into available shards but when next time
      if similar doc id comes for indexing ("id" : "windows!998"). it routes to
      shard where it indexed doc(s) with simliar id prefix like windows.

      > we can even send queries to specific shards using _route_ param
      ex : q=solr&_route_=windows!

      2. Implicit
      In this case additionally we have to define a router.field parameter in every
      document so that solr knows to which shard the doc should go. if the required
      router.field is missing in any doc it will be rejected when we try to index.

      hope this helps. sorry for the late reply.

      Best,
      Vinod

      Delete
  2. This comment has been removed by the author.

    ReplyDelete