Wednesday 16 December 2015

Solr Master - Slave Configuration with DataImportHandler & Scheduling

In this post we will se how we can setup Solr Master - Slave replication setup as shown below -


For simplicity lets assume that we have two nodes node1 and node2. Node1 is the master node and Node2 is the slave node.

1. Install solr-5.3.1 on both Node1(master) and Node2(slave)
2. Create Solr core using the command
    $> bin/solr create [-c name] [-d confdir] [-n configName] [-shards #] [-replicationFactor #] [-p           port]
on both Node1 and Node2

Lets assume the name of the core is test_core.

So in both the instance if we go to ${SOLR_HOME}/server/solr we will see test_core which have conf directory , core.properties file and data directory.

Now lets start with master slave configuration -

Master Setup 

If we navigate to conf directory within the test-core directory under /server/solr we will see solrconfig.xml file

Edit the file and add

<requestHandler name="/replication" class="solr.ReplicationHandler">
    <lst name="master">
         <str name="enable">${master.replication.enabled:false}</str>
         <str name="replicateAfter">commit</str>
         <str name="replicateAfter">optimize</str>
        <str name="replicateAfter">startup</str>
    </lst>

</requestHandler>

add master.replication.enabled=true in core.properties file located in /solr directory.


Slave Setup

If we navigate to conf directory within the test-core directory under /server/solr we will see solrconfig.xml file

Edit the file and add

<requestHandler 
name="/replication" class="solr.ReplicationHandler">
     <lst name="slave">
           <str name="enable">${slave.replication.enabled:false}</str>
           <str name="masterUrl">http://${masterserver}/solr/${solr.core.name}/replication</str>
          <str name="pollInterval">00:05:00</str></lst>

</requestHandler>

add 


slave.replication.enabled=true
masterserver=52.33.134.44:8983


solr.core.name=<core_name> (test_core)

in core.properties file located in /solr directory.


Thats it we are done with master slave configuration.

DataImportHandler

Using solr DataImportHandler we can create indexes in solr directly from data store like MySQL Oracle, Postgre SQL etc.

Lets continue with previous example to configure a data import handler
1.  Edit solrconfig.xml file under conf directory of your core and add -

<requestHandler name="/dataimport"                           class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
      <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

2. Create data-config.xml file within the conf directory with following content-

<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="" user="" password=""/>
    <document name="">
        <entity name="" query=""
deltaQuery="<some_date_condition> &gt; '${recommendation.last_index_time}';">
 <field column="" name="" />
            .
.
.
.
      <field column="allcash_total_annualized_return_growth" name="Allcash_total_annualized_return_growth" />
        </entity>
    </document>
</dataConfig>

3. Create corresponding filed mapping in managed-schema file for index creation.

4. Make sure you have the jar file for Driver class is available in lib directory or any other directory and you have mentioned that in solrconfig.xml file like

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />

We are done with DataImportHandler configuration.

Scheduling: 

Solr by default don,t support scheduling for delta import.
Clone either of

1. https://github.com/badalb/solr-data-import-scheduler.git
2. https://github.com/mbonaci/solr-data-import-scheduler.git

Create a jar file and put that jar file in {SOLR_HOME}/ server/solr-webapp/ webapp/ WEB-INF / lib directory

3. Make sure, regardless of whether you have single or multi-core Solr, that you create dataimport.properties located in your solr.home/conf (NOT solr.home/core/conf) with the content like

 #  to sync or not to sync
#  1 - active; anything else - inactive
syncEnabled=1

#  which cores to schedule
#  in a multi-core environment you can decide which cores you want syncronized
#  leave empty or comment it out if using single-core deployment
syncCores=coreHr,coreEn

#  solr server name or IP address
#  [defaults to localhost if empty]
server=localhost

#  solr server port
#  [defaults to 80 if empty]
port=8080

#  application name/context
#  [defaults to current ServletContextListener's context (app) name]
webapp=solrTest_WEB

#  URL params [mandatory]
#  remainder of URL
params=/select?qt=/dataimport&command=delta-import&clean=false&commit=true

#  schedule interval
#  number of minutes between two runs
#  [defaults to 30 if empty]
interval=10

4. Add application listener to web.xml of solr web app ({SOLR_HOME}/ server/solr-webapp/WEB-INF/web.xml)

<listener>
  <listener-class>org.apache.solr.handler.dataimport.scheduler.ApplicationListener</listener-class>
</listener>

Restart Solr so that changes are reflected.

Happy searching .....

Tuesday 15 December 2015

Integrating Tableau Desktop with Spark SQL

In this post we will see how we can integrate Tableau Desktop with Spark SQL. Tableau’s integration with Spark brings tremendous value to the Spark community – we can visually analyse data without writing a single line of Spark SQL code. That’s a big deal because creating a visual interface to our data expands the Spark technology beyond data scientists and data engineers to all business users. The Spark connector takes advantage of Tableau’s flexible connection architecture that gives customers the option to connect live and issue interactive queries, or use Tableau’s fast in-memory database engine.

Software requirements :-

We will be using the following softwares to do the integration -
1. Tableau Desktop-9-2-0
2. Hive 1.2.1
3. Spark 1.4.0 for Hadoop 2.6.0

We can skip the Hive and can directly work with Spark SQL. For this example we will use Hive, import Hive tables to Spark SQL and will Integrate them with Tableau SQL.

Hive Setup :-

1. Download and install Hive 1.2.1.
2. Download and copy mysql connector jar file to ${HIVE_HOME}/lib directory so hive will use           MySql metastore.
3. Start Hive ${HIVE_HOME}/bin $./hive
4. Create some table and insert data to that table

create table product(productid INT, productname STRING, proce FLOAT, category STRING) ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ',';

INSERT INTO TABLE product VALUES(1,Book,25,Statonery);
INSERT INTO TABLE product VALUES(2,Pens,10,Stationery);
INSERT INTO TABLE product VALUES(3,Sugar,40.05,House Hold Item);
INSERT INTO TABLE product VALUES(4,Furniture,1200,Interiors);

Hive setup is complete now.

Spark Setup :-

1. Download and extract Spark 1.5.2 for Hadoop 2.6.0
2. Copy hive-site.xml from ${HIVE_HOME}/conf directory to ${SPARK_HOME}/conf directory
3. Replace all "s" from time values like 0s to 0 or <xyz>ms to <xyz> else it might give us Number         Format Exception
4. Define  SPARK MASTER IP export SPARK_MASTER_IP=<host_ip_addr>  in spark-env.sh file  (without this thrift server will not work) located at ${SPARK_HOME}/conf directory

5. Start spark master and slave
  1. ${SPARK_HOME}/sbin $./start-master.sh 
  2. ${SPARK_HOME}/sbin $./start-slaves.sh 
6. Goto http://localhost:8080/   and check that worker has started

Now time to start Thrift server -

7.  ${SPARK_HOME}/sbin $ ././start-thriftserver.sh --master spark://<spark_host_ip>:<port> --driver-class-path ../lib/mysql-connector-java-5.1.34.jar  --hiveconf hive.server2.thrift.bind.host localhost --hiveconf hive.server2.thrift.port 10001

It will start thrift server on 10001 port


8. Go to http://localhost:8080/  and check spark sql application has started








































Now go to Tableau Desktop
  1. Select Spark Sql.
  2. Enter host as localhost, enter thrift server port from step here its 10001
  3. Select type as SparkThriftServer, Authentication as User Name 
  4. Keep rest of the fields empty and click on OK
You are done!!! Happy report building using Tableau-Spark.