Spark Solr Integration on Windows

Prerequisite : 
  • Install zookeeper and Solr 5.3.X 
  • If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
( See here for details http://prasi82.blogspot.com/2015/11/migrating-solr-4x-index-data-to-solr-5x.html )


1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME

https://github.com/LucidWorks/spark-solr

set SS_HOME=d:\spark-solr-master

2) Download maven and set following env variables

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;

3) open a command prompt, make sure m2_home is set and and added to path. Run command

cd %SS_HOME%
mvn install -DskipTests

This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.


4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)

https://github.com/freissmann/SolrWithSparks

Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master

The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions

Open the pom.xml and add following if not already specified :

 
 

If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :


See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1


package de.blogspot.qaware.spark;

import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.util.Arrays;

public class SparkSolrJobApp {

    private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
    private static final String SOLR_COLLECTION = "collection1";
    private static final String QUERY_ALL = "*:*";

    public static void main(String[] args) throws Exception {
        String zkHost = ZOOKEEPER_HOST_AND_PORT;
        String collection = SOLR_COLLECTION;
        String queryStr = QUERY_ALL;

        JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");

        SolrRDD solrRDD = new SolrRDD(zkHost, collection);
        final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
        JavaRDD solrJavaRDD = solrRDD.query(javaSparkContext.sc(), solrQuery);

        JavaRDD titleNumbers = solrJavaRDD.flatMap(doc -> {
            Object possibleTitle = doc.get("title");
            String title = possibleTitle != null ? possibleTitle.toString() : "";
            return Arrays.asList(title);
        }).filter(s -> !s.isEmpty());

        System.out.println("\n# of found titles: " + titleNumbers.count());

        // Now use schema information in Solr to build a queryable SchemaRDD
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
        DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");

        // SQL can be run over RDDs that have been registered as tables.
        DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");

        // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
        // The columns of a row in the result can be accessed by ordinal.
        JavaRDD resultsRDD = results.javaRDD();

        System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());

        javaSparkContext.stop();
    }
}


5) Build Assembly Jar of your code :

Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :


   

Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.

cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests

This will build the jar of your code with all its dependency classes included in it in the "target" folder.

6) Run solr spark example :

Before you run this program, you will need to start your zookeeper and solrcloud shards.

Go to command prompt and run following :

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd

Now that spark environment is set, Run the following command to execute your spark job :

%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.

7) Troubleshooting :


If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.

set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1


If you get error of following type :

Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:

bci: @62
flags: { }
locals:
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }
stack:
{ 'org/apache/http/impl/client/DefaultHttpClient' }
Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)

Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.

httpclient-4.4.1.jar
httpcore-4.4.1.jar

See here for details : https://issues.apache.org/jira/browse/SOLR-7948

8) FAQ

How to build solr Queries in Spark Job :

https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java


How to sort a JavaRDD

http://stackoverflow.com/questions/27151943/sortby-in-javardd


How to invoke a spark job from Java Program :

https://github.com/spark-jobserver/spark-jobserver

http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html