Spark Solr Integration on Windows

Prerequisite : 
  • Install zookeeper and Solr 5.3.X 
  • If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
( See here for details )

1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME

set SS_HOME=d:\spark-solr-master

2) Download maven and set following env variables

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;

3) open a command prompt, make sure m2_home is set and and added to path. Run command

cd %SS_HOME%
mvn install -DskipTests

This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.

4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)

Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master

The sample needs JDK1.8 for lambda functions

Open the pom.xml and add following if not already specified :


If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :

See link below for additional info :

package de.blogspot.qaware.spark;

import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.util.Arrays;

public class SparkSolrJobApp {

    private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
    private static final String SOLR_COLLECTION = "collection1";
    private static final String QUERY_ALL = "*:*";

    public static void main(String[] args) throws Exception {
        String zkHost = ZOOKEEPER_HOST_AND_PORT;
        String collection = SOLR_COLLECTION;
        String queryStr = QUERY_ALL;

        JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");

        SolrRDD solrRDD = new SolrRDD(zkHost, collection);
        final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
        JavaRDD solrJavaRDD = solrRDD.query(, solrQuery);

        JavaRDD titleNumbers = solrJavaRDD.flatMap(doc -> {
            Object possibleTitle = doc.get("title");
            String title = possibleTitle != null ? possibleTitle.toString() : "";
            return Arrays.asList(title);
        }).filter(s -> !s.isEmpty());

        System.out.println("\n# of found titles: " + titleNumbers.count());

        // Now use schema information in Solr to build a queryable SchemaRDD
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
        DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");

        // SQL can be run over RDDs that have been registered as tables.
        DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");

        // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
        // The columns of a row in the result can be accessed by ordinal.
        JavaRDD resultsRDD = results.javaRDD();

        System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());


5) Build Assembly Jar of your code :

Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :


Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.

cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests

This will build the jar of your code with all its dependency classes included in it in the "target" folder.

6) Run solr spark example :

Before you run this program, you will need to start your zookeeper and solrcloud shards.

Go to command prompt and run following :

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
call %SPARK_HOME%\bin\load-spark-env.cmd

Now that spark environment is set, Run the following command to execute your spark job :

%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.

7) Troubleshooting :

If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :

REM set ip address of spark master

If you get error of following type :

Exception Details:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:

bci: @62
flags: { }
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }
{ 'org/apache/http/impl/client/DefaultHttpClient' }
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:

Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.


See here for details :

8) FAQ

How to build solr Queries in Spark Job :

How to sort a JavaRDD

How to invoke a spark job from Java Program :

Migrating Solr 4.X Index data to Solr 5.X index

To upgrade your solr 4.X indexed data to Solr 5.X , run the following command

java -cp D:\solr-5.3.1\server\solr-webapp\webapp\WEB-INF\lib\* org.apache.lucene.index.IndexUpgrader D:\solr-4.4.0\example\solr\collection1\data\index

Here assuming "D:\solr-4.4.0\awc\solr\collection1\data\index" is the directory that contains indexed data that you want to upgrade to Solr 5.3.X.

After this you can copy your Solr 4.4.X colleciton / core directory (e.g. D:\solr-4.4.0\example\solr\collection1 in above command) to Solr 5.3.X home directory.

After copying to Solr5.3.X home directory, you will need to make few changes in the schema.xml and solrConfig.xml of your collection :

In your collection\conf\solrConfig.xml, comment out following :

In your collection\conf\schema.xml, change following, field values to include "Trie" Prefix

Setup Apache Spark On Windows

download spark and unzip E.g

spark needs hadoop jars. download hadoop binaries for windows (hadoop 2.6.0) from

unzip hadoop at some locaiton e.g.


If your java_home or hadoop_home path contains space charcters in it ' ', you will need to convert to path to short paths:

  • Create a batch script with following contents

echo %~s1

  • Run the above batch script file from java_home directory to get the short path for java-home
  • Run the above batch script file from hadoop_home directory to get the short path for hadoop_home

set java_home=short path obtained from above command
set hadoop_home=short path obatained from above command.

Run following command and copy the classpath generated by the command for next step

         %HADOOP_HOME%\bin\hadoop classpath

under spark_home\conf, create a file named "spark-env.cmd" like below

@echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1

on Command prompt

          cd %spark_home%\bin
          set SPARK_CONF_DIR=%SPARK_HOME%\conf
          spark-shell.cmd //To start spark shell
          spark-submit.cmd   //To submit spark job

Refer below to create a spark word count example

To run a spark job (written using Java) word count example from above URL

          spark-submit --class org.sparkexample.WordCount --master local[2]  your_spark_job_jar  Any_additional_parameters_needed_by_your_job_jar

References :