Solr: Exporting an Index To an External File


For a change of pace we are going to look at  content flow from a different direction. Instead of importing content we are going to export it. Why would we do that? A few reasons:

  • Having the content in Solr means that we can pre-process the fields during ingestion and export the changes for use in other venues (reports, backups, re-import in databases, etc.)
  • Sometimes you just need to have more than one kind of backup
  • Sometime you feel like a nut

How would you do that (no, not feel like a nut)? It’s actually pretty straightforward and can be done using any tools that support the sending of URLs to retrieve a response. Java is usually my poison of choice, but in this case I decided on a simpler solution: curl + bash (yes, this means this is a Unix specific solution unless you run Cygwin or some equivalent package on Windows).

The Short Version

  • Determine the query that will return the subset of the index that you need. Decide on:
    • keyword
    • desired fields
    • how many items per file (or some really large number if you want them all in one gulp/file)
    • output format (XML, JSON, or CSV)
  • Execute a curl statement to export the entire subset as one really really really big file or
  • Execute multiple curl statements to export the entire subset into multiple files

The Long Version

Determine the query that will return the subset of the index that you need

Keyword: This one is easy. Any keyword will do and if you want the entire collection then use *:*. Remember that curl will not encode the URL so make sure that you encode your query (for example, spaces become either %20 or +. For *:* please use *%3A*).

For this example we will use:

q=*%3A*

Desired Fields: do you want 1, 2, or more fields returned in the result or just go hog wild and get them all? We’ll use the fl parameter and just return the id and title, but if you want them all just use *.

fl=id,title

How many items per file: if you want them all in one search result then I suggest either looking up the total number of items or setting a really large number (like a 1 followed by a bazillion zeroes…or perhaps 2 billion…not 2 billion zeroes, just the number 2 billion). If you want the result to go into multiple files then decide how many you want per file and get ready to have a file population explosion.

rows=2000000000

Output Format: Solr will output the search result as XML, JSON, or CSV. I like XML, but it seems to have fallen out of favor with the cool kids. Setting the wt parameter to either xml, json or csv will do the trick. I added indent=true for legibility.

wt=xml&indent=true

Execute a curl statement to export the entire subset as one really really really big file or…

So you’re feeling really ambitious and you’ve got the disk space to prove it. The only thing you need to do then is pass the correct URL to curl and you are all set. For example (and the particulars will be different for your installation):

curl "http://localhost:8888/solr/collection1/select?q=*%3A*&wt=xml&indent=true&start=0&rows=2000000000&fl=id,title" > full-output-of-my-solr-index.xml

Execute multiple curl statements to export the entire subset into multiple files

So maybe you have the disk space, but the output file would be too big. Or you think you’d rather stay organized and keep your file sizes to within some reasonable size. Call curl multiple times changing the start parameter appropriately and send the output to files of slightly different names. For example:

curl "http://localhost:8888/solr/collection1/select?q=*%3A*&wt=$xml&indent=true&start=0&rows=5000&fl=id,title" > partial-output-of-my-solr-index-1.xml
curl "http://localhost:8888/solr/collection1/select?q=*%3A*&wt=$xml&indent=true&start=5000&rows=5000&fl=id,title" > partial-output-of-my-solr-index-2.xml
curl "http://localhost:8888/solr/collection1/select?q=*%3A*&wt=$xml&indent=true&start=10000&rows=5000&fl=id,title" > partial-output-of-my-solr-index-3.xml
...

Of course, the easiest thing to do is to write a shell script that handles all that for you. There is a version of just such a shell script below…

Caveats

(I love the word Caveats. It sounds a lot like caviar…which reminds me: I haven’t had dinner yet)

A couple of things to bear in mind:

  • Asking Solr for some arbitrarily large number of search results all at once does have its consequences: as a search engine it does feel certain responsibilities not the least of which is it has to grab and sort the entire result before returning it to you. That means that if you are just looking for the last few results of a query it is still going to grab the entire result and do its relevancy and sorting magic to get them for you.
  • If you export an index that is in the middle of being modified due to incoming changes expect questionable results.

Good News!

Solr 4.7 will have a cursor! No, Solr is not developing Tourette’s.

Through the use of cursors one of the caveats above will be solved and the traversing of the index will be much more efficient. This will give your mom a great reason to call and tell you how proud she is of you.

Code

Finally! The following is a semi-complete script. If you want to change the parameters you will have to change them within the script every time. Could the script have pulled the parameters from the command line? Of course (but, once it worked I left it alone)!

lws-export.sh

#! /bin/bash
#
# lws-export.sh
#
# While this will export a Solr index based on a requested query it should be
# remembered that the results will change if items are being indexed into this
# collection at the same time as this is running. Probably shouldn't do that.
#
# Use this only for good.
#
# Carlos Valcarcel
# Created: 4/2/14
#

solrURL=http://localhost:8888/solr
collectionName=nutrition
# this can be either a comma separated list of field names or * to return all fields
fieldNames=id,title
maxDocumentsPerFile=2000
# this must be URL encoded. *%3A* is the equivalent of a full search using *:*
query=*%3A*
# a path would look like this using the following pieces where N is a number attached
# to the filename to make it unique:
# $destinationPath/$baseFilename-N.fileExtention
destinationPath=.
baseFilename=solr-export
# this can be xml,json or csv
# fileExtention=json
# outputFormat=json
# fileExtention=csv
# outputFormat=csv
fileExtention=xml
outputFormat=xml

#
# How many docs are there in total?
#
maxDocs=`curl "http://localhost:8888/solr/nutrition/select?q=$query&rows=0" | awk '/numFound=/{idx=index($0, "numFound="); totalDocs=substr($0,idx+10); idx=index(totalDocs,"\""); totalDocs=substr(totalDocs, 0, idx-1); print totalDocs}'`

echo "maxDocs: " $maxDocs
exit;

maxPageCount=$(($maxDocs/$maxDocumentsPerFile));
mod=$(($maxDocs % $maxDocumentsPerFile))

# might not have been a clean division. check if we have to do one last page check...
if [ $maxDocumentsPerFile -lt $maxDocs ]; then
  if [ $mod != 0 ]; then
    let maxPageCount=maxPageCount+1
  fi
fi

# echo "maxPageCount: " $maxPageCount

fileNumber=0
pageNumber=0
while [ $pageNumber -lt $maxPageCount ];
do
  offset=$(($pageNumber * $maxDocumentsPerFile))
  let fileNumber=fileNumber+1
  let pageNumber=pageNumber+1

  outputFilename=$destinationPath/$baseFilename-$fileNumber.$fileExtention
  echo "Writing file " $outputFilename

  echo curl "$solrURL/$collectionName/select?q=$query&wt=$outputFormat&indent=true&start=$offset&rows=$maxDocumentsPerFile&fl=$fieldNames"
  curl "$solrURL/$collectionName/select?q=$query&wt=$outputFormat&indent=true&start=$offset&rows=$maxDocumentsPerFile&fl=$fieldNames" > $outputFilename
done

References

Common Query Parameters

Thanks

A shout out to Erik Hatcher for the advanced news about the 4.7 cursor.

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them. The cat isn’t real, but then neither are you. Enjoy your search responsibly.

 

Advertisements

7 thoughts on “Solr: Exporting an Index To an External File

  1. Marcelo Daparte

    Hi!

    Just a little bug in the script. At line 38, subtract 1 to idx variable in order to avoid to get a ” character after the number
    Replace this
    totalDocs=substr(totalDocs, 0, idx); print totalDocs}’`
    for this
    totalDocs=substr(totalDocs, 0, idx-1); print totalDocs}’`

    Thanks in advance,

    Marcelo Daparte

    Reply
    1. cvalcarcel Post author

      Excellent catch! For some reason I did not see that in my testing, but some times the cat gets in my eyes.

      I will correct it, sight unseen, and hope for the best.

      Reply
  2. Manuel

    Hi, thank you for this script.
    Just a problem : for the count request, you forgot to change the solr address using the vars.

    it should be : maxDocs=`curl “$solrURL/$collectionName/select?q=$query&rows=0” | awk ‘/numFound=/{idx=index($0, “numFound=”); totalDocs=substr($0,idx+10); idx=index(totalDocs,”\””); totalDocs=substr(totalDocs, 0, idx-1); print totalDocs}’`

    Reply
    1. cvalcarcel Post author

      Importing the content back is a whole different animal. If you want to import it as Solr XML then you have to run your export through XSLT to turn it into something Solr would be happy to ingest. If you want to import it as a CSV or an XML then you need to configure DIH to take it in and parse it based on your needs. Not trivial, but not necessarily that difficult either.

      I am a fan of CSVs as they are so much easier to ingest.

      Reply
  3. Llorenç Chiner

    first of all, thank you very much for your answer, in fact what I’m trying to do is to move an index and his contents from one server to another, I’m not sure if this method is the right one to do so

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s