Solr: Ingesting a CSV File (Part 1)


There are two ways of pushing a CSV file into Solr: with code and without code (I know: be still my heart). The cat approves of such behavior.

Let’s talk about the no code way first. That always sends me into a tizzy (notice the self-referential link).

The following information can be found at the Solr wiki site at the UpdateCSV page (but not so well told as here). Don’t go there. Read this first. The other page is just a wiki page written by people who know much more than I do, and my self-esteem if fragile.

Using Curl (or Look, Ma, No Code!)

Since an example worth a thousand words here is your first example:

curl http://localhost:8983/solr/mycollection/update/csv --data-binary @myfile.csv -H 'Content-type:text/plain; charset=utf-8'

Notice the grace and joie de vivre of the overall line. Let’s break it down:

  • curl – if you don’t have it, get it. On Linux just do an apt-get install curl. On Windows you will have to look for it (what are you doing on Windows doing Solr development, anyway?).
  • the URL pointing to the CSV Solr target: http:// [servername | IP address]:port-number/solr/[collection name]/update.csv
  • –data-binary [filename] – the name of the CSV file to be ingested
  • -H [HTTP request header information] – in this case the content type and character set of the incoming file. Equivalent to –header.

It is interesting to note that while the file is a CSV file (technically comma separated) the MIME type listed is text/plain. Why does this still work? Is Solr so smart that it can see through lead-lined planters and decide pink?

Hello? The Solr target URL is http://localhost:8983/solr/mycollection/update/csv. No x-ray vision required; just the right URL.

Yes! That is it. Nothing else to do. Nothing else to see.

Kind of.

What if you want to stream the file explicitly? Then you would define it mostly like this:

curl http://localhost:8983/solr/mycollection/update/csv?stream.file=/my/csv/file/is/here.csv&stream.contentType=text/plain;charset=utf-8

Notice the continued use of the Solr CSV URL target. Notice the continued use of the MIME type text/plain. The difference is in the parameters:

  • stream.file – the absolute/relative path to the CSV file
  • stream.contentType – the MIME type and character set designation

Which do I use? Neither. I let the cat decide.

Wait (I hear you say)! My CSV files doesn’t use commas as the separator. I thought CSV stood for Character Separated Values! What do I do with my tab-delimited files? My pipe-delimited files? My repeating-group-of-character-delimited files? My…

Yes, I get it. You have a thing about delimits.

Use this as a jumping off point (no suicide jokes, please. I’m saving those for later):

curl 'http://localhost:8983/solr/mycollection/update/csv?commit=true&separator=%09&escape=\&stream.file=/my/csv/file/here.csv'

For the stuff that hasn’t change refer to the information above.

The newly introduced parameters:

  • commit – set to true. You’ll be glad you did.
  • separator – (reminds me of the Sade song Smooth Separator) this is where you can add tabs or pipes or anything else that strikes your fancy as a delimiter. Yes, you must encode it so a tab becomes %09. No need to ask.
  • escape – if you are using a separator that is NOT a regular character (how boring) you have to tell the CSV handler to escape the separator so it is used properly. Without it the \t will be treated as a t and that would ruin the after-party.

As stated before (repetition becomes me) commas are the separator of default so you don’t need to include them at all in any configuration if they are your delimiter of choice.

[Want to know more about how this works behind the scenes? Read up on RequestHandlers? Then go to the wiki page and have at it. I’m out of time.]

THINGS TO WATCH OUT FOR

If you are doing this on Linux check that the file has the proper carriage return/linefeed characters or the result will be quite bad. Use vi (doesn’t everyone?) and use s/^M/^M/g to convert the newlines to something recognizable. This is your only warning.

Why is this Part 1? Well, read Part 2 to find out.

References

Updating a Solr Index with CSV

Disclosures

I am a full time employee of LucidWorks. We like Solr. Sometimes even I do.

I make no money from this blog. If you think of a way for me to make money using this blog, let me know. I’ll probably ignore you like I do everyone else.

This blog does not reflect the thoughts, opinions, facts or idiosyncrasies of LucidWorks (my employer). I’m still deciding if it even reflects my idiosyncrasies. LucidWorks has not approved, or rejected, any of the things I said here. They do not approve or disapprove any of the things I have said, not said, may say or may not say in the future or in a competing timeline. I might have written this on company time. But then again, maybe I didn’t. The cat will never tell.

Advertisements

4 thoughts on “Solr: Ingesting a CSV File (Part 1)

  1. Satish

    Awesome disclosure! Thanks for the article. Ive been trying to understand how I can upload a tab delimited txt file with no headers and you answered some of my questions.

    Reply
  2. Omer

    Hi

    I did same but somehow was unable to index. I created a demo collection and used your query but data is not posted. No error is thrown and I see below response. Appreciate your support. Thanks.

    [solr@ambari solr]$ curl http://localhost:8983/solr/fbdemo_shard1_replica1/update/csv –data-binary /tmp/solrdata/331076060277979_facebook_comments_korek.csv -H ‘Content-type:text/plain; charset=utf-8’

    04

    [solr@ambari solr]$

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s