LWS: Ingesting a CSV File (Part 2)


Need to ingest a CSV file into Solr and not sure how to do it? Have a seat my friend (have you seen part 1?).

Let me start by admitting that I do all of my examples using LucidWorks Search. That means that if you use a plain vanilla Solr installation I might ask you to do something or to find something that is not there (or against your moral code of ethics). Not a pretty picture.

The Short Version

  1. Stop Solr
  2. Edit the solrconfig.xml file for the collection you will be updating with the CSV content (in this case collection1) and change <openSearcher> from false to true
  3. Start Solr
  4. Open up your favorite text editor or IDE and get ready to write some SolrJ code
  5. Within the code:
    1. Create a SolrServer object
    2. Open the CSV file
    3. Read a line from the CSV file
    4. Parse the CSV line
    5. Create a SolrDocument
    6. Add the appropriate content as key/value pairs to the SolrDocument
    7. Send the document to the SolrServer
  6. Repeat from step 3 in the sublist until you have no more lines to parse
  7. Run your SolrJ client code

That wasn’t so hard, was it? Well, okay, maybe it was. If it was then let’s go over the same steps in a slightly longer version.

The Longer Version

In this particular example we are taking an existing collection (collection1) with no additional schema changes. We are going to execute SolrJ code that will automatically/dynamically add fields to the schema (we will be happy about it).

This will be the input file:

fauxid,title,author,isbn,price
1,book title 1, author 1, 12345678,5.99
2,book title 2, author 2, 12345679,6.99
3,book title 3, author 3, 12345680,7.99
4,book title 4, author 4, 12345681,8.99
5,book title 5, author 5, 12345681,9.99

Stop Solr

New to Solr/LucidWorks Search? Everything I discuss in this blog, until further notice will always take the perspective of an LWS installation (until LWX or Apollo and then I will give in and talk from a pure Solr perspective).

Run $LWS/app/bin/stop.sh (if you are on Windows run stop.bat. And step away from the ledge).

Edit the solrconfig.xml file for the collection you will be updating with the CSV content (in this case collection1) and change <openSearcher> from false to true

Go to (not considered harmful in this case) $LWS_HOME/conf/solr/cores/[collection name]/conf/solrconfig.xml:

...
<autoCommit>
 <maxTime>15000</maxTime> 
 <openSearcher>true</openSearcher> 
</autoCommit>
...

Yes, this was the one part of the exercise that left me feeling queasy, but is a necessary evil if you are using SolrJ to push content into Solr. The content will eventually commit into the index, but it is not predictable and you could miss out on a lot of sleep if you decide to wait to see it happen. By setting openSearcher to true the system will commit at the rate defined at http://localhost:8989/collections/collection1/indexing_settings/edit (again, adjust to taste) under Auto-commit max time. Yes, it make take a few seconds longer, but I promise you it will do the hard commit you need to get to sleep.

Start Solr

Run $LWS/app/bin/start.sh (or start.bat if you are on that other OS).

Open up your favorite text editor or IDE and get ready to write some SolrJ code

I typically use Eclipse. Use what you like.

With Eclipse:

  1. Create a Java Project. Name is something clever like SolrJ CSV Example.
  2. Create a Java class. Name is something clever like now.on.tap.example.CSVImporter.

Create a SolrServer object

    public static void main(String[] args) throws Exception {
        String serverURL = "http://localhost:8888/solr/collection1";
        SolrServer server = new HttpSolrServer(serverURL);

        // open the file with the content
        String filename = "test-data.txt";

        parseFile(server, filename);
    }

1. Open the CSV file AND

2. Read a line from the CSV file AND

3. Parse the CSV line AND

4. Create a SolrDocument AND

5. Add the appropriate content as key/value pairs to the SolrDocument AND

6. Send the document to the SolrServer

    private static void parseFile(SolrServer server, String filename) throws Exception {
        // not important to the example
        FileReader reader = new FileReader(filename);
        BufferedReader bufReader = new BufferedReader(reader);

        // not important
        String line = bufReader.readLine();
        String[] fieldNames = line.split(",");

        // not important
        int idIdx = 0;
        while ((line = bufReader.readLine()) != null) {
            // Kind of important from a processing perspective...nah, not really...
            Map<String, String> map = parseLine(idIdx, fieldNames, line);

            // IMPORTANT! PAY ATTENTION!
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", map.get("id"));
            for (String fieldName : fieldNames) {
                doc.addField(fieldName, map.get(fieldName));
            }

            // ALSO IMPORTANT
            Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
            docs.add(doc);

            // YES, ALSO IMPORTANT
            server.add(docs);

            // don't call _server.commit(). let LWS auto-commit when it is
            // ready. solrconfig.xml --&gt; openSearcher == true
        }

        bufReader.close();
    }

Repeat from step 3 from the sublist until you have no more lines to parse

Run your SolrJ client code

Running the SolrJ client code is very straightforward if you are running it from an IDE like Eclipse. You will have already added the necessary JAR files to your build path so that you could get a clean compile. If you are running the code from the command line make sure you have the JAR files in your classpath or the code won’t run (it probably wouldn’t have compiled either):

  • common-io-2.1.jar
  • httpclient-4.2.3.jar
  • httpcore-4.2.2.jar
  • httpmime-4.2.3.jar
  • jcl-over-slf4j-1.6.4.jar
  • slf4j-api-1.6.4.jar
  • slf4j-jdk14-1.6.4.jar
  • solr-solrj-4.2.1.jar

Quite honestly, the SolrJ wiki page has a list with slightly different versions than the ones above, but they worked just as well for this example. Here is the entire SolrJ client will a lot more indirection than it needs, but I wanted to make it generic enough to use for any delimiter:

// SolrjCSVClient.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;

public class SolrjCSVClient {
    private SolrServer         _server;
    private Reader             _reader;
    private SolrjCSVProperties _properties;

    public static void main(String args[]) throws Exception {
        if (args.length != 2) {
            System.err
                    .println("Usage:\njava -cp classes:lib/* "
                            + SolrjCSVClient.class.getName()
                            + " [serverpath including collection name] [input file path]\nFor example:\njava -cp classes:lib/* SolrjExample http://localhost:8888/solr/a-uhg-test test-data-pipe-separated.txt");
            System.exit(0);
        }

        SolrjCSVProperties properties = new SolrjCSVProperties("solr-csv-input.properties");

        SolrjCSVClient app = new SolrjCSVClient();

        // Connect to the search server
        String serverURL = args[0];
        SolrServer server = new HttpSolrServer(serverURL);

        // open the file with the content
        String filename = args[1];
        FileReader reader = new FileReader(filename);

        app.setProperties(properties);
        app.setSolrServer(server);
        app.setReader(reader);

        app.parseFile();
    }

    public void setReader(FileReader reader) {
        _reader = reader;
    }

    public void setSolrServer(SolrServer server) {
        _server = server;
    }

    public void setProperties(SolrjCSVProperties properties) {
        _properties = properties;
    }

    private void parseFile() throws Exception {
        BufferedReader bufReader = new BufferedReader(_reader);

        // get the first line to get the field names. use them as keys in our
        // map
        String line = bufReader.readLine();
        // there should be check that a line was actually read from the file
        // but what would we do in any case? The same thing that always happens
        // in this case an exception gets thrown...
        String[] fieldNames = line.split(_properties.getSeparator());

        // which field so we use as the solr id field?
        // in real life the id field may be a composite of multiple fields.
        // this logic could be extended to do that.
        int idIdx = getIdFieldIndex(fieldNames, _properties.getIdFieldName());
        if (idIdx == -1) {
            throw new Exception("No ID field found to use as a document ID.");
        }

        while ((line = bufReader.readLine()) != null) {

            // parse a line
            Map<String, String> map = parseLine(idIdx, fieldNames, line);

            // create the document
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", map.get("id"));
            for (String fieldName : fieldNames) {
                doc.addField(fieldName, map.get(fieldName));
            }

            // store the docs in an ArrayList
            Collection docs = new ArrayList();
            docs.add(doc);

            _server.add(docs);

            // don't call _server.commit(). let LWS auto-commit when it is
            // ready. solrconfig.xml --> openSearcher == true
        }

        bufReader.close();
    }

    private int getIdFieldIndex(String[] fieldNames, String idFieldName) {
        int result = -1;

        for (int i = 0; i < fieldNames.length; i++) {
            if (fieldNames[i].equals(idFieldName)) {
                result = i;
                break;
            }
        }

        return result;
    }

    private Map<String, String> parseLine(int idIdx, String[] fieldNames, String line) {
        Map<String, String> map = new HashMap<String, String>();

        String[] cols = line.split(_properties.getSeparator());

        map.put("id", cols[idIdx]);
        for (int i = 0; i < cols.length; i++) {
            map.put(fieldNames[i], cols[i]);
        }

        return map;
    }

}

Just a class to read the teeny tiny properties file I created for this…

// And a support class
import java.io.FileReader;
import java.io.IOException;
import java.util.Properties;

public class SolrjCSVProperties {
    private static final String SEPARATOR = "separator.csv.solrj";

    private static final String SOLR_ID_FIELDNAME = "id.field.csv.solr";

    private Properties _properties = new Properties();

    public SolrjCSVProperties(String propertyFilename) throws IOException {
        FileReader fileReader = new FileReader(propertyFilename);

        _properties.load(fileReader);
    }

    public String getSeparator() {
        String result = null;

        result = getProperty(SEPARATOR);

        return result;
    }

    public String getIdFieldName() {
        String result = null;

        result = getProperty(SOLR_ID_FIELDNAME);

        return result;
    }

    private String getProperty(String fieldname) {
        String result = null;

        result = _properties.getProperty(fieldname);
        result = result == null ? "" : result;

        return result;
    }
}

And an expected properties file:

#
# solr-csv-input.properties
#
# For the purposes of this example the field names are found in the first line of the CSV file
# and correspond to their respective column.
#

# what is the separator string?
separator.csv.solrj=,

# what field from the CSV file should be used as the id?
id.field.csv.solr=fauxid

A Sad Side-effect

If you are like me (and, really, who isn’t?) you just have to know what is happening under the covers. And with Solr the only way to determine that is to call it directly and look at the XML response output. Notice that of the fields that we added (fauxid, title, author, isbn, and price) the only one that already exists in the Solr schema is title. However, its configuration is the same as the others end up getting: which is a multi-valued field. A sample after running http://localhost:8888/solr/collection1/select?q=*:

<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">10</int>
    <lst name="params">
      <str name="q">*</str>
    </lst>
  </lst>
  <result name="response" numFound="5" start="0">
    <doc>
    <str name="id">1</str>
    <arr name="fauxid">
      <str>1</str>
    </arr>
    <arr name="title">
      <str>book title 1</str>
    </arr>
    <arr name="author">
      <str>author 1</str>
    </arr>
    <arr name="isbn">
      <str>12345678</str>
    </arr>
    <arr name="price">
      <str>5.99</str>
    </arr>
    <long name="_version_">1442545339050491904</long>
    <date name="timestamp">2013-08-05T16:00:40.102Z</date>
  </doc>

Some of you may find this disturbing. Why would Solr create fields of type array? Who is responsible for this travesty? Well, actually, the schema.xml is responsible. In the dynamic field definitions the default dynamic field type is an array. And by array I mean a multi-valued field.

admin-page-multi-valued-checkbox

Fig 1 – From the field definition page for LucidWorks Search

Remember this? If you don’t go to http://localhost:8989/collections/collection1/fields?include_dynamic=true (making the appropriate changes to the server and collection name if you need to). You can either update the schema.xml file directly (you fearless wonder!) or go through the admin page (you lover of simplicity!). I always recommend going through the admin field page because I am not developer enough. Or I am terrified of XML syntax errors. Or both.

To be clear: there is nothing wrong with multi-valued fields. The fact is that until you index the file adding the fields is a tedious exercise best left to interns and liberal arts majors (of which I am one). Use the opportunity to update the field types, adding additional fields as needed and copying values to them, rather than entering them all by hand.

That would draw this, our second blog post, to a close. But wait. As this is Part 2 there must be a Part 1. Go back to Part 1 and look at how you can do the above without writing a single line of SolrJ code (no processing, unfortunately). Not. A. Single. Line. Really. The cat is in agreement. Go to Part 1.

Thanks

A big shout out to Joel Bernstein and Erik Hatcher for their comments and suggestions that led to this deeply moving and heart-warming post.

Disclosures

Nothing I will discuss in this blog will be rocket science. That is until I post something about NASA. Then it might be about rocket science, but still not actual rocket science. To the Rocket Scientists: please accept my apologies.

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them. The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s