LWS: How to Index SolrXML


[I will be using Linux 13.03 with LucidWorks Search 2.6.2.]
Following the continuing saga of doing the easy stuff we will look at another of the standard data sources available in LucidWorks Search: SolrXML.

At its most basic SolrXML is made up of a root element that tells Solr what to do with the incoming document(s). For this deeply moving episode the XML document will look something like:

<add>
  <doc>
    <field name="id">document1</field>
    <field name="title">Excitement!</field>
    <field name="body">This is the main text of an incredibly exciting document!</field>
  </doc>
  ...
</add>

You can have multiple SolrXML files with one document each or you can have multiple documents in a SolrXML file or some combination of the two. For the purposes of this experiment we will do both: 1 SolrXML file with a single document and a second SolrXML file with multiple documents. We will not be updating the collection schema as we’re going to use 3 existing fields (if we used field names that did not exist then the default configuration of LWS would have created multi-value string fields).

The Shorter Version

  1. Create a collection and call it lucidworks-is-awesome
  2. Create a SolrXML data source
    • Name: my-solrxml-files
    • Path: /home/search/Downloads/test-content/solrxml-files [your path will vary]
    • Include paths: .*\.xml
  3. Press the green Create button
  4. Press the not-green Start Crawl button

The Longer Version

The files I used for this example are listed below.

1. Create a collection and call it lucidworks-is-awesome

If that seems too self-serving you can always call it lucidworks-is-great.

There are numerous ways to create a collection in Solr. I am going to use the LWS admin pages to make life simpler for me.

On the hope that you already have LWS running then go to the main dashboard page which lists all of the collections.

  1. Press New Collection (the green button)
    • Name: lucidworks-is-awesome
  2. Press Create.

2. Create a SolrXML data source

Click on the name of the collection to take you to its dashboard page.

  1. Press New Data Source
  2. Select Solr XML
    • Name: my-solrxml-files
    • Path: /home/search/Downloads/test-content/solrxml-files [your path will vary]
    • Include paths: .*\.xml

3. Press the green Create button

If this fails it is probably a permission issue. Make sure you started LWS at the right permission level.

4. Press the not-green Start Crawl button

At the bottom of the page there will be a Crawl History section. Watch it only if you’re bored. Instead click on Status on the menubar. When that page appears don’t be afraid to press Hard Commit (the equivalent of a shotgun proposal).

When the page says that 4 documents have been indexed (or however many you put in your files) then click on Tools in the menubar. Click on Search without entering any keyword (that returns everything which in this case isn’t much).

search-results-1

Here are the files I used for this example:

single-doc-solr.xml

<pre>
<add>
  <doc>
    <field name="id">doc1</field>
    <field name="title">History of the World</field>
    <field name="body">This history of the world is so exciting!</field>
  </doc>
</add>

multiple-docs-solr.xml

<add>
  <doc>
    <field name="id">doc2</field>
    <field name="title">The Dark Times</field>
    <field name="body">It was dark.</field>
  </doc>
  <doc>
    <field name="id">doc3</field>
    <field name="title">Global Warming</field>
    <field name="body">It was hot.</field>
  </doc>
  <doc>
    <field name="id">doc4</field>
    <field name="title">Sum of All Relevancies</field>
    <field name="body">Eveything is ranked and sorted and no one knew why.</field>
  </doc>
</add>
</pre>

Taking it One More Step

The fact is that the SolrXML that we indexed above is not so much an XML document to be indexed as an XML message that Solr executes based on the type of the message. In the next post I will show you how to use SolrXML file to insert, update, and delete documents.

The cat always loves going one step further.

Reference

http://docs.lucidworks.com/display/lweug/SolrXML+Data+Sources

Disclosures

I am a full time employee of LucidWorks.

This blog does not reflect the thoughts, opinions, facts or idiosyncrasies of my employer (whoever that might be at any given moment). I’m still deciding if it reflects my idiosyncrasies. My employer has not approved or disapproved any of the things I have said, not said, and/or may or may not say in the future or in a competing timeline. I might have written this on company time. But then again, maybe I didn’t. The cat will never tell.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s