Solr Classic/LWS: How To Send A Single Query to Multiple Collections or Multiple Shards


[LucidWorks Search 2.6.2]
Due to circumstances of birth or events beyond your control (I guess that’s redundant) you find yourself with either multiple collections or multiple shards or both and you need to execute a single query to retrieve distributed results.

The cat understands in a way that only a true critic of Copenhagen could.

For the sake of argument, and an excuse to try a few novel things, let’s assume I have 2 shards running on a local VM. These are their settings:

Shard 1
Core: http://127.0.0.1:18888
Search UI: http://127.0.0.1:18989

Shard 2
Core: http://127.0.0.1:28888
Search UI: http://127.0.0.1:28989

[Notice how I added the number 1 in front of the ports for shard 1 and a 2 in front of the ports for shard 2. That’s called lazy consistency.]

Also, for the sake of argument, let’s say that that the ontap-s[12]collection1 and ontap-s[12]-collection2 collections for both shards each have one document (that would be 4 documents: 2 collections in each of our two shards).

The Easy, but Hard, Way

If I wanted to perform a search against one collection (say ontap-s1-collection1) using the select query handler I could use the standard URL (which would return the default XML output):
http://localhost:18888/solr/ontap-s1-collection1/select?q=*:*

For those of you more visually oriented let’s do this using the standard search UI from the LWS admin page (which uses the search request handler, which doesn’t really exist):

http://localhost:18989/admin/collections/ontap-s1-collection1/search?q=

1-collection1-shard1

This returns one document and the data source says it is from ontap-s1-collection1.

[I will do everything from here on on the admin page, but if you change the base URL to the Solr URL everything will still work. Promise.]

What you are looking for is the parameter that will fix everything in your life. That parameter is shards.

Using the shards parameter the same query looks like this:
http://localhost:18989/admin/collections/ontap-s1-collection1/search?shards=localhost:18888/solr/ontap-s1-collection1&q=

2-collection1-shard1

Also one document. Pretty pointless until we add the ontap-s1-collection2 collection to the shards parameter:
http://localhost:18989/admin/collections/ontap-s1-collection1/search?shards=localhost:18888/solr/ontap-s1-collection1,localhost:18888/solr/ontap-s1-collection2&q=

3-collection1-and-2-shard1

Now we have two documents and the Data Source facet lists ontap-s1-collection1 and ontap-s2-collection1. One query, two collections. I’m feeling a tingle.

So that gets us the ability to send a single query across different collections. How about across additional shards? Try this:

http://localhost:18989/admin/collections/ontap-s1-collection1/search?shards=localhost:18888/solr/ontap-s1-collection1,localhost:18888/solr/ontap-s1-collection2,localhost:28888/solr/ontap-s2-collection1,localhost:28888/solr/ontap-s2-collection2&q=

Notice the new port number (port 28888. I purposely bolded the port numbers). That adds ontap-s2-collection1 and ontap-s2-collection2 from the second shard.

4-collection1-and-2-both-shards

Now we have four documents and a damn long URL. What happens we have a 1 bazillion shards? That will be one honkin’ long URL (if you could find a server or browser to support a longer than 4k URL buffer).

The Hard, but Easy, Way

The previous example was very mechanical. While we could write code that would take care of creating that URL, or we could always hard-code it, there should be an easier way. What we need is a way to have the request handler take care of the distributed query for us. And, of course, it can.

We will arbitrarily decide to use shard 1 as the request handler. That means that any queries we send will go there and it will take care of distributing the query for us. If you configure shard 2 with the same configuration then it can be used as well.

What is the configuration? The cat wants to know.

There is no UI to perform this so you will have to fire up your favorite editor and edit solrconfig.xml. In this case we will edit the solrconfig.xml file for collection1 (found in $LUCIDWORKS_HOME/conf/solr/cores). Why collection1? Because of 2 things:

  1. It is not used as one of the collections we are querying
  2. If we don’t use a collection name in the URL the system defaults to collection1‘s solrconfig.xml configuration (also known as magic behavior)

The requestHandler entry in solrconfig.xml needs to be any but the default one, which in this case is /select. For this example we will mess with /lucid (you can also create a custom requestHandler entry just for the fun of it, but use an existing one as your guide).

First, let’s add the shards entries in the param list for /lucid:

<requestHandler class="solr.StandardRequestHandler" name="/lucid"> ...
  <lst name="defaults">
    ...
    <str name="shards">localhost:18888/solr/ontap-s1-collection1,localhost:18888/solr/ontap-s1-collection2,localhost:28888/solr/ontap-s2-collection1,localhost:28888/solr/ontap-s2-collection2</str>
     ...
  </lst>
</requestHandler>

The above should look familiar. It is the exact same information we entered in the URL when we were doing this manually through the parameter list.

Save the file and restart shard 1 (or whichever shard’s collection1 you edited).

First call the following URL from the admin UI:
http://localhost:18989/admin/collections/collection1/search?q=

5-both-shards-from-lucid

Four documents! Perfect.

Wait!, I hear you scream. The query target is /search nor /lucid. Yes, quite true. The /search target uses the /lucid request handler for its…request handling.

If you wanted the raw output then the URL would look like this (notice the lack of a specific collection name):

http://localhost:18888/solr/lucid?q=

6-xml-results-for-both-shards

Ignore most of the XML. Look for the results count which is four.

Now you know how to call all or some of the shards without having to explicitly list
them in the URL. Ain’t life grand?

The cat is exhausted. It isn’t easy being in an entangled state.

References

http://docs.lucidworks.com/display/lweug/Distributed+Search+and+Indexing

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s