Fusion: How to Resend a Query When You Don’t Like the Initial Results


The cat is hopping mad. Well, at least as hopping mad as a cat in a box can get.

Fusion has been on the streets since September of 2014 and there has been nary a post on this blog to talk about some of the rather inventive things that can be done with it. I am here to break that trend and do some super short blog posts that look at various things users might run into when using Fusion that aren’t necessarily in the documentation because, well, there are more interesting things to write about.

This post assumes you already have Fusion 2.4.1 up and running (you can download it from here) and that you understand the basics of search and Solr. There will not be a lot of background on the Why of Things. This is about the How of Things .

So let’s start by indexing something we can be upset about: when our search doesn’t return any results.

Select a web site that you can safely index without being accused of initiating a DDoS. I chose http://www.cnn.com. Select a site that makes you comfortable.

Start Fusion and log in.

Create a collection for our web pages. Let’s call it..oh, I don’t know..how about web-pages. Create it where you like, but if you let Fusion decide it will create the collection in the Solr that ships with it. Which is fine. A wonderful decision.

create-collection-2016-0712

Click on the web-pages collection. A Menu panel should appear. All by itself. Alone. In a vast sea of darkness.

Click on Datasources -> Add+ -> Web -> Web.

Screen Shot 2016-07-21 at 6.50.50 PM

Screen Shot 2016-07-21 at 6.54.25 PM

Enter the following:

  • Datasource ID: web-pages-ds
  • Pipeline ID: web-pages-default (it will be in the drop down list, promise)
  • StartLinks: [the web site you selected. I entered http://www.cnn.com]
    • Press the Green plus sign to create a StartLinks entry and then enter the URL
  • Press Save

Screen Shot 2016-07-21 at 7.29.54 PM

Congratulations! You now have a web crawling datasource. Well, almost.

Click on the label under the Pipeline ID field called Go To web-pages-default Pipeline. That will open up another panel that will show you the various index pipeline stages that will be executed on each document (in this case, web page) that will be on their way to Solr through Fusion.

Screen Shot 2016-07-21 at 7.32.26 PM

Do the following (partly for hygiene and partly for parsing):

  1. Click on the Field Mapper stage and press Remove Stage. A dialog box will open for confirmation. Press  Yes, Delete. We don’t need that stage in this case so get rid of it
  2. Click on the drop down with the label Add a New Pipeline Stage and select the Apache Tika Parser stage. You don’t need to configure anything in this case so press Save and go on with your life

There should only be 2 stage in the web-pages-default pipeline: the Apache Tika Parser and the Solr Indexer.

And that is all there is to it! Well, almost. We are going to have to modify the index schema to make sure the web pages get indexed correctly.

Press the Home icon for the pipeline panel to make the panel transform into the Menu Panel. Select Configuration -> Solr Config to open the panel that allows you to directly edit the configuration files for this collection that are under the protection of zookeeper.

Normally, I would not ask you to modify any of these files (the cat acknowledges I am lying. You will always have to change one or more of these files unless your search needs are so simple that you should be using a pen and paper), but in this case the changes can be made before we start crawling.

Click managed-schema to open an editor on the schema file that is automatically created for you by the system and that you should never edit. Never.

Except this one time.

Until the next time.

At line 434, or insert a blank line above the first entry for <dynamicField>, copy and paste the following:

<field name="body" type="text_general"/>
<field name="body.links.anchor" type="text_general" multiValued="true"/>

Press Save. Through the magic of Fusion you have just edited one of the many files under the control of zookeeper. Fusion takes care of downloading the file, storing your edits, uploading the file, and reloading the collection so it pays attention to your change (yes, take responsibility for what you just did).

Click the Home icon for the Solr Config panel and select Query -> Search.

Screen Shot 2016-07-21 at 7.52.00 PM

(Yes, the one towards the bottom of the image)

Now you have a way of seeing your handiwork once you start the Datasource.

Start the datasource by pressing Start Crawl on the Datasource panel (which should be the panel to the left).

Screen Shot 2016-07-21 at 7.53.47 PM

You will see a message box open to the far right letting you that the crawl has started. If you want to see how it is going, press Job History. When the dialog opens press the top entry (there should only be one entry the first time you do this) and it will show you how the crawl is going.

Screen Shot 2016-07-21 at 7.55.25 PM

Press Stop Crawl to stop the crawl after a few minutes as you have just downloaded more of CNN (or any site for that matter) than you need for an exercise of this type.

Screen Shot 2016-07-21 at 7.56.53 PM

Return to the Search panel and press the magnifying glass to see any search results you might have (leave the *:* in the input field). I have 1676 pages indexed. How many do you have?

Screen Shot 2016-07-21 at 7.58.55 PM

Your output might look different than mine. I don’t like mine so I am going to ask the Search panel to display the title of the web pages instead of the rather boring site name.

Press the gear icon next to the keyword input field. Click on the Documents tab and drag the title field up to the top by holding the mouse down on the stack of lines to the left of the label. Once you have the title at the top you should see the first line of the output change. Yay, for us.

Screen Shot 2016-07-21 at 8.01.20 PM

Click anywhere to make that window close. Go ahead. Try it.

Let’s do a quick search. Enter into the keyword input field the word help. In my index I found 1436 documents. The cat yawns her approval.

Screen Shot 2016-07-21 at 8.07.25 PM

Now let’s be enter a more interesting keyword (feel free to select a more appropriate word as long as it returns nothing): in this case the name of a non-existent concept. I call it asdfadsf (that is, the letters a-s-d-f twice).

Entering the above keyword gets me 0 results.

Screen Shot 2016-07-21 at 8.14.03 PM

The cat is beside herself, but that is due to existing in 2 states at once (and by that I mean New York and Massachusetts).

What if you would prefer to have some content returned instead of no content? While I can’t think of too many situations where that would be true I am sure that someone in Dubai might.

So let’s make that dream a reality. If a keyword returns no search results then substitute a new keyword to return at least something; in this case, we will substitute the keyword help if there are no search results to be had.

We have already confirmed that searching for the ever elusive asdfasdf will return nothing. We have also confirmed that entering the keyword help will return 1436 documents (your document count will vary).

What we need to do now is change the behavior of the query pipeline.

DISCLAIMER: This is a simple example of what can be done with the query pipeline and the Sub Query stage. All manner of magic is possible with other stages including writing a stage using Javascript or a fully customized stage in Java.

Click the Home icon on the Datasource panel to make it go away (we won’t be using it anymore).

Click on Query -> Query Pipeline.

Screen Shot 2016-07-21 at 8.28.37 PM

The Query Pipeline panel will open. Select the web-pages-default pipeline (you did call your collection web-pages, didn’t you?).

Screen Shot 2016-07-21 at 8.30.57 PM

Remember how we erased the Field Mapper stage from the index pipeline earlier in this post? We are going to do the same to the Search Fields stage and the Facets stage. While they can be useful they are obstacles to the current phase of world domination. Select each one in turn and press Remove Stage. You’ll be glad you did (I know I was).

Screen Shot 2016-07-21 at 8.32.40 PM

The only stage you should now have in your query pipeline is the Query Solr stage. That one is rather important as you can’t actually query Solr otherwise (don’t roll your eyes).

From the Stages drop down button select Sub Query. That will open up the Sub Query stage.

Enter the following:

  • Key: subquery_results (just change the dash to an underscore. It’s important. Just kidding; it’s not really important, but do it anyway or this isn’t going to work)
  • Collection: web-pages
  • Request Handler: select
  • HTTP method: GET
  • Query parameters from the Parent query: q (press the Green plus sign to get an input field)
  • Query params:
    • Row 1
      • Parameter name: q
      • Parameter value: ${q}
    • Row 2
      • Parameter name: rows
      • Parameter value: 0

Screen Shot 2016-07-21 at 8.55.37 PM

Press Save in the upper right hand corner of the Query Pipeline panel (say that 3 times as fast as you can).

So what does the above do?

  • The parent query is the inbound query
  • The query params are for the subquery that will be sent to Solr just so we can check out the number of results that would be found. We set rows=0 so we don’t waste bandwidth with a larger result than we need. And ${q}? That tells the stage to get the value of the q parameter from the inbound query.

As soon as you save the Sub Query stage it is executed on the Search panel (which you should still have up. Did I ask you to close it? No, I did not). Execute a search on the valid keyword you decided to use and the document count should mirror what you found before.

Not very exciting, is it?

Now it is time for the exciting part. When you enter a term that you know will return zero documents you should expect to see pages returned from the keyword help. How do we do that?

Simple (notice how I made that bold so you would notice its importance). Add another stage that changes the inbound keyword with our new keyword (in this case, help).

Return to the Query Pipeline panel.

Select from the Stages drop down button the Set Query Params stage.

In the Condition box enter the following Javascript:

ctx.subquery_results.response.numFound == 0

In the Parameters and Values area click the green plus sign and enter:

  • Parameter name: q
  • Parameter value: help
  • Update policy: replace

Screen Shot 2016-07-21 at 8.57.28 PM

Press Save. You will see an error on your Search panel. Ignore it. it’s not really there.

The Set Query Params stage is above the Sub Query stage by default (all new stages are stacked at the top). Hold your mouse down over the stack of lines to the left of the Set Query Params stage and drag it below the Sub Query stage.

Screen Shot 2016-07-21 at 9.25.14 PM

Before

Screen Shot 2016-07-21 at 9.26.11 PM

After

Take a breath. The error isn’t there anymore, is it? Told you.

Enter a valid keyword. You should see results.

Enter asdfasdf. You still get results!

Screen Shot 2016-07-21 at 9.00.55 PM

My job here is done. The cat is not stirring (which concerns me).

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s