Fusion 3.1: Multi-term synonyms!


Yes, my example is going to be trivial.

No, the cat is not happy with that.

Yes, I am doing it anyway.

With the advent of Solr 6.5 we have (drum roll, please) multi-term synonym support! Yes! Do that happy dance, but remember not to scuff up the floor too much.

Let’s run through a trivial example to show it off.

Required Software

Fusion 3.1 (because it ships with Solr 6.5.1 which supports multi-term synonyms)

Sample Stuff

Here is my CSV file:

title,body
Doc 1!,The United States of America sent a man to the moon in 1969
Doc 2!,The USA has won again!
Doc 3!,The states of America continue to fluctuate.

The Short Version

  1. Start Fusion
  2. Create a collection called synonym-test
  3. Create a Local Filesystem datasource called synonyms-ds
  4. Modify the auto-created index pipeline to just have the Solr Indexer stage
  5. Create a new parser pipeline called csv-only and just have the CSV parser in it
  6. Index the file just for fun
  7. Clear the datasource
  8. Edit the managed-schema file and change body from strings to text_general
  9. Edit text_general to use SynonymGraphFilterFactory as the synonym analyzer for queries
  10. Index the file again
  11. Edit synonyms.txt and add usa,united states of america
  12. Edit the query pipeline to add an addition parameter: sow=false

 

Yes, a 12-step program.

The Long Story (or The Short Version Annotated)

C’mon! It’ll be fun! If the screen captures are too small just click on them to see them in their original size and splendor.

1. Start Fusion

That means open a command line window, go to the folder where you have installed Fusion and run fusion start.

Open your browser and log in, please (don’t expect this level of courtesy all the time).

Behold the beauty of the Fusion desktop!

2. Create a collection called synonym-test

Click on Devops -> Collections to open the Collections Manager (ignore my xml-test collection).

Press New and enter a collection name of synonyms-test.

Press Save Collection. When the synonyms-test collection appears click on it to open the Collections dashboard which usually looks might desolate.

3. Create a Local Filesystem datasource called synonyms-ds

On the menu bar press Datasources Panel -> Add -> Local Filesystem.

When the Local Filesystem configuration panel opens enter the following:

Datasource ID: synonyms-ds

StartLinks: [the location where you have the CSV file above. You did make a CSV file with the contents from above, don’t you?]

4. Modify the auto-created index pipeline to just have the Solr Indexer stage

Directly below the Pipeline ID field is a label that says Open Synonyms-default Pipeline. Click it to open the Index Pipeline panel. Fusion will take you directly to that pipeline.

Click on the first pipeline stage (Field Mapping) and press Remove. When the confirmation dialog opens select Yes, Delete. We don’t need that stage for this example, and a clean pipeline is a happy pipeline.

Remove the second stage as well (Solr Dynamic Field Name Mapping). Don’t need that one either.

You should now have a pipeline with only the Solr Indexer stage in it. Perfect.

Close this panel by clicking on the X in the upper right hand corner.

5. Create a new parser pipeline called csv-only and just have the CSV parser in it

Directly below the Description field is a the Parser pipeline input field. Below that  to the right is a label that says Open Default Parser. Click it to open the Parser panel. Fusion will take you directly to that parser.

Press the Add+ button to the right of the Filter field (you can find that towards the top left of the Parser panel).

Enter a Parser ID of csv-only.

Select each of the parsers in turn and press the red Remove button for each one except for the CSV parser.

Press Save in the upper right hand corner of the Parser panel.

Your csv-only parser should look something like this:

Close this panel by clicking on the X in the upper right hand corner.

6. Index the file just for fun

Okay, maybe not for fun. This is where we iterate some of the changes that need to be done to properly configure search to use multi-term synonyms properly.

So press Run -> Start and count the seconds as Fusion indexes the 4 line CSV file. Yes, click on the X in the right-hand corner of the Scheduler panel to close it.

In the time it takes you to inhale the crawl will be done. To prove it, press the plus sign in the upper right of the Datasource panel and select Query Workbench. It will automatically display the 3 docs from the CSV file.

Lovely, isn’t it? However, that was not the point of the exercise. The point is actually Step 8.

Press X and close the Query Workbench. We will need to reload the query pipeline (synonyms-test-default) later so just close it.

7. Clear the datasource

Datasources Panel -> Clear Datasource

Are you sure? -> Yes

‘Nuff said.

8. Edit the managed-schema file and change body from strings to text_general

The motivation behind indexing the file for fun was so the managed-schema would automatically create a meager schema that we could then modify.

So:

Datasources Panel -> + -> Solr Config -> managed-schema

Scroll down until about line 414 to this entry:

<field name="body" type="strings"/>

Change the type to text_general:

<field name="body" type="text_general"/>

Don’t close this panel yet.

9. Edit text_general to use solr.SynonymGraphFilterFactory as the synonym analyzer for queries

Scroll up to around line 238. Look for the fieldType definition for text_general (you could do this for any fieldType, but the analyzable ones are the best to work with as you can search against them).

The fieldTypes are divided in two sections: analysis for indexing and analysis for querying. Multi-term synonym support is specifically for the query side of things. The starting definition looks like this (the SynonymFilterFactor is in bold):

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The new definition looks like this (the synonym declaration now uses SynonymGraphFilterFactory):

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Press Save. Close the Solr Config panel (click the X in the upper right corner of the panel).

10. Index the file again

Run the datasource again and run a wildcard search in the Query Workbench (*:*). You should see your 3 documents again.

In the Query workbench enter a search of usa. You should find 1 document.

Perform the same for a query for united states of america. You should find 2 documents.

Perform the same for a query for states of america. You should also find 2 documents.

Time for you to see the awesome power of multi-term synonyms in action.

11. Edit synonyms.txt and add usa,united states of america.

Datasources Panel -> + -> Solr Config -> synonyms.txt

Empty out the file and just enter the following:

usa,united states of america

Press Save. Press X to close the Solr Config panel.

12. Edit the query pipeline to add an addition parameter: sow=false

This is it! The moment of truth. One last step!

Datasources Panel -> + -> Query Pipelines -> synonym-test-default -> Add -> Additional Query Parameters -> Parameters and Values -> + -> Parameter Name -> sow -> Parameter Value -> false -> Save

That was a mouthful.

Close the Query Pipeline panel.

Datasources Panel -> + -> Query Workbench

Query: usa (should return 2 documents: doc 1 and doc 2)

Query: united states of america  (should return 2 documents: doc 1 and doc 2)

Query: states of america (should return 2 documents: doc 1 and doc 3)

And there you have it.

Sadly, the cat missed everything after Fusion started.

Thanks!

Much thanks to Steve Rowe for his patience and fortitude.

References

https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s