Fusion 4: Regex Field Replacement Index Pipeline Stage


This will be a short one. At least the cat hopes so.

The question of how to change a date from something like 2020-04-01 into something less foolish (like 2020-04) came up recently and I couldn’t help but feel the pull of a simplistic solution (simple as well, but simplistic was the draw). This is something that is applicable to numerous scenarios where a string and its composite parts might be better off being rearranged (kind of like kaleidoscope without all the colors).

As Fusion 4 is available I will be using that for this example.

The Short List

1. Input some content with the string in question (in this case 2020-04-01)

2. Open the Index Workbench and add the Regex Field Replacement stage to your index pipeline

3. Configure the Regex Field Replacement to use the proper input field, output field, regex pattern, and regex groups to output

4. Simmer, Stir, Repeat until your input and output match

The Long List

My Input File (named date-sample.csv)

date
2018-01-01
2020-04-06

I prefer CSV files for testing as they are easily created and easily ingested. Better than eating vegetables.

Before Pre-step 1: download, install, and run Fusion 4 (go ahead, you know you want to).

Pre-step 1

Start Fusion.

Create an app (I called mine Date Test).

1. Input some content with the string in question (in this case 2020-04-01)

Create a local filesystem datasource to ingest the CSV. It can look something like this:

Don’t worry about the configuration of the parser or index pipeline stages. You’ll have plenty to worry about when you get home.

2. Open the Index Workbench and add the Regex Field Replacement stage to your index pipeline

This is where things get interesting (or not, depending on today’s meds). From the Index Workbench you can try out all sort of things and they will cause no harm to your system. Unless you save the changes…and then I can’t guarantee anyone’s safety.

Open the Index Workbench by going to the menu icons to the left of the Fusion app page and selecting Index Workbench.

That will open the rather plain looking Index Workbench.

Fear not! The best is yet to come!

Press the Load button on the upper right-hand corner of the panel and select your datasource. The Index Workbench will come to life, load your datasource, and attempt to display a sample from your input file. Since my input file only has 3 lines (a header plus 2 content lines) the Index Workbench makes small work of the task.

(Click on the left side of the Parser to see the full list of Parsers)

However, small work that it is, there is too much going on. Press the green dots in the parser stages turning them to white for all but the CSV parser (if you are indexing a CSV file…and why wouldn’t you?), and the first two stages of the index pipeline. Every time you click on a dot the index pipeline is going to reindex the sample of content. Be patient and disable everything except the CSV Parser, and the Solr Indexer.

Much better. Notice that one of our dates is shown in the field called date and it is formatted as in the input file (your date may vary).

Time to change that.

Press Add in the index pipeline stage section and select the Regex Field Replacement stage.

Ignore the stage listed in the Other category. The cat snuck it in.

3. Configure the Regex Field Replacement to use the proper input field, output field, regex pattern, and regex groups to output

So our input field is called date. We will define an output field called dateOutput with a regex that will take the year and the day (or maybe that’s the month? I don’t know) and put those two pieces in the output field. A date of yyyy-mm-dd will be turned into yyyy-dd.

The regex, after much gnashing of teeth, is (\d+)-\d+-(\d+). The two groups in parenthesis are the ones that will be pulled together for the denouement. We can use the $ syntax to create the new string ($1-$2 in this case).

Configure the stage with the following:

Regex Rules: press the green plus sign.

Source Field: press the ditto dots to open a panel that allows you to enter as many fields as you like to be regex-ed. Press the green plus sign there and enter a Source Field of date (the name of the field as told to us by the Index Workbench). Press Apply.

Target Field: dateOutput

Write Mode: overwrite

Regex Pattern: (\d+)-\d+-(\d+)

Return if no Match: input_string

No Match Literal Value: [leave blank]

Regex Replacement: $1-$2

Replace Which: all

When you press Apply for the stage, the Index Workbench will go into Deep Thought(tm) mode and reingest the content from the CSV file and show you the fruits of our labor.

These are what fruits look like in the search industry.

Closing Arguments

If you want to save your handiwork, and actually index the file, press Save in the upper right hand corner of the Index Workbench panel, return to the Datasource panel, and run your crawl (or just click on the Start Job link in the Index Workbench).

The cat has disappeared. I think that is an unaccounted for state.

Thanks

Thanks to Brian Land for asking about this and for being such a good guinea pig test subject.

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s