Category Archives: search

Fusion 4.x: Overeating with the Parallel Bulk Loader


This will be brief but not necessarily quick. I can’t find the cat and I have never felt comfortable without that level of uncertainty.

This post will discuss a rather interesting new feature in Fusion called the Parallel Bulk Loader (or as those in the know call it: PBL. Not to be mistaken for PBJ).

So sometimes you’re in a rush and you just need to get something indexed faster than usual (meaning all the time). Your choices typically depend on your task, but indexing lots of documents leaves you very few paths except to go direct to Solr which doesn’t always address your pre-processing needs or your scaling needs.

A couple of rules of thumb to bear in mind with Solr is that regardless of the number of documents you want to index you typically want to steer away from indexing more than 250G-300G of stuff in any given shard. Whether you have 1 collection or 1000 collections (way close to the zookeeper event horizon for Solr) how fast you can index is dependent on disk space and how well you’ve sharded your collection.

Want the straight beef (or for you vegetarians: the impossible beef)? Here is the documentation that discusses it.

The flash non-fiction: the Parallel Bulk Loader (PBL) allows you to upload documents into Solr faster than you ever have before (with the exception of something like curl with the caveat that it’s hard to control multiple curls running at once so scaling out gets…well, curly).

For today’s shallow introduction to the PBL let me reveal the details of my rather puny environment:

Hardware

  • 4 CPU cores
  • 32G RAM
  • 250G Disk

Software

  • Fusion 4.2.1
    • includes Spark
    • Parallel Bulk Loader

The Short Story

  • Start Fusion
  • Configure the PBL Job
  • Run the PBL job

The Long Story

Let’s pretend you have a 1 million line CSV file and you just have to upload it with the least pain possible (the cat agrees heartily). You could crawl the file or you could use the Parallel Bulk Loader to push the file in as fast as possible (avoiding the Fusion index pipeline and parser) or almost as fast as possible (use the Fusion index pipeline and/or parser and/or SQL processing within Spark).

Assuming you have a CSV file the following steps should work for you:

  • Start Fusion (don’t have it? Download it, unzip it, and you are ready to go)
    • Log in
    • Create an app (name it something original like bulk-loader-test or cucumber)
    • Go to you app by clicking on the newly app panel

Figure – The App Panel for the newly created app

  • Go to Apps -> Jobs -> Parallel Bulk Loader
    • Spark Job Id: million-row-file
    • Format: csv
    • Path: [enter the path to your file]
    • Read Options: header true
    • Output Collection: bulk-loader-test
    • Send to Index Pipeline: bulk-loader-test

Figure – Example Configuration of the Parallel bulk Loader job

A few things to note:

Spark Job ID: that is for Spark’s use. Name it anything you want in case you need to check things out in the Spark log or if you start up Spark master/worker.

Format: In this case all you need to state is csv. That tells Spark and the PBL all it needs to know about the input file. It will handle the rest.

Path: where is your file? Please enter an absolute path. A relative path might work, but why play with uncertainty like that?

Read Options: this has to do with with the file format entered previously. In this case, the use of header/true simply tells the PBL job that headers exist in the CSV file that cane be used as field names.

Output Collection: where should the documents (each line of the CSV file) go? In this case, the default collection is the output collection.

Send to index pipeline: (optional) Which Fusion index pipeline should each document go through? If you leave this out (blank) the document will be sent directly to Solr. If you fill this in, it will be processed by Fusion first. This adds overhead so if you simply need the content pushed into Solr as fast as it can go then don’t bother putting in an index pipeline name.

There are a lot of other things you can set here (I didn’t even bother enabling the Advanced Section. The cat wouldn’t hear it): Spark settings, clearing the collection before each run (or not clearing the collection), etc. There is much to be written about the PBL one day. That place is not here and that day is not today.

  • Run -> Start

Before

If you run basic timings on the ingestion of a million lines you will see a difference. Using the File System connector with an almost empty parser (just the CSV parser) and an almost empty index pipeline (just the Solr Indexer stage) caused the PBL to take 2 hours (2 hours 1 minute and 24 seconds, but who’s counting?) to ingest one million lines of a CSV file.

After

PBL with just the index pipeline (no parser used): 8 mins 47 seconds

PBL with no index pipeline: 5 mins 47 secs

Objections/Whines

Q: Can’t I just upload my docs into Solr directly? Especially something like CSV/XML/JSON?
A: The cat is mildly offended. Of course that can be true, but isn’t always.
Let’s supposed you have a 10 collections of 10 shards each. You could write a script to call, say, curl 100 times either in succession (standard loop) or in parallel (set them loose as individual processes) and watch them go to town. That would work. No argument.
Or you could create 10 PBL jobs (or fewer, or more), after tuning Spark to properly handle the multithreading, and have only 10 PBL jobs (or fewer, or more) to administer instead of 100 lines in a script where anything could go wrong and often does. However, as usual, YMMV and make your life choices wisely.

References

Import with the Bulk Loader

Parallel Bulk Loader

Configuration Settings for the Parallel Bulk Loader

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Fusion 4.x: How to Upload Configuration Files for Use by Solr


So you want to upload configuration files Solr can use to properly function and make your application a screaming success.

The cat is happy to hear this. There is nothing the cat likes more than to put Continue reading

Fusion 4: Regex Field Replacement Index Pipeline Stage


This will be a short one. At least the cat hopes so.

The question of how to change a date from something like 2020-04-01 into something less foolish (like 2020-04) came up recently and I couldn’t help but feel the pull of a simplistic solution (simple as well, but simplistic was the draw). This is something that is applicable to numerous scenarios where a string and its composite parts might be better off being rearranged (kind of like kaleidoscope without all the colors).

As Fusion 4 is available I will be using that for this example. Continue reading

Fusion 3.1: Multi-term synonyms!


Yes, my example is going to be trivial.

No, the cat is not happy with that.

Yes, I am doing it anyway.

With the advent of Solr 6.5 we have (drum roll, please) multi-term synonym support! Yes! Do that happy dance, but remember not to scuff up the floor too much.

Let’s run through a trivial example to show it off. Continue reading

Fusion 3.1.0: How to Use The REST Query Index Pipeline Stage


Fusion 3.1, everybody! The following may or may not work on past, or future, versions of Fusion.

Don’t have it? Go get it! Don’t make the cat do all the work.

Question: How do I use the Fusion REST Query index pipeline stage to add additional metadata to an inbound document?

Answer: This assumes the existence of a Solr collection with metadata and that Fusion knows of its existence (that means either use the default Solr cluster that runs within Fusion or make sure that the external Solr cluster you are using is registered with Fusion).

The basic steps:

  • create a collection to store the metadata and populate it the metadata of your choice
  • create a collection which will hold new enhanced content with additional metadata from the first collection
  • configure the index pipeline of the second collection to include the REST Query stage which will make a query to the first collection and add some content to the current inbound content of the second collection

Some detail:
Continue reading

LWS: Connecting to Zabbix


So there I was, minding my own business, surfing the web reading about technology, clouds, development environments and fuzzy handcuffs when I found this absolutely incredible and much needed post on the LucidWorks Knowledge Base:

Installing Zabbix to integrate with LucidWorks

I will vet the steps and discuss this in a future post, but this is a major find. If you try it, let me know what you ran into and if it left a mark.

The cat is ecstatic. The box suddenly doesn’t feel so bad.

Happy Valentine’s Day!

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.