This will be brief but not necessarily quick. I can’t find the cat and I have never felt comfortable without that level of uncertainty.
This post will discuss a rather interesting new feature in Fusion called the Parallel Bulk Loader (or as those in the know call it: PBL. Not to be mistaken for PBJ).
So sometimes you’re in a rush and you just need to get something indexed faster than usual (meaning all the time). Your choices typically depend on your task, but indexing lots of documents leaves you very few paths except to go direct to Solr which doesn’t always address your pre-processing needs or your scaling needs.
A couple of rules of thumb to bear in mind with Solr is that regardless of the number of documents you want to index you typically want to steer away from indexing more than 250G-300G of stuff in any given shard. Whether you have 1 collection or 1000 collections (way close to the zookeeper event horizon for Solr) how fast you can index is dependent on disk space and how well you’ve sharded your collection.
Want the straight beef (or for you vegetarians: the impossible beef)? Here is the documentation that discusses it.
The flash non-fiction: the Parallel Bulk Loader (PBL) allows you to upload documents into Solr faster than you ever have before (with the exception of something like curl with the caveat that it’s hard to control multiple curls running at once so scaling out gets…well, curly).
For today’s shallow introduction to the PBL let me reveal the details of my rather puny environment:
Hardware
- 4 CPU cores
- 32G RAM
- 250G Disk
Software
- Fusion 4.2.1
- includes Spark
- Parallel Bulk Loader
The Short Story
- Start Fusion
- Configure the PBL Job
- Run the PBL job
The Long Story
Let’s pretend you have a 1 million line CSV file and you just have to upload it with the least pain possible (the cat agrees heartily). You could crawl the file or you could use the Parallel Bulk Loader to push the file in as fast as possible (avoiding the Fusion index pipeline and parser) or almost as fast as possible (use the Fusion index pipeline and/or parser and/or SQL processing within Spark).
Assuming you have a CSV file the following steps should work for you:
- Start Fusion (don’t have it? Download it, unzip it, and you are ready to go)
- Log in
- Create an app (name it something original like bulk-loader-test or cucumber)
- Go to you app by clicking on the newly app panel
Figure – The App Panel for the newly created app
- Go to Apps -> Jobs -> Parallel Bulk Loader
- Spark Job Id: million-row-file
- Format: csv
- Path: [enter the path to your file]
- Read Options: header true
- Output Collection: bulk-loader-test
- Send to Index Pipeline: bulk-loader-test
Figure – Example Configuration of the Parallel bulk Loader job
A few things to note:
Spark Job ID: that is for Spark’s use. Name it anything you want in case you need to check things out in the Spark log or if you start up Spark master/worker.
Format: In this case all you need to state is csv. That tells Spark and the PBL all it needs to know about the input file. It will handle the rest.
Path: where is your file? Please enter an absolute path. A relative path might work, but why play with uncertainty like that?
Read Options: this has to do with with the file format entered previously. In this case, the use of header/true simply tells the PBL job that headers exist in the CSV file that cane be used as field names.
Output Collection: where should the documents (each line of the CSV file) go? In this case, the default collection is the output collection.
Send to index pipeline: (optional) Which Fusion index pipeline should each document go through? If you leave this out (blank) the document will be sent directly to Solr. If you fill this in, it will be processed by Fusion first. This adds overhead so if you simply need the content pushed into Solr as fast as it can go then don’t bother putting in an index pipeline name.
There are a lot of other things you can set here (I didn’t even bother enabling the Advanced Section. The cat wouldn’t hear it): Spark settings, clearing the collection before each run (or not clearing the collection), etc. There is much to be written about the PBL one day. That place is not here and that day is not today.
- Run -> Start
Before
If you run basic timings on the ingestion of a million lines you will see a difference. Using the File System connector with an almost empty parser (just the CSV parser) and an almost empty index pipeline (just the Solr Indexer stage) caused the PBL to take 2 hours (2 hours 1 minute and 24 seconds, but who’s counting?) to ingest one million lines of a CSV file.
After
PBL with just the index pipeline (no parser used): 8 mins 47 seconds
PBL with no index pipeline: 5 mins 47 secs
Objections/Whines
Q: Can’t I just upload my docs into Solr directly? Especially something like CSV/XML/JSON?
A: The cat is mildly offended. Of course that can be true, but isn’t always.
Let’s supposed you have a 10 collections of 10 shards each. You could write a script to call, say, curl 100 times either in succession (standard loop) or in parallel (set them loose as individual processes) and watch them go to town. That would work. No argument.
Or you could create 10 PBL jobs (or fewer, or more), after tuning Spark to properly handle the multithreading, and have only 10 PBL jobs (or fewer, or more) to administer instead of 100 lines in a script where anything could go wrong and often does. However, as usual, YMMV and make your life choices wisely.
References
Configuration Settings for the Parallel Bulk Loader
Disclosures
Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.
The cat isn’t real, but then neither are you. Enjoy your search responsibly.