Category Archives: solr

Fusion 4.x: Overeating with the Parallel Bulk Loader


This will be brief but not necessarily quick. I can’t find the cat and I have never felt comfortable without that level of uncertainty.

This post will discuss a rather interesting new feature in Fusion called the Parallel Bulk Loader (or as those in the know call it: PBL. Not to be mistaken for PBJ).

So sometimes you’re in a rush and you just need to get something indexed faster than usual (meaning all the time). Your choices typically depend on your task, but indexing lots of documents leaves you very few paths except to go direct to Solr which doesn’t always address your pre-processing needs or your scaling needs.

A couple of rules of thumb to bear in mind with Solr is that regardless of the number of documents you want to index you typically want to steer away from indexing more than 250G-300G of stuff in any given shard. Whether you have 1 collection or 1000 collections (way close to the zookeeper event horizon for Solr) how fast you can index is dependent on disk space and how well you’ve sharded your collection.

Want the straight beef (or for you vegetarians: the impossible beef)? Here is the documentation that discusses it.

The flash non-fiction: the Parallel Bulk Loader (PBL) allows you to upload documents into Solr faster than you ever have before (with the exception of something like curl with the caveat that it’s hard to control multiple curls running at once so scaling out gets…well, curly).

For today’s shallow introduction to the PBL let me reveal the details of my rather puny environment:

Hardware

  • 4 CPU cores
  • 32G RAM
  • 250G Disk

Software

  • Fusion 4.2.1
    • includes Spark
    • Parallel Bulk Loader

The Short Story

  • Start Fusion
  • Configure the PBL Job
  • Run the PBL job

The Long Story

Let’s pretend you have a 1 million line CSV file and you just have to upload it with the least pain possible (the cat agrees heartily). You could crawl the file or you could use the Parallel Bulk Loader to push the file in as fast as possible (avoiding the Fusion index pipeline and parser) or almost as fast as possible (use the Fusion index pipeline and/or parser and/or SQL processing within Spark).

Assuming you have a CSV file the following steps should work for you:

  • Start Fusion (don’t have it? Download it, unzip it, and you are ready to go)
    • Log in
    • Create an app (name it something original like bulk-loader-test or cucumber)
    • Go to you app by clicking on the newly app panel

Figure – The App Panel for the newly created app

  • Go to Apps -> Jobs -> Parallel Bulk Loader
    • Spark Job Id: million-row-file
    • Format: csv
    • Path: [enter the path to your file]
    • Read Options: header true
    • Output Collection: bulk-loader-test
    • Send to Index Pipeline: bulk-loader-test

Figure – Example Configuration of the Parallel bulk Loader job

A few things to note:

Spark Job ID: that is for Spark’s use. Name it anything you want in case you need to check things out in the Spark log or if you start up Spark master/worker.

Format: In this case all you need to state is csv. That tells Spark and the PBL all it needs to know about the input file. It will handle the rest.

Path: where is your file? Please enter an absolute path. A relative path might work, but why play with uncertainty like that?

Read Options: this has to do with with the file format entered previously. In this case, the use of header/true simply tells the PBL job that headers exist in the CSV file that cane be used as field names.

Output Collection: where should the documents (each line of the CSV file) go? In this case, the default collection is the output collection.

Send to index pipeline: (optional) Which Fusion index pipeline should each document go through? If you leave this out (blank) the document will be sent directly to Solr. If you fill this in, it will be processed by Fusion first. This adds overhead so if you simply need the content pushed into Solr as fast as it can go then don’t bother putting in an index pipeline name.

There are a lot of other things you can set here (I didn’t even bother enabling the Advanced Section. The cat wouldn’t hear it): Spark settings, clearing the collection before each run (or not clearing the collection), etc. There is much to be written about the PBL one day. That place is not here and that day is not today.

  • Run -> Start

Before

If you run basic timings on the ingestion of a million lines you will see a difference. Using the File System connector with an almost empty parser (just the CSV parser) and an almost empty index pipeline (just the Solr Indexer stage) caused the PBL to take 2 hours (2 hours 1 minute and 24 seconds, but who’s counting?) to ingest one million lines of a CSV file.

After

PBL with just the index pipeline (no parser used): 8 mins 47 seconds

PBL with no index pipeline: 5 mins 47 secs

Objections/Whines

Q: Can’t I just upload my docs into Solr directly? Especially something like CSV/XML/JSON?
A: The cat is mildly offended. Of course that can be true, but isn’t always.
Let’s supposed you have a 10 collections of 10 shards each. You could write a script to call, say, curl 100 times either in succession (standard loop) or in parallel (set them loose as individual processes) and watch them go to town. That would work. No argument.
Or you could create 10 PBL jobs (or fewer, or more), after tuning Spark to properly handle the multithreading, and have only 10 PBL jobs (or fewer, or more) to administer instead of 100 lines in a script where anything could go wrong and often does. However, as usual, YMMV and make your life choices wisely.

References

Import with the Bulk Loader

Parallel Bulk Loader

Configuration Settings for the Parallel Bulk Loader

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Fusion 3.1: Multi-term synonyms!


Yes, my example is going to be trivial.

No, the cat is not happy with that.

Yes, I am doing it anyway.

With the advent of Solr 6.5 we have (drum roll, please) multi-term synonym support! Yes! Do that happy dance, but remember not to scuff up the floor too much.

Let’s run through a trivial example to show it off. Continue reading

Fusion 3.0.0: How to Use The JDBC Lookup Index Pipeline Stage


Database lookups in the middle of ingesting content: reasonable request or abhorrent behavior? The cat tries to vote by breaking the radioactive vial, but isn’t alive long enough to vote.

In the meantime, there are 2 ways to make a database call from either the index pipeline or the query pipeline. While the process is very very (that’s 2 verys) similar I wouldn’t assume the logic works exactly the same until you can put a quantum lock on future behavior.

You would use the JDBC Lookup Index Pipeline stage or the JDBC Lookup Query Pipeline stage. The cat prefers the index pipeline. We’ll discuss that one.
Continue reading

Fusion 3.1.0: How to Use The REST Query Index Pipeline Stage


Fusion 3.1, everybody! The following may or may not work on past, or future, versions of Fusion.

Don’t have it? Go get it! Don’t make the cat do all the work.

Question: How do I use the Fusion REST Query index pipeline stage to add additional metadata to an inbound document?

Answer: This assumes the existence of a Solr collection with metadata and that Fusion knows of its existence (that means either use the default Solr cluster that runs within Fusion or make sure that the external Solr cluster you are using is registered with Fusion).

The basic steps:

  • create a collection to store the metadata and populate it the metadata of your choice
  • create a collection which will hold new enhanced content with additional metadata from the first collection
  • configure the index pipeline of the second collection to include the REST Query stage which will make a query to the first collection and add some content to the current inbound content of the second collection

Some detail:
Continue reading

Fusion: How to Resend a Query When You Don’t Like the Initial Results


The cat is hopping mad. Well, at least as hopping mad as a cat in a box can get.

Fusion has been on the streets since September of 2014 and there has been nary a post on this blog to talk about some of the rather inventive things that can be done with it. I am here to break that trend and do some super short blog posts that look at various things users might run into when using Fusion that aren’t necessarily in the documentation because, well, there are more interesting things to write about.

This post assumes you already have Fusion 2.4.1 up and running (you can download it from here) and that you understand the basics of search and Solr. There will not be a lot of background on the Why of Things. This is about the How of Things .

So let’s start by indexing something we can be upset about: Continue reading

Solr: Exporting an Index To an External File


For a change of pace we are going to look at  content flow from a different direction. Instead of importing content we are going to export it. Why would we do that? A few reasons:

  • Having the content in Solr means that we can pre-process the fields during ingestion and export the changes for use in other venues (reports, backups, re-import in databases, etc.)
  • Sometimes you just need to have more than one kind of backup
  • Sometime you feel like a nut

How would you do that (no, not feel like a nut)? Continue reading