Category Archives: solr

Fusion 4.x: Overeating with the Parallel Bulk Loader

This will be brief but not necessarily quick. I can’t find the cat and I have never felt comfortable without that level of uncertainty.

This post will discuss a rather interesting new feature in Fusion called the Parallel Bulk Loader (or as those in the know call it: PBL. Not to be mistaken for PBJ).

So sometimes you’re in a rush and you just need to get something indexed faster than usual (meaning all the time). Your choices typically depend on your task, but indexing lots of documents leaves you very few paths except to go direct to Solr which doesn’t always address your pre-processing needs or your scaling needs.

A couple of rules of thumb to bear in mind with Solr is that regardless of the number of documents you want to index you typically want to steer away from indexing more than 250G-300G of stuff in any given shard. Whether you have 1 collection or 1000 collections (way close to the zookeeper event horizon for Solr) how fast you can index is dependent on disk space and how well you’ve sharded your collection.

Want the straight beef (or for you vegetarians: the impossible beef)? Here is the documentation that discusses it.

The flash non-fiction: the Parallel Bulk Loader (PBL) allows you to upload documents into Solr faster than you ever have before (with the exception of something like curl with the caveat that it’s hard to control multiple curls running at once so scaling out gets…well, curly).

For today’s shallow introduction to the PBL let me reveal the details of my rather puny environment:

Hardware

4 CPU cores
32G RAM
250G Disk

Software

Fusion 4.2.1
- includes Spark
- Parallel Bulk Loader

The Short Story

Start Fusion
Configure the PBL Job
Run the PBL job

The Long Story

Let’s pretend you have a 1 million line CSV file and you just have to upload it with the least pain possible (the cat agrees heartily). You could crawl the file or you could use the Parallel Bulk Loader to push the file in as fast as possible (avoiding the Fusion index pipeline and parser) or almost as fast as possible (use the Fusion index pipeline and/or parser and/or SQL processing within Spark).

Assuming you have a CSV file the following steps should work for you:

Start Fusion (don’t have it? Download it, unzip it, and you are ready to go)
- Log in
- Create an app (name it something original like bulk-loader-test or cucumber)
- Go to you app by clicking on the newly app panel

Figure – The App Panel for the newly created app

Go to Apps -> Jobs -> Parallel Bulk Loader
- Spark Job Id: million-row-file
- Format: csv
- Path: [enter the path to your file]
- Read Options: header true
- Output Collection: bulk-loader-test
- Send to Index Pipeline: bulk-loader-test

Figure – Example Configuration of the Parallel bulk Loader job

A few things to note:

Spark Job ID: that is for Spark’s use. Name it anything you want in case you need to check things out in the Spark log or if you start up Spark master/worker.

Format: In this case all you need to state is csv. That tells Spark and the PBL all it needs to know about the input file. It will handle the rest.

Path: where is your file? Please enter an absolute path. A relative path might work, but why play with uncertainty like that?

Read Options: this has to do with with the file format entered previously. In this case, the use of header/true simply tells the PBL job that headers exist in the CSV file that cane be used as field names.

Output Collection: where should the documents (each line of the CSV file) go? In this case, the default collection is the output collection.

Send to index pipeline: (optional) Which Fusion index pipeline should each document go through? If you leave this out (blank) the document will be sent directly to Solr. If you fill this in, it will be processed by Fusion first. This adds overhead so if you simply need the content pushed into Solr as fast as it can go then don’t bother putting in an index pipeline name.

There are a lot of other things you can set here (I didn’t even bother enabling the Advanced Section. The cat wouldn’t hear it): Spark settings, clearing the collection before each run (or not clearing the collection), etc. There is much to be written about the PBL one day. That place is not here and that day is not today.

Run -> Start

Before

If you run basic timings on the ingestion of a million lines you will see a difference. Using the File System connector with an almost empty parser (just the CSV parser) and an almost empty index pipeline (just the Solr Indexer stage) caused the PBL to take 2 hours (2 hours 1 minute and 24 seconds, but who’s counting?) to ingest one million lines of a CSV file.

After

PBL with just the index pipeline (no parser used): 8 mins 47 seconds

PBL with no index pipeline: 5 mins 47 secs

Objections/Whines

Q: Can’t I just upload my docs into Solr directly? Especially something like CSV/XML/JSON?
A: The cat is mildly offended. Of course that can be true, but isn’t always.
Let’s supposed you have a 10 collections of 10 shards each. You could write a script to call, say, curl 100 times either in succession (standard loop) or in parallel (set them loose as individual processes) and watch them go to town. That would work. No argument.
Or you could create 10 PBL jobs (or fewer, or more), after tuning Spark to properly handle the multithreading, and have only 10 PBL jobs (or fewer, or more) to administer instead of 100 lines in a script where anything could go wrong and often does. However, as usual, YMMV and make your life choices wisely.

References

Import with the Bulk Loader

Parallel Bulk Loader

Configuration Settings for the Parallel Bulk Loader

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Fusion 4.x: How to Upload Configuration Files for Use by Solr

Leave a reply

So you want to upload configuration files Solr can use to properly function and make your application a screaming success.

The cat is happy to hear this. There is nothing the cat likes more than to put Continue reading →

Fusion 3.1.2: Call Pipeline Index Stage

Leave a reply

So you realize that sometimes you want your document to go to one collection or another, or perhaps even both. How would you do that? The cat is on vacation and so didn’t respond.

Let’s look at how to send a document to a different collection. Continue reading →

Fusion 3.1: Multi-term synonyms!

Leave a reply

Yes, my example is going to be trivial.

No, the cat is not happy with that.

Yes, I am doing it anyway.

With the advent of Solr 6.5 we have (drum roll, please) multi-term synonym support! Yes! Do that happy dance, but remember not to scuff up the floor too much.

Let’s run through a trivial example to show it off. Continue reading →

Fusion 3.0.0: How to Use The JDBC Lookup Index Pipeline Stage

Leave a reply

Database lookups in the middle of ingesting content: reasonable request or abhorrent behavior? The cat tries to vote by breaking the radioactive vial, but isn’t alive long enough to vote.

In the meantime, there are 2 ways to make a database call from either the index pipeline or the query pipeline. While the process is very very (that’s 2 verys) similar I wouldn’t assume the logic works exactly the same until you can put a quantum lock on future behavior.

You would use the JDBC Lookup Index Pipeline stage or the JDBC Lookup Query Pipeline stage. The cat prefers the index pipeline. We’ll discuss that one.
Continue reading →

Fusion 3.1.0: How to Use The REST Query Index Pipeline Stage

Leave a reply

Fusion 3.1, everybody! The following may or may not work on past, or future, versions of Fusion.

Don’t have it? Go get it! Don’t make the cat do all the work.

Question: How do I use the Fusion REST Query index pipeline stage to add additional metadata to an inbound document?

Answer: This assumes the existence of a Solr collection with metadata and that Fusion knows of its existence (that means either use the default Solr cluster that runs within Fusion or make sure that the external Solr cluster you are using is registered with Fusion).

The basic steps:

create a collection to store the metadata and populate it the metadata of your choice
create a collection which will hold new enhanced content with additional metadata from the first collection
configure the index pipeline of the second collection to include the REST Query stage which will make a query to the first collection and add some content to the current inbound content of the second collection

Some detail:
Continue reading →

Fusion: How to Resend a Query When You Don’t Like the Initial Results

Leave a reply

The cat is hopping mad. Well, at least as hopping mad as a cat in a box can get.

Fusion has been on the streets since September of 2014 and there has been nary a post on this blog to talk about some of the rather inventive things that can be done with it. I am here to break that trend and do some super short blog posts that look at various things users might run into when using Fusion that aren’t necessarily in the documentation because, well, there are more interesting things to write about.

This post assumes you already have Fusion 2.4.1 up and running (you can download it from here) and that you understand the basics of search and Solr. There will not be a lot of background on the Why of Things. This is about the How of Things .

So let’s start by indexing something we can be upset about: Continue reading →

Conference: Enterprise Search Summit 2014

Leave a reply

Just a quick reminder that the Enterprise Search Summit 2014 is in town. Or will be next week.

You can download the show program from here and to see if you qualify for a free-exhibits only pass go here. Lucidworks even has a quick write up here.

I’ll be speaking on Tuesday, but the cat will not be in attendance. Stop by and say hi!

Solr: Exporting an Index To an External File

7 Replies

For a change of pace we are going to look at content flow from a different direction. Instead of importing content we are going to export it. Why would we do that? A few reasons:

Having the content in Solr means that we can pre-process the fields during ingestion and export the changes for use in other venues (reports, backups, re-import in databases, etc.)
Sometimes you just need to have more than one kind of backup
Sometime you feel like a nut

How would you do that (no, not feel like a nut)? Continue reading →

LWS/Solr: Configuring the Database Connector for SQL Server 2008 R2

Leave a reply

So how do you configure the Database connector to connect and crawl a SQL Server 2008 R2 database? I’m glad you asked. We can do this the easy way or the hard way. I always opt for the easy way even if it is harder.

Continue reading →

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Now On Tap…

A group of regular and seasonal search selections

Category Archives: solr

Fusion 4.x: Overeating with the Parallel Bulk Loader

Hardware

Software

The Short Story

The Long Story

Before

After

Objections/Whines

References

Disclosures

Fusion 4.x: How to Upload Configuration Files for Use by Solr

Fusion 3.1.2: Call Pipeline Index Stage

Fusion 3.1: Multi-term synonyms!

Fusion 3.0.0: How to Use The JDBC Lookup Index Pipeline Stage

Fusion 3.1.0: How to Use The REST Query Index Pipeline Stage

Fusion: How to Resend a Query When You Don’t Like the Initial Results

Conference: Enterprise Search Summit 2014

Solr: Exporting an Index To an External File

LWS/Solr: Configuring the Database Connector for SQL Server 2008 R2