LWS: How to index a file system (the simple version)


So you got past your approach anxiety and downloaded LucidWorks Search (or Solr, though I won’t go into that).

[Hereon in I will be referring to LucidSearch Search as LWS. Just for the excitement. The version will be 2.6.2.]

You installed that puppy.

install-finished-1

You let the installer start the server for you (which you can also do manually by running $LWS_HOME/app/bin/start.[sh|bat]).

start-server-2start-server-3

You opened your browser to the URL given to you by the installer (aren’t you clever?) and went to go have a cup of the most delicious coffee you could find while you waited for the system to start.

lws-launch-pad

Woo hoo! The launch pad! Where you can do all sorts of…launch paddy things.

What do you do with a search engine anyway? Oh, yeah, search things. Let’s index some local files and see what that means.

  1. Click on Quick Start (that’s the clock icon with the label Quick Start under it).
  2. Click on File System.
  3. Enter a name for this crawl target and enter the path to the files you want to index.
  4. Press Submit.

Depending on how many documents you are pointing to it may take a few seconds or a few lifetimes to get through the crawl. Having just 61 documents of varying sizes means that i waited a few seconds.

quick-start-file-system-crawl-in-progress-4

Yawn (the cat refused to comment). Maybe that was too easy. Let’s dig down one layer and see how it’s really done.

I am going to make a few assumptions:

  • You are running LWS on your local machine (I am using a VM on VirtualBox running Kubuntu 13.04, with 2 cores, 4G RAM, 101GB of virtual hard disk)
  • You accepted all the default configurations

Everything will run smoothly now. Please ignore the smoke coming out of your server.

The Short Version

  1. Find some content (doc/pdf/ppt)
    • Keep track of the path to the content
  2. Create a collection
    • Give it a name and press Create
  3. Create and configure a data source
    • Give the data source a name and the path from Step #1
  4. Run the crawl

The Long Version

1. Find some content (doc/pdf/ppt)

Somewhere on your hard drive you must have files. Somewhere, anywhere. Some of them might even be yours. Make note of the location. We’ll need that path in a few minutes (maybe seconds depending on your reading speed).

2. Create a collection

Ah! Welcome to Ease of Use Land where you get to navigate a UI and let it worry about the messy details of doing things like updating the proper configurations or killing off your enemies using a remotely piloted drone.

Click on the LucidWorks Search icon or open your browser to http://localhost:8989/admin/. If this is your first time in then log on with admin/admin.

Notice that 2 collections currently exist: collection1 and quickstart. While they both have good parentage you can safely ignore them.

  1. Press the green New Collection button
  2. Give the new collection a name like lucidworks-is-awesome.
  3. Press the green Create button.

Excellent. You now have a collection in which to store your files. And its name is pretty spiffy too.

3. Create and configure a data source

  1. Click on the name of your new collection (lucidworks-is-awesome)
  2. Stop and admire the page. Notice the iPhone-like curves of the corners. The use of green to catch the eye.
  3. Push the green New Data Source button
  4. Select File system from the list of possible data sources (we’ll cover more of them in future posts)
  5. Give this data source a name (test-files?) and the full path to your documents
  6. Press the green Create button

You are ready to execute the last step.

4. Run the crawl

Press the Start Crawl button. It is not green. Stop looking at the green Edit buttons. You only want to press those if you want to change the data source configuration (stop looking at those buttons!).

At the bottom of the page is the Crawl History section which will display the crawl status. It starts as Running (shouldn’t crawling be labelled as Crawling?) and ends as Finished.

To check your handiwork click on Tools on the top menu bar and press the blue Search button without entering a keyword.

search-results-6

Bathe in its pleasant glow. Notice the use of the blue gradient in the background, the black in the document description. The facets to the right of the results. The use of orange to offset the URL to the document (which doesn’t work of course. You have to configure the web server to find the files).

Click on the title link. As this is an administrative environment it does not take you to the document, but to the information that was indexed during the crawl.

And there you have it. Please edit the data source configurations and see what happens when you have subdirectories in your path that you would like to ignore or subdirectories you would like to include or different files (based on file extension) that you would like to ignore.

The cat was marginally interested, but knew she fared better than last week’s turkey.

Keep those cards and letters coming in.

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them. The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s