LWS: Crawling the Web (the simple version)


So I will safely assume that you have read the post on crawling your local file system because you downloaded LWS and couldn’t contain yourself. You are now beside yourself with excitement that you can’t wait to try crawling the web. The cat is beside me and excited is not what I feel. Let’s start at the Launch Pad again (http://localhost:8989).

The Quick Start Version

  1. Click on Quick Start
  2. Click on Web
  3. Enter a catchy name like web-crawl
  4. Enter a URL like http://docs.lucidworks.com
  5. Leave the Crawl Depth blank. This will cause the most damage read the most content
  6. Click Submit

See the numbers rapidly increasing as the crawler goes hog wild (nothing like visualizing wild hogs to get people motivated).

0-quick-start-after-crawl

Really? You needed this? You ran into a problem when you entered a URL and pressed Submit? If you did run into a problem then you need to fix that first. I would suspect either a permission or authorization problem, but there is no way for me to tell from behind this donut.

Notice how I crawled the entire documentation subdomain from lucidworks.com (http://docs.lucidworks.com). Some sites can just take the abuse. So let’s not be such babies and do a web crawl the way it was meant to be done: through the UI.

The Shorter Version

  1. Create a collection. Name it the-web-collection
  2. Create a data source for the-web-collection. Name the data source lucidworks-docs and use the URL http://docs.lucidworks.com
  3. Start the crawl

The Longer Version

1. Create a collection. Name it the-web-collection

You can do this. Use the LWS admin UI. I believe in you.

2. Create a data source for the-web-collection. Name the data source lucidworks-docs  and use the URL http://docs.lucidworks.com

This is your first time. Have a seat. Most of it is pretty obvious straightforward.

1-web-crawler-config

Set the following:

  • Name: lucidworks-docs
  • URL: http://docs.lucidworks.com
  • Crawl Depth: 3 [because that is what the quick start one uses]
  • Constrain To: tree
  • Skip Files Larger Than: 10485760
  • Commit Within (seconds): 0
  • Commit When Crawl Finishes: [leave checked]

Press Create.

3. Start the crawl

Look for the button labeled Start Crawl. Press it. Savor the message towards the bottom of the page. If you are indexing the docs.lucidworks.com site then go get something to drink or have some dinner. It will take a few minutes.

2-manual-crawl-result

Voila! All done. Life is Beautiful.

[Yes, I notice that the manual crawl is off by 4 documents. I’m working on it.]

Random Notes

Really awesome documentation can be found at the LucidWorks document site…which you just finished crawling. The page that will go in somewhat more depth as to the configuration of the web crawler can be found at http://docs.lucidworks.com/display/help/Create+a+New+Web+Site+Data+Source. Since I don’t mind plagiarism here is a summary of the configuration fields:

  • Name: an arbitrary name so you know which data source this is
  • URL: the URL of the web site you want to crawl Ignore robots.txt: The robots.txt files is a way to tell the crawler what it should and should not do. You can read up on it here []INSERT LINK TO ROBOTS.TXT FILE HERE]
  • Proxy Server: are you trying to crawl the web from a location that only allows access through a proxy? Then this is the field for you. Click on Show and enter the appropriate proxy host, proxy port, username, and password.
  • Authentication Credentials: want to crawl a website that requires you to log in first? No problem! Basic, Digest and NTLM v1 and v2 are all supported.
  • Crawl Depth: at 0 (zero) it will only crawl the starting URL. If Crawl Depth is blank then the crawler will crawl every thing on the site, and everything that site links to, and everything that site links to, and everything…I think you get the picture. It would become the never ending crawl. A crawl depth of 1 (one) will only crawl links one level away from the starting URL (2 will crawl 2 levels from the starting URL, and so on). If the only site you want to index is the one at the starting URL then you want to set the Constrain To to tree.
  • Constrain Totree or noneTree means just this pages found using the base URL. None means you might need a lot of hard drives.
  • Include Paths: Specific paths that you want to include in the crawl. You can use regular expressions as well as complete paths.
  • Exclude Paths: Specific paths that you want to exclude in the crawl. You can use regular expressions as well as complete paths.
  • Skip Files Large Than (bytes): if you think that size matters than this is the field for you. A value of -1 tells the crawler to read files of any size. Blank configures the crawler to ignore files larger than 10M. Rest easy: the largest file the crawler is allowed to read is 2G. [CHECK THIS]

And within the Advanced section:

  • Commit Within (seconds): so you don’t trust Solr to commit the content at a rate you’re happy with? Then tell it how quickly you want the content committed.
  • Commit When Crawl Finishes: if checked Solr will not commit the content until the crawl finishes. If you don’t need to see the content right away then this would be a great way to keep resources usage to a minimum…until it starts committing. Then you might want to stand back.
  • Log Extra Detail: you want more information about the content being indexed? Then check the box.
  • Fail Unsupported File Types: the system ignores failures due to the discovery of files that can’t be parsed because the crawler (actually Tika) doesn’t recognize the file type. Checking this will cause the failure to be logged.
  • Add Failed Docs: Checking this box will cause document failures to be stored in the index. Whatever metadata is found will be stored in the index.
  • Log Warnings for Unknown File Types: when the crawler finds a file of a type it does not recognize it normally tries to parse it as a plain text file. If that fails then the crawler ignores it, but does not log an error. Checking this box will cause an event of this type to be logged.
  • Max Retries: So how many times do you want the crawler to try to connect to the URL? Be aware, if the crawler can’t get to a web site after successfully crawling it in the past it will start flagging documents for deletion if it can’t get in there later. This is your only warning.
  • Output Type: where do you want to send the crawled content? The default value is solr as the default behavior of the crawler is to send any crawled content to Solr through an HTTP call. This will come in handy when we discuss Named Entity Recognition (NER). You normally leave this alone.
  • Output Arguments: the arguments to be used when the URL is created using the Output Type property. You normally leave this alone.

That was way too much. Notice the vast majority of the fields are left alone. We’re exciting like that.

BTW, Cassandra Targett rocks (if you don’t know who she is, please find out)!

References

http://docs.lucidworks.com/display/lweug/Getting+Started

Disclosures

Carlos Valcarcel is a full time employee of LucidWorks, but lives in New York as he prefers hurricanes to earthquakes. Having worked at IBM, Microsoft, and Fast Search and Transfer the only thing he is sure of is that the font editor he wrote on his Atari 800 was the coolest program he has ever written. While questions can be a drag he admits that answers will be harder to give without them.

The cat isn’t real, but then neither are you. Enjoy your search responsibly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s