jump to navigation

How Not To Set Up SharePoint Search September 3, 2009

Posted by moffitar in Uncategorized.
trackback

This is a repost from my old blog over at SharePointBlogs.com. Since that site has been down more and more often, ostensibly due to denial-of-service attacks, I am moving my blog over here. I had a heart-stopping couple of days when I thought I had lost everything I had posted there, but when it briefly came up a couple of weeks ago, I seized the opportunity to archive everything.

Original post date: 2-3-2008 on http://www.SharePointblogs.com/wsspectacular

 

This is going to be a quick little post where I confess to doing something stupid.  You will find that I am not shy about talking about mistakes as well as triumphs. It’s how we learn (and sometimes, I might serve as a cautionary example to others).

We are currently prototyping a MOSS 2007 Enterprise solution for a government organization.  This customer has a fairly large public website, and a similarly large intranet site.  While demonstrating how to configure search scopes, I created a scope pointing at the external site, and changed its settings to “unlimited” without really thinking about what that meant.

Well, what it means is, it’s a good way to fill up your server’s hard drive, and a good way to annoy other departments of the government.  It turns out that in “unlimited” mode, SharePoint crawls every link, then keeps crawling links found on destination pages, branching out until, I assume, it has indexed the entire Internet.  I also learned that SharePoint doesn’t handle the ROBOTS.TXT files correctly, so sensitive information could be crawled along with everything else.  I couldn’t believe this, until I found a post here that confirms it:  SharePoint only looks at the root of the site for ROBOTS.TXT files, and ignores them wherever else they may be:

Observations

During our testing we discovered the following.

  1. robots.txt file is cached for 24 Hours following it’s first request by the crawler. The implication of this is that changes to robots.txt require either a restart of the Office Search Service or a delay of upto 24 hours before they are respected by the gatherer.
  2. Placing a robots.txt anywhere other than the root of a website is completely ineffective.

For example. http://www.website.com/folder/robots.txt will be ignored

 So after running for a weekend, the following Monday I received a much-forwarded email from irate webmasters in another city who wondered what the hell my MOSS server was doing crawling every directory on their site, even in places specifically flagged by the ROBOTS.TXT file (that upset them more than anything else). 

I also saw that the Search database had grown by seven gigabytes over the weekend, which brought my SharePoint server to its knees (the VM we were using had limited disk space). 

So, let my life serve as a warning to others.  Don’t use the “Unlimited” scope setting unless you know where every link on your website goes. 

Edited to add:

I forgot to mention the resolution to this.  One of the customer’s biggest problems with their Intranet and Public website is search integration.  For an external site, probably the best solution is to use a Google Site Search, which basically embeds a Google search field on your webpage and filters the results to your organization’s URL.   

This is better, I think, than pointing SharePoint at it, because first of all, Google handles the ROBOTS.TXT files correctly.  Second of all, especially if you have a large site, Google does all the heavy lifting and even archives versions of the web pages.  If you wanted to, you could drop this functionality into a SharePoint Search Center page and have the Google field sit next to your SharePoint field.

For this customer’s intranet, it is another matter entirely.  A couple of years ago, they were running both webs on a beleaguered NT4 server running IIS4.  I know, most of you are cringing.  This server had been hacked in the past, but it had been recovered and locked down and was limping dutifully along.  We migrated all content to IIS6 without much difficulty, but the one thing that couldn’t be migrated was the search solution.  In IIS4 there was a built-in script to allow basic user searches (using Microsoft Indexing Services).  Actually, it worked pretty well.  But in IIS6, Microsoft stripped it out, perhaps out of concern for security, and there was no similar functionality.  We explored and evaluated numerous open-source and third party replacements, but they were either too expensive, inaccurate, or simply didn’t work as advertised.  The most functional solution was the FrontPage search solution that used a web bot to index content, but it turned out it never did this automatically; you’d have to re-run it every time you updated the website (which was often). 

Now this customer has MOSS 2007, and I’m glad to say that MOSS is going to solve their problem once and for all.  Now we have the ability to create search scopes, target types of information, and go far beyond any functionality they ever had before.

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: