Sunday, November 09, 2008

How does search inside websites work?


I have had the same questions when I started myself in this industry. Initial days, when exposed to RDBMS, I felt that there should be a database and a query associated to this search. Later when I get deep into RDBMS, had a feeling really is a database being used? Later I learnt that there are website search engines available which initially index the pages of the site and then render them based on query against the engine.

How may of you have heard of the word "Lucene"? Lucene, from Apache, is a full featured search engine library completely written in Java. It is an open source project and is available for Free. For those who want to take a overview of Lucene, please visit (http://lucene.apache.org/).

But I am NOT going to talk about Lucene here. Lucene requires you to take the compiled files and have it in your application. I wish to talk about two different search engines that do the work that we need and have a comparison of the same. Keep in mind that my intention is NOT to conclude on any specific implementation.

The two search engine are Apache Solr Project and ht://Dig project.

Apache Solr (http://lucene.apache.org/solr/)
Solr is again an Apache project, built on top of Lucene. Solr is a search server. It need not only feature indexing/search of websites, but anything under the SUN. Let me explain.

Solr is basically a WAR file which can be deployed with any web container. It comes with the examples which have the XML format to upload data and also a shell / dos script to do that. All we need to do, is have our data in that specific XML format and then upload the same to Solr instance. Once that is done, Solr is ready to render the search data for you based on the query string. Slor has it's own set of programming APIs also.

Solr has an excellent Administrator screen through which all the admin operations for the server can be performed.

A typical scenario where I use Solr is:- I have a requirement where I need to have a search engine in my application which is the heart of it. I convert all my data to XMLs and index it with Solr. The same in my RDBMS takes 2 secs and in Solr it is in very few milli secs. Experience it yourself by working with Solr.

ht://Dig (http://www.htdig.org/)
ht://Dig is an website search engine. All you need to do is configure the website URL int the ht dig config file htdigconfig.xml and ht://Dig index's all the page of the site and give you the output results.

ht://Dig is an Unix project and hence best suited for Unix. You need to download the software and follow a set of simple setps to setup ht://Dig. ht://Dig runs as a separate engine and would render search results based on the URL.

ht://Dig can be used at the same time to index multiple website and render search results for multiple queries. Once you have an entry to the config file for a site an indexing is performed and a database is created. Multiple web-site URLs can be configured in the same config file or you can have multiple config files.

The best practice is to have a single config file per site.

Geeks, try both, it just took 3 hours for me to try. But this a good piece of software / solution to be known..

Jai Hind

No comments: