Searchy FAQ (last updated 12/30/04)

You should always check out the README file first...

To compile

In the main directory of searchy-0.1:
./setup
./configure
gmake

To generate an index

mkdb/ subdirectory deals with parsing the data and sort them into inverted files.

Firstly, you need a file called mkdb.config, an example mkdb.confile file is provided in the top level directory.
Secondly, to create the index for your corpus, there are two ways:

Suppose you specify prefix=ind in mkdb.config file, the files generated by the indexer will appear in the current directory as:

If you use method 2 for indexing, it's up to you to generate ind-n2f.db (e.g. in ./createonebigfile.pl) when you feed the files with their corresponding DIDs for indexing.

To search

query/ subdirectory deals with searching.

Similarly, you need a mkdb.config file to tell the search engine where the index files are.
To search, type command:

query/qe -q "congestion control"

1.56019:5:207
0.840253:3:3593
0.829798:291:4
0.719151:608:2
...
The list of results is returned, one line per document
The first number is a document score, second number is the matching DID, third number is the word position within that matching document.
you need your own script to translate the DIDs and wc into nice web pages using ind-n2f.db etc.

To search "congestion control" only in complete sentences,

qe -q "congestion control :."
fancier searches, e.g. I want "congestion control" in the same sentence AND jacobson anywhere in the document.
qe -q "jacobson (congestion control :.)"  
or
qe -q "congestion control :. jacobson"

query/printresults.pl is an example CGI script that displays the search results nicely in a web page.

To extend the indexer/search engine

More often than not, you will want to extend Searchy to include your domain specific knowledge about the corpus for more accurate ranking.

There are many different types of QueryOp objects that process the stream of posting list (or Matches). Currently, there 4 major different QueryOp objects:

You can implment your own QueryOp object, you need to implemnt these two virtual methods in your object; You then tie it together with the rest of QueryOp objects in qe.C:parse_query()

tag bits

Tag bits are useful if you want to give extra score to words with special meanings or that occur in special context. For example, you might want to double the score of a document if it contains the query word in bold font (figure captions or author list etc.) To do this, firstly, you need to extend the indexer by writing a new document parser class, inheriting from the base class DocParser class in directory mkdb/. This new parser NewDocParser tags words that appears in bold font by setting the leftmost bit of word_pos field to 1. There can be more than 1 tag bits set, but they must be the leftmost bits of word_pos field in PostIt. You also need to note that tag bits take space from word_pos so less words can be indexed in a document. Secondly, you need to modify mkdb.config file to specify how many tag bits you've used, 1 in this case, with a score multiplier (e.g.2 if you want to double the score). Tag bits are expensive in terms of space and you should only use tag bits when the effects of tag bits are fuzzy and only influence the overall score. If you want to match ONLY certain special words (e.g. only match words that have bold font), it's better to make a separate special stream that records the positions of special words and perform an explicit AND with the streams.

mkdb/ includes an example FilenameDocParser that sets the leftmost tag bit of a word if it appears in the filename.


[Jinyang Home] [Searchy Home]