Searchy FAQ (last updated 12/30/04)
You should always check out the README file first...
To compile
In the main directory of searchy-0.1:
./setup
./configure
gmake
To generate an index
mkdb/ subdirectory deals with parsing the data and sort them into inverted files.
Firstly, you need a file called mkdb.config, an example mkdb.confile file is provided in the top level directory.
Secondly, to create the index for your corpus, there are two ways:
- Specifying the list of path names for the files to be indexed in "myfile", one line per filename. Run command:
./mkdb/docparser < myfile | ./mkdb/edsort -t tmp | ./mkdb/mkdb
(-t tmp specifies the temporary directory the program edsort will store its big temporary files)
- Feeding all documents contents as one long input stream with the special format.
./createonebigfile.pl | ./mkdb/docparser -i | ./mkdb/edsort -t tmp | ./mkdb/mkdb
./createonebigfile.pl is your own script that "stitches" your corpus together in the following special format and feed it to indexer.
An example file starts with:
!@#\$% 123 /disk/files/f1\n
where !@#\$% is a special string used to detect out-of-sync documents, 123 is the size of the file in
bytes following this special line, /disk/files/f1 is the name of the file.
This can be the name of a real file on disk, or it could simply be some identifier string (like a URL)
Depending on what you do, you need to change the way you display search context from DID and word position given by searchy
Suppose you specify prefix=ind in mkdb.config file, the files generated by the indexer will appear in the current directory as:
- ind-f: a binary file that contains all the inverted lists for each word appearing in the corpus, sorted by DIDs and then word
positions within a document.
- ind-w2p.db: a berkeley db file that maps a word to the offset of its inverted list in ind-f file.
- ind-n2f.db: a berkeley db file that maps a DID to the name of the file if you use method 1) for indexing.
- ind-f2n.db a berkeley db file that maps filename to DIDs. This file is around to make sure you don't index duplicate files
If you use method 2 for indexing, it's up to you to generate ind-n2f.db (e.g. in ./createonebigfile.pl) when you feed the files with
their corresponding DIDs for indexing.
To search
query/ subdirectory deals with searching.
Similarly, you need a mkdb.config file to tell the search engine where the index files are.
To search, type command:
query/qe -q "congestion control"
1.56019:5:207
0.840253:3:3593
0.829798:291:4
0.719151:608:2
...
The list of results is returned, one line per document
The first number is a document score, second number is the matching DID, third number is the word position within that matching document.
you need your own script to translate the DIDs and wc into nice web pages using ind-n2f.db etc.
To search "congestion control" only in complete sentences,
qe -q "congestion control :."
fancier searches, e.g. I want "congestion control" in the same sentence AND jacobson anywhere in the document.
qe -q "jacobson (congestion control :.)"
or
qe -q "congestion control :. jacobson"
query/printresults.pl is an example CGI script that displays the search results nicely in a web page.
To extend the indexer/search engine
More often than not, you will want to extend Searchy to include your domain specific knowledge about the corpus for more accurate ranking.
There are many different types of QueryOp objects that process the stream of posting list (or Matches).
Currently, there 4 major different QueryOp objects:
- Word
- And
- Between
- Aggregate
You can implment your own QueryOp object, you need to implemnt these two virtual methods in your object;
- Match next ()
- Match advance (PostIt to)
You then tie it together with the rest of QueryOp objects in qe.C:parse_query()
tag bits
Tag bits are useful if you want to give extra score to words with special meanings or that occur in special context. For example, you
might want to double the score of a document if it contains the query word in bold font (figure captions or author list etc.)
To do this, firstly, you need to extend the indexer by writing a new document parser class, inheriting from the base class DocParser class in directory mkdb/.
This new parser NewDocParser tags words that appears in bold font by setting the leftmost bit of word_pos field to 1.
There can be more than 1 tag bits set, but they must be the leftmost bits of word_pos field in PostIt. You also need to note that
tag bits take space from word_pos so less words can be indexed in a document.
Secondly, you need to modify mkdb.config file to specify how many tag bits you've used, 1 in this case,
with a score multiplier (e.g.2 if you want to double the score).
Tag bits are expensive in terms of space and you should only use tag bits when the effects of
tag bits are fuzzy and only influence the overall score.
If you want to match ONLY certain special words (e.g. only match words that have bold font), it's better to
make a separate special stream that records the positions of special words and perform an explicit AND with the streams.
mkdb/ includes an example FilenameDocParser that sets the leftmost tag bit of a word if it appears in the filename.
[Jinyang Home] [Searchy Home]