"OpenFTS Inside" --------------- MOTIVATION: This document is a list of comments to examples script. If you already pass through examples and looking for instrinsic details you're in a right way. For detailed description of API see OpenFTS primer. INITIALIZATION: There is no magic at this step. You have to create tables and configure OpenFTS. This is what init.pl is doing. a) $dbi->do("create table txt ( tid int not null primary key, path varchar, fts_index tsvector );") || die; Creates test table 'txt' with three fields: tid - document id, path - path to the document, fts_index of type tsvector (see tsearch2 doc) as a storage for unique lexemes from the document. b) my $idx=Search::OpenFTS::Index->init( dbi=>$dbi, txttid=>'txt.tid', tsvector_field=>'fts_index', ignore_id_index=>[ qw( 7 13 14 12 23 ) ], ignore_headline=>[ qw(13 15 16 17 5) ], map=>'{ \'4\'=>[1], 5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown }', dict=>[ 'Search::OpenFTS::Dict::PorterEng', # example how to use snowball stemmer # { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english", stop_file=>"/u/megera/app/pgsql/fts/test-suite/Dict/english.stop"}' }, 'Search::OpenFTS::Dict::UnknownDict', ] ); Creates (instantiates) fts object with some attributes, stored in table fts_conf of your database, which looks like: --------------------------------------------------- openfts=# \d fts_conf Table "fts_conf" Column | Type | Modifiers --------+-------------------+--------------------- name | character varying | not null did | integer | not null default -1 mod | character varying | not null param | character varying | Primary key: fts_conf_pkey --------------------------------------------------- Now, we have something to play with. First, you see a bunch of integers in attributes 'ignore_id_index', 'ignore_headline', 'map'. These numbers designate types of lexemes (see OpenFTS primer), which should be ignored ( attributes 'ignore_id_index', 'ignore_headline') or recognized by according dictionaries ('map'). For example, ignore_id_index=>[ qw( 7 13 14 12 23 ) ] means that numbers in scientific notation (7), HTML tags (13), HTML entities (23), protocol part of URL (string like 'http://', 'ftp://') (14), and special symbols (12) should be ignored while indexing a document. map=>'{ \'4\'=>[1], 5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown }', means that lexemes of specified types should be processed by the dictionary specified in [] (enumeration of dictionaries starts from zero !), defined in the 'dict' attribute. Types of lexems are defined in parser and specific numbers are what default OpenFTS parser uses, so if you write you custom parser (why not ?), keep in mind you have to be in sync with 'init' method. (see simple_parser.pl as an example of simple parser in perl, which read from STDIN and recognize space delimited words with length =>2 ) NOTICE: Due to bug in perl5 you should use \'4\' notation for the first element in map attribute ! Parser passes a lexeme to dictionaries in order specified in 'dict' attribute until it recognized by some dictionary. In our example, Search::OpenFTS::Dict::UnknownDict dictionary has a deal with unrecognized words. But you may dont' use it, so those lexems will be ignored. Parser uses mapping to pass a lexem to specific dictionary, which is not only an optimization but is also good for indexing of mixed-languages documents. Some dictionaries (stemming, for example) could recognize all lexems, so we could use mapping to define explicit rule what dictionary should be used for which type of lexeme. In our example, we use Porter's stemmer ( [1] ) for english "words" and UnknownDict for unrecognized words. Some dictionaries requires parameters and commented line # { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english"}' }, demonstrates how to define dictionary in this case. Be sure you read a documentation for full list of parameters for Snowball stemmer. If you have several dictionaries and especially if you index multi-language collection you may *explicitly* define the order of dictionaries lexeme processed. For example, if you want to use our interface to ISpell dictionaries (provides sort of morphology) and Porter's stemming algorithm, it's sound idea, that lexeme pass to ISpell and if it doesn't recognized, pass to Porter's dictionary (NOTE: Stemming dictionaries does recognized any words by definition !) In this case, you may map latin words as : 1=>[0,1], 11=>[0,1], 16=>[0,1], # latin 2=>[2,3], 10=>[2,3], 17=>[2,3], # cyrillic where dictionaries assigned as : 0 - ISpell english 1 - Porter english 2 - ISpell russian 3 - Snowball russian Read OpenFTS primer for information about dictionaries API. c) $idx->create_index; Creates index 'gist_key' on field 'fts_index' of table 'txt'. This index is used for speeding search operation, but could significantly slowdown indexing process. So, as a rule of thumb, for batch indexing of documents create index only after finishing of indexing. But for online indexing you need to create index at initialization. In our test example, we use batch mode, but for sake of clarity we leave index creation in init.pl. INDEXING: Read filenames from STDIN and invoke method $idx->index to index document. Also, 'tid' and 'path' are stored for further referencing by search script. Actually, there are two operations with database: $sth->execute( $STID, ,$file ) - inserting 'tid', 'path' and $idx->index($STID, \*INFILE) - indexing and inserting 'fts_index' That's why we need to explicitly invoke $dbi->commit if everything is ok and $dbi->rollback if something gets wrong. Read DBI, DBD::Pg documentation for details about transactions support. SEARCHING: Search script could be used for testing, benchmarking and searching. Invoke ./search.pl without parameters to see syntax. $sql = $fts->_sql( \@ARGV ); Method '_sql' returns sql query for given search query (reference to @ARGV). For example: ./search.pl -p openfts -vq supernovae stars select txt.tid, rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'supernova\' & \'star\'', 1 ) as pos from txt where txt.fts_index @@ '\'supernova\' & \'star\'' order by pos desc Notice, that query terms are passed through dictionary (Porter's stemmer): 'supernovae' becomes 'supernova' and 'stars' - 'star'. relkov - is a relevation function based on proximity between search terms and used for ranking (order by) results. Magic numbers could be defined while creating of fts object, see documentation for Search::OpenFTS (perldoc Search::OpenFTS). We use defaults in our example. Also, for testing purposes, you could invoke search.pl with -e option to get explain for sql command used for searching (see above). $dbi->do("explain $sql" ); For example: ./search.pl -p openfts -qe supernovae stars NOTICE: QUERY PLAN: Sort (cost=4.83..4.83 rows=1 width=4) -> Index Scan using gist_key on txt (cost=0.00..4.82 rows=1 width=4) Benchmarking, use 'search' method, \@ARGV is a reference to array with search terms. foreach ( 1..$opt{b} ) { my $a=$fts->search( \@ARGV ); $count=$#{$a}; } Example: ./search.pl -p openfts -b 100 Uma 47 Found documents:2 908;328 Speed gun in use :)... Found documents:2, total time (100 runs): 0.39 sec; average time: 0.004 sec In real life applications searching usually includes an additional constraints to metadata. Method get_sql returns sql parts which could be used to construct sql query. For example: my ($out, $condition, $order) = $fts->get_sql( $query, rejected => \@stopwords ); my $sql=" select txt.tid, txt.path, $out from txt where $condition order by $order"; @stopwords contains words recognized by dictionaries as stop words or rejected by S attribute. It's quite useful to return feedback to user. To get real feeling from searching invoke search.pl with '-h' option: In this way search uses explicit sql command and results are displayed as documents fragments with search terms hilighted. Hilighting is done using termcap control sequences. You may use HTML's markup instead: my $headline=$fts->get_headline(query=>$query, src=>\*FH, maxread=>1024, maxlen=>100, otag=>'',ctag=>'' ); # otag=>'',ctag=>'' ); Please note, 'maxread' is a maximum bytes to read from 'src' and 'maxlen' is a length of text fragment. You're welcome to use your custom procedure to generate text fragments. Default method supplied by OpenFTS is currently not smart to keep text fragments looking nice, i.e. without heading or trailing punctuation marks. Play with get->get_headline2 method which should be smarter as regards this problem. Read the primer for references. Example: /search.pl -p openfts -h 3 crab nebulae ------TID: 1589 WEIGHT:0.077 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1165090 Energy Crab Nebula Credit: NASA , UIT Explanation: This is the mess that is left when a star explodes. The Crab Nebula is so ------TID: 1121 WEIGHT:0.073 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1163277 . The Crab Nebula is so energetic that it glows in every kind of light known. Shown above are images of the Crab Nebula from ------TID: 667 WEIGHT:0.062 PATH:/u/megera/app/pgsql/fts/test-suite/apod/1162865 M1: Filaments of the Crab Nebula Credit and Copyright: S. Kohle, T. Credner et al. ( AIUB ) Explanation : The Crab Nebula is (Hilighting is lost here because of cat'n paste from xterm). TID here is a document id as specified in database, PATH - path to document and WEIGHT - weight of document in terms of relevance function. USING PREFIXES: OpenFTS could works with different collections stored in one database. Storing collections in one database doesn't require establishing different connections to database. Collections could be specified using prefixes (currently, they are characters from english alphabet). You may play with collections using examples scripts - just use DATABASE:PREFIX instead of DATABASE. For example: ./init.pl openfts:a find /path/to/test-collection/apod -type f | ./index.pl openfts:a ./search.pl -p openfts:a supernovae stars another collection ./init.pl openfts:x find /path/to/test-collection/xfiles -type f | ./index.pl openfts:x ./search.pl -p openfts:x spaceship biogenesis Name of table (template name), used for storing meta data and search index, is specified at the init stage. It could be changed in init.pl script ( my $TABLE = 'txt'; ). SEE ALSO: The OpenFTS Primer perldoc Search::OpenFTS::Search perldoc Search::OpenFTS::Index perldoc Search::OpenFTS::Parser perldoc Search::OpenFTS::Dict::PorterEng perldoc Search::OpenFTS::Dict::Snowball perldoc Search::OpenFTS::Dict::UnknownDict perldoc Search::OpenFTS::Morph::ISpell TODO: Simple crawler for indexing personal web site. Volunteer are welcome. Done. See perldoc Search::OpenFTS::Crawler and example scripts. FINAL NOTES: Test suite for OpenFTS is a start point for novices and could be used for customization and writing your own search application. Consult the OpenFTS primer and documentation to the perl modules (use perldoc) for details. Authors appreciate your ideas and comments about further development of OpenFTS and support, please, use OpenFTS discussion list (http://lists.sourceforge.net/lists/listinfo/openfts-general) -------------------------------------------------------------------- Sat Aug 2 23:08:10 MSD 2003 Comments to Oleg Bartunov