"OpenFTS Inside"
---------------

MOTIVATION:

  This document is a list of comments to examples script. If you already pass
  through examples and looking for instrinsic details you're in a right way.
  For detailed description of API see OpenFTS primer.


INITIALIZATION:

  There is no magic at this step. You have to create tables and configure
  OpenFTS. This is what init.pl is doing.

  a) $dbi->do("create table txt ( tid int not null primary key, path varchar, fts_index tsvector );") || die;
     
     Creates test table 'txt'  with three fields: tid - document id,
     path - path to the document, fts_index of type tsvector (see tsearch2 doc)
     as a storage for unique lexemes from the document.

  b)

my $idx=Search::OpenFTS::Index->init( 
	dbi=>$dbi, 
	txttid=>'txt.tid',
	tsvector_field=>'fts_index',
	ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
	ignore_headline=>[ qw(13 15 16 17 5) ],
	map=>'{ 
                \'4\'=>[1],  5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown
              }',
        dict=>[
                'Search::OpenFTS::Dict::PorterEng',
# example how to use snowball stemmer
#                { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english", stop_file=>"/u/megera/app/pgsql/fts/test-suite/Dict/english.stop"}' },
                'Search::OpenFTS::Dict::UnknownDict',
	] 
);

   Creates (instantiates) fts object with some attributes, stored in 
   table fts_conf of your database, which looks like:

---------------------------------------------------
 openfts=# \d fts_conf
                 Table "fts_conf"
 Column |       Type        |      Modifiers      
--------+-------------------+---------------------
 name   | character varying | not null
 did    | integer           | not null default -1
 mod    | character varying | not null
 param  | character varying | 
Primary key: fts_conf_pkey
---------------------------------------------------

   Now, we have something to play with. First, you see a bunch of
   integers in attributes 'ignore_id_index', 'ignore_headline', 'map'.
   These numbers designate types of lexemes (see OpenFTS primer),
   which should be ignored ( attributes 'ignore_id_index', 'ignore_headline')
   or recognized by according dictionaries ('map'). For example, 
   
   ignore_id_index=>[ qw( 7 13 14 12 23 ) ]

   means that numbers in scientific notation (7), HTML tags (13), HTML entities (23), protocol 
   part of URL (string like 'http://', 'ftp://') (14), and special 
   symbols (12) should be ignored while indexing a document.
      
	map=>'{ 
                \'4\'=>[1],  5=>[1], 6=>[1], 8=>[1], 18=>[1], 19=>[1], # unknown
              }',

   means that lexemes of specified types should be processed by the
   dictionary specified in [] (enumeration of dictionaries  starts from zero !), 
   defined in the 'dict' attribute. 
   Types of lexems are defined in parser and specific numbers
   are what default OpenFTS parser uses, so if you write you custom parser
   (why not ?), keep in mind you have to be in sync with 'init' method.

   (see simple_parser.pl as an example of simple parser in perl, which
    read from STDIN and recognize space delimited words with length =>2 )
  
   NOTICE: Due to bug in perl5 you should use \'4\' notation for the
           first element in map attribute !

   Parser passes a lexeme to dictionaries in order specified in 'dict' 
   attribute until it recognized by some dictionary. In our example, 
   Search::OpenFTS::Dict::UnknownDict dictionary has a deal with unrecognized
   words. But you may dont' use it, so those lexems will be ignored.

   Parser uses mapping to pass a lexem to specific dictionary, which is
   not only an optimization but is also good for indexing of mixed-languages
   documents. Some dictionaries (stemming, for example) could recognize all 
   lexems, so we could use mapping to define explicit rule what
   dictionary should be used for which type of lexeme. In our example,
   we use Porter's stemmer ( [1] ) for english "words" and UnknownDict for 
   unrecognized words.
   
   Some dictionaries requires parameters and commented line
#                { mod=>'Search::OpenFTS::Dict::Snowball', param=>'{lang=>"english"}' },
   demonstrates how to define dictionary in this case. Be sure you read
   a documentation for full list of parameters for Snowball stemmer.

   If you have several dictionaries and especially if you index
   multi-language collection you may *explicitly* define the order
   of dictionaries lexeme processed. For example, if you want to use 
   our interface to ISpell dictionaries (provides sort of morphology) 
   and Porter's stemming algorithm, it's sound idea, that lexeme pass to 
   ISpell and if it doesn't recognized, pass to Porter's dictionary 
   (NOTE: Stemming dictionaries does recognized any words by definition !)
   In this case, you may map latin words as :

       1=>[0,1], 11=>[0,1], 16=>[0,1],  # latin 
       2=>[2,3], 10=>[2,3], 17=>[2,3],      # cyrillic

   where dictionaries assigned as :

       0 - ISpell english
       1 - Porter english
       2 - ISpell russian
       3 - Snowball russian

   Read OpenFTS primer for information about dictionaries API.

  c) $idx->create_index;
    
    Creates index 'gist_key' on field 'fts_index' of table 'txt'.
    This index is used for speeding search operation, but could 
    significantly slowdown indexing process. So, as a rule of thumb, 
    for batch indexing of documents create index only after finishing
    of indexing. But for online indexing you need to create index at
    initialization. In our test example, we use batch mode, but for sake of
    clarity we leave index creation in init.pl.

INDEXING:

    Read filenames from STDIN and invoke method $idx->index to index
    document. Also, 'tid' and 'path' are stored for further referencing
    by search script. Actually, there are two operations with database:
   
    $sth->execute( $STID, ,$file ) - inserting 'tid', 'path'
    and 
    $idx->index($STID, \*INFILE)   - indexing and inserting 'fts_index'
    
    That's why we need to explicitly invoke $dbi->commit if everything is ok
    and $dbi->rollback if something gets wrong. 

    Read DBI, DBD::Pg documentation for details about transactions support.

SEARCHING:

    Search script could be used for testing, benchmarking and searching.
    Invoke ./search.pl without parameters to see syntax.

    $sql = $fts->_sql( \@ARGV );

    Method '_sql' returns sql query for given search query (reference to @ARGV).
    For example: ./search.pl -p openfts -vq supernovae stars
    
select 
        txt.tid,
rank( '{0.1, 0.2, 0.4, 1.0}', txt.fts_index, '\'supernova\' & \'star\'', 1 ) as pos
from
        txt
where
        txt.fts_index @@ '\'supernova\' & \'star\''
order by pos desc

    Notice, that query terms are passed through dictionary (Porter's stemmer):
    'supernovae' becomes 'supernova' and 'stars' - 'star'.
    relkov - is a relevation function based on proximity between search terms 
    and used for ranking (order by) results. Magic numbers could be defined
    while creating of fts object, see documentation for Search::OpenFTS
    (perldoc Search::OpenFTS). We use defaults in our example.

    Also, for testing purposes, you could invoke search.pl with -e option
    to get explain for sql command used for searching (see above).

    $dbi->do("explain $sql" );

    For example: ./search.pl -p openfts -qe supernovae stars

NOTICE:  QUERY PLAN:

Sort  (cost=4.83..4.83 rows=1 width=4)
  ->  Index Scan using gist_key on txt  (cost=0.00..4.82 rows=1 width=4)

    Benchmarking, use 'search' method,  \@ARGV is a reference to array
    with search terms.

            foreach ( 1..$opt{b} ) {
                my $a=$fts->search( \@ARGV );
                $count=$#{$a};
        }
   
   Example: ./search.pl -p openfts -b 100 Uma 47

Found documents:2
908;328
Speed gun in use :)...
Found documents:2, total time (100 runs): 0.39 sec; average time: 0.004 sec

    In real life applications searching usually includes an additional 
    constraints to metadata. Method get_sql returns sql parts which could be 
    used to construct sql query.  For example:
    
    my ($out, $condition, $order) = $fts->get_sql( $query, rejected => \@stopwords );
   
    my $sql="
select
    txt.tid,
    txt.path,
    $out
from
    txt
where
    $condition
order by $order";

    @stopwords contains words recognized by dictionaries as stop words or
    rejected by S<ignore_id_index> attribute. It's quite useful to return
    feedback to user.

    To get real feeling from searching invoke search.pl with  '-h' option:

    In this way search uses explicit sql command and results are displayed as 
    documents fragments with search terms hilighted. Hilighting is done using
    termcap control sequences. You may use HTML's markup instead:

                   my $headline=$fts->get_headline(query=>$query, src=>\*FH,
                                                   maxread=>1024, maxlen=>100,
                                                   otag=>'[1m',ctag=>'[0m' );
#                                otag=>'<b>',ctag=>'</b>' );

    Please note, 'maxread' is a maximum bytes to read from 'src' and
    'maxlen' is a length of text fragment. You're welcome to use your custom
    procedure to generate text fragments. Default method supplied by
    OpenFTS is currently not smart to keep text fragments looking nice,
    i.e. without heading or trailing punctuation marks. 
    Play with get->get_headline2 method which should be smarter as regards
    this problem. Read the primer for references.
   
    Example: /search.pl -p openfts -h 3 crab nebulae

------TID: 1589 WEIGHT:0.077    PATH:/u/megera/app/pgsql/fts/test-suite/apod/1165090
 Energy Crab Nebula Credit: NASA , UIT Explanation: This is the mess that is left when a star explodes. The Crab Nebula is so
------TID: 1121 WEIGHT:0.073    PATH:/u/megera/app/pgsql/fts/test-suite/apod/1163277
. The Crab Nebula is so energetic that it glows in every kind of light known. Shown above are images of the Crab Nebula from
------TID: 667  WEIGHT:0.062    PATH:/u/megera/app/pgsql/fts/test-suite/apod/1162865
M1: Filaments of the Crab Nebula Credit and Copyright: S. Kohle, T. Credner et al. ( AIUB ) Explanation : The Crab Nebula is

   (Hilighting is lost here because of cat'n paste from xterm).

    TID here is a document id as specified in database, PATH - path to
    document and WEIGHT - weight of document in terms of relevance function.


USING PREFIXES:

    OpenFTS could works with different collections stored in one database.
    Storing collections in one database doesn't require establishing
    different connections to database. Collections could be specified
    using prefixes (currently, they are characters from english alphabet).
    
    You may play with collections using examples scripts - just use
    DATABASE:PREFIX instead of DATABASE. For example:

      ./init.pl openfts:a
      find /path/to/test-collection/apod -type f | ./index.pl openfts:a
      ./search.pl -p openfts:a supernovae stars

      another collection

      ./init.pl openfts:x
      find /path/to/test-collection/xfiles -type f | ./index.pl openfts:x
      ./search.pl -p openfts:x spaceship biogenesis

    Name of table (template name), used for storing meta data and search
    index, is specified at the  init stage. It could be changed in 
    init.pl script ( my $TABLE = 'txt'; ).

SEE ALSO:    

    The OpenFTS Primer
    perldoc Search::OpenFTS::Search
    perldoc Search::OpenFTS::Index
    perldoc Search::OpenFTS::Parser
    perldoc Search::OpenFTS::Dict::PorterEng
    perldoc Search::OpenFTS::Dict::Snowball
    perldoc Search::OpenFTS::Dict::UnknownDict
    perldoc Search::OpenFTS::Morph::ISpell

TODO:

    Simple crawler for indexing personal web site.
    Volunteer are welcome.

    Done. See perldoc Search::OpenFTS::Crawler and 
    example scripts.

FINAL NOTES:
    
    Test suite for OpenFTS is a start point for novices and could be used
    for customization and writing your own search application. 
    Consult the OpenFTS primer and documentation to the perl modules
    (use perldoc) for details. Authors appreciate your ideas and
    comments about further development of OpenFTS and support,
    please, use OpenFTS discussion list
    (http://lists.sourceforge.net/lists/listinfo/openfts-general)

--------------------------------------------------------------------
Sat Aug  2 23:08:10 MSD 2003
Comments to Oleg Bartunov <oleg@sai.msu.su>