---------------------------------------------------------- USING TSEARCH AND POSTGRESQL FOR A WEB BASED SEARCH ENGINE ---------------------------------------------------------- This documentation is provided as a short guide on how to quickly get up and running with tsearch and PostgreSQL, for those who want to implement a full text indexed based search engine. It is not meant to be a complete in-depth guide into the full ins and outs of the contrib/tsearch module, and is primarily aimed at beginners who want to speed up searching of large text fields, or those migrating from other database systems such as MS-SQL. A much more detailed (and technical) description of the tsearch module can be found in the proper README.tsearch distributed with the tsearch module. -------- OVERVIEW -------- MS-SQL provides a full text indexing (FTI) system which enables the fast searching of text based fields, very useful for websites (and other applications) that require a results set based on key words. PostgreSQL ships with a contributed module called tsearch, which implements a special type of index that can also be used for full text indexing. Further more, unlike MS' offering which requires regular incremental rebuilds of the text indexes themselves, tsearch indexes are always up-to-date and keeping them so induces very little overhead. Before we get into the details, it is recommended that you have installed and tested PostgreSQL, are reasonably familiar with databases, the SQL query language and also understand the basics of connecting to PostgreSQL from the local shell. This document isn't intended for the complete PostgreSQL newbie, but anyone with a reasonable grasp of the basics should be able to follow it. ------------ INSTALLATION ------------ tsearch is very easy to install, just change the current directory to PGSQL_SRC/contrib/tsearch and type: gmake gmake install gmake installcheck That is pretty much all you have to do, unless of course you get errors. However if you get those, you better go check with the mailing lists over at http://www.postgresql.org since its never failed for me. Note that in the default installation, some common words such as "the" and "who" are not indexed at all. If you wish them to be, or want to add further words that should not be indexed, you will need to create a custom dictionary and recompile and reinstall tsearch. See the section on changing non-indexed words later in this document. ------------------------------------------- ADDING TSEARCH FUNCTIONAILITY TO A DATABASE ------------------------------------------- If you thought installation was easy, this next bit is even easier. Change to the PGSQL_SRC/contrib/tsearch directory and type: psql DATABASE < tsearch.sql This should populate the database specified with all the functions, types and operators required to use txtidx FTI indexes. -------------------------- INDEXING FIELDS IN A TABLE -------------------------- The next stage is to add a text index to an existing table. In this example we already have a table defined as follows: CREATE TABLE tblMessages { intIndex int4, strTopic varchar(100), strMessage text, }; The next stage is to create a special text index which we will use for FTI, so we can search our table of messages for words or a phrase. We do this using the SQL command: ALTER TABLE tblMessages ADD idxFTI txtidx; Note that unlike traditional indexes, this is actually a new field in the same table, which is then used (through the magic of the tsearch operators and functions) by a special index we will create in a moment. The next thing to do is to populate the new field with indexed data, which we do as follows: UPDATE tblTitles SET idxFTI=txt2txtidx(strMessage); Note that this only indexes the field strMessage, so if you want to also add strTopic to the same index, you should instead do the following, which effectively concates the two fields into one field before being fed into the indexing system: UPDATE tblTitles SET idxFTI=txt2txtidx(strTopic || ' ' || strMessage); Two final things you need to do, are create a special index on your newly created field, which you do as follows: CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI); And to set up a trigger so every time a row in this table is changed, the text index is automatically updated. This is easily done using: CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strMessage); Or if you are indexing both strMessage and strTopic you should instead do: CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strTopic, strMessage); Before you ask, the tsearch function accepts multiple fields as arguments so there is no need to concate the two into one like we did before. ---------------- QUERYING A TABLE ---------------- So you have your indexed data, how do you query it? Well the README.tsearch has some very detailed descriptions of everything you can do, but here are some examples to get you going. If you wanted to search the above example table for all entries which contain the word "minority", you could use the query: SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority' Note that searching this way isn't case sensitive, so using "MiNoRiTy" in place of the above will return the same results. A good thing for those wanting to use these as search engines on websites! Now suppose you want to search for more than one word, say "minority" and "report", then you would use the following query instead: SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority&report' Note that spaces are a bad thing in these sorts of queries, so you better get your regexps out and filter these away. Also note that the & has special meaning here too, so you can't search for anything with "&" in it either. The last example here relates to searching for a phrase, for example "minority report". This poses a problem with regard to tsearch, as it doesn't index phrases, only words. But there is a way around which doesn't appear to have a significant impact on query time, and that is to use a query such as the following: SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority&report' AND strMessage ILIKE '%minority report%' Of course if your indexing both strTopic and strMessage, and want to search for this phrase on both, then you will have to get out the brackets and extend this query a little more. -------------------------- CHANGING NON-INDEXED WORDS -------------------------- Some words such as "and", "the", and "who" are automatically not indexed, since they belong to a pre-existing dictionary of Stop Words which tsearch does not perform indexing on. If someone needs to search for "The Who" in your database, they are going to have a tough time coming up with any results, since both are ignored in the indexes. But there is a solution, and that is to build your own dictionary file for tsearch. This is pretty easy to do, just change to the PGSQL_SRC/contrib/tsearch/makedict directory, and create yourself a text file with a list of words (one per line) that you wish to be ignored. Example files for English and Russian languages are at http://www.sai.msu.su/~megera/postgres/gist/tsearch/ if you need somewhere to start. When you have all the files you need, run the command: ./makedict.pl -l 'C' -s stopwords.txt > ../dict/mydict.dct You can also create a list of word endings in another text file, which will be stripped from any relevant words before being indexed. For example, if you have the word "ending" in your table, it will be indexed as "end" and the "ing" is stripped out. If you choose to have word endings included in your dictionary, then you can create your new dictionary file using the following command instead: ./makedict.pl -l 'C' -e endwords.txt -s stopwords.txt > ../dict/mydict.dct Note that there is no dictionary of words which do appear in the index, only a dictionary of words and word endings that will not. The next stage is to change back to the PGSQL_SRC/contrib/tsearch and edit the source header file dict.h, so it contains the new line: #include "dict/mydict.dct" immediately before the default dictionary, which should be commented out as shown here: //#include "dict/porter_english.dct" The whole dict.h file should now look something like this: #define TABLE_DICT_START ,{ #define TABLE_DICT_END } #include "dict/mydict.dct" //#include "dict/porter_english.dct" #ifdef USE_LOCALE #include "dict/russian_stemming.dct" #endif #undef TABLE_DICT_START #undef TABLE_DICT_END After you've done this, making sure you are in the PGSQL_SRC/contrib/tsearch directory, perform: gmake clean gmake gmake install At this point you will have to restart PostgreSQL, and rebuild any relevant indexes. The latter is done merely by using the same UPDATE statement you used to populate the index in the first place. ------------------------------------------------------- BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH ------------------------------------------------------- Believe it or not, this isn't as straight forward as it should be, and you will have problems trying to backup and restore any database which uses tsearch unless you take the steps shown below. And before you ask using pg_dumpall will result in failure every time. These took a lot of trial and error to get working, but the process as laid down below has been used a dozen times now in live production environments so it should work fine. HOWEVER never rely on anyone elses instructions to backup and restore a database system, always develop and understand your own methodology, and test it numerous times before you need to do it for real. To Backup a PostgreSQL database that uses the tsearch module: 1) Backup any global database objects such as users and groups (this step is usually only necessary when you will be restoring to a virgin system) pg_dumpall -g > GLOBALobjects.sql 2) Backup the full database schema using pg_dump pg_dump -s DATABASE > DATABASEschema.sql 3) Backup the full database using pg_dump pg_dump -Fc DATABASE > DATABASEdata.tar To Restore a PostgreSQL database that uses the tsearch module: 1) Create the blank database createdb DATABASE 2) Restore any global database objects such as users and groups (this step is usually only necessary when you will be restoring to a virgin system) psql DATABASE < GLOBALobjects.sql 3) Create the tsearch objects, functions and operators psql DATABASE < tsearch.sql 4) Edit the backed up database schema and delete all SQL commands which create tsearch related functions, operators and data types, BUT NOT fields in table definitions that specify txtidx types. If your not sure what these are, they are the ones listed in tsearch.sql. Then restore the edited schema to the database psql DATABASE < DATABASEschema.sql 5) Restore the data for the database pg_restore -N -a -d DATABASE DATABASEdata.tar If you get any errors in step 4, it will most likely be because you forgot to remove an object that was created in tsearch.sql. Any errors in step 5 will mean the database schema was probably restored wrongly. ---------------- ACKNOWLEDGEMENTS ---------------- Thanks to Oleg Bartunov for taking the time to answer many of my questions regarding this module, and also to Teodor Sigaev for clearing up the process of making your own dictionaries. Plus of course a big thanks to the pair of them for writing this module in the first place!