XPat Overview

History

Until 1998, most DLPS middleware was built using the OpenText search engine, version 5.0, aka pat50. The indexes pat50 builds is based on a structure called a Patricia Tree which analyzes substrings of the entire text which are known as semi-infinite strings or sisstrings. For more information on these check:
Information Retrieval: Data Structures and Algorithms
W. B. Frakes and R. S. Baeza-Yates
1992.
 
Sistrings as you'll see them in the literature, have some interesting properties:
  • They start at some offset in the entire string of the database
  • They stretch off to at least the end of the database (implying that they always overlap with each other...)
  • One can change (with the IndexPts section of the pat50 .dd file) where they start

  • One of the most important features of pat50/XPat is its ability to index not only full text, but also to index SGML regions. This gives the ability to create complex searches that reach into regions of text based on the markup elements.

    In 1998 and 1999 we began to use the next generation of OpenText search engine, ot60, which builds indexes on tokens rather than sistrings. However, we much prefered the Pat tree structure and some of its features.

    Happily, DLPS was able, in 1999, to acquire the source code to pat50 (see pat50/DLPS recent developments), which OpenText was no longer supporting. We can now offer a license to use our version of the original engine, now being called XPat. We have already begun to make changes to it and will continue to do so. These enhancements include: