Last updated	2002-05-17 21:50:28 EDT
Doc Title	SGML and XML Indexing Support
Author 1	Wilkin, John Price
CVS Revision	$Revision: 1.8 $

2 SGML and XML Indexing Support

2.1 INTRODUCTION

[Editor's note: This text is in the process of being adapted from the original Open Text manual, chapter 13 in the DBA section. References to sections with a "13" prefix are internal to this document. The original document has a heavy emphasis on MFS index building, which has not yet been corrected, and on "dbbuild", which DLXS does not support or recommend. This text was drawn from OCR, and so many errors exist, and figures are typically no longer meaningful.]

This chapter is a summary of SGML concepts and index-building techniques and thus assumes that the reader has some familiarity with the concepts of tagged text (see Chapters 1 and 4) and SGML. Further information on SGML can be found in the references listed in Section 13.6.

2.2 A BRIEF SUMMARY OF SGML

In order to maximize DLXS XPAT's SGML functionality, it is necessary to be aware of SGML and its capabilities with respect to your documents. This section will introduce some of the terminology that is used in the rest of the documentation. The characteristics and function of SGML tags will be described, along with the configuration files required by the DLXS XPAT software to utilize SGML functionality.

Note: SGML is the International Standards Organization (ISO) Standard 8879. It is explained in great detail in a number of books. See Section 13.6 for references. The following subsections will provide enough of an introduction to get started.

2.2.1 SGML and structure elements

Standard Generalized Markup Language (SGML) is a system that allows you to "mark up" text with special tags. These tags specify the structure of the document you are working with. For instance, if you were writing a book, you would use special tags to say "this block of text is a paragraph", or "this block of text is a chapter title". These tags can be combined: "this block of text, made up of a chapter title and one or more paragraphs, is a chapter". Saying that "this block of text is a paragraph" means that the 'paragraph' is a structure element in your document. And saying that "this block of text, made up of a 'chapter title' and one or more 'paragraphs', is a 'chapter' " shows you that structure elements can be made of up combinations of other structure elements.

2.2.2 SGML tags

In order to tell programs such as xpat that, for instance, "this block of text is a paragraph", you have to surround that block of text with tags. These tags usually exist in pairs: start-tags and end- tags. For the 'paragraph' example, '' is a possible start-tag (the actual tag name is arbitrary, provided you use it consistently). This tells xpat that the text following the start-tag is to be considered pan of a 'paragraph' region. To end the 'paragraph', you would use the end-tag ''.

The start-tag and end-tag are almost identical, except that the end-tag's name is preceded by a slash ('/') character. The start and end tag pairs allow both you and the program to quickly and easily find the structure elements in an SGML document.

2.2.3 SGML document type

You can combine structure elements, such as paragraphs and headlines, together until you get a single element. This element is called the document type. For instance, if you were writing a book, your document type could be BOOK. For a newspaper, you could have a document type called NEWPAPER (document type labels are, by default, limited to 8 characters. However, that limit can be changed by modifying the SGML declaration).

2.3 THE SGML DOCUMENT ENVIRONMENT

2.3.1 SGML Document

An SGML document is represented as a sequence of characters organized physically into an entity structure and logically into element structure. The first entity of an SGML document that is parsed must be the "SGML document entity", which contains the document type definition and other information that governs the parse.

The SGML declaration is a section that defines all concrete syntax, optional features and capacity requirements that affect the parsing of the Document Type Definition (DTD) and the document instance. This SGML declaration can usually be skipped and the system default declaration will be assumed.

After the SGML declaration is a DTD that defines the structure of the document in terms of the elements it contains. Within the DTD, each type of element found in the document is given a name (generic identifier) by which it can be recognized. When placed within special markup delimiter characters, these generic identifiers form the tags that are used to identify the start and the end of each element.

To allow large documents to be generated efficiently, SGML documents can be built up from a series of sub-documents or SGML text 'entities'. The non-SGML data can also be referenced by non- SGML data entities (such as graphics, spreadsheet, etc). Physically, a typical SGML document entity will look like the this:

The SGML declaration is at the top and is followed by the document type declaration. The document instance, which contains the actual data, is after the document type declaration. However, the SGML declaration "< ! SGML . . . >" is usually ignored because most applications can rely on the default (see the sgmlrgn man pages in the DBA Reference Guide). The DTD exists in a separate file. Therefore, a more common arrangement is be depicted as follows:

<! DOCTYPE NEWPAPER SYSTEM news. did "> Document type declaration

<NEWPAPER> Document instance

< /NEWPAPER>

The "< DOCTYPE ...>" document type declaration defines the location of the DTD information. In the previous example, the DTD is located in the file "news . dtd". The document instance will follow the declaration. However, since the location of the DTD may be constantly changed, a more convenient setup will be to separate the "< ! DOCTYPE . . >" statement into another file (suffix '.inp'). This is shown below:

The declaration is located in 'filename. inp' input file. The document instance will be kept separately in 'filename. sgm' to achieve maximum portability of the document instance. When using sgmlrgn, the two files can be concatenated together by listing them in sequence in the sgmlrgn command line:

sgmlrgn [...various options...] filename.inp flename.sgm

In this case, the contents of filename. inp' will be considered before the contents of 'filename. sgm'. Thus, greater flexibility is achieved by separating the document type declaration from the document. The pure data file filename . sgm' can be ported elsewhere and used with a different DTD and INP.

2.3.2 SGML Document Example

To take advantage of the information contained in your SGML tagged document, you first need to tell programs such as sgmlrgn that you have a base document type and then say how that document type is constructed from the different structure elements. In other words, the allowable fields (elements) have to be defined, and legal nestings must be unambiguously declared. This requires two files: (1) the '.inp' file, and (2) the '.dtd' file. The '.inp' file declares the document type and tells the program where to find the '.dtd' file. The '.dtd' (Document Type Definition) file defines the document structure elements (fields) and their allowable nestings. This section will provide examples of both files, as well as an example of SGML tagged text, all of which will be based on the example of a newspaper.

2.3.2.1 The Document Type Declaration (.inp) file

In this section, we will describe the function of the Document Type Declaration file, describe its syntax, and give an example of it. The Document Type Declaration (' . inp') file declares the name of the base document type for your SGML tagged document and describes in which file to find the formal definition of the document structure (i.e., the DTD). The Document Type Declaration file usually only has one line, which has the following syntax: < ! DOCTYPE doctype SYSTEM "filename. dtd"> where 'doctype'is the name of the base document type, and 'filename. dtd' is the name of the file that contains the formal definition of the document structure (called the Document Type Definition or DTD). The filename. dtd' file will be assumed to be in the current directory or in the directory specified by the environment variable, SGMLREGION_PATH (Unix only). Documents can be of two types: SYSTEM or PUBLIC. PUBLIC documents are those known to more than the native system format, whereas SYSTEM documents are those that are specific to the system on which they are prepared. Our example is not going to refer to any other documents, so we can specify a SYSTEM entry.

For our newspaper example, we will call the doctype NEWPAPER. The filename. dtd' file will be called 'newpaper. dtd'. Our Document Type Declaration file, called 'newpaper. inp', now looks like this:

<!DOCTYPE NEWPAPER SYSTEM "newpaper.dtd">

2.3.2.2 The Document Type Definition (.dtd) file

This section gives an overview of the function and syntax of the Document Type Definition (' .dtd') file, and gives an example of its use. Refer to Section 13.6 for further references on SGML and DTD's.

The Document Type Definition (DTD) formally defines the structure of an SGML document, as well as the relationships between the different structure elements. It describes how simple structure elements, made up of characters, can be combined to form more complex structure elements, including the base document type. For the newspaper example, we can assume a newspaper is made up of Stories, Illustrations, and Ads. Stories may be made up of Paragraphs, a Date, a Byline, an Author, and other pieces. Paragraphs may be made up only of text (character data), not other structure elements.

To construct our DTD, we will start with the topmost element of an SGML document: the base document type itself. We illustrate with the newspaper example.

<!ELEMENT NEWPAPER O O (STORY)*>

The clement NEWPAPER is made up of zero or more STORY's. (The '*' means zero or more). The first capital letter 'O' means that the start-tag for this structure clement can be omitted from the actual SGML document. The second capital letter '0' means that the end-tag for this structure element can be omitted.

The next step is to define what a STORY element looks like:

<!ELEMENT STORY - - (TEXT ILLUST)*>

For our newspaper example, a story is made up of zero or more TEXTS or ILLUSTrations. The vertical bar (I) means "or". The dashes ('-') mean that the start and end tags must be present in the text in order for the structure element to be recognized as a story. We can also say that the STORY has certain attributes, such as Status, Publisher, Date and Page. Each of these attributes can have a value associated with it. So, for our newspaper example, we add the following:

These "entries" tell us that the structure element STORY has an ATTribute LIST that includes STATUS, PUBlisher, DATE, and PAGE data regions. The STATUS attribute can take one of three values (Draft, Prepare, or Ready), but defaults to Draft. The PUBlisher attribute is made up of characters (CDATA) and defaults to 'Local Newspaper'. The DATE attribute is a number. The # IMPLIED value tells the system that there is no default for the attribute, and that the system should imply a value if none is given. The PAGE is simply character data and defaults to a blank. The TEXT and ILLUSTration elements are, for the purposes of our newspaper example, made up of characters. To be recognized, they will require both start and end tags. To enter these definitions into the DTD, we write:

<!ELEMENT TEXT - - CDATA>

<!ELEMENT ILLUST - - CDATA>

By looking at all the entries together, it is easy to see that our base document type, NEWPAPER, is made up of zero or more STORY's. A STORY is made up of zero or more pieces of TEXT or ILLUSTrations, each of which are in turn made up of characters (CDATA). Each STORY also has a STATUS, PUBlisher, DATE, and PAGE associated with it. The following puts all the information together. The tags that start with '< e --' and end with '- >' are regarded as comments.

2.3.2.3 The SGML. document (.sgm) file

The SGML document ('.sgm') file contains the actual SGML tagged text. The DTD is used to interpret the text and its various elements (fields). Notice that the elements defined in the DTD are called tags when surrounded by angle brackets (<>). Also notice that the same line that appeared in the Document Type Declaration ('.inp') file is also the first line of the SGML ('.sgm') document file. The attributes of the STORY element and their associated values are all contained within the STORY tag as well. The following is some sample text that, in this example, would be stored in the SGML document file called 'newpaper. sgm':

2.4 SGML PROCESSING

The SGML processing model closely resembles the traditional model of processing computer programs written in a programming language. Most processing systems (i.e., a compiler) for computer language programs have the same structure. This structure is depicted below:

Figure 13-1: Programming Language Document Processing Structure (Compiler)

input parse semanc

~pasprogram -^ par tree - processi9ng- output

The task of the "parser" is to check whether the input is syntactically correct and to build the parse tree. After the parser has done its task, the other part of the system will perform the semantic processing. An SGML processing system has the same general structure as a compiler:

Figure 13-2: SGML Document Processing Structure

inpu/ GL-T valid output

An SGML parser, as defined in the SGML Standard, has the same structure as a parser for programming languages. The parser only checks the conformance of SGML document to its DTD and performs no further semantic processing. The output of most SGML parsers includes a normalized document, which is the document for which all start-tags and end-tags have been fully expanded. At this stage, the document is said to conform to the corresponding DTD. The internal structure of this complete document corresponds to the parse tree in systems for programming language.

As with programs, the complete document is not the end-stage in processing a document. It merely serves as an intermediate product, in which the correctness of the document has been assessed. Subsequently, the document has to be further processed. This is labeled SGML Application in Figure 13-2. The SGML application may generate code for various output formats. More specifically, sgmlrgn relies on a common, public-domain SGML parser called "sgmls". The various SGML applications are combined into a unified interface as sgmlrgn. Each mode serves as a unique application that generates'support information for DLXS XPAT products. For example, the 'region' mode generates the Region indices which can be used by xpat. The relationship and application for various SGML supports are illustrated in Figure 13-3:

Figure 13-3: SGML Processing support for 'sgmlrgn', various modes

Each processing mode will take an SGML document as an input and use the SGML parser to

produce an oututt format. The 'check' mode should be used before any other processing mode

since a validated SGML document is vital to other processing tools. The most commonly used mode

will be 'check' and 'region' modes. This combination will validate the SGML document and

produce the regions for the Pat search engine. If PatMotif50and LectorMotif50 are selected as yhe

usr's sarch ier and viewer, the i ter' mode can be used to support communication between the

two programs. The 'spec' mode is used to generate a simple "Lector specification file" for

displaying tagged text in LectorMotif5O. The 'root' is used'to generate the topmost level element in

the DTD so that it can be included in the "Pat initialization file".

sgmlrgn Processing Modes

The sgmlrgn program has several different SGML application modes. Many of these will be used

throughout this section, so we will describe these modes here. The desired mode is specified as part of the '- m' option of the sgmlrgn command line:

Commonly used options are described in the following table in order of importance:

Mode	Function
-m check	This mode validates the SGML document itself with respect to the DTD file. Any syntax or other errors will be reported by sgmlrgn.
-m region	This mode generates all the regions in the file and updates
-D datadictname.dd	the region information in the DD file. The '-D' option must be included to specify which DD file is to be updated. The name of the region file created uses a '.rgn' extension and the same prefix as that of the text file.
-m root	This mode determines and prints out the root element (also referred to as the base document type for our purposes) of the SGML document.
-m filter	This mode gets sgmlrgn to parse the DTD and wait for standard input. It is suitable for use as a filter between PatMotifSO and LectorMotif50. This configuration is described further below.
-m spec	This mode will generate a simple LectorMotif50 specification ('.spc') file, where all elements are recorded. See the LectorMotif50(l) man pages or the LectorMotif50 section in the DLXS XPATQuery Configuration Guide for a more complete discussion of the specification file.

Refer to the sgmlrgn(5) man page for a more complete discussion of these options. It is important to note why the sgmlrgn program is used here, instead of the xpatrgn or multirgn programs. The xpatrgn program is designed to be used when the document has arbitrary patterns to denote regions. The multirgn program is designed to be used when the document has SGML style tags (i.e., surrounded by angle (<>) brackets), but no DTD. The sgmlrgnprogram is designed to be used with fully validated SGML documents that have associated DTD's. This can be summarized with the following table.

Region

Boulder Appropriate Use

xpatrgn text has arbitrary patterns

multirgn text has SGML type tags but no DTD

sgmlrgn text has fully validated SGML tags and a DTD

Throughout this section, reference will be made to the files 'newpaper. inp', 'newpaper. dtd', and 'newpaper. sgm'. These are the names of the example files described in Section 13.2.

2.4.1 Checking the SGML document correctness

In order to ensure that there arc no syntax or other errors in the DTD (contained in the '.dtd' file) or the SGML tagged document (in the '.sgm' file) itself, you should run a test over them. The sgmlrgn program provides an easy way to perform this verification: the check mode.

sgmlrgn -v -m check newpaper.sgm

The '-v' option again makes the output verbose, instructing sgmlrgnto describe its operations as it proceeds. The '-m' option selects the mode (see the discussion above and the sgmlrgn(1)man page). Substitute your SGML file for the 'newpaper . sgm' file given here. Please note that all fatal errors must be overcome before other processing modes will be able to use the document. If the verification is successful, messages similar to the following will be returned:

check mode ...

checking total size(125K) time (2s)

Once this simple step has completed, we know that our SGML document and DTD are correct and fully validated. We can now move on to building the regions file.

2.4.2 Building the SGML regions

One of the benefits associated with using SGML documents is that we can define regions of text. A region is the text that exists between the start and end tags of an SGML document structure element. So, for instance, text between and tags will be referred to as a region. When lists of regions are placed into a regions file, xpat can restrict searches only to "paragraphs" (for example), or some other defined region. However, the DD file must exist before you try to build the regions file, as the region builder will update the DD file with the new region information. The SGML region builder is invoked with the following command:

sgmlrgn -v -m region -D newpaper.dd newpaper.sgm

You can substitute your DD name for the 'newpaper. dd' file and your SGML document for the 'newpaper. sgm' file. The region information is derived from the actual SGML document, and the results are placed in the region file. In the above example, sgmlrgn would create a region index file called 'newpaper. rgn'. If the region building operation is successful, then messages similar to the following will be displayed (the number of regions, their sizes, and the time shown are for the example given):

In addition to these messages, a message similar to the following will be given for each different type of region that was built:

built (newpaper.rgn) region (NEWPAPER count= 2)

The sgmlrgn program will clean up the DD file to accommodate the new region information. If the region name previously existed in the DD file, that region definition will be replaced by the newly constructed information. If a region name is no longer a reference to any file, this region definition will be removed.

If for some reason a region is not needed or not wanted for xpator PatMotif50, it can be manually deleted from the DD after the regions building process is completed. For instance, if ILLUSTrations are not needed, the following segment can be deleted from the DD:

<Name> ILLUST</Name>

</Region>

Once that segment is deleted from the DD, the I LLUST region is invisible to xpat or PatMotifSO.

2.5 ADVANCE TOPICS IN SGML PROCESSING

2.5.1 Regions Built by sgmlrgn

The region building mode for sgmlrgn will construct region indices that can be used by the Pat search engine. For every unique element occurring inside the SGML document, sgmlrgn will assign a unique region index for it. For example, assuming the following SGML document has been validated with 'check' mode, the document instance is:

It is important to note that three types of region indices will be built on each region tag. Type 1 indices are built on the contents of the data regions marked by <tag_body>and < /tag_body>. Type 2 indices are built on just the contents of the start tags (i.e., <tagbody>. Type 3 indices are built on attributes within a tagody.

For instance, a type 1 index would be built on the contents of the inventor data region and would be called INVENTOR (SGML region indices are always named with uppercase letters). A Type 2 index would be built on the start tag <uspatapp ...> tag and would be called USPATAPP-T (type 2 indices are always suffixed by "-T"). Type 3 indices would be built on the attributes of uspatapp. For example, the Type 3 index built on the patnu., attribute would be called A- PATNUM (type 3 indices are always prefixed by "A-"). (The figure below illustrates further.) These three different types of SGML indices allow users to restrict queries to very specific sets of regions (i.e., those with specific combinations of attribute values).

Figure 13-4: Scope of SGML indices on regions, tag bodies, and tag attributes

Scope of Scope of Scope of

A-PATNUM A-IMGAVLDATE A-APPNUM

index index index

Scope of USPATAPP-T index

Scope of USPATAPP index

2.5.2 External Entity Management

An external entity resides in one or more files. A system identifier is interpreted as a list of filenames separated by colons. If no system identifier is supplied, then the entity manager will attempt to generate a filename using the public identifier. The searching of the related system filename associated with the public identifier is done by a table lookup. The table is named "sgml ent i ty .map" in the system. The sgmlent ity .map file has two white-space delimited fields per document type. The first field is the system filename. The second field is the PUBLIC ID. The following are sample entries for document types in the sgmlentity file:

sgmlrgn uses the following precedence order searching algorithm to find the PUBLIC ID:

1. the sgmlentity. map file in the local directory.

2. the sgmlentity . map file pointed to by the SGMLREGION_PATH environment variable

3. the system filename in the local directory.

An External Entity Mapping Example

The examples that follow rely on three files: the document type definition, the document instance, and the input file. The following is a sample document type definition called 'example. dtd'.

The following is an example SGML document instance called example. sgm:

<!DOCTYPE doc SYSTEM "example.dtd">

<doc><intro> Introduction

<body>Paragraph 1

Paragraph 2

<concl>Conclusion

The PUBLIC entity "TSO 8879-1986//ENTITIES Added Latin i//EN" will producea table lookup in the entity map file 'sgmlent i ty. map' (or in '$ (SGMLREGION_PATH) /sgmientity.map'). The following is an entry in the sgml ent i ty. map file:

/usr/app/isolatl.gml "ISO 8879-1986//ENTITIES Added Latin 1//EN"

This particular ISOlat public entity will be mapped to the system id "/usr/app/isolat. gml"

2.5.3 Building SGML Regions in MFS Databases

2.5.3.1 SGML Data in MFS Database

In MFS databases, the MFS system creates a "virtual text" from the text of all the files in the database. The portion of this virtual text that corresponds to each file consists of three pieces: the Meta-Header section, the Data section and the Meta-Trailer section. This breakdown is illustrated in the following diagram:

The data in the Meta-Header and Meta-Trailer sections is highly structured and is uniform across all the files in the MFS database. In contrast, the data in the Data sections may be untagged text, tagged text without a DTD or tagged text with a DTD (SGML data).

The process of building region indices on such databases involves three steps. The first step involves running mfsmeta over the database to build a meta structure_file. This file contains information about the positions of the Meta-Header, Data and Meta-Trailer sections for each file in the database. The second step involves building regions on the fields in the Meta-Headcr and Meta-Trailer sections that are common to all files. Refer to the multirgn(l) man page for further details. The third step involves building regions for the Data sections. For the Data sections that contain tagged text without a DTD, this task is accomplished using multirgn. For SGML Data sections (that do have a DTD), this task is accomplished using sgmlrgn.

There are three types of SGML MFS databases. The first type consists of a group of SGML files that all conform to the same DTD and where each file is a complete document. The second type consists of a group of SGML files that conform to several different DTD's, but where each file is still a complete document. The third type consists of a group of SGML files that conform to one or more DTD's and where the files may contain either complete documents or pieces of documents (i.e., the text for specific elements in the DTD). Each of the next three sections discusses how to build regions for one of the above database types.

2.5.3.2 uilding Regions for Type 1 SGML Databases

The first step in building region indices for Type 1 SGML databases involves setting up the Fi tcerChain section of the Data Dictionary, which specifies the SGML files to be included. In particular, the DisplayFmt field should be set to the value, 'sgml'. For example, the following FilterChain section might be appropriate for a Type 1 SGML database.

Once the FilterChain sections have been set up, the following command can be used to build the SGML regions (usually done separately after dbbuild or individual index-builders have been run):

sgmlrgn -v -m region -M data.str -D data.dd data.inp data.dd

For this example, assume that the meta_structure_file generated by mfsmeta is called 'data. str' and that the 'data. inp' contains the < ! DOCTYPE .. .> declaration for the SGML files in the database. The sgmlrgn program will then use the 'data. str' to identify all the 'sgml' format files and will build SGML regions on them.

2.5.3.3 Building Regions for Type 2 SGML Databases

As with Type I SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. However, because the files conform to more than one DTD, they must be separated into groups, where all the files in a group conform to a particular DTD. A F ilterChain section is then setup for each group. The DisplayFmt section of each FilterChain is then set with two values separated by a comma. The first value is the keyword 'sgml' and the second value is a short group name that you pick, which uniquely identifies the group. For example, the following FilterChain sections might be appropriate for a Type 2 SGML database that contains files from two DTD's (having group names 'manual' and 'news').

Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions (each DTD in the database requires one pass with sgmlrgn). For this example, assume the meta_structurc_file generated by mfsmeta is called data. str. Assume that the file, 'manual . inp' contains the < ! DOCTYPE . . .> declaration for the 'manual' files. Finally, assume that the file, 'news . inp' contains the < ! DOCTYPE . . . > declaration for the 'news' files.

sgmlrgn -v -m region -M data.str -G manual -D data.dd manual.inp data.dd

sgmlrgn -v -m region -M data.str -G news -D data.dd news.inp data.dd

Note: The '-G' option is used to specify which group to build the regions on in each pass.

2.5.3.4 Building Regions for Type 3 SGML Databases

As with Type 2 SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. Also, as in Type 2 SGML databases, the files must be separated into groups. What is different for Type 3 databases is that the groups not only specify files that use a particular DTD has, but may also be further refined to specify files that contain text for a specific clement of a DTD.

For example. assume the newspaper documents in the example above consists of two elements, HEADLINE and TEXT. Further, assume that text for all the HEADLINE regions are in files with the suffix, '.hi' and that the text for the TEXT regions are in files with the suffix, '.txt'. Then the following FilterChain sections could be used to define this database (which also includes the 'manual' files from the other directory):

Note: A third attribute has been added to the DisplayFmt fields of the 'news' filegroup, which identifies the element that the text in those files corresponds to. Also note that HEADLINE and TEXT groups have different group names ('newshl' and 'newstxt'). Finally, note that there is no element attribute defined for the 'manual' files because they are to be parsed using the entire 'manual' DTD.

Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions. For this example, assume the meta_structure_file generated by mfsmeta is called 'data. str'. Assume that the file, 'manual. inp' contains the < ! DOCTYPE . . . > declaration for the 'manual' files. Finally, assume that the file, 'news. inp' contains the < ! DOCTYPE . . . > declaration for the 'news' files.

Note: The '-G' option is used to specify which group to build the regions on in each pass.

2.6 CHAPTER SUMMARY

Section 13.1 provided a brief overview of SGML, its concepts and structure. Further information can be found in the references listed in Section 13.6 below. Section 13.2 reviewed the concepts and structure of an SGML document. The base document type of an SGML document is made up of a number of different structure elements, which may themselves be made up of other structure elements. The base document type should be first declared in the Document Type Declaration ('.inp') file. The function of the Document Type Definition ('.dtd') file is then to describe the relationships among the different structure elements of your document, as well as to describe the attributes that are associated with these elements. The SGML ('.sgm') file is itself the actual SGML tagged document whose structure is defined by the first two files. Section 13.3 described the concept of SGML processing and various SGML processing modes supported by sgmlrgn. We then checked the SGML document for correctness. We used sgmlrgn to automatically generate "regions" index file for our document. We also used the sgmlregion program to generate a simple LectorMotif50 specification file. We then discussed the need to use an SGML filter between PatMotifSO and LectorMotifSO. Section 13.4 reviewed the internals of sgmlrgnabout how the regions are being constructed for a regular SGML documents or minimized SGML documents. If the SGML document refers to external entities, the 'sgml ent i ty .map' is used to map the PUBLIC identifier to the system file. Finally, the method to use sgmlrgn to construct regions in an MFS database is described. By identifying the display format, group name and start element, sgmlrgncan jump into any SGML data section to construct regions.

2.7 REFERENCES

Suggested readings and reference materials on SGML:

(1) The SGML Handbook: The annotated full text of ISO 8879 - Standard Generalized Markup Language,Dr. Charles F. Goldfarb. Claredon Press, Oxford, 1990.

(2) SGML: An Author's Guide to the Standard Generalized Markup Language, Martin Bryan. Addison-Wesley Publishing Company, New York, 1988.

(3) SGML and Related Standards: Document Descriptionand Processing Language, Joan Smith. Ellis Horwood, New York, 1992.

(4) CAN/CSA-Z243.210-89 (ISO 8879, 9069), Canadian Standard Association.

2.8 ERROR MESSAGES

The sgmlrgn index building program contains a very extensive error reporting facility. The original error messages are adapted from the public domain "SGMLS" parser. A typical error message generated by sgmlrgnwill look like:

sgmlrgn: SGML error at <filename>, line <number> at "<char>":

A complete error reporting will contain the '< filename>' and its line '<number>' where the error occurred. It will also locate the closest character '<char>' where the parser starts to detecting problem. The 'specific error messages' will also be produced to briefly explain the problem. Although the severity and the type of error is not reported, the user can refer to the following tables to find more information about the severity and the type of error. SGML is a very strict system of text markup. Thus, errors can occur very easily. All errors must be resolved in order to get document conformation. Therefore, it is necessary to use sgmlrgn program's 'check' mode to find and resolve all problems before other processing modes can be applied. The error messages are classified by their severity ('Code') and the type of error ('Type'). The severity codes are as following:

Severity Code	Description
I	Information (not an SGML error)
W	Warning (an SGML markup error but it knows what you mean)
E	Error (the parser keeps a count and aborts if too many errors occurred)
C	Critical Error (the parser will abort at this point)

The type of error can be used to identify the nature of the problem and the types are as follows:

Type Code	Description
R	Resource problem
C	Context/Content problem
M	Minimization problem
Q	Quantity problem
S	Syntax problem
D	Declaration problem
U	Unsupported feature

The following is the table of error messages. The first column is the reference error code number. The second column is the severity code, the third column is the error type code, and the last column is the actual error message being generated. Inside the error message, the X and Y represent /- , variables which will be substituted with the appropriate name where the problem occurred.

E#	Code	Type	Error Message
1	E	C	X element not allowed at this point in Y element
2	E	D	X markup declaration not permitted here; declaration ended
3	E	Q	Length of name number or token exceeded NAMELEN limit
4	E	S	Non-SGML character occurred in markup; character ignored
5	E	C	X end-tag ignored: doesn't end any open element (current is Y)
6	E	Q	X start-tag exceeds open element limit; possible lies from Y on
7	E	M	Start-tag omitted from X with empty content
8	E	S	Illegal entity end in markup or delimited text
9	E	S	Incorrect character in markup; markup terminated
10	E	C	Data not allowed at this point in X element
11	E	C	No element declaration for X end-tag GI; end-tag ignored
12	E	S	X name ignored: not a syntactically valid SGML name
13	E	C	X = "Y" attribute ignored: not defined for this element
14	E	S	X = "Y" attribute value defaulted: invalid character
15	E	Q	X = "Y" attribute value defaulted: token too long
16	E	C	X = "Y" attribute value defaulted: too many tokens
17	E	C	X = "Y" attribute value defaulted: wrong token type
18	E	C	X = "Y" attribute value defaulted: token not in group
19	E	C	Required X attribute was not specified; may affect processing
20	E	M	X end-tag implied by Y end-tag; not minimizable
21	W	M	X start-tag implied by Y start-tag; not minimizable
22	E	C	Possible attributes treated as data because none were defined
23	E	D	Duplicate specification occurred for "X"; may affect processing
24	E	D	"X" keyword invalid; declaration terminated
25	E	C	X = "Y" attribute defaulted: empty string not allowed for token
26	E	S	Marked section end ignored; not in a marked section
27	E	Q	Marked section start ignored; X marked sections open already
28	E	D	One or more parameters missing; declaration ignored
29	E	D	"PUBLIC" or "SYSTEM" required; declaration terminated
30	E	C	X element ended prematurely; required Y omitted
31	E	R	Entity "X" terminated: could not read file
32	E	R	Could not open file for entity "X"; entity reference ignored
33	C	R	Insufficient main memory; unable to continue parsing
34	E	Q	X entity reference ignored; exceeded open entity limit (Y)
35	E	C	No declaration for entity "X"; reference ignored
36	E	C	X entity reference occurred within own text; reference ignored
37	E	S	Entity nesting level out of sync
38	E	D	Parameter entity text cannot have X keyword; keyword ignored
39	W	M	X end-tag implied by Y start-tag; not minimizable
40	E	D	Start-tag minimization ignored; element has required attribute
41	E	C	Required X element cannot be excluded from Y element
42	E	C	No DOCTYPE declaration; document type is unknown
43	E	C	Undefined X start-tag GI was used in DTD; "X O O ANY" assumed
44	E	S	Invalid character(s) ignored; attempting to resume DOCTYPE subset
45	1	C	No declaration for entity "X"; default definition used
46	W	M	X end-tag implied by NET delimiter; not minimizable
47	W	M	X end-tag implied by data; not minimizable
48	W	M	X end-tag implied by short start-tag (no GI); not minimizable
49	W	M	X start-tag implied by data; not minimizable
50	W	M	X start-tag implied by short start-tag (no GI); not minimizable
51	E	C	Short end-tag (no GI) ignored: no open elements
52	E	C	No definition for X document type; "X-O O ANY" assumed
53	E	C	No definition for X implied start-tag; "X O 0 ANY" assumed
54	E	C	X element ended prematurely; required sub-element omitted
55	E	D	Content model token X: connectors conflict; first was used
56	E	D	Duplicate specification occurred for "X"; duplicate ignored
57	E	S	Bad end-tag in R/CDATA clement; treated as short (no GI) cnd-tag
58	E	D	Start-tag minimization prohibited for EMPTY or R/CDATA; ignored
59	E	S	Reference to PI entity not permitted here; reference ignored
60	W	S	Non-SGML character found; should have been character reference
61	E	S	Numeric character reference exceeds 255; reference ignored
62	E	S	Invalid alphabetic character reference ignored
63	E	S	Invalid character in minimum literal; character ignored
64	E	D	Keyword X ignored; "Y" is not a valid marked section keyword
65	E	Q	Parameter entity name longer than (NAMELEN-1); truncated
66	W	Q	Start-tag length exceeds TAGLEN limit; parsed correctly .
67	W	C	X attribute defaulted: FIXED attribute must equal default
68	1	D	Duplicate specification occurred for "X"; duplicate ignored
69	E	C	X = "Y" IDREF attribute ignored: referenced ID does not exist
70	E	Q	X = "Y" IDREF attribute ignored: number of IDs in list exceeds GRPCNT limit
71	E	C	X = "Y" ID attribute ignored: ID in use for another element
72	E	C	X = "Y" ENTITY attribute not general entity; may affect processing
73	W	C	X = "Y" attribute ignored: previously specified in same list
74	E	C	"" - "X" name token ignored: not in any group in this list
75	E	Q	Normalized attribute specification length over ATTSPLEN limit
76	E	C	X = "Y" NOTATION ignored: clement content is empty
77	E	C	X = "Y" NOTATION undefined: may affect processing
78	E	C	Entity "X" has undefined notation "Y"
79	E	C	X = "Y" default attribute value not in group; #IMPLIED used
80	E	D	#CURRENT default value treated as #IMPLIED for X ID attribute
81	E	D	ID attribute X cannot have a default value; treated as #IMPLIED
82	E	D	X attribute must be token not empty string; treated as #IMPLIED
83	E	D	NOTATION attribute ignored for EMPIY element
84	E	C	X = "Y" NOTATION ignored: content reference specified
85	W	D	#CONREF default value treated as #IMPLIED for EMPTY element
86	E	C	X = "Y" entity not data entity; may affect processing
87	1	D	End-tag minimization should be "0" for EMPTY element
88	E	S	Formal public identifier "X" invalid; treated as informal
89	E	C	Out-of-context X start-tag ended Y document element (and parse)
90	E	D	"X" keyword is for unsupported feature; declaration terminated
91	E	D	Attribute specification list in prolog cannot be empty
92	C	S	Document ended invalidly within a literal; parsing ended
93	E	C	Short ref in map "X" to undeclared entity "Y" treated as data
94	E	R	Could not reopen file to continue entity "X"; entity terminated
95	E	C	Out-of-context data ended X document element (and parse)
96	E	C	i Short start-tag (no GI) ended X document lmnt element (and parse)
97	E	D	DSO delimiter (X) omitted from marked section declaration
98	E	D	Group token X: duplicate name or name token "Y" ignored
99	E	D	Attempt to redefine X attribute ignored
100	E	D	X definition ignored: Y is not a valid declared value keyword
101	E	D	X definition ignored: NOTATION attribute already defined
102	E	D	X definition ignored: ID attribute already defined
103	E	D	X definition ignored: no declared value specified
104	E	D	X definition ignored: invalid declared value specified
105	E	D	X definition ignored: number of names or name tokens in group exceeded GRPCNT limit
106	E	D	X definition ignored: name group omitted for NOTATION attribute
107	E	D	#CONREF default value treated as #IMPLIED for X ID attribute
108	E	D	X definition ignored: Y is not a valid default value keyword
109	E-	D	X definition ignored: no default value specified
110	E	D	X definition ignored: invalid default value specified
111	E	D	More than ATTCNT attribute names and/or name (token) values; terminated
112	E	D	Attempted redefinition of attribute definition list ignored
113	E	Q	Content model token X: more than GRPCNT model group tokens; terminated
114	E	Q	Content model token X: more than GRPGTCNT content model tokens; terminated
115	E	Q	Content model token X: more than GRPLVL nested model groups; terminated
116	E	D	Content model token X: Y invalid; declaration terminated
117	E	D	"PUBLIC" specified without public ID; declaration terminated
118	E	D	"X" keyword invalid (only Y permitted); declaration terminated
119	E	D	"X" specified without notation name; declaration terminated
120	E	D	Parameter must be a name; declaration terminated
121	E	D	Parameter must be a GI or a group of them; declaration terminated
122	E	D	Parameter must be a name or PERO (%); declaration terminated
123	E	D	Parameter must be a literal; declaration terminated
124	E	D	"X" not valid short reference delimiter; declaration terminated
125	E	C	Map does not exist; declaration ignored
126	E	D	MDC delimiter (>) expected; following text may be misinterpreted
127	C	S	Document ended invalidly within prolog; parsing ended
128	E	D	"PUBLIC" or "SYSTEM" or DSO ([) required; declaration terminated
129	E	D	Minimization must be "-" or "O" (not "X"); declaration terminated
130	E	D	Content model or keyword expected; declaration terminated
131	E	D	Rank stem "X" + suffix "Y" more than NAMELEN characters; not defined
132	E	C	Undefined X start-tag GI ignored; not used in DTD
133	C	S	Document ended invalidly within a markup declaration; parsing ended
134	E	Q	Normalized length of literal exceeded X; markup terminated
135	E	D	R/CDATA marked section in declaration subset; prolog terminated
136	E	Q	X = "Y" ENTITIES attribute ignored: more than GRPCNT in list
137	W	D	Content model is ambiguous
138	E	S	Invalid parameter entity name "X"
139	C	S	Document ended invalidly within a marked section; parsing ended
140	D		Element "X" used in DTD but not defined
141	E	S	Invalid NDATA or SUBDOC entity reference occurred; ignored
142	E	C	Associated element type not allowed in document instance
143	E	C	Illegal DSC character; in different entity from DSO
144	E	D	Declared value of data attribute cannot be ID"
145	E	S	Invalid reference to external CDATA or SDATA entity; ignored
146	E	R	Could not find external document type "X"
147	E	R	Could not find external general entity "X"
148	E	R	Could not find external parameter entity "X"
149	E	R	Could not find external notation "X"
150	E	R	Could not find entity "X" using default declaration
151	E	R	Could not find entity "X" in attribute Y using default declaration
152	E	S	Confusing non-SGML character found; ignored
153	I	D	End-tag minimization should be "0" for element with CONREF attribute
154	E	D	Declared value of data attribute cannot be ENTITY or ENTITIES"
155	E	D	Declared value of data attribute cannot be IDREF or IDREFS"
156	E	D	Declared value of data attribute cannot be NOTATION"
157	E	D	CURRENT cannot be specified for a data attribute"
158	E	D	CONREF cannot be specified for a data attribute"
159	E	C	Short reference map for element "X" not defined; ignored
160	C	R	Cannot create temporary file
161	C	D	Document ended invalidly within SGML declaration
162 1	W	Q	Capacity limit X exceeded by Y points
163	W	D	Amendment 1 requires "ISO 8879:1986" instead of "ISO 8879-1986"
164	E	D	Non-markup non-minimum data character in SGML declaration
165	E	D	Parameter cannot be a literal
166	E	D	Invalid concrete syntax scope "X"
167	E	D	Parameter must be a number
168	E	D	"X" should have been "Y"
169	E	U	Character number X is not supported as an additional name character
170	E	D	Parameter must be a literal or "X"
171	E	D	Bad character description for character X
172	W	D	Character number X is descried more than once
173	E	D	Character number plus number of characters exceeds 256
174	W	D	No description for upper half of character set: assuming "128 128 UNUSED"
175	E	D	Character number X was not described; assuming UNUSED
176	E	D	Non-significant shunned character number X not declared UNUSED
177	E	D	Significant character "X" cannot be non-SGML
178	E	U	Unknown capacity set "X"
179	E	D	No capacities specified
180	E	U	Unknown concrete syntax "X"
181	E	D	Character number exceeds 255
182	E	U	Concrete syntax SWITCHES not supported
183	E	U	"INSTANCE" scope not supported
184	E	D	Value of "X" feature must be one or more
185	E	D	"X" invalid; must be "YES" or "NO"
186	E	D	"X" invalid; must be "PUBLIC" or "SGMLREF"
187	E	U	Feature "X" is not supported
188	E	Q	Too many open subdocument entities
189	1	D	Invalid formal public identifier
190	I	D	Public text class should have been "X"
191	W	D	Character number X must be non-SGML
192	W	D	Notation "X" not defined in DTD
193	W	M	Unclosed start or end tag requires "SHORTTAG YES"
194	W	M	Net-enabling start tag requires "SHORTTAG YES"
195	W	M	Attribute name omission requires "SHORTTAG YES"
196	W	M	Undelimited attribute value requires "SHORTTAG YES"
197	W	M	Attribute specification omitted for "X": requires markup minimization
198	E	D	Concrete syntax does not have any short reference delimiters
199	E	D	Character number X does not exist in the base character set
200	E	D	Character number X is UNUSED in the syntax reference character set
201	E	D	Character number X was not described in the syntax reference character set
202	E	D	Character number X in the syntax reference character set has no corresponding character in the system character set
203	E	D	Character number X was described using an unknown base set
204	E	D	Duplication specification for added function "X"
205	E	D	Added function character cannot be "X"
206	E	U	Only reference concrete syntax function characters supported
207	E	U	Only reference concrete syntax general delimiters supported
208	E	U	Only reference concrete syntax short reference delimiters supported
209	E	D	Unrecognized keyword "X"
210	E	D	Unrecognized quantity name "X"
211	E	D	Interpretation of "X" is not a valid name in the declared concrete syntax
212	E	D	Replacement reserved name "X" cannot be reference reserved name
213	E	D	Duplicate replacement reserved name "X"
214	E	D	Quantity "X" must not be less than Y
215	E	U	Only values up to X are supported for quantity "Y"
216	E	C	Exclusions attempt to change required status of group in "X"
217	E	C	Exclusion cannot apply to token "X" in content model for "Y"
218	E	D	An entity with notation "X" has already been declared
219	E	D	UCNMSTRT must have the same number of characters as LCNMSTRT
220	E	D	UCNMCHAR must have the same number of characters as LCNMCHAR
221	E	D	Character number X assigned to both LCNMSTRT or UCNMSTRT and LCNMCHIAR or UCNMCHAR
222	E	D	Character number X cannot be an additional name character
223	E	U	It is unsupported for "-" not to be assigned to UCNMCHAR or LCNMCHAR
224	E	Q	Normalized length of value of attribute "X" exceeded LITLEN
225	E	Q	Length of interpreted parameter literal exceeds LITLEN less the length of the bracketing delimiters
226	W	M	Start tag of document element omitted; not minimizable
227	I	U	Unrecognized designating escape sequence "X"
228	I	D	Earlier reference to entity "X" used default entity