Last updated | 2002-05-17 21:50:28 EDT |
Doc Title | SGML and XML Indexing Support |
Author 1 | Wilkin, John Price |
CVS Revision | $Revision: 1.8 $ |
[Editor's note: This text is in the process of being adapted from the original Open Text manual, chapter 13 in the DBA section. References to sections with a "13" prefix are internal to this document. The original document has a heavy emphasis on MFS index building, which has not yet been corrected, and on "dbbuild", which DLXS does not support or recommend. This text was drawn from OCR, and so many errors exist, and figures are typically no longer meaningful.]
This chapter is a summary of SGML concepts and index-building techniques and thus assumes that the reader has some familiarity with the concepts of tagged text (see Chapters 1 and 4) and SGML. Further information on SGML can be found in the references listed in Section 13.6.
In order to maximize DLXS XPAT's SGML functionality, it is necessary to be aware of SGML and its capabilities with respect to your documents. This section will introduce some of the terminology that is used in the rest of the documentation. The characteristics and function of SGML tags will be described, along with the configuration files required by the DLXS XPAT software to utilize SGML functionality.
Note: SGML is the International Standards Organization (ISO) Standard 8879. It is explained in great detail in a number of books. See Section 13.6 for references. The following subsections will provide enough of an introduction to get started.
Standard Generalized Markup Language (SGML) is a system that allows you to "mark up" text with special tags. These tags specify the structure of the document you are working with. For instance, if you were writing a book, you would use special tags to say "this block of text is a paragraph", or "this block of text is a chapter title". These tags can be combined: "this block of text, made up of a chapter title and one or more paragraphs, is a chapter". Saying that "this block of text is a paragraph" means that the 'paragraph' is a structure element in your document. And saying that "this block of text, made up of a 'chapter title' and one or more 'paragraphs', is a 'chapter' " shows you that structure elements can be made of up combinations of other structure elements.
In order to tell programs such as xpat that, for instance, "this block of text is a paragraph", you have to surround that block of text with tags. These tags usually exist in pairs: start-tags and end- tags. For the 'paragraph' example, '< P>' is a possible start-tag (the actual tag name is arbitrary, provided you use it consistently). This tells xpat that the text following the start-tag is to be considered pan of a 'paragraph' region. To end the 'paragraph', you would use the end-tag '< / P>'.
The start-tag and end-tag are almost identical, except that the end-tag's name is preceded by a slash ('/') character. The start and end tag pairs allow both you and the program to quickly and easily find the structure elements in an SGML document.
You can combine structure elements, such as paragraphs and headlines, together until you get a single element. This element is called the document type. For instance, if you were writing a book, your document type could be BOOK. For a newspaper, you could have a document type called NEWPAPER (document type labels are, by default, limited to 8 characters. However, that limit can be changed by modifying the SGML declaration).
An SGML document is represented as a sequence of characters organized physically into an entity structure and logically into element structure. The first entity of an SGML document that is parsed must be the "SGML document entity", which contains the document type definition and other information that governs the parse.
The SGML declaration is a section that defines all concrete syntax, optional features and capacity requirements that affect the parsing of the Document Type Definition (DTD) and the document instance. This SGML declaration can usually be skipped and the system default declaration will be assumed.
After the SGML declaration is a DTD that defines the structure of the document in terms of the elements it contains. Within the DTD, each type of element found in the document is given a name (generic identifier) by which it can be recognized. When placed within special markup delimiter characters, these generic identifiers form the tags that are used to identify the start and the end of each element.
To allow large documents to be generated efficiently, SGML documents can be built up from a series of sub-documents or SGML text 'entities'. The non-SGML data can also be referenced by non- SGML data entities (such as graphics, spreadsheet, etc). Physically, a typical SGML document entity will look like the this:
The SGML declaration is at the top and is followed by the document type declaration. The document instance, which contains the actual data, is after the document type declaration. However, the SGML declaration "< ! SGML . . . >" is usually ignored because most applications can rely on the default (see the sgmlrgn man pages in the DBA Reference Guide). The DTD exists in a separate file. Therefore, a more common arrangement is be depicted as follows:
<! DOCTYPE NEWPAPER SYSTEM news. did "> Document type declaration
<NEWPAPER> Document instance
< /NEWPAPER>
The "< DOCTYPE ...>" document type declaration defines the location of the DTD information. In the previous example, the DTD is located in the file "news . dtd". The document instance will follow the declaration. However, since the location of the DTD may be constantly changed, a more convenient setup will be to separate the "< ! DOCTYPE . . >" statement into another file (suffix '.inp'). This is shown below:
The declaration is located in 'filename. inp' input file. The document instance will be kept separately in 'filename. sgm' to achieve maximum portability of the document instance. When using sgmlrgn, the two files can be concatenated together by listing them in sequence in the sgmlrgn command line:
sgmlrgn [...various options...] filename.inp flename.sgm
In this case, the contents of filename. inp' will be considered before the contents of 'filename. sgm'. Thus, greater flexibility is achieved by separating the document type declaration from the document. The pure data file filename . sgm' can be ported elsewhere and used with a different DTD and INP.
To take advantage of the information contained in your SGML tagged document, you first need to tell programs such as sgmlrgn that you have a base document type and then say how that document type is constructed from the different structure elements. In other words, the allowable fields (elements) have to be defined, and legal nestings must be unambiguously declared. This requires two files: (1) the '.inp' file, and (2) the '.dtd' file. The '.inp' file declares the document type and tells the program where to find the '.dtd' file. The '.dtd' (Document Type Definition) file defines the document structure elements (fields) and their allowable nestings. This section will provide examples of both files, as well as an example of SGML tagged text, all of which will be based on the example of a newspaper.
In this section, we will describe the function of the Document Type Declaration file, describe its syntax, and give an example of it. The Document Type Declaration (' . inp') file declares the name of the base document type for your SGML tagged document and describes in which file to find the formal definition of the document structure (i.e., the DTD). The Document Type Declaration file usually only has one line, which has the following syntax: < ! DOCTYPE doctype SYSTEM "filename. dtd"> where 'doctype'is the name of the base document type, and 'filename. dtd' is the name of the file that contains the formal definition of the document structure (called the Document Type Definition or DTD). The filename. dtd' file will be assumed to be in the current directory or in the directory specified by the environment variable, SGMLREGION_PATH (Unix only). Documents can be of two types: SYSTEM or PUBLIC. PUBLIC documents are those known to more than the native system format, whereas SYSTEM documents are those that are specific to the system on which they are prepared. Our example is not going to refer to any other documents, so we can specify a SYSTEM entry.
For our newspaper example, we will call the doctype NEWPAPER. The filename. dtd' file will be called 'newpaper. dtd'. Our Document Type Declaration file, called 'newpaper. inp', now looks like this:
<!DOCTYPE NEWPAPER SYSTEM "newpaper.dtd">
This section gives an overview of the function and syntax of the Document Type Definition (' .dtd') file, and gives an example of its use. Refer to Section 13.6 for further references on SGML and DTD's.
The Document Type Definition (DTD) formally defines the structure of an SGML document, as well as the relationships between the different structure elements. It describes how simple structure elements, made up of characters, can be combined to form more complex structure elements, including the base document type. For the newspaper example, we can assume a newspaper is made up of Stories, Illustrations, and Ads. Stories may be made up of Paragraphs, a Date, a Byline, an Author, and other pieces. Paragraphs may be made up only of text (character data), not other structure elements.
To construct our DTD, we will start with the topmost element of an SGML document: the base document type itself. We illustrate with the newspaper example.
<!ELEMENT NEWPAPER O O (STORY)*>
The clement NEWPAPER is made up of zero or more STORY's. (The '*' means zero or more). The first capital letter 'O' means that the start-tag for this structure clement can be omitted from the actual SGML document. The second capital letter '0' means that the end-tag for this structure element can be omitted.
The next step is to define what a STORY element looks like:
<!ELEMENT STORY - - (TEXT ILLUST)*>
For our newspaper example, a story is made up of zero or more TEXTS or ILLUSTrations. The vertical bar (I) means "or". The dashes ('-') mean that the start and end tags must be present in the text in order for the structure element to be recognized as a story. We can also say that the STORY has certain attributes, such as Status, Publisher, Date and Page. Each of these attributes can have a value associated with it. So, for our newspaper example, we add the following:
These "entries" tell us that the structure element STORY has an ATTribute LIST that includes STATUS, PUBlisher, DATE, and PAGE data regions. The STATUS attribute can take one of three values (Draft, Prepare, or Ready), but defaults to Draft. The PUBlisher attribute is made up of characters (CDATA) and defaults to 'Local Newspaper'. The DATE attribute is a number. The # IMPLIED value tells the system that there is no default for the attribute, and that the system should imply a value if none is given. The PAGE is simply character data and defaults to a blank. The TEXT and ILLUSTration elements are, for the purposes of our newspaper example, made up of characters. To be recognized, they will require both start and end tags. To enter these definitions into the DTD, we write:
<!ELEMENT TEXT - - CDATA>
<!ELEMENT ILLUST - - CDATA>
By looking at all the entries together, it is easy to see that our base document type, NEWPAPER, is made up of zero or more STORY's. A STORY is made up of zero or more pieces of TEXT or ILLUSTrations, each of which are in turn made up of characters (CDATA). Each STORY also has a STATUS, PUBlisher, DATE, and PAGE associated with it. The following puts all the information together. The tags that start with '< e --' and end with '- >' are regarded as comments.
The SGML document ('.sgm') file contains the actual SGML tagged text. The DTD is used to interpret the text and its various elements (fields). Notice that the elements defined in the DTD are called tags when surrounded by angle brackets (<>). Also notice that the same line that appeared in the Document Type Declaration ('.inp') file is also the first line of the SGML ('.sgm') document file. The attributes of the STORY element and their associated values are all contained within the STORY tag as well. The following is some sample text that, in this example, would be stored in the SGML document file called 'newpaper. sgm':
The SGML processing model closely resembles the traditional model of processing computer programs written in a programming language. Most processing systems (i.e., a compiler) for computer language programs have the same structure. This structure is depicted below:
Figure 13-1: Programming Language Document Processing Structure (Compiler)
input parse semanc
~pasprogram -^ par tree - processi9ng- output
The task of the "parser" is to check whether the input is syntactically correct and to build the parse tree. After the parser has done its task, the other part of the system will perform the semantic processing. An SGML processing system has the same general structure as a compiler:
Figure 13-2: SGML Document Processing Structure
inpu/ GL-T valid output
An SGML parser, as defined in the SGML Standard, has the same structure as a parser for programming languages. The parser only checks the conformance of SGML document to its DTD and performs no further semantic processing. The output of most SGML parsers includes a normalized document, which is the document for which all start-tags and end-tags have been fully expanded. At this stage, the document is said to conform to the corresponding DTD. The internal structure of this complete document corresponds to the parse tree in systems for programming language.
As with programs, the complete document is not the end-stage in processing a document. It merely serves as an intermediate product, in which the correctness of the document has been assessed. Subsequently, the document has to be further processed. This is labeled SGML Application in Figure 13-2. The SGML application may generate code for various output formats. More specifically, sgmlrgn relies on a common, public-domain SGML parser called "sgmls". The various SGML applications are combined into a unified interface as sgmlrgn. Each mode serves as a unique application that generates'support information for DLXS XPAT products. For example, the 'region' mode generates the Region indices which can be used by xpat. The relationship and application for various SGML supports are illustrated in Figure 13-3:
Figure 13-3: SGML Processing support for 'sgmlrgn', various modes
Each processing mode will take an SGML document as an input and use the SGML parser to
produce an oututt format. The 'check' mode should be used before any other processing mode
since a validated SGML document is vital to other processing tools. The most commonly used mode
will be 'check' and 'region' modes. This combination will validate the SGML document and
produce the regions for the Pat search engine. If PatMotif50and LectorMotif50 are selected as yhe
usr's sarch ier and viewer, the i ter' mode can be used to support communication between the
two programs. The 'spec' mode is used to generate a simple "Lector specification file" for
displaying tagged text in LectorMotif5O. The 'root' is used'to generate the topmost level element in
the DTD so that it can be included in the "Pat initialization file".
sgmlrgn Processing Modes
The sgmlrgn program has several different SGML application modes. Many of these will be used
throughout this section, so we will describe these modes here. The desired mode is specified as part of the '- m' option of the sgmlrgn command line:
Commonly used options are described in the following table in order of importance:
Mode | Function |
-m check | This mode validates the SGML document itself with respect to the DTD file. Any syntax or other errors will be reported by sgmlrgn. |
-m region | This mode generates all the regions in the file and updates |
-D datadictname.dd | the region information in the DD file. The '-D' option must be included to specify which DD file is to be updated. The name of the region file created uses a '.rgn' extension and the same prefix as that of the text file. |
-m root | This mode determines and prints out the root element (also referred to as the base document type for our purposes) of the SGML document. |
-m filter | This mode gets sgmlrgn to parse the DTD and wait for standard input. It is suitable for use as a filter between PatMotifSO and LectorMotif50. This configuration is described further below. |
-m spec | This mode will generate a simple LectorMotif50 specification ('.spc') file, where all elements are recorded. See the LectorMotif50(l) man pages or the LectorMotif50 section in the DLXS XPATQuery Configuration Guide for a more complete discussion of the specification file. |
Refer to the sgmlrgn(5) man page for a more complete discussion of these options. It is important to note why the sgmlrgn program is used here, instead of the xpatrgn or multirgn programs. The xpatrgn program is designed to be used when the document has arbitrary patterns to denote regions. The multirgn program is designed to be used when the document has SGML style tags (i.e., surrounded by angle (<>) brackets), but no DTD. The sgmlrgnprogram is designed to be used with fully validated SGML documents that have associated DTD's. This can be summarized with the following table.
Region
Boulder Appropriate Use
xpatrgn text has arbitrary patterns
multirgn text has SGML type tags but no DTD
sgmlrgn text has fully validated SGML tags and a DTD
Throughout this section, reference will be made to the files 'newpaper. inp', 'newpaper. dtd', and 'newpaper. sgm'. These are the names of the example files described in Section 13.2.
In order to ensure that there arc no syntax or other errors in the DTD (contained in the '.dtd' file) or the SGML tagged document (in the '.sgm' file) itself, you should run a test over them. The sgmlrgn program provides an easy way to perform this verification: the check mode.
sgmlrgn -v -m check newpaper.sgm
The '-v' option again makes the output verbose, instructing sgmlrgnto describe its operations as it proceeds. The '-m' option selects the mode (see the discussion above and the sgmlrgn(1)man page). Substitute your SGML file for the 'newpaper . sgm' file given here. Please note that all fatal errors must be overcome before other processing modes will be able to use the document. If the verification is successful, messages similar to the following will be returned:
check mode ...
checking total size(125K) time (2s)
Once this simple step has completed, we know that our SGML document and DTD are correct and fully validated. We can now move on to building the regions file.
One of the benefits associated with using SGML documents is that we can define regions of text. A region is the text that exists between the start and end tags of an SGML document structure element. So, for instance, text between <P> and </P> tags will be referred to as a region. When lists of regions are placed into a regions file, xpat can restrict searches only to "paragraphs" (for example), or some other defined region. However, the DD file must exist before you try to build the regions file, as the region builder will update the DD file with the new region information. The SGML region builder is invoked with the following command:
sgmlrgn -v -m region -D newpaper.dd newpaper.sgm
You can substitute your DD name for the 'newpaper. dd' file and your SGML document for the 'newpaper. sgm' file. The region information is derived from the actual SGML document, and the results are placed in the region file. In the above example, sgmlrgn would create a region index file called 'newpaper. rgn'. If the region building operation is successful, then messages similar to the following will be displayed (the number of regions, their sizes, and the time shown are for the example given):
In addition to these messages, a message similar to the following will be given for each different type of region that was built:
built (newpaper.rgn) region (NEWPAPER count= 2)
The sgmlrgn program will clean up the DD file to accommodate the new region information. If the region name previously existed in the DD file, that region definition will be replaced by the newly constructed information. If a region name is no longer a reference to any file, this region definition will be removed.
If for some reason a region is not needed or not wanted for xpator PatMotif50, it can be manually deleted from the DD after the regions building process is completed. For instance, if ILLUSTrations are not needed, the following segment can be deleted from the DD:
<Region>
<Name> ILLUST</Name>
</Region>
Once that segment is deleted from the DD, the I LLUST region is invisible to xpat or PatMotifSO.
The region building mode for sgmlrgn will construct region indices that can be used by the Pat search engine. For every unique element occurring inside the SGML document, sgmlrgn will assign a unique region index for it. For example, assuming the following SGML document has been validated with 'check' mode, the document instance is:
It is important to note that three types of region indices will be built on each region tag. Type 1 indices are built on the contents of the data regions marked by <tag_body>and < /tag_body>. Type 2 indices are built on just the contents of the start tags (i.e., <tagbody>. Type 3 indices are built on attributes within a tagody.
For instance, a type 1 index would be built on the contents of the inventor data region and would be called INVENTOR (SGML region indices are always named with uppercase letters). A Type 2 index would be built on the start tag <uspatapp ...> tag and would be called USPATAPP-T (type 2 indices are always suffixed by "-T"). Type 3 indices would be built on the attributes of uspatapp. For example, the Type 3 index built on the patnu., attribute would be called A- PATNUM (type 3 indices are always prefixed by "A-"). (The figure below illustrates further.) These three different types of SGML indices allow users to restrict queries to very specific sets of regions (i.e., those with specific combinations of attribute values).
Figure 13-4: Scope of SGML indices on regions, tag bodies, and tag attributes
<uspatapp patnum="..........." IMGAVLDATE=" ..........." appnum="......">...............</uspatapp>
Scope of Scope of Scope of
A-PATNUM A-IMGAVLDATE A-APPNUM
index index index
Scope of USPATAPP-T index
Scope of USPATAPP index
An external entity resides in one or more files. A system identifier is interpreted as a list of filenames separated by colons. If no system identifier is supplied, then the entity manager will attempt to generate a filename using the public identifier. The searching of the related system filename associated with the public identifier is done by a table lookup. The table is named "sgml ent i ty .map" in the system. The sgmlent ity .map file has two white-space delimited fields per document type. The first field is the system filename. The second field is the PUBLIC ID. The following are sample entries for document types in the sgmlentity file:
sgmlrgn uses the following precedence order searching algorithm to find the PUBLIC ID:
1. the sgmlentity. map file in the local directory.
2. the sgmlentity . map file pointed to by the SGMLREGION_PATH environment variable
3. the system filename in the local directory.
An External Entity Mapping Example
The examples that follow rely on three files: the document type definition, the document instance, and the input file. The following is a sample document type definition called 'example. dtd'.
The following is an example SGML document instance called example. sgm:
<!DOCTYPE doc SYSTEM "example.dtd">
<doc><intro> Introduction
<body><p type=left>Paragraph 1
<p type=center>Paragraph 2
<concl>Conclusion
The PUBLIC entity "TSO 8879-1986//ENTITIES Added Latin i//EN" will producea table lookup in the entity map file 'sgmlent i ty. map' (or in '$ (SGMLREGION_PATH) /sgmientity.map'). The following is an entry in the sgml ent i ty. map file:
/usr/app/isolatl.gml "ISO 8879-1986//ENTITIES Added Latin 1//EN"
This particular ISOlat public entity will be mapped to the system id "/usr/app/isolat. gml"
In MFS databases, the MFS system creates a "virtual text" from the text of all the files in the database. The portion of this virtual text that corresponds to each file consists of three pieces: the Meta-Header section, the Data section and the Meta-Trailer section. This breakdown is illustrated in the following diagram:
The data in the Meta-Header and Meta-Trailer sections is highly structured and is uniform across all the files in the MFS database. In contrast, the data in the Data sections may be untagged text, tagged text without a DTD or tagged text with a DTD (SGML data).
The process of building region indices on such databases involves three steps. The first step involves running mfsmeta over the database to build a meta structure_file. This file contains information about the positions of the Meta-Header, Data and Meta-Trailer sections for each file in the database. The second step involves building regions on the fields in the Meta-Headcr and Meta-Trailer sections that are common to all files. Refer to the multirgn(l) man page for further details. The third step involves building regions for the Data sections. For the Data sections that contain tagged text without a DTD, this task is accomplished using multirgn. For SGML Data sections (that do have a DTD), this task is accomplished using sgmlrgn.
There are three types of SGML MFS databases. The first type consists of a group of SGML files that all conform to the same DTD and where each file is a complete document. The second type consists of a group of SGML files that conform to several different DTD's, but where each file is still a complete document. The third type consists of a group of SGML files that conform to one or more DTD's and where the files may contain either complete documents or pieces of documents (i.e., the text for specific elements in the DTD). Each of the next three sections discusses how to build regions for one of the above database types.
The first step in building region indices for Type 1 SGML databases involves setting up the Fi tcerChain section of the Data Dictionary, which specifies the SGML files to be included. In particular, the DisplayFmt field should be set to the value, 'sgml'. For example, the following FilterChain section might be appropriate for a Type 1 SGML database.
Once the FilterChain sections have been set up, the following command can be used to build the SGML regions (usually done separately after dbbuild or individual index-builders have been run):
sgmlrgn -v -m region -M data.str -D data.dd data.inp data.dd
For this example, assume that the meta_structure_file generated by mfsmeta is called 'data. str' and that the 'data. inp' contains the < ! DOCTYPE .. .> declaration for the SGML files in the database. The sgmlrgn program will then use the 'data. str' to identify all the 'sgml' format files and will build SGML regions on them.
As with Type I SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. However, because the files conform to more than one DTD, they must be separated into groups, where all the files in a group conform to a particular DTD. A F ilterChain section is then setup for each group. The DisplayFmt section of each FilterChain is then set with two values separated by a comma. The first value is the keyword 'sgml' and the second value is a short group name that you pick, which uniquely identifies the group. For example, the following FilterChain sections might be appropriate for a Type 2 SGML database that contains files from two DTD's (having group names 'manual' and 'news').
Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions (each DTD in the database requires one pass with sgmlrgn). For this example, assume the meta_structurc_file generated by mfsmeta is called data. str. Assume that the file, 'manual . inp' contains the < ! DOCTYPE . . .> declaration for the 'manual' files. Finally, assume that the file, 'news . inp' contains the < ! DOCTYPE . . . > declaration for the 'news' files.
sgmlrgn -v -m region -M data.str -G manual -D data.dd manual.inp data.dd
sgmlrgn -v -m region -M data.str -G news -D data.dd news.inp data.dd
Note: The '-G' option is used to specify which group to build the regions on in each pass.
As with Type 2 SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. Also, as in Type 2 SGML databases, the files must be separated into groups. What is different for Type 3 databases is that the groups not only specify files that use a particular DTD has, but may also be further refined to specify files that contain text for a specific clement of a DTD.
For example. assume the newspaper documents in the example above consists of two elements, HEADLINE and TEXT. Further, assume that text for all the HEADLINE regions are in files with the suffix, '.hi' and that the text for the TEXT regions are in files with the suffix, '.txt'. Then the following FilterChain sections could be used to define this database (which also includes the 'manual' files from the other directory):
Note: A third attribute has been added to the DisplayFmt fields of the 'news' filegroup, which identifies the element that the text in those files corresponds to. Also note that HEADLINE and TEXT groups have different group names ('newshl' and 'newstxt'). Finally, note that there is no element attribute defined for the 'manual' files because they are to be parsed using the entire 'manual' DTD.
Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions. For this example, assume the meta_structure_file generated by mfsmeta is called 'data. str'. Assume that the file, 'manual. inp' contains the < ! DOCTYPE . . . > declaration for the 'manual' files. Finally, assume that the file, 'news. inp' contains the < ! DOCTYPE . . . > declaration for the 'news' files.
Note: The '-G' option is used to specify which group to build the regions on in each pass.
Section 13.1 provided a brief overview of SGML, its concepts and structure. Further information can be found in the references listed in Section 13.6 below. Section 13.2 reviewed the concepts and structure of an SGML document. The base document type of an SGML document is made up of a number of different structure elements, which may themselves be made up of other structure elements. The base document type should be first declared in the Document Type Declaration ('.inp') file. The function of the Document Type Definition ('.dtd') file is then to describe the relationships among the different structure elements of your document, as well as to describe the attributes that are associated with these elements. The SGML ('.sgm') file is itself the actual SGML tagged document whose structure is defined by the first two files. Section 13.3 described the concept of SGML processing and various SGML processing modes supported by sgmlrgn. We then checked the SGML document for correctness. We used sgmlrgn to automatically generate "regions" index file for our document. We also used the sgmlregion program to generate a simple LectorMotif50 specification file. We then discussed the need to use an SGML filter between PatMotifSO and LectorMotifSO. Section 13.4 reviewed the internals of sgmlrgnabout how the regions are being constructed for a regular SGML documents or minimized SGML documents. If the SGML document refers to external entities, the 'sgml ent i ty .map' is used to map the PUBLIC identifier to the system file. Finally, the method to use sgmlrgn to construct regions in an MFS database is described. By identifying the display format, group name and start element, sgmlrgncan jump into any SGML data section to construct regions.
Suggested readings and reference materials on SGML:
(1) The SGML Handbook: The annotated full text of ISO 8879 - Standard Generalized Markup Language,Dr. Charles F. Goldfarb. Claredon Press, Oxford, 1990.
(2) SGML: An Author's Guide to the Standard Generalized Markup Language, Martin Bryan. Addison-Wesley Publishing Company, New York, 1988.
(3) SGML and Related Standards: Document Descriptionand Processing Language, Joan Smith. Ellis Horwood, New York, 1992.
(4) CAN/CSA-Z243.210-89 (ISO 8879, 9069), Canadian Standard Association.
The sgmlrgn index building program contains a very extensive error reporting facility. The original error messages are adapted from the public domain "SGMLS" parser. A typical error message generated by sgmlrgnwill look like:
sgmlrgn: SGML error at <filename>, line <number> at "<char>":
<specific error messages>
A complete error reporting will contain the '< filename>' and its line '<number>' where the error occurred. It will also locate the closest character '<char>' where the parser starts to detecting problem. The 'specific error messages' will also be produced to briefly explain the problem. Although the severity and the type of error is not reported, the user can refer to the following tables to find more information about the severity and the type of error. SGML is a very strict system of text markup. Thus, errors can occur very easily. All errors must be resolved in order to get document conformation. Therefore, it is necessary to use sgmlrgn program's 'check' mode to find and resolve all problems before other processing modes can be applied. The error messages are classified by their severity ('Code') and the type of error ('Type'). The severity codes are as following:
Severity Code | Description |
I | Information (not an SGML error) |
W | Warning (an SGML markup error but it knows what you mean) |
E | Error (the parser keeps a count and aborts if too many errors occurred) |
C | Critical Error (the parser will abort at this point) |
The type of error can be used to identify the nature of the problem and the types are as follows:
Type Code | Description |
R | Resource problem |
C | Context/Content problem |
M | Minimization problem |
Q | Quantity problem |
S | Syntax problem |
D | Declaration problem |
U | Unsupported feature |
The following is the table of error messages. The first column is the reference error code number. The second column is the severity code, the third column is the error type code, and the last column is the actual error message being generated. Inside the error message, the X and Y represent /- , variables which will be substituted with the appropriate name where the problem occurred.
E# | Code | Type | Error Message |
1 | E | C | X element not allowed at this point in Y element |
2 | E | D | X markup declaration not permitted here; declaration ended |
3 | E | Q | Length of name number or token exceeded NAMELEN limit |
4 | E | S | Non-SGML character occurred in markup; character ignored |
5 | E | C | X end-tag ignored: doesn't end any open element (current is Y) |
6 | E | Q | X start-tag exceeds open element limit; possible lies from Y on |
7 | E | M | Start-tag omitted from X with empty content |
8 | E | S | Illegal entity end in markup or delimited text |
9 | E | S | Incorrect character in markup; markup terminated |
10 | E | C | Data not allowed at this point in X element |
11 | E | C | No element declaration for X end-tag GI; end-tag ignored |
12 | E | S | X name ignored: not a syntactically valid SGML name |
13 | E | C | X = "Y" attribute ignored: not defined for this element |
14 | E | S | X = "Y" attribute value defaulted: invalid character |
15 | E | Q | X = "Y" attribute value defaulted: token too long |
16 | E | C | X = "Y" attribute value defaulted: too many tokens |
17 | E | C | X = "Y" attribute value defaulted: wrong token type |
18 | E | C | X = "Y" attribute value defaulted: token not in group |
19 | E | C | Required X attribute was not specified; may affect processing |
20 | E | M | X end-tag implied by Y end-tag; not minimizable |
21 | W | M | X start-tag implied by Y start-tag; not minimizable |
22 | E | C | Possible attributes treated as data because none were defined |
23 | E | D | Duplicate specification occurred for "X"; may affect processing |
24 | E | D | "X" keyword invalid; declaration terminated |
25 | E | C | X = "Y" attribute defaulted: empty string not allowed for token |
26 | E | S | Marked section end ignored; not in a marked section |
27 | E | Q | Marked section start ignored; X marked sections open already |
28 | E | D | One or more parameters missing; declaration ignored |
29 | E | D | "PUBLIC" or "SYSTEM" required; declaration terminated |
30 | E | C | X element ended prematurely; required Y omitted |
31 | E | R | Entity "X" terminated: could not read file |
32 | E | R | Could not open file for entity "X"; entity reference ignored |
33 | C | R | Insufficient main memory; unable to continue parsing |
34 | E | Q | X entity reference ignored; exceeded open entity limit (Y) |
35 | E | C | No declaration for entity "X"; reference ignored |
36 | E | C | X entity reference occurred within own text; reference ignored |
37 | E | S | Entity nesting level out of sync |
38 | E | D | Parameter entity text cannot have X keyword; keyword ignored |
39 | W | M | X end-tag implied by Y start-tag; not minimizable |
40 | E | D | Start-tag minimization ignored; element has required attribute |
41 | E | C | Required X element cannot be excluded from Y element |
42 | E | C | No DOCTYPE declaration; document type is unknown |
43 | E | C | Undefined X start-tag GI was used in DTD; "X O O ANY" assumed |
44 | E | S | Invalid character(s) ignored; attempting to resume DOCTYPE subset |
45 | 1 | C | No declaration for entity "X"; default definition used |
46 | W | M | X end-tag implied by NET delimiter; not minimizable |
47 | W | M | X end-tag implied by data; not minimizable |
48 | W | M | X end-tag implied by short start-tag (no GI); not minimizable |
49 | W | M | X start-tag implied by data; not minimizable |
50 | W | M | X start-tag implied by short start-tag (no GI); not minimizable |
51 | E | C | Short end-tag (no GI) ignored: no open elements |
52 | E | C | No definition for X document type; "X-O O ANY" assumed |
53 | E | C | No definition for X implied start-tag; "X O 0 ANY" assumed |
54 | E | C | X element ended prematurely; required sub-element omitted |
55 | E | D | Content model token X: connectors conflict; first was used |
56 | E | D | Duplicate specification occurred for "X"; duplicate ignored |
57 | E | S | Bad end-tag in R/CDATA clement; treated as short (no GI) cnd-tag |
58 | E | D | Start-tag minimization prohibited for EMPTY or R/CDATA; ignored |
59 | E | S | Reference to PI entity not permitted here; reference ignored |
60 | W | S | Non-SGML character found; should have been character reference |
61 | E | S | Numeric character reference exceeds 255; reference ignored |
62 | E | S | Invalid alphabetic character reference ignored |
63 | E | S | Invalid character in minimum literal; character ignored |
64 | E | D | Keyword X ignored; "Y" is not a valid marked section keyword |
65 | E | Q | Parameter entity name longer than (NAMELEN-1); truncated |
66 | W | Q | Start-tag length exceeds TAGLEN limit; parsed correctly . |
67 | W | C | X attribute defaulted: FIXED attribute must equal default |
68 | 1 | D | Duplicate specification occurred for "X"; duplicate ignored |
69 | E | C | X = "Y" IDREF attribute ignored: referenced ID does not exist |
70 | E | Q | X = "Y" IDREF attribute ignored: number of IDs in list exceeds GRPCNT limit |
71 | E | C | X = "Y" ID attribute ignored: ID in use for another element |
72 | E | C | X = "Y" ENTITY attribute not general entity; may affect processing |
73 | W | C | X = "Y" attribute ignored: previously specified in same list |
74 | E | C | "" - "X" name token ignored: not in any group in this list |
75 | E | Q | Normalized attribute specification length over ATTSPLEN limit |
76 | E | C | X = "Y" NOTATION ignored: clement content is empty |
77 | E | C | X = "Y" NOTATION undefined: may affect processing |
78 | E | C | Entity "X" has undefined notation "Y" |
79 | E | C | X = "Y" default attribute value not in group; #IMPLIED used |
80 | E | D | #CURRENT default value treated as #IMPLIED for X ID attribute |
81 | E | D | ID attribute X cannot have a default value; treated as #IMPLIED |
82 | E | D | X attribute must be token not empty string; treated as #IMPLIED |
83 | E | D | NOTATION attribute ignored for EMPIY element |
84 | E | C | X = "Y" NOTATION ignored: content reference specified |
85 | W | D | #CONREF default value treated as #IMPLIED for EMPTY element |
86 | E | C | X = "Y" entity not data entity; may affect processing |
87 | 1 | D | End-tag minimization should be "0" for EMPTY element |
88 | E | S | Formal public identifier "X" invalid; treated as informal |
89 | E | C | Out-of-context X start-tag ended Y document element (and parse) |
90 | E | D | "X" keyword is for unsupported feature; declaration terminated |
91 | E | D | Attribute specification list in prolog cannot be empty |
92 | C | S | Document ended invalidly within a literal; parsing ended |
93 | E | C | Short ref in map "X" to undeclared entity "Y" treated as data |
94 | E | R | Could not reopen file to continue entity "X"; entity terminated |
95 | E | C | Out-of-context data ended X document element (and parse) |
96 | E | C | i Short start-tag (no GI) ended X document lmnt element (and parse) |
97 | E | D | DSO delimiter (X) omitted from marked section declaration |
98 | E | D | Group token X: duplicate name or name token "Y" ignored |
99 | E | D | Attempt to redefine X attribute ignored |
100 | E | D | X definition ignored: Y is not a valid declared value keyword |
101 | E | D | X definition ignored: NOTATION attribute already defined |
102 | E | D | X definition ignored: ID attribute already defined |
103 | E | D | X definition ignored: no declared value specified |
104 | E | D | X definition ignored: invalid declared value specified |
105 | E | D | X definition ignored: number of names or name tokens in group exceeded GRPCNT limit |
106 | E | D | X definition ignored: name group omitted for NOTATION attribute |
107 | E | D | #CONREF default value treated as #IMPLIED for X ID attribute |
108 | E | D | X definition ignored: Y is not a valid default value keyword |
109 | E- | D | X definition ignored: no default value specified |
110 | E | D | X definition ignored: invalid default value specified |
111 | E | D | More than ATTCNT attribute names and/or name (token) values; terminated |
112 | E | D | Attempted redefinition of attribute definition list ignored |
113 | E | Q | Content model token X: more than GRPCNT model group tokens; terminated |
114 | E | Q | Content model token X: more than GRPGTCNT content model tokens; terminated |
115 | E | Q | Content model token X: more than GRPLVL nested model groups; terminated |
116 | E | D | Content model token X: Y invalid; declaration terminated |
117 | E | D | "PUBLIC" specified without public ID; declaration terminated |
118 | E | D | "X" keyword invalid (only Y permitted); declaration terminated |
119 | E | D | "X" specified without notation name; declaration terminated |
120 | E | D | Parameter must be a name; declaration terminated |
121 | E | D | Parameter must be a GI or a group of them; declaration terminated |
122 | E | D | Parameter must be a name or PERO (%); declaration terminated |
123 | E | D | Parameter must be a literal; declaration terminated |
124 | E | D | "X" not valid short reference delimiter; declaration terminated |
125 | E | C | Map does not exist; declaration ignored |
126 | E | D | MDC delimiter (>) expected; following text may be misinterpreted |
127 | C | S | Document ended invalidly within prolog; parsing ended |
128 | E | D | "PUBLIC" or "SYSTEM" or DSO ([) required; declaration terminated |
129 | E | D | Minimization must be "-" or "O" (not "X"); declaration terminated |
130 | E | D | Content model or keyword expected; declaration terminated |
131 | E | D | Rank stem "X" + suffix "Y" more than NAMELEN characters; not defined |
132 | E | C | Undefined X start-tag GI ignored; not used in DTD |
133 | C | S | Document ended invalidly within a markup declaration; parsing ended |
134 | E | Q | Normalized length of literal exceeded X; markup terminated |
135 | E | D | R/CDATA marked section in declaration subset; prolog terminated |
136 | E | Q | X = "Y" ENTITIES attribute ignored: more than GRPCNT in list |
137 | W | D | Content model is ambiguous |
138 | E | S | Invalid parameter entity name "X" |
139 | C | S | Document ended invalidly within a marked section; parsing ended |
140 | D | Element "X" used in DTD but not defined | |
141 | E | S | Invalid NDATA or SUBDOC entity reference occurred; ignored |
142 | E | C | Associated element type not allowed in document instance |
143 | E | C | Illegal DSC character; in different entity from DSO |
144 | E | D | Declared value of data attribute cannot be ID" |
145 | E | S | Invalid reference to external CDATA or SDATA entity; ignored |
146 | E | R | Could not find external document type "X" |
147 | E | R | Could not find external general entity "X" |
148 | E | R | Could not find external parameter entity "X" |
149 | E | R | Could not find external notation "X" |
150 | E | R | Could not find entity "X" using default declaration |
151 | E | R | Could not find entity "X" in attribute Y using default declaration |
152 | E | S | Confusing non-SGML character found; ignored |
153 | I | D | End-tag minimization should be "0" for element with CONREF attribute |
154 | E | D | Declared value of data attribute cannot be ENTITY or ENTITIES" |
155 | E | D | Declared value of data attribute cannot be IDREF or IDREFS" |
156 | E | D | Declared value of data attribute cannot be NOTATION" |
157 | E | D | CURRENT cannot be specified for a data attribute" |
158 | E | D | CONREF cannot be specified for a data attribute" |
159 | E | C | Short reference map for element "X" not defined; ignored |
160 | C | R | Cannot create temporary file |
161 | C | D | Document ended invalidly within SGML declaration |
162 1 | W | Q | Capacity limit X exceeded by Y points |
163 | W | D | Amendment 1 requires "ISO 8879:1986" instead of "ISO 8879-1986" |
164 | E | D | Non-markup non-minimum data character in SGML declaration |
165 | E | D | Parameter cannot be a literal |
166 | E | D | Invalid concrete syntax scope "X" |
167 | E | D | Parameter must be a number |
168 | E | D | "X" should have been "Y" |
169 | E | U | Character number X is not supported as an additional name character |
170 | E | D | Parameter must be a literal or "X" |
171 | E | D | Bad character description for character X |
172 | W | D | Character number X is descried more than once |
173 | E | D | Character number plus number of characters exceeds 256 |
174 | W | D | No description for upper half of character set: assuming "128 128 UNUSED" |
175 | E | D | Character number X was not described; assuming UNUSED |
176 | E | D | Non-significant shunned character number X not declared UNUSED |
177 | E | D | Significant character "X" cannot be non-SGML |
178 | E | U | Unknown capacity set "X" |
179 | E | D | No capacities specified |
180 | E | U | Unknown concrete syntax "X" |
181 | E | D | Character number exceeds 255 |
182 | E | U | Concrete syntax SWITCHES not supported |
183 | E | U | "INSTANCE" scope not supported |
184 | E | D | Value of "X" feature must be one or more |
185 | E | D | "X" invalid; must be "YES" or "NO" |
186 | E | D | "X" invalid; must be "PUBLIC" or "SGMLREF" |
187 | E | U | Feature "X" is not supported |
188 | E | Q | Too many open subdocument entities |
189 | 1 | D | Invalid formal public identifier |
190 | I | D | Public text class should have been "X" |
191 | W | D | Character number X must be non-SGML |
192 | W | D | Notation "X" not defined in DTD |
193 | W | M | Unclosed start or end tag requires "SHORTTAG YES" |
194 | W | M | Net-enabling start tag requires "SHORTTAG YES" |
195 | W | M | Attribute name omission requires "SHORTTAG YES" |
196 | W | M | Undelimited attribute value requires "SHORTTAG YES" |
197 | W | M | Attribute specification omitted for "X": requires markup minimization |
198 | E | D | Concrete syntax does not have any short reference delimiters |
199 | E | D | Character number X does not exist in the base character set |
200 | E | D | Character number X is UNUSED in the syntax reference character set |
201 | E | D | Character number X was not described in the syntax reference character set |
202 | E | D | Character number X in the syntax reference character set has no corresponding character in the system character set |
203 | E | D | Character number X was described using an unknown base set |
204 | E | D | Duplication specification for added function "X" |
205 | E | D | Added function character cannot be "X" |
206 | E | U | Only reference concrete syntax function characters supported |
207 | E | U | Only reference concrete syntax general delimiters supported |
208 | E | U | Only reference concrete syntax short reference delimiters supported |
209 | E | D | Unrecognized keyword "X" |
210 | E | D | Unrecognized quantity name "X" |
211 | E | D | Interpretation of "X" is not a valid name in the declared concrete syntax |
212 | E | D | Replacement reserved name "X" cannot be reference reserved name |
213 | E | D | Duplicate replacement reserved name "X" |
214 | E | D | Quantity "X" must not be less than Y |
215 | E | U | Only values up to X are supported for quantity "Y" |
216 | E | C | Exclusions attempt to change required status of group in "X" |
217 | E | C | Exclusion cannot apply to token "X" in content model for "Y" |
218 | E | D | An entity with notation "X" has already been declared |
219 | E | D | UCNMSTRT must have the same number of characters as LCNMSTRT |
220 | E | D | UCNMCHAR must have the same number of characters as LCNMCHAR |
221 | E | D | Character number X assigned to both LCNMSTRT or UCNMSTRT and LCNMCHIAR or UCNMCHAR |
222 | E | D | Character number X cannot be an additional name character |
223 | E | U | It is unsupported for "-" not to be assigned to UCNMCHAR or LCNMCHAR |
224 | E | Q | Normalized length of value of attribute "X" exceeded LITLEN |
225 | E | Q | Length of interpreted parameter literal exceeds LITLEN less the length of the bracketing delimiters |
226 | W | M | Start tag of document element omitted; not minimizable |
227 | I | U | Unrecognized designating escape sequence "X" |
228 | I | D | Earlier reference to entity "X" used default entity |