Last updated 2002-12-01 11:20:19 EST
Doc Title Image Class Data Transformation
Author 1 Weise, John
CVS Revision $Revision: 1.6 $
Image Class Data Transformation

dlxs-info@umich.edu

Contents

Release Notes: DLXS 11 Introduction

Import data into FileMaker Pro

Check/prepare required data fields

Create configuration file using FileMaker Pro

Export data from FileMaker Pro to HTML table file

Set-up the Work Space on the UNIX Development Server

Transfer the Exported Files to the UNIX Development Server

Collection Level Information

Transform the Data into SGML

Release Notes

Field mappings must now appear in collmgr/colldb as well as in the config file discussed here. Also worth noting is that the convention for abbreviating fields has changed from collid.fldabbrev to collid_fldabbrev. However, you should continue to use collid.fldabbrev in the config file.

Introduction

This document describes the process of transforming an image database into SGML for online access as part of the DLXS Image Class. The assumption is that there is a database of text records that describe digital image files. The digital images are typically continuous tone (i.e., grayscale or full color), although this is not a requirement.

Technically speaking, these transformation tools, and the transformation process in general, are unsupported by DLXS. They are provided as one method possible method for transforming data into SGML for the Image Class. We hope that you find these tools useful, and we would appreciate knowing the extent to which you use them.

Commercial Software Required

DLXS Software Required

The following Perl programs are provided with the DLXS distribution of the Image Class.

James Clark Freeware Required

Operating System

Import data into FileMaker Pro

First of all, in most cases it is important to load the image files before doing the data transformation. It is OK to proceed without doing this first, but ultimately, it will not be possible to restrict searches to records that absolutely have images if you haven't loaded the images before transforming the data. Please see Image Loading.

Why FileMaker Pro? FileMaker Pro is easy to use, inexpensive, and available for both Macintosh and Windows. More importantly, it is a good tool to use for checking to make sure that the data is as expected. It is also good for making minor alterations to the data, which is sometimes necessary. Also, the data transformation program (idb) expects data to be in the HTML table format that FileMaker exports. The table format is simple and can be generated without FileMaker Pro if necessary. A sample can be seen at $DLXSROOT/prep/s/sampleic/sampleic-data.htm.

Your original data may be in any number of formats including TAB-delimited ASCII text, Microsoft Access, Microsoft Excel and FileMaker Pro. It is not important what format is at hand as long as it can be imported into FileMaker Pro.

So, go ahead, import the data into FileMaker Pro. (Please refer to FileMaker Pro documentation if detailed guidance is needed.)

Tip:The FileMaker import function requires a FileMaker table to be defined first. FileMaker can also "open" a wide variety of formats, in which case it automatically defines the table structure based on the data file being opened.

Tip: When selecting the file for import into FileMaker, be sure to specify the format of the file being imported. This typically makes a difference. FileMaker accepts many formats including tab/comma delimited text and Excel.

Check/prepare required data fields

Once the data is imported, confirm that it meets the minimal data requirements (please read or review this now).

Create configuration file using FileMaker Pro

The configuration file is a FileMaker Pro file that has exactly 4 records and a field corresponding to each of the fields that will be transformed to SGML for online use. The goal here is to create a file that defines the database in a way that the transformation program can understand. The transformation process converts the database data into SGML. The decisions made in the configuration file directly affect accessibility of the data, especially the way in which two or more collections are searched simultaneously and how the results of such a search are displayed.

For example, if the database has 5 fields named ID, Title, Artist, Description, and Filename, and all 5 fields are to be transformed to SGML for online use, then the configuration file must also have 5 fields with the same names. In some cases only a subset of the database's fields are to be used online, in which case the configuration file should have fewer fields.

A great way to start a fresh configuration file is to save a copy of the data file as a clone with no records. This creates an empty database file with the same set of fields as the original data file. Using this method it is important to redefine all of the fields in the configruation database file simply as "text" (i.e., no calculations, auto-entries, or anything else).

Each of the 4 records of the configuration file serves a different purpose in the transformation process.

The terms "field" and "category" are used synonymously.

Database fields can be of many different types, including numeric, text, date, and calculated. Regardless of the original format of the fields, the fields in the configuration file should all be of the type text.

Record Number Record Name Record Purpose
1 Category Name(a.k.a. Field Name or Base Name) Supplies the transformation process with the common, unabbreviated, name of the field.
2 Category Abbreviation An abbreviated and unique name for the category.
3 Category Metaname Mapping Maps categories to a meta-category, which is used to enable cross-collection searching.
4 Category Administrative Mapping Maps categories to administrative categories for purposes of transformation. Essentially, this is how the transformation program knows which field holds image filename references, for example.

Now, in greater detail...

Record Number 1: Record Name

Supplies the transformation process with the common, unabbreviated, name of the field.

This one is simple; if the field name is "Title", so is the value of the field. Actually, it is common to use a different field name in the online system. Historically, database field names are sometimes terse or abbreviated. If a different or more descriptive name is desired in the online system, this is the place to do it. Pay attention to case, spaces, and spelling.

Example: Record 1 of Configuration File
Field Name Field Value
ID ID
Title Title
Creator Creator
Location Location
View View
Date Range Date Range
Image Filename Image Filename

Record Number 2: Category Abbreviation

An abbreviated and unique name for the category.

For this example, the name of the database/collection is "French Architecture", and it's unique abbreviated name is "sampleic" (it used to be "frarch", but has been changed to "sampleic" as a matter of convention.

Example: Record 2 of Configuration File
Field Name Field Value
ID SAMPLEIC.id
Title SAMPLEIC.ti
Creator SAMPLEIC.cr
Location SAMPLEIC.lo
View SAMPLEIC.vi
Date Range SAMPLEIC.da
Image Filename SAMPLEIC.fn

Record Number 3: Category Metaname Mapping

This record maps categories to meta-categories, which are used to enable cross-collection searching. Please see Guidelines for Mapping to Core Categories for Image Services for detailed guidance on mapping.

Example: Record 3 of Configuration File
Field Name Field Value
ID DC.id
Title DC.ti DC.su DLXS.ma
Creator DC.cr DLXS.ma
Location DC.de
View DC.de
Date Range DC.da
Image Filename

"DC" stands for Dublin Core. The meta-categories are loosely based on Dublin Core categories. "DC.de" is an abbreviation for Dublin Core Description. Since field/category names vary greatly among collections, categories are mapped to the common set of meta-categories. When multiple collections are searched together, searching is done on the meta-categories. Alternatively, a collection may be searched independently by the collection specific categories.

For example, a search across multiple collections using the DC Description field searches all of the collection specific fields that have been mapped to DC Description. In the case of the above example, that is Location and View.

While it is highly recommended that you continue to map fields to the Dublin Core categories, it is easy to create and use your own metanames if so desired. If you are using the provided "idb" program for data transformation, then modify the "idb" program to add your metanames and abbreviations to the gGenMetaNamesHash. If you are generating Image Class SGML using some other method, just be sure to include the new metanames as a META element within the GEN element. Then use your mappings in Record 3 of the configuration file. You will also need to use CollMgr to add the new mappings to the relevant collection groups.

Table of Default Metanames
Metaname Abbreviation Metaname
DC.ti Title
DC.cr Creator
DC.su Subject
DC.de Description
DC.pu Publisher
DC.co Contributors
DC.da Date
DC.ty Type
DC.fo Format
DC.id Identifier
DC.so Source
DC.la Language
DC.re Relation
DC.co Coverage
DC.ri Rights
DLXS.ma Main Entry
IC.misc Miscellaneous

Please see Guidelines for Mapping to Core Categories for Image Services for detailed guidance on mapping. Mapping is an imperfect art. Mappings are not set in stone, and you may choose to change them at a later date for a given database.

Main Entry

Notice "DLXS.ma" in the table above. DLXS.ma is used to identify fields that should be used when displaying results in a cross collection search. It is strongly recommended that each collection have at least one field mapped to DLXS.ma.

Mapping for Sorting

Image Class can sort search results by any collection specific or cross collection field. Cross collection fields pose an interesting challenge since there are often multiple collection specific fields mapped to a single cross collection field. Image Class sorts on the value of the first collection specific field in the list of mappings. As of DLXS Release 11, the middleware uses the colldb/collmgr to obtain field mappings. This means that in addition to preparing mappings for data preparation, they must also be added to the colldb. In collmgr, the cross collection mappings are stored in the form, "dc_de:::sampleic_vi sampleic_lo" and results are sorted on the first collection specific field mapped, which in this example is "sampleic_vi". Nothing special needs to be done in the config file.

Finally

It is especially important to note that all fields, even those not mapped to a meta-category, will be searched when the query is not limited to a specific field or fields.

Record Number 4: Category Administrative Mapping

Maps categories to administrative categories for purposes of transformation to SGML. Essentially, this is how the transformation program knows which field holds image filenames, for example.

Example: Record 4 of Configuration File
Field Name Field Value
ID IC.id
Title IC.vi
Creator
Location  
View IC.vi
Date Range
Image Filename IC.fn

Table of Default Administrative Names
Admin Name Abbreviation Admin Name
IC.id ID
IC.vi View/Caption
IC.fn Image Filename
DLXS.ea Entry Auth (please see Image Class Access Control)

The administrative mappings are most important to the process of transforming the descriptive data from a fielded structure into SGML. The program that transforms the data into SGML is informed of the following by the administrative mappings:

  1. which of the database fields holds the data that should be used as the unique identifier for each record;
  2. which field holds image filename;
  3. which field holds captions or view information.

This is a good time to review minimal data requirements.

Additionally, it is important to know that there are significant limitations on the characters that are allowed within SGML IDs. Unique record IDs in image databases can take many different forms and include many different characters. The Image Class transformation process intelligently filters illegal SGML ID characters into legal logical representations of the character in order to ensure legal SGML IDs without hassle. For example, ampersand characters that occur in ID data are changed to "-amp-". This can at times result in very long and very ugly SGML IDs. The unfiltered version of the the ID remains searchable and displayable since it is also encoded in a non-ID field in the SGML. If data is encountered that has illegal ID characters that are not filtered properly, contact dlxs-info@umich.edu for guidance.

Tips for successful mapping of administrative categories:

Export data from FileMaker Pro to HTML table file

Once the data are checked and prepared and the configuration file is ready, then it is time to export the data from the data file and the configuration file into HTML files. Later, the transformation program will read the HTML file and create the SGML that will used by the access system.

Make sure that all of the records to be exported are in FileMaker's current "found set". The "Show All Records" command does the trick unless there are some records you want to have excluded. Filemaker only exports the records that are in the current found set. This is also the time to sort the records. The order of export will be the default sort order for search results online.

It is required that you name the exported database file like this: collid-data.htm

And the exported configuration file like this: collid-config.htm

Using the "French Architecture" collection example, these would be:

Tips for successful data export:

Set-up the Work Space on the UNIX Development Server

The next steps in the transformation process are done in the UNIX environment.

There is a standard directory structure for storing collection specific files.

All SGML transformations happen in the $DLXSROOT/prep directory path. For example the sampleic collection example used earlier in this document is at $DLXSROOT/prep/s/sampleic
Where "s" is the first letter of the collection abbreviation "sampleic".

If the collection is new, it will be necessary to create the collection directory, and possibly the directory above if there are not yet any collections with the same first letter of the collection id. For example, if the "sampleic" collection is new, and there are no other collections that begin with "s", then both the "s" directory and the "sampleic" directory will need to be created.

Transfer the Exported Files to the UNIX Development Server

The collid-data.htm and collid-conf.htm files need to be transferred to the UNIX development server for transformation to SGML. SCP is commonly the protocol used for making the transfer. (FTP may also be used, though at Michigan we prefer SCP since it is encrypted.)

As noted above, transfer the files into the $DLXSROOT/prep path, and in to the collection specific directory. For example,

$DLXSROOT/prep/s/sampleic
Tips for successful file transfer:

With the development space set-up and the files transferred in to place, our example "sampleic" directory would look like this.

   $DLXSROOT/prep/s/sampleic/sampleic-data.htm
$DLXSROOT/prep/s/sampleic/sampleic-config.htm

Collection Level Information

The name of the database, the source of the database, and item level access restrictions need to be established in a text file called the Collection Level Information File. The file is required, and must be located and named like this...

$DLXSROOT/prep/c/collabr/collid-info.txt

The file needs just one line with four fields that are delimited by the "#" character. Using the sampleic collection as an example again you would have...

$DLXSROOT/prep/s/sampleic/sampleic-info.txt

... and the contents of the file might be something like...

French Architecture#Images by Rebecca Price#SAMPLEIC

To fully understand the purpose and importance of the third field of the Collection Level Information file, please see Image Class Access Control Summary and Examples Table as well as Image Class Collection Access Restrictions.

Legal values for the fourth field are "summary", "detail" and "both". If the field is excluded, the default is "both".

To take advantage of this functionality it is also necessary to properly prepare and configure the image file fields in the data. Please see Record Number 4: Category Administrative Mapping.

As an example, consider the situation where there is an overview image of a building, and 20 additional detail images, and all of these images together are associated with a single record. By specifying the fourth field as "summary", a search that retrieves the record will display the summary image as the lone result for the record. If the fourth field is "detail", the 20 detail images will display, abut not the overview or summary image. If the fourth field is "both", then 21 results will appear, all linked to the single record. In any case, all of the images related to an image are linked from the record if the HTML template includes a "relatedviews" place holder.

Transform the Data into SGML

The "idb" program is used to control the transformation, validation, and normalization of the data.

sgmlnorm is not distributed with the DLXS Image Class.

To execute the idb program to transform the sampleic collection data to sgml, do this...

$DLXSROOT/bin/i/image/idb transform sampleic

For testing purposes it is possible to transform a limited number of records instead of the entire database. For example, to process just the first 10 records of a collection, try the following...

$DLXSROOT/bin/i/image/idb transform sampleic 10

When the program is done running, it will present a report. The report, along with several other reports on the process, are saved in the prep directory and are useful for troubleshooting.

Most importantly, the process creates an SGML file.

The SGML file uses the following name convention...

ic.collid.unnorm.sgm

For the "sampleic" collection, that would be...

ic.sampleic.unnorm.sgm

Viewing and Assessing the SGML

You can view the file with the UNIX command "less" (if your system doesn't have the program "less", try "more"). For example,

less ic.sampleic.unnorm.sgm

You can also use "less" and "more" to view the other files output by the transformation process.

Things to look for in the SGML:

Do ENTRY elements have IDs?

<ENTRY ENTRYID="x-34" COLLID="MCsampleic" CA="sampleic">

If not, then check that the ID field (IC.id) is properly configured and that the data has values in the ID field.

Are there appropriate Entry Auth values?

<ENTRYAUTH MALLOW="SAMPLEIC">

If not, then check the Collection Level Information file (collid-info.txt) and the configuration of the Entry Auth field (DLXS.ea) if there is one.

Do all of the attributes of the ISTRUCT have values? Does the M attribute have an image file name when it should? Does the MS attribute have a "P" value in cases where an image file is present on the disk and an "N" value when the image file is not present?

<ISTRUCT ISENTRYID="s-sampleic-34-1" STID="1" FACE="FRONT" STTY="SUMM" X="1" Y="1" MT="IMAGE" MS="P" M="0034">distant view from Avranches</ISTRUCT>

If not, check the configuration of all image filename (IC.fn) and image view (IC.vi) fields. Has the imageprep program been run to create an index directory for the images, and is the index directory located on the machine that the transformation is being done on?

Are field values present?

<C CN="SAMPLEIC.lo" CM="DC.cr">Normandy, France</C>

Empty field instances are not included in the SGML, but if the field is not empty and not showing in the SGML, check the configuration of the field. Make sure that the category abbreviation (e.g., SAMPLEIC.lo) was not assigned to more than one field.

Normalization and Cleaning of the SGML

After transformation of data to SGML, the SGML must be normalized and validated. The SGML must be checked against the Image Class DTD (SGML Document Type Definition) to make sure the SGML is valid. At the same time, the SGML is normalized, which means that the tagging is made as consistent as possible within the SGML file. These processes together are referred to "normalization", though again it includes validation and cleaning too.

The idb command is used as follows to normalize the "sampleic" collection SGML.

$DLXSROOT/bin/i/image/idb norm sampleic

The output is a new, normalized and validated, SGML file in the form...

ic.collid.norm.sgm 

The normalization process may also generate errors. In the majority of instances, the errors are caused by illegal characters in the SGML. On the one hand, it is often possible to ignore illegal characters and successfully build an index and begin searching. On the other hand, it is worthwhile to fix the illegal characters so that they do not interfere with searching and the display of results to the user.

There are two basic ways to handle illegal characters in the Image Class:

  1. Edit the source data so that all characters are ISO 8859 Latin 1 Compliant.
  2. Create a character conversion file that the transformation program can use to convert specific characters to a compliant substitute.

A nice feature of FileMaker Pro's HTML export function is that it normalizes decimal character value differences that are due to differences among fonts. The HTML has HTML decimal character entities, which the transformation program easily converts to legal decimal values in the SGML. Beyond this it is still necessary to convert non ISO 8859 Latin 1 Compliant characters to compliant equivalents or suitable replacements with the character conversion file.

The hardest part about creating a character conversion file is identifying the illegal characters and specifying suitable replacements. This topic is discussed in detail in the document Image Class Character Set Conversion.