Last updated | 2002-12-01 11:20:19 EST |
Doc Title | Image Class Data Transformation |
Author 1 | Weise, John |
CVS Revision | $Revision: 1.6 $ |
Import data into FileMaker Pro
Check/prepare required data fields
Create configuration file using FileMaker Pro
Export data from FileMaker Pro to HTML table file
Set-up the Work Space on the UNIX Development Server
Transfer the Exported Files to the UNIX Development Server
Field mappings must now appear in collmgr/colldb as well as in the config file discussed here. Also worth noting is that the convention for abbreviating fields has changed from collid.fldabbrev to collid_fldabbrev. However, you should continue to use collid.fldabbrev in the config file.
This document describes the process of transforming an image database into SGML for online access as part of the DLXS Image Class. The assumption is that there is a database of text records that describe digital image files. The digital images are typically continuous tone (i.e., grayscale or full color), although this is not a requirement.
Technically speaking, these transformation tools, and the transformation process in general, are unsupported by DLXS. They are provided as one method possible method for transforming data into SGML for the Image Class. We hope that you find these tools useful, and we would appreciate knowing the extent to which you use them.
The following Perl programs are provided with the DLXS distribution of the Image Class.
First of all, in most cases it is important to load the image files before doing the data transformation. It is OK to proceed without doing this first, but ultimately, it will not be possible to restrict searches to records that absolutely have images if you haven't loaded the images before transforming the data. Please see Image Loading.
Why FileMaker Pro? FileMaker Pro is easy to use, inexpensive, and available for both Macintosh and Windows. More importantly, it is a good tool to use for checking to make sure that the data is as expected. It is also good for making minor alterations to the data, which is sometimes necessary. Also, the data transformation program (idb) expects data to be in the HTML table format that FileMaker exports. The table format is simple and can be generated without FileMaker Pro if necessary. A sample can be seen at $DLXSROOT/prep/s/sampleic/sampleic-data.htm.
Your original data may be in any number of formats including TAB-delimited ASCII text, Microsoft Access, Microsoft Excel and FileMaker Pro. It is not important what format is at hand as long as it can be imported into FileMaker Pro.
So, go ahead, import the data into FileMaker Pro. (Please refer to FileMaker Pro documentation if detailed guidance is needed.)
Tip:The FileMaker import function requires a FileMaker table to be defined first. FileMaker can also "open" a wide variety of formats, in which case it automatically defines the table structure based on the data file being opened.
Tip: When selecting the file for import into FileMaker, be sure to specify the format of the file being imported. This typically makes a difference. FileMaker accepts many formats including tab/comma delimited text and Excel.
Once the data is imported, confirm that it meets the minimal data requirements (please read or review this now).
The configuration file is a FileMaker Pro file that has exactly 4 records and a field corresponding to each of the fields that will be transformed to SGML for online use. The goal here is to create a file that defines the database in a way that the transformation program can understand. The transformation process converts the database data into SGML. The decisions made in the configuration file directly affect accessibility of the data, especially the way in which two or more collections are searched simultaneously and how the results of such a search are displayed.
For example, if the database has 5 fields named ID, Title, Artist, Description, and Filename, and all 5 fields are to be transformed to SGML for online use, then the configuration file must also have 5 fields with the same names. In some cases only a subset of the database's fields are to be used online, in which case the configuration file should have fewer fields.
A great way to start a fresh configuration file is to save a copy of the data file as a clone with no records. This creates an empty database file with the same set of fields as the original data file. Using this method it is important to redefine all of the fields in the configruation database file simply as "text" (i.e., no calculations, auto-entries, or anything else).
Each of the 4 records of the configuration file serves a different purpose in the transformation process.
The terms "field" and "category" are used synonymously.
Database fields can be of many different types, including numeric, text, date, and calculated. Regardless of the original format of the fields, the fields in the configuration file should all be of the type text.
Record Number | Record Name | Record Purpose |
1 | Category Name(a.k.a. Field Name or Base Name) | Supplies the transformation process with the common, unabbreviated, name of the field. |
2 | Category Abbreviation | An abbreviated and unique name for the category. |
3 | Category Metaname Mapping | Maps categories to a meta-category, which is used to enable cross-collection searching. |
4 | Category Administrative Mapping | Maps categories to administrative categories for purposes of transformation. Essentially, this is how the transformation program knows which field holds image filename references, for example. |
Now, in greater detail...
Supplies the transformation process with the common, unabbreviated, name of the field.
This one is simple; if the field name is "Title", so is the value of the field. Actually, it is common to use a different field name in the online system. Historically, database field names are sometimes terse or abbreviated. If a different or more descriptive name is desired in the online system, this is the place to do it. Pay attention to case, spaces, and spelling.
Example: Record 1 of Configuration File | |
---|---|
Field Name | Field Value |
ID | ID |
Title | Title |
Creator | Creator |
Location | Location |
View | View |
Date Range | Date Range |
Image Filename | Image Filename |
Record Number 2: Category Abbreviation
An abbreviated and unique name for the category.
For this example, the name of the database/collection is "French Architecture", and it's unique abbreviated name is "sampleic" (it used to be "frarch", but has been changed to "sampleic" as a matter of convention.
Example: Record 2 of Configuration File | |
---|---|
Field Name | Field Value |
ID | SAMPLEIC.id |
Title | SAMPLEIC.ti |
Creator | SAMPLEIC.cr |
Location | SAMPLEIC.lo |
View | SAMPLEIC.vi |
Date Range | SAMPLEIC.da |
Image Filename | SAMPLEIC.fn |
Record Number 3: Category Metaname Mapping
This record maps categories to meta-categories, which are used to enable cross-collection searching. Please see Guidelines for Mapping to Core Categories for Image Services for detailed guidance on mapping.
Example: Record 3 of Configuration File | |
---|---|
Field Name | Field Value |
ID | DC.id |
Title | DC.ti DC.su DLXS.ma |
Creator | DC.cr DLXS.ma |
Location | DC.de |
View | DC.de |
Date Range | DC.da |
Image Filename |
"DC" stands for Dublin Core. The meta-categories are loosely based on Dublin Core categories. "DC.de" is an abbreviation for Dublin Core Description. Since field/category names vary greatly among collections, categories are mapped to the common set of meta-categories. When multiple collections are searched together, searching is done on the meta-categories. Alternatively, a collection may be searched independently by the collection specific categories.
For example, a search across multiple collections using the DC Description field searches all of the collection specific fields that have been mapped to DC Description. In the case of the above example, that is Location and View.
While it is highly recommended that you continue to map fields to the Dublin Core categories, it is easy to create and use your own metanames if so desired. If you are using the provided "idb" program for data transformation, then modify the "idb" program to add your metanames and abbreviations to the gGenMetaNamesHash. If you are generating Image Class SGML using some other method, just be sure to include the new metanames as a META element within the GEN element. Then use your mappings in Record 3 of the configuration file. You will also need to use CollMgr to add the new mappings to the relevant collection groups.
Table of Default Metanames | |
---|---|
Metaname Abbreviation | Metaname |
DC.ti | Title |
DC.cr | Creator |
DC.su | Subject |
DC.de | Description |
DC.pu | Publisher |
DC.co | Contributors |
DC.da | Date |
DC.ty | Type |
DC.fo | Format |
DC.id | Identifier |
DC.so | Source |
DC.la | Language |
DC.re | Relation |
DC.co | Coverage |
DC.ri | Rights |
DLXS.ma | Main Entry |
IC.misc | Miscellaneous |
Please see Guidelines for Mapping to Core Categories for Image Services for detailed guidance on mapping. Mapping is an imperfect art. Mappings are not set in stone, and you may choose to change them at a later date for a given database.
Main Entry
Notice "DLXS.ma" in the table above. DLXS.ma is used to identify fields that should be used when displaying results in a cross collection search. It is strongly recommended that each collection have at least one field mapped to DLXS.ma.
Mapping for Sorting
Image Class can sort search results by any collection specific or cross collection field. Cross collection fields pose an interesting challenge since there are often multiple collection specific fields mapped to a single cross collection field. Image Class sorts on the value of the first collection specific field in the list of mappings. As of DLXS Release 11, the middleware uses the colldb/collmgr to obtain field mappings. This means that in addition to preparing mappings for data preparation, they must also be added to the colldb. In collmgr, the cross collection mappings are stored in the form, "dc_de:::sampleic_vi sampleic_lo" and results are sorted on the first collection specific field mapped, which in this example is "sampleic_vi". Nothing special needs to be done in the config file.
Finally
It is especially important to note that all fields, even those not mapped to a meta-category, will be searched when the query is not limited to a specific field or fields.
Record Number 4: Category Administrative Mapping
Maps categories to administrative categories for purposes of transformation to SGML. Essentially, this is how the transformation program knows which field holds image filenames, for example.
Example: Record 4 of Configuration File | |
---|---|
Field Name | Field Value |
ID | IC.id |
Title | IC.vi |
Creator | |
Location | |
View | IC.vi |
Date Range | |
Image Filename | IC.fn |
Table of Default Administrative Names | |
Admin Name Abbreviation | Admin Name |
IC.id | ID |
IC.vi | View/Caption |
IC.fn | Image Filename |
DLXS.ea | Entry Auth (please see Image Class Access Control) |
The administrative mappings are most important to the process of transforming the descriptive data from a fielded structure into SGML. The program that transforms the data into SGML is informed of the following by the administrative mappings:
This is a good time to review minimal data requirements.
Additionally, it is important to know that there are significant limitations on the characters that are allowed within SGML IDs. Unique record IDs in image databases can take many different forms and include many different characters. The Image Class transformation process intelligently filters illegal SGML ID characters into legal logical representations of the character in order to ensure legal SGML IDs without hassle. For example, ampersand characters that occur in ID data are changed to "-amp-". This can at times result in very long and very ugly SGML IDs. The unfiltered version of the the ID remains searchable and displayable since it is also encoded in a non-ID field in the SGML. If data is encountered that has illegal ID characters that are not filtered properly, contact dlxs-info@umich.edu for guidance.
Tips for successful mapping of administrative categories:
Once the data are checked and prepared and the configuration file is ready, then it is time to export the data from the data file and the configuration file into HTML files. Later, the transformation program will read the HTML file and create the SGML that will used by the access system.
Make sure that all of the records to be exported are in FileMaker's current "found set". The "Show All Records" command does the trick unless there are some records you want to have excluded. Filemaker only exports the records that are in the current found set. This is also the time to sort the records. The order of export will be the default sort order for search results online.
It is required that you name the exported database file like this: collid-data.htm
And the exported configuration file like this: collid-config.htm
Using the "French Architecture" collection example, these would be:
The next steps in the transformation process are done in the UNIX environment.
There is a standard directory structure for storing collection specific files.
All SGML transformations happen in the $DLXSROOT/prep
directory
path. For example the sampleic collection example used earlier in this document
is at $DLXSROOT/prep/s/sampleic
Where "s" is the first letter of the collection abbreviation "sampleic".
If the collection is new, it will be necessary to create the collection directory, and possibly the directory above if there are not yet any collections with the same first letter of the collection id. For example, if the "sampleic" collection is new, and there are no other collections that begin with "s", then both the "s" directory and the "sampleic" directory will need to be created.
The collid-data.htm and collid-conf.htm files need to be transferred to the UNIX development server for transformation to SGML. SCP is commonly the protocol used for making the transfer. (FTP may also be used, though at Michigan we prefer SCP since it is encrypted.)
As noted above, transfer the files into the $DLXSROOT/prep path, and in to the collection specific directory. For example,
$DLXSROOT/prep/s/sampleicTips for successful file transfer:
With the development space set-up and the files transferred in to place, our example "sampleic" directory would look like this.
$DLXSROOT/prep/s/sampleic/sampleic-data.htm
$DLXSROOT/prep/s/sampleic/sampleic-config.htm
The name of the database, the source of the database, and item level access restrictions need to be established in a text file called the Collection Level Information File. The file is required, and must be located and named like this...
$DLXSROOT/prep/c/collabr/collid-info.txt
The file needs just one line with four fields that are delimited by the "#" character. Using the sampleic collection as an example again you would have...
$DLXSROOT/prep/s/sampleic/sampleic-info.txt
... and the contents of the file might be something like...
French Architecture#Images by Rebecca Price#SAMPLEIC
To fully understand the purpose and importance of the third field of the Collection Level Information file, please see Image Class Access Control Summary and Examples Table as well as Image Class Collection Access Restrictions.
Legal values for the fourth field are "summary", "detail" and "both". If the field is excluded, the default is "both".
To take advantage of this functionality it is also necessary to properly prepare and configure the image file fields in the data. Please see Record Number 4: Category Administrative Mapping.
As an example, consider the situation where there is an overview image of a building, and 20 additional detail images, and all of these images together are associated with a single record. By specifying the fourth field as "summary", a search that retrieves the record will display the summary image as the lone result for the record. If the fourth field is "detail", the 20 detail images will display, abut not the overview or summary image. If the fourth field is "both", then 21 results will appear, all linked to the single record. In any case, all of the images related to an image are linked from the record if the HTML template includes a "relatedviews" place holder.
The "idb" program is used to control the transformation, validation, and normalization of the data.
sgmlnorm is not distributed with the DLXS Image Class.
To execute the idb program to transform the sampleic collection data to sgml, do this...
$DLXSROOT/bin/i/image/idb transform sampleic
For testing purposes it is possible to transform a limited number of records instead of the entire database. For example, to process just the first 10 records of a collection, try the following...
$DLXSROOT/bin/i/image/idb transform sampleic 10
When the program is done running, it will present a report. The report, along with several other reports on the process, are saved in the prep directory and are useful for troubleshooting.
Most importantly, the process creates an SGML file.
The SGML file uses the following name convention...
ic.collid.unnorm.sgm
For the "sampleic" collection, that would be...
ic.sampleic.unnorm.sgm
Viewing and Assessing the SGML
You can view the file with the UNIX command "less" (if your system doesn't have the program "less", try "more"). For example,
less ic.sampleic.unnorm.sgm
You can also use "less" and "more" to view the other files output by the transformation process.
Things to look for in the SGML:
Do ENTRY elements have IDs?
<ENTRY ENTRYID="x-34" COLLID="MCsampleic" CA="sampleic">
If not, then check that the ID field (IC.id) is properly configured and that the data has values in the ID field.
Are there appropriate Entry Auth values?
<ENTRYAUTH MALLOW="SAMPLEIC">
If not, then check the Collection Level Information file (collid-info.txt) and the configuration of the Entry Auth field (DLXS.ea) if there is one.
Do all of the attributes of the ISTRUCT have values? Does the M attribute have an image file name when it should? Does the MS attribute have a "P" value in cases where an image file is present on the disk and an "N" value when the image file is not present?
<ISTRUCT ISENTRYID="s-sampleic-34-1" STID="1" FACE="FRONT" STTY="SUMM" X="1" Y="1" MT="IMAGE" MS="P" M="0034">distant view from Avranches</ISTRUCT>
If not, check the configuration of all image filename (IC.fn) and image view (IC.vi) fields. Has the imageprep program been run to create an index directory for the images, and is the index directory located on the machine that the transformation is being done on?
Are field values present?
<C CN="SAMPLEIC.lo" CM="DC.cr">Normandy, France</C>
Empty field instances are not included in the SGML, but if the field is not empty and not showing in the SGML, check the configuration of the field. Make sure that the category abbreviation (e.g., SAMPLEIC.lo) was not assigned to more than one field.
Normalization and Cleaning of the SGMLAfter transformation of data to SGML, the SGML must be normalized and validated. The SGML must be checked against the Image Class DTD (SGML Document Type Definition) to make sure the SGML is valid. At the same time, the SGML is normalized, which means that the tagging is made as consistent as possible within the SGML file. These processes together are referred to "normalization", though again it includes validation and cleaning too.
The idb command is used as follows to normalize the "sampleic" collection SGML.
$DLXSROOT/bin/i/image/idb norm sampleic
The output is a new, normalized and validated, SGML file in the form...
ic.collid.norm.sgm
The normalization process may also generate errors. In the majority of instances, the errors are caused by illegal characters in the SGML. On the one hand, it is often possible to ignore illegal characters and successfully build an index and begin searching. On the other hand, it is worthwhile to fix the illegal characters so that they do not interfere with searching and the display of results to the user.
There are two basic ways to handle illegal characters in the Image Class:
A nice feature of FileMaker Pro's HTML export function is that it normalizes decimal character value differences that are due to differences among fonts. The HTML has HTML decimal character entities, which the transformation program easily converts to legal decimal values in the SGML. Beyond this it is still necessary to convert non ISO 8859 Latin 1 Compliant characters to compliant equivalents or suitable replacements with the character conversion file.
The hardest part about creating a character conversion file is identifying the illegal characters and specifying suitable replacements. This topic is discussed in detail in the document Image Class Character Set Conversion.