4 Standardization

We identify the following crucial areas for standardization: data format, data documentation, data provision, and data preservation.

In each case, a balance must be found between the need to ensure that resources remain usable over long periods of time and in many different contexts, and the need to adapt rapidly to changing technology (and fashions) in a rapidly developing field. Fortunately, this is a problem by no means unique to arts and humanities data users or providers.

4.1 Standards Reference Guide

The AHDS will need to identify and specify the standards considered relevant in each of the above mentioned areas. We propose that an early work item should be the production of a technical report, which will form the basis of a Standards Reference Guide (SRG) defining levels of standardization mandatory on all AHDS-affiliated data services and data sets. Important input to this process will be the FIGIT Framework Program for Standards activities. The SRG will of necessity be an extensible and changing document, but its core provisions should be relatively easy to define.

Some basic principles underlining the SRG can be stated at the outset:

The existence of the SRG will help individual resource providers position their services with respect to overall AHDS policies, and hence to define meaningful service level agreements. It will also provide agencies concerned with the funding of such services with a set of appropriate performance metrics. Its definition is one of the most important of the AHDS strategic functions. Its continued maintenance will also be an essential task, requiring non-trivial effort for the foreseeable future.

4.2 Data Formats

Standardization is currently well advanced in this domain, with many competing or overlapping standards in a number of areas. Fortunately, considerable expertise exists within the academic and industrial data processing communities with respect to data conversion and inter-operability of systems. Service providers and other appropriate groups of experts will be asked to provide short lists of recommendations, perhaps with differing emphases in different subject areas, as to the commonly observed standards for data format.

4.2.1 Textual data

For the AHDS purposes, the most appropriate standard is likely to be that provided by the Text Encoding Initiative's ``Guidelines'' (TEI P3, 1994). This provides an internationally-accepted set of recommendations for the standardized encoding of a very wide range of textual and other data formats, based on SGML which is itself a widely implemented ISO standard (ISO 8879), of increasing importance in both commercial and academic publishing worlds. We know of no other equally relevant standard for the widest variety of arts and humanities data. The TEI Recommendations have the unusual virtue of combining in one integrated framework a set of recommendations for data representation, data interpretation or analysis and data documentation, which is designed to be independent of any particular application area, software environment or hardware platform.

Although under development since 1989, the TEI ``Guidelines'' were published only in May 1994. Expertise and software capable of exploiting them is rapidly becoming available, in part because of an increased awareness of SGML itself within the academic community, but is by no means widespread. It is therefore important to ensure that large existing collections of data do not remain inaccessible to the AHDS community until they can be converted to use it.

We anticipate that there will be a considerable start-up cost associated with conversion of ``legacy data'', but that this cost will rapidly decrease.

4.2.2 Tabular data

For the rectangular data sets useful in certain types of research in the social and historical sciences, data in delimited or fixed format columns is widely accepted. Where information has been stored in databases based on the relational model, the export and transfer of data is also generally easy, although the transfer of associated descriptive meta-information may be more difficult. [See note 39] For data sets stored in databases using other models, for example with multi-valued fields, data export and transfer generally pose serious problems, because each such database system will have its own proprietary internal format, and there may be no conversion or import/export facility.

4.2.3 Graphical data

Many different proprietary formats are used, each of which has its advocates. There is considerable inter-operability, because of a number of proprietary converters and because it is common for software vendors to support more than one format. This is, however, a rapidly changing and developing field.

4.2.4 Audio and other time-based data

Time-based data sets, both in the form of compressed audio or video and in symbolic form (such as an encoded representation of a musical score) will be of increasing importance as equipment capable of managing them becomes more readily available. At present, handling such data sets appears to be a highly specialized activity, although audio and video ``clips'' are becoming a normal part of the material presented in on-line services such as the World Wide Web, and have obvious academic uses.

The AHDS will need to consult with appropriate specialist groups, at the British Library National Sound Archive, the Center for Computer-Assisted Research in the Humanities (CCARH) at Menlo Park, California, and elsewhere, to identify de facto standards in this area and integrate them into its framework. At the time of writing, for example, the CCARH is about to commence distribution o a non-commercial basis of large music data sets comprising a significant proportion of the works of classical composers such as Bach, Handel, Mozart, and others, derived from out-of-copyright scholarly editions. However, the music will be issued in a proprietary binary format (MuseData) together with software tools for data-retrieval and processing. [See note 40]

The commercial protocol Musical Instrument Digital Interface (MIDI) is a very widely-used binary system for control of musical devices, which has an associated Standard MIDI File format (SMF). While this is a de facto standard for musical performance, it has well-recognized drawbacks for the symbolic representation of scores. [See note 41]

SGML-based initiatives may be of more significance in this connexion, in particular those based on HyTime, which is an ISO standard (ISO/IEC 10744:1992) for the encoding of time-based data, specifically for multimedia purposes. [See note 42] The Standard Music Description Language (SMDL) is a specialised subset of HyTime concerned with the encoding of symbolic representations of music. [See note 43]

Standards for data compression are relevant both to image (e.g. JPEG) and time-based data; the AHDS will have little choice but to follow industry trends.

4.3 Data Documentation

We use this term to refer to the general topic of standards and codes of practice relating to meta-information, that is (in the most mature case) bibliographic standardization such as that embodied by the International Standard Bibliographic Description, well known in library circles, or the Standard Study Description, equally well known in social science. We also include under this heading consideration of the indexing techniques used to facilitate access to image and sound data.

Conventions for the documentation of networked electronic resources are still in a state of ferment, despite a number of years of discussion both within the library and information science community and outside it. It is only a slight exaggeration to say that at present the quality of information available in ``gopher space'' is indirectly proportional to its quantity: never before have so many known so little about so much.

The most promising basis for an improvement in this situation will be a careful examination of existing standards for bibliographic description, and their relevance to the description of non-book materials. Relevant standards include AACR2 (1988), and ISO 690 (1987). Monitoring of the development of more electronically-relevant standards such as MARC will be of great importance, and is likely to have considerable impact within the library community. The recommendations of the TEI for formal documentation of electronic texts are also of great relevance here, since these are explicitly modelled on the ISBD.

The identification and approval of any pre-existing descriptive thesauri appropriate within given subject areas will be of considerable assistance to subject-specific resource providers. The high cost of building and maintaining indexes based on such thesauri should be set against the enormous increase in data accessibility which they typically facilitate.

This is particularly true for the description of visual images in verbal or encoded form, without which large scale image databases are likely to be entirely unmanageable. Natural language descriptive thesauri (similar to those used for other data sets) are widely used, for example by the Getty Foundation's projects. For representational art in the Western European tradition there also exist abstract descriptive schemes such as Iconclass.

Wide consultation involving creators, users and subject specialists will be necessary to ensure that data documentation supports both informed and responsible data usage. The AHDS will need to formulate clear guidelines as to the level and scope of data documentation which is essential, desirable, or useful, for each of a variety of resources. It should also provide helpful guidelines for those seeking to create documentation tools, such as database proformas, or could initiate and manage projects to develop such tools.

4.4 Data Provision

There are few pre-existing formal standards of relevance in this area. We expect therefore that the AHDS will begin by tabulating and codifying existing codes of practice with a view to agreeing levels of conformity amongst all AHDS service providers. We propose that an effective way of carrying this out might be to set up small task groups to report on specific areas in which individual practice tends to diverge. These task groups should be composed of a small number (five or less) of experts, not necessarily involved with the AHDS, who might be required to report back with detailed proposals for standardization within a reasonable time scale (say nine months).

Most of the areas of concern common to all AHDS service providers were discussed above in section 2 Perceived needs and the current situation. Specific topics on which we anticipate that such reports would be particularly useful are:

Each of these is discussed in more detail below; others may be added as the need emerges.

4.4.1 Charging and licensing

It has to be recognized that there are always costs associated with the creation and maintenance of electronic resources. It is not unreasonable to seek to relate the value of such resources for subsequent users directly to these costs. However, some ambivalence exists within the academic community on this point. There is a widely-held view amongst information consumers that all services should be free, just as there is an equally widely-held expectation (often located in the same individuals) that information provision should be remunerated in some way. A variety of mechanisms have been developed to resolve this contradiction, of which the three most popular appear to be: top-slicing -- where access to a resource is centrally-funded once for all, but appears to be free to the individual; differential charging -- where free access to a resource by one group (academics) is effectively subsidized by the high charges made for access to the same resource by another group (commercial users); and subscription -- where payment of a single fee provides unlimited access to members only. In the network environment other modes of charging have also been proposed, such as the notion of ``pay by use'' -- where an automatic charge is levied for each access to a given network resource or group of resources.

It would be premature for the AHDS to begin by proposing any particular charging policy. However, it is an area of policy-making which should be tackled as a matter of urgency. Longer term, the AHDS will need to remain capable of responding flexibly to changes in public attitude, or even legislation, which are the ultimate determiners of such matters. All of that said, it is clearly in the interests of the AHDS community that as few barriers as possible be introduced between users and providers of data resources.

Service providers may in any case need to adopt different cost-recovery procedures. Some may need to recover only media costs; others may need to recover a proportion of the cost of creating and maintaining the resources within a defined period. In some cases, costs may be beyond the control of service providers, being determined by external agencies who make resources available to service providers on certain terms only.

In the case of commercially-funded resource provision, the existing expertise of bodies such as CHEST should be employed to negotiate satisfactory contracts for the AHDS community as a whole, rather than solely for the benefit of the richer members of that community.

Where resources are prepared specifically for the AHDS, clear policies regarding their possible commercial exploitation need to be set in place. There is growing evidence, for example in the TLTP programme, of the commercial viability of certain kinds of Arts and Humanities Data Resources: the expertise of bodies such as the Association for Learning Technology (ALT), of commercial publishers, and of academic industrial liaison officers should be sought to ensure that appropriate royalties are derived from such resources.

Where charges are made, for whatever reason, it is likely that differential rates may be applied to different users. One common model is to provide commercial access at a premium rate, royalties from which can then be used to subsidize academic access. It is not always easy to distinguish ``academic'' and ``commercial'' institutions or usage however. Equally, if institutions not funded by the Funding Councils wish to access the service, it is likely that differential charging rates would be applicable.

Cost recovery for the service as a whole is greatly complicated by the fact that very large economies of scale may be obtained by localization of some key strategic functions such as data preservation and data description.

We recommend that, as a first step, existing charging policies of all AHDS service providers be collated, and if possible made consistent. Where no financial charge is made, but certain conditions are imposed (for example, as with the GNU public licence), the AHDS should attempt to define a legally acceptable formulation of the conditions. Where charging at some level is essential to the provision of a resource, guidelines for an acceptable level of charging, and mechanisms for its recovery from the community or elsewhere need to be defined. At this stage, it is probably only possible to identify commonly occurring practices and highlight the strengths or weaknesses of each. The proposed structure allows for cost-recovery and administration at data set, service provider or AHDS level.

4.4.2 Intellectual Property Rights

IPR considerations have always been closely related to charging issues: ownership of an intellectual property right is one way of justifying making a charge for it. However, the AHDS has a wider role to play in defining acceptable agreements with IPR holders than simply setting a fair price. It is widely recognized that this is a highly complex issue: current legislation in this area is both confused and confusing, with entirely different principles applying to information in different formats, or distributed to different countries. As with pricing, it is something which will need to be addressed as a matter of urgency by AHDS, working in close collaboration with FIGIT and other relevant initiatives, and monitoring closely the developments in European legislation.

It is clearly essential that the AHDS be seen to be behaving in a responsible way with respect to the rights of IPR holders. This implies, for example, that service providers should be required to maintain records of all those with IPR in their holdings, and that appropriate legal instruments be drawn up defining the nature of those rights and the licence under which the holdings are being made available. Definition and provision of such legal instruments should be charged to a small task group of appropriate experts.

4.4.3 User Support and Consultancy

It is anticipated that the level of advisory support available for different data sets may vary considerably. At one extreme, it may be impossible to do more than supply documented data resources on a ``caveat emptor'' basis; others may readily provide detailed instruction in their use, complete with tutorials, self-help systems or casebook studies, perhaps even without actually supplying the data resources themselves. Most will fall between these two extremes. In line with other AHDS policies, Service Providers will be required to define the level of service which they can provide, enhancing it to minimal levels defined by the SRG where possible.

From the end-user's point of view, what is important is a clear statement of the level of service a given provider can reasonably be expected to provide. This might include simple ``performance indicators'' such as time taken to turn round a request for a resource, or to answer a query, or more imponderable qualitative matters such as the extent to which the service offered actually improved the effectiveness of a researcher's work or a teacher's productivity. In attempting to provide such information, the AHDS will simply be fitting in with the general pattern of performance monitoring now being required throughout academia.

4.4.4 Maintenance, Error Correction, Quality Assessment

Electronic data sets are not necessarily static objects. A few data sets are highly dynamic, changing from day to day, but even those which are supposedly authoritative may need maintenance or error correction. In the software context, users are accustomed to the notion of regular updates -- whereby manufacturers bring out new versions in which existing errors are removed (and replaced by new ones). Similar considerations apply to data sets.

It is the users of data sets who are most likely to identify discrepancies or inconsistencies and they should be encouraged to report these for the benefit of all. Different service providers may have different policies as to whose responsibility it should be to act on such error reports, but we recommend that redistribution of such information should be a pre-requisite for AHDS service providers. Electronic discussion groups and mailing lists provide a natural and effective mechanism for this.

Thought should be given to ways of providing qualitative information about data resources (that is, some indication of the accuracy or reliability of particular resources) in ways compatible with the general principles of detailed data description discussed in section 4.3 Data Documentation . The definition of acceptable levels of accuracy is an important area for AHDS service providers, given both the very variable quality of many present-day networked resources and also the desirability of allocating academic credit for the creation of such resources.

4.5 Data Preservation

All AHDS service providers will need to arrange for the long term preservation and storage of their resources, either locally, or by delegating this function to some other service provider, within the AHDS or elsewhere. In either case, it is clearly essential that this archival function itself be responsibly and adequately carried out. The objective is to ensure that resources can be recovered, without loss of functionality, and made available on computing or other information processing equipment at any time in the foreseeable future.

In general the term archive implies ``for ever''; all archivists know, however, that policies allowing for some culling (or removal) of unused or duplicated resources at some time are necessary. There is no reason to believe that electronic archives will be any different from others in this respect, except perhaps quantitatively. To some extent, the existence of properly defined and agreed quality assurance procedures for the accession of AHDS resources may reduce the urgency of the need to define such policies, but their definition, and a clear statement of the criteria underlying them, will still be necessary at some point. In this respect, the AHDS will naturally look to the wider library and academic communities for guidance.

Unlike the archiving of documents or other artefacts, the archiving of electronically held information resources requires a separation between the medium and its content which may not always be easy to carry out. Hardware changes probably make impossible the indefinite preservation of software. However, there is ample evidence that data sets can be preserved for almost indefinite lengths of time simply by establishing routines for the migration of the data from one carrier medium to another, without loss of information.

Because this task of archival storage is likely to be delegated in many cases to specialist agencies, it is correspondingly important for the AHDS to specify minimal operational requirements for such agencies. The nature of these specifications will also need constant review as technology changes --- a standard based on the use of magnetic tape might no longer be appropriate in five years time.

Some aspects are unlikely to change however: these include considerations such as the number of copies held, their locations, the nature and degree of integrity checking carried out both across the resources synchronically and periodically, to check for degradation.

In general, economies of scale argue strongly for provision of archival facilities as a strategic AHDS service, which may be contracted out to specialist agencies offering a guaranteed level of service. Individual AHDS service providers may also be able to offer equivalent facilities to others less well endowed. The cost-sharing implications of this are considerable.


Back to table of contents
On to next section
Back to previous section