2 Perceived needs and the current situation

In this section we review the general HE framework within which data sets and networked information services are being developed, identify the needs of the Arts and Humanities within this framework, and discuss the current situation in relation to each of the needs.

2.1 Background

2.1.1 Data sets and services today

In some disciplines, and in some countries, deposit of digital data sets in ``approved'' archives is a common and accepted condition of grant. This is frequently not the case in humanities disciplines in the United Kingdom, as a brief informal survey of four funding institutions has shown.

Each year the British Academy funds a significant number of projects which result in the creation of arts or humanities data sets. Since 1992 it has required applicants for grants to provide evidence that they have made adequate provision for the long-term access and preservation of the resources generated by the project.

The British Library Research and Development Department funds several projects every year, some of which result in the creation of relevant data sets, though these often are small and with limited scope for re-use (some are specifically precluded from widespread use due to copyright restrictions). It typically has ``encouraged'' creators of data sets to deposit them at the ESRC Data Archive (see 8.1.3.4 Source of funding:), without making this a condition of grant; not all data sets so created have been deposited. The Library is considering a policy which requires deposit as a condition of grant.

The Getty Art History Information Program has funded two projects in the UK explicitly concerned with the creation of data sets. [See note 5] In both cases, the projects were required to deposit copies with the Institute (in California), but not elsewhere.

The Leverhulme Trust funds about 50 humanities projects each year, many of which include data set creation. It does not impose conditions concerning access, deposit or preservation.

The Economic and Social Sciences Research Council (ESRC) funds a variety of projects relevant to the humanities, particularly in history and linguistics; deposit of resultant data sets is a condition of grant. However, where researchers in the humanities have been successful in obtaining funding for major projects from more scientifically-oriented research councils, these have not always included any provision for the long-term availability of any resources created. For example, several language engineering projects were funded by the Science and Engineering Research Council (SERC), under the Speech and Language Technology initiative, but their availability to the humanities community at large remains unclear. Often individual resource creation projects have been expected to manage distribution for themselves, with consequent dangers of invisibility outside the immediate research communities served.

There is an expectation that data set provision can be made economically viable and self-sustaining in the long term. This may be true for some kinds of resource where the demand is clear and sustained; it is less self-evident in the general case, particularly for the wide range of materials likely to be of interest to the Humanities.

The provision of services is varied and and uneven. There are two formally constituted data archives --- the History Data Unit of the ESRC Data Archive at the University of Essex, [See note 6] and the Oxford Text Archive at Oxford University. Otherwise the management of data sets, including access, documentation, support and preservation, depends very much on the awareness and resources of the individual projects and their host institution.

2.1.2 Commercial data sets

In addition to the academic projects and institutions, several commercial organizations are now becoming involved in producing and publishing digital documents for scholarly use. Frequently, these are traditional paper-based publishers which are diversifying into electronic media. Their products range from novels sold on diskette for a few pounds to corpora sold on CD-ROM for tens of thousands of pounds. [See note 7]

Recently the British Academy carried out a Networked Data Information Survey of all Arts and Humanities departments in the UK. This suggested that there is a demand for commercial data sets, many of which are too expensive for smaller institutions to obtain individually. [See note 8]

While it is in any single publisher's interest to maintain archival copies of its materials, there is no commercial imperative to do so. On the contrary, it is easily possible to imagine commercial pressures which would result in data sets for which there is little demand being discarded. We are not aware of any unified source or catalogue for these materials; nor are we aware of any widely-used standard for works published in electronic form.

2.1.3 Resources for teaching

There have been three major Research Council-funded UK initiatives in the past few years, all aimed at promoting the use of electronic resources in teaching, and at creating a range of transferable materials. First was the Computers in Teaching Initiative (CTI) (see 8.1.2.4 Source of funding), followed more recently by the Information Technology Training Initiative (ITTI), and the Teaching and Learning Technology Programme (TLTP).

2.1.4 Networked Information in the UK

The creation and management of networked electronic resources is currently a major area of activity within the Higher Education community, with a number of bodies and institutions involved, and a number of projects and initiatives.

Within the UK the academic communications infrastructure --- JANET and SuperJANET --- has largely, up to now, been centrally funded by the HEFCs. The funding councils are advised by an Advisory Council on Networking (ACN), and the work of creating and maintaining the networks is carried out by UKERNA (UK Educational Research Network Association) [See note 9].

There have also been a number of recent policy reviews, including the SuperJANET Project on Information Resources (SPIRS), [See note 10] and more recently the Joint Funding Councils' Libraries Review, [See note 11] The Follett recommendations were accepted by the Councils and are now in process of implementation. A large number of the recommendations are concerned with the creation and delivery of electronic resources, and their implementation is being planned and carried out by the Follett Implementation Group on Information Technology (FIGIT). [See note 12]

A number of projects and groups are active in developing network provision and access. Notable among these are the UK Office for Library and Information Networking (UKOLN), based at Bath University, and the recently formed special interest group Information Networking Alliance (INA), which represents librarians (through SCONUL), academic computing professionals (UCISA) and publishers (BIC).

2.1.5 The international dimension

Although, for practical reasons, the focus in our discussion is largely within the UK, there are many relevant institutions and projects outside the UK. There has always been a strong international tradition among scholars, which has been greatly facilitated by the ease with which students and researchers can now cross international frontiers electronically in their pursuit of digital resources. Indeed, there is a clearly emerging tendency for a small number of academic experts working within a comparatively narrow specialization but spread across several countries, to gain ``critical mass'' by working together using the Internet, thereby achieving results greater than the sum of the results they might achieve individually.

Alongside this the rapid and chaotic growth of the Internet has to be taken into account, comprising as it does a certain number of stable well-ordered data sets and services and very large volume of unregulated activity and information. This presents problems of its own; for it has become difficult to distinguish useful high-quality information from the low-grade material available side by side. (``On the one hand everything is available. But on the other hand, everything is available.'' - Alfred Glossbrenner). The difficulty of locating appropriate resources has become a common complaint.

2.2 Needs

The following sections present a number of specific issues in the provision of networked information. They formed a major component of the AHDS Workshop at the British Academy, and continued to be important throughout the Feasibility Study. Discussion is presented under broad headings chosen to reflect the ``life cycle'' of an electronic resource: Within this framework, we identify the specific needs of the Arts and Humanities community, and try to summarize the current situation in each case.

2.3 Resource Creation

The resources covered here include: All these resource types are covered within the FIGIT framework, in particular Programme Areas 2 - Electronic Journals, and 3 - Digitisation. [See note 13]

2.3.1 Academic quality.

2.3.1.1 Need

Peer review procedures are needed which provide external validation for AHDS data sets. Without these the creation of such data sets, whether as resources for research or teaching, will not receive recognition from HE institutions or from the academic community as a whole.

2.3.1.2 Current Situation

Funding bodies such as the British Academy and the Leverhulme Trust carry out refereeing procedures when bids are assessed for projects which will create electronic data sets, as do a number of HE institutions when allocating funds internally for such projects. However, there are large numbers of data sets which are conceived and initiated without any such procedure. Even more important, perhaps, peer review of completed projects is relatively rare. It may be that such procedures should be external to the AHDS, but it is nevertheless a matter which the AHDS must take very seriously.

2.3.2 Copyright

2.3.2.1 Need

First, it is important that copyright issues are addressed at a very early stage in the planning of individual projects. More generally, agreements and guidelines are needed at HE community level to provide a coherent and well-understood framework within which resource creation projects can operate.

2.3.2.2 Current Situation

There is discussion going on amongst publishers, and between publishers and HE authors and institutions. The British Library has a working party on Legal Deposit of electronic material which is of relevance here. The development of ``a model agreement with publishers'' is being addressed by FIGIT.
[See note 14] The WATCH project, with its aim of creating a database of copyright holders, is of interest in this area. [See note 15] It is also likely that the Copyright Licensing Agency will make their database of copyright holders available on-line in the relatively near future. [See note 16]

2.3.3 Design standards

2.3.3.1 Need

There is a need for more widespread awareness and use of good design practice, both for research data sets and for teaching modules.

2.3.3.2 Current Situation

A number of data analysis and design methodologies are available in the commercial world which can be brought to bear on the development of research data sets.
[See note 17] There is also now an extensive literature concerning the design of Computer-Based Learning (CBL) packages. However there are many projects in both research and teaching contexts which do not use specialist designers or specific design standards.

2.3.4 Re-usability and access

2.3.4.1 Need

Platform-independent standards for data encoding are essential to ensure transferability and re-usability of data. Such re-usability has academic benefits in terms of increasing the resources available for research and teaching, of allowing arguments and conclusions to be challenged by manipulation of a particular data set, and also, where copyright conditions of use permit it, of enabling resources to be ``re-purposed'' --- i.e. modified or adapted to serve a new teaching or research purpose. Re-usability also has economic importance: high-quality electronic resources are expensive to produce, and it is important to ensure that as far as possible their re-use is a matter of academic judgment rather than of technical or financial constraint.

2.3.4.2 Current Situation

Currently platform-independence remains a goal rather than a reality for the most part. However, data encoding is an area of great activity and promise for all data types, with the work of the Text Encoding Initiative (TEI) offering particular promise (see Section
4 Standardization ). On the other hand, for certain types of resources, such as databases and CBL packages, platform- independence is somewhat distant, which means that for the foreseeable future AHDS Service Providers would need to cope with a number of proprietary software packages. Even so, it is often possible to ensure platform-independent access to the data, as for example in constructing relational databases so they can be manipulated by means of Structured Query Language (SQL). [See note 18] In hypermedia work, for CBL and other purposes, the HyperText Mark-up Language (HTML) used by the World Wide Web is very popular, in part because of the platform independence consequent upon its use of the Standard Generalized Markup Language (SGML) [See note 19]

Following on from the above, an increasingly important issue is whether the resource is to be accessed and manipulated on-line. If so, consideration has to be given to which access tools users will be able to employ. Among the most common network access mechanisms are Gopher, World Wide Web (WWW), and Wide Area Information Server (WAIS). Use of any of these systems involves additional encoding issues which should be considered when the resource is created.

2.3.5 Resource identification

This is the first in a range of documentation which is required if electronic resources are to be both accessible and usable, and which is best produced by the resource creator.

2.3.5.1 Need

Standards are urgently required which enable resources in the rapidly growing Internet world to be uniquely identified and located. The need is far from simple, as for example in identifying resources which are replicated in a number of places.

2.3.5.2 Current Situation

The Internet Engineering Task Force (IETF) has a work group on Uniform Resource Identifiers (URIS) which aims to provide a standard for the unique identification of networked resources, by means of a Uniform Resource Name (URN) and a Uniform Resource Locator (URL). Where libraries and archives have on-line catalogues, the individual entries will follow the standards used by the institution.
[See note 20]

2.3.6 Resource description

2.3.6.1 Need

Standards are needed for the electronic equivalent of full catalogue entries, to enable prospective users to make a judgment on the suitability of the resource to meet their needs without having to retrieve the whole data set. A considerable amount of technical information is needed, beyond the standard catalogue entries, from the details of the encoding scheme used to the software and hardware required in order to manipulate the resource.

2.3.6.2 Current Situation

Within the Library community there has been a great deal of work on resource description, mostly involving variations on the MARC standard, e.g. USMARC, UKMARC, UNIMARC. The ``TEI Header'' provides another possible approach, and the TEI Guidelines describe how the two approaches may be linked.
[See note 21]

2.3.7 Technical description

2.3.7.1 Need

Standards are required so that all the technical information needed to understand and manipulate the data set is available. Documentation of this type is particularly important for certain types of data set, e.g. for complex relational databases where schemas showing the tables and their relationships are essential. It is also essential whenever proprietary software or hardware is used.

2.3.7.2 Current Situation

It is far from clear that technical descriptions are produced as a matter of course along with every new data set. A number of system design methodologies, e.g. SSADM, include documentation standards, such as ``data structure diagrams'', which can be very helpful. These could form the basis of guidelines for use by academic projects. A number of software packages enable relevant technical information to be embedded within the data set, as do mark-up systems such as SGML.

2.3.8 User Guidelines

2.3.8.1 Need

Guidelines are needed to ensure first that users have available the necessary information on how to use the resource, and second that the way in which this information is presented is consistent from one data set to another.

2.3.8.2 Current Situation

It is widely accepted that while a paper version may be appropriate in some circumstances, help for users of on-line resources must be provided on-line. There are no standards as yet for the structure and presentation of such material, but hypertext approaches - often using HTML - are increasingly used.

2.4 Resource Management

2.4.1 Basic procedures

2.4.1.1 Need

Policies, standards and procedures are needed for accessioning data sets, documenting them where necessary, adding entries to the local catalogue(s) and indexes, and where appropriate to central catalogues.

2.4.1.2 Current Situation

All the established data archives have such procedures as a matter of course (as, of course, do all libraries).

2.4.2 Information content

2.4.2.1 Need

Procedures are needed to assure the integrity of the resource against accidental or deliberate corruption. Where a resource is subject to amendment, version control procedures are required to ensure that users know which version of the resource they are using, and which versions are available for use.

2.4.2.2 Current Situation

Version control procedures are well established in existing data archives and university computing services. The TEI Guidelines include detailed proposals for embedding version control information within an electronic resource.

2.4.3 Intellectual history

2.4.3.1 Need

Documentation procedures are required which will allow the usage of the resource to be recorded, along with comments by the users, in particular any reports of ``errors'' in the data. Clearly it would be improper for an archive to ``correct'' such errors silently, but equally it has a duty to report them to future prospective users. Over time, networked resources will inevitably accrue critical commentary (often in electronic form) in exactly the same way as printed resources. It would be desirable for at least the bibliography for this commentary to be included with the documentation for the data set.

2.4.3.2 Current Situation

It seems that most data archives, e.g. the ESRC Data Archive (EDA) and the Oxford Text Archive (OTA), have mechanisms to allow users to comment on the data sets they use (and the service provided by the archive). In the past these have been paper based, but electronic forms, bulletin boards and discussion groups are increasingly being used for this purpose.

2.4.4 Physical access

2.4.4.1 Need

Procedures are needed which ensure that there is no unauthorized access to networked resources.

2.4.4.2 Current Situation

For some data sets and services there is no restriction on access. Where access has to be restricted, electronic ``registration'' is widely used, typically involving the allocation to a user of a ``user identifier'' and ``password''. Where necessary, passwords --- and even complete data sets --- may be encrypted. Usually, registration will entitle the user to the basic services offered by the institution. However, additional procedures are required where particular conditions are attached to the use of specific data sets, as described below.

2.4.5 Conditions of use

2.4.5.1 Need

Procedures are needed which ensure that conditions of use imposed by creators, copyright holders, and the institution managing the resource are agreed and respected by users.

2.4.5.2 Current Situation

Typically, as part of the registration process, the prospective user signs a ``Conditions of Use'' form in relation to the archive generally. However, further conditions may be imposed in relation to specific data sets.

2.4.6 Intellectual Property Rights (IPR)

2.4.6.1 Need

Procedures are needed which ensure that the rights of authors and other rights-holders are protected, as far as is within the power of the service. (See
4.4.2 Intellectual Property Rights.)

2.4.6.2 Current Situation

This can be a complex area. At present it is covered typically within the registration and ``conditions of use'' procedures, with the user undertaking to cite all materials used for further publication in the appropriate manner.

2.4.7 Charging

2.4.7.1 Need

Procedures are needed which can cope, potentially, with a variety of charging mechanisms. (See
4.4.1 Charging and licensing.)

2.4.7.2 Current Situation

There are few electronic charging mechanisms in current use. Typically charges are levied as part of the registration or ``conditions of use'' procedures.

2.4.8 Platform-specific data

2.4.8.1 Need

Until such time as platform independence is uniformly enforceable, individual Service Providers will need to be able to manage data sets produced in a number of formats, and to handle both the variety of software common among their constituency of creators and users, and the hardware on which it runs. It is desirable if conversion between formats can be carried out where necessary. It would also be desirable for old non-conformant data sets to be ``upgraded'' to comply with defined standards, particularly where they are likely to be widely used.

2.4.8.2 Current Situation

Typically, existing data archives accept and manage data sets in a variety of formats. In certain circumstances data conversion may be carried out at the request of a user. The Oxford Text Archive has undertaken a phased programme of marking up its non-TEI conformant texts in SGML to the TEI standards, but its policy remains to accept and distribute material in any format.

2.5 Resource Preservation

This is an area of particular importance to the funding bodies, who are anxious to ensure that expensively created resources will remain available for use after the project funding runs out.

2.5.1 Preservation procedures

2.5.1.1 Need

Procedures are needed which can ensure the preservation of electronic resources indefinitely, until a decision is taken to destroy them or to allow them to lapse. Associated documentation procedures are also needed, e.g. to record the versions and formats in which a resource is held.

2.5.1.2 Current Situation

For all too many data sets there are no measures to ensure their preservation. For others, advantage is taken of the routine back-up and archive facilities offered by the local computing service. Where data is deposited with one of the established archives, there are usually more elaborate procedures in operation, with data sets held in several formats, and new copies made periodically.

2.5.2 Preservation media

2.5.2.1 Need

Effective preservation procedures need to take account of the likely longevity of the media on which the data sets are to be stored, both in terms of physical degeneration and of the availability of the necessary hardware and software to decode the data.

2.5.2.2 Current Situation

There is considerable activity in this area among the Archive community. Until recently, the major preservation medium was magnetic tape, with new copies made periodically to protect against physical deterioration. Optical disc technologies are now increasingly widely used, but there are no universally accepted standards. The Council of European Social Sciences Data Archives (CESSDA) has initiated a project on preservation methodology, based at the ESRC Data Archive.
[See note 22]

2.6 Resource Dissemination and Discovery

``Discovery'' and ``Dissemination'' are very closely related. The emphasis in ``Dissemination'' is on the steps taken by resource publishers to make materials available and usable; in ``Discovery'' it is on the range of tools needed by users to access and manipulate the resources. However, the latter depends very directly on the former.

2.6.1 Delivery platforms

2.6.1.1 Need

For the foreseeable future it is likely that a variety of delivery methods will be required. Guidelines are needed covering dissemination of resources in parallel: over the networks, on other electronic media such as CD-ROM, and on paper. Authors and funders will be reluctant to limit themselves to any single delivery platform while the capabilities of users and their equipment are so varied.

2.6.1.2 Current Situation

Mixed platform delivery is well established in existing archives, where data sets are available over the Internet, on CD-ROM, on floppy discs, and even on magnetic tape if required. Commercial publishers are increasingly involved, at least in parallel publishing on paper and CD-ROM; a number are also very interested in publication over the Internet.
[See note 23] FIGIT Programme Areas 1 - Electronic Document and Article Delivery and 4 - On-Demand Publishing are particularly significant in this context. [See note 24]

2.6.2 Communications infrastructure

2.6.2.1 Need

The networking of resources is possible only if a suitable communications infrastructure is in place, and valuable only if institutional infrastructures provide access to the resources for a critical mass of researchers, teachers, and students.

2.6.2.2 Current Situation

SuperJANET will provide a nation-wide high-speed, high-bandwidth platform, with links to other national and international networks, and capable of carrying the whole range of arts and humanities data types, including high resolution still and moving colour images. There are a number of SuperJANET demonstrator projects, including some from the Arts and Humanities, such as the Remote Access to Museums and Archives (RAMA) and Van Eyck projects.
[See note 25] UKERNA in the UK, and the IETF, are concerned with creating the basic platform, and also with the whole range of related issues, such as resource identification and resource discovery tools. It will be important for the AHDS to stay in close touch with developments in this area.

2.6.3 Catalogues and Indexes

2.6.3.1 Need

The creation of comprehensive metadata resources is crucial if the networks are to be a successful method of dissemination. This involves not only on-line catalogues of resources, and catalogues of catalogues, but also searchable indexes of various types, including by subject, by author and by data type.

2.6.3.2 Current Situation

This has long been an area of considerable activity in the library community, and On-line Public Access Catalogues are well established. Many existing data archives have on-line catalogues and indexes. On a larger scale, in the US, the Research Libraries Group's Research Libraries Information Network (RLIN) holds a database of information on more than 56 million items held in over 200 member institutions, accessible by a search service called Eureka.

2.6.4 Gateways

2.6.4.1 Need

Gateways are needed to provide simplified access to resources, saving users a great deal of time and the effort. Busy academics need ``one stop shops'' which will provide access to all the resources they require, using a single interface or at worst a small number of consistent interfaces.

2.6.4.2 Current Situation

There is a great deal of activity in this area. One approach, originally intended for Gopher-based services only, has been to develop a National Entry Point (NEP). The UK NEP is currently based at the UK Office for Library and Information Networking, and several other European countries have followed this model. The TopNode project of the Coalition for Networked Information (CNI) is based on a similar concept.

Another approach has been to develop interfaces between gateways of different types. The Common Gateway Interface (CGI), for example, specifies interfaces between WWW and other services. [See note 26] Within the last year the Working Group on Access to Networked Information Resources recommended to the ISSC that a Network Information Development Agency should be established to develop a range of National Entry Point services. On a more modest scale, the Humanities Bulletin Board (HUMBUL) has been transformed into the ``HUMBUL Gateway'' on the World Wide Web. [See note 27]

2.6.5 Service reliability

2.6.5.1 Need

If networked resources are to be utilized to their full potential, it is important that they form part of a well-supported service rather than being just a data set. Further, unless users can rely on the resources to be available, the services will soon lose their credibility.

2.6.5.2 Current Situation

A great many of the resources currently available on the Internet cannot be relied upon. Many exist only on an individual's personal computer, and are therefore unavailable if that machine is switched off or if it has a hardware, software or communications fault. Among the established data archives, the situation is a great deal better, of course. In many cases the file servers are mirrored, so that a fault with one does not bring the service down. It is also common to design communications systems with a degree of redundancy. Another benefit of redundancy can be performance. The DANTE InfoFlow project is based on a design of interlinked (national) data servers, with data replicated on two or more of these, to avoid access bottlenecks as well as to ensure a stable and reliable service (see Section
8.2.5.4 Source of funding ).

2.6.6 Documentation, training and support for users

2.6.6.1 Need

These are crucial if a service is to be provided rather than just a data set. Standards are needed to define the form which the documentation, training and support are to take.

2.6.6.2 Current Situation

Without exception, the Data Archives we contacted all stressed documentation, training and user support as being a particularly important aspect of the service they provide. Publicity material includes leaflets sent to Higher Education institutions or distributed at conferences. Other documentation may include bulletins and newsletters, made available on paper and on- line. Training includes self-help guides and scheduled courses and workshops, offered in-house, and at conferences and in HE institutions. Other aspects of user support are typically centred on a help-desk, but increasingly include electronic activities, such as an email query service, an electronic bulletin board, and electronic discussion groups open to all registered users. This group of activities is included in FIGIT Programme Area 5 - Training and Awareness.
[See note 28]

2.6.7 Resource discovery tools

2.6.7.1 Need

Standard tools are needed for:

2.6.7.2 Current Situation

The current work on resource location (URIS) has been discussed briefly above. Browsing has also been discussed, in relation to Gopher, WAIS and WWW. The most common means of document retrieval over the Internet is file transfer protocol (FTP). There is considerable work being done based on the Search and Retrieve (SR) and Z39.50 protocols.
[See note 29] This work promises to provide tools for location and retrieval of complex multimedia documents. There are also an increasing number of SGML-aware browsers, so that cross-platform access tools will give access to platform-independent resources. In the area of databases, work continues on developing tools based on SQL, including natural language interfaces to SQL. This area of development is covered within FIGIT Programme Area 6 - Access to Network Resources. [See note 30]

2.6.8 Documentation and training in resource discovery

2.6.8.1 Need

Resource packs are needed which give new users basic guidance on what tools are available, and how they can be used to explore the world of electronic resources. On-line documentation also has an important role to play, to provide help once the user has succeeded in making a connection to the network.

2.6.8.2 Current Situation

There are a large number of books about the Internet, how to get started, and how to navigate around Gopher-space and the World Wide Web. The AHDS would not need to re-invent this material, but a simple introduction to electronic resources in the context of the AHDS would be invaluable in getting busy academics started.


Back to table of contents
On to next section
Back to previous section