TEI Day in Kyoto 2006: Abstracts

Contents:

Paper presentations

Why was and is TEI unknown in Japan and will it become better known?

TUTIYA Syun (Chiba University)

Japan did send delegates to the first preparatory meeting of the TEI and maintained interest in the developments up to P2. Beyond that, the interest remained at the personal level with the result that researchers within the Humanities with a knowledge even of the existence of TEI are extremely rare. The reason for this is not simply that the interest has been low, it also brings to light the fact that researchers in the disciplines of Humanities and Social Sciences do have a concept of textual documents that is very much orientated towards the use of a text. In this paper, I will attempt to investigate the concept of textual documents, the actual use of texts and its change in the 1990s in Japan, and will illuminate changes that have occurred or not occurred in some disciplines. Finally, I will try to highlight what problems this articulates for the preservation in electronic form of textual documents (in a very broad sense) in Japan and propose some steps towards tackling this problem.

Languages with scarce textual materials and markup technologies

MATSUMURA Kazuto (University of Tokyo)

Spoken words are inherently transient; unless recorded to a tape or written down with some kind of writing system, they will soon be gone, sucked without trace into the spatial-temporal past. It is said that the number of languages spoken on this planet is about 6900, most of these are only used in spoken form, so-called 'minority languages without writing system', if the speakers of these languages disappear, so will their language disappear without trace from the face of the earth.

On the other side, languages without writing system are precious ressources for linguists, therefore there are quite a few examples of languages for which linguists did create a written notation using a phonetic transcription. To make these valuable ressources available for computer-based processing, the digitalization and markup of them is an important task for linguists. This presentation will present some of the experiences and lessons learned.

Marking up spoken dialog corpora

TUTIYA Syun (Chiba University)

ITAHASHI Shuichi (National Institute of Advanced Industrial Science and Technology, National Institute of Informatics)

OHSUGA Tomoko (National Institute of Informatics)

The recording of spoken dialogs is interesting also since it does away with the intrinsic linearity of language. The mechanism that is necessary to explore this within the framework of a TEI document, together with real-world encoding examples will be presented. 128 recordings of spoken dialog that had been made at Chiba University in 1993 and for which the dialogs along with information about the participants and the length of the utterances have been transcribed, this will be used to illustrate the type of problems that had been encountered in this work. We will also discuss a model for representing the phenomena of spoken language that is the prerequisite for such a transcription and will show how this can be realized in a conforming TEI document.

Markup problems: Syntactical analysis and steps to their resolution

OHYA Kazushi (Tsurumi University)

In the process of digitalizing textual ressources used in the humanities and subsequently adding markup to them, there are, I think, mainly three reasons that difficulties are encountered:

the source material does not have been sufficiently analyzed
markup as a method is at odds with the original aims
markup technologies as such have not been sufficuently mastered.

However, the last point also includes the fact that markup technologies themselves have not yet matured sufficiently, so that because of the way the XML standard is defined, some needed constructs can not be written easily. Among other things for example, although the combination of several XML applications (among them for example TEI) is difficult to achieve because of the way the standard is defined, there have nevertheless a large number of applications been defined and widely used. This combination of several applications is something even specialists in markup languages achieve only with difficulties. Not just data input or data conversion, but decide how to encode what is indeed a task that requires high level skills.

On the other hand, adding markup itself is something that is very close to home for a scholar trained in the humanities. In this paper, I will focus on one of the above difficulties, that is the "syntactically induced" problems and pitfalls of markup languages. This is not because of underspecification of the standard, but rather is a consequence of the inherent freedom of description in a markup system. In the application of TEI, some problems encountered are due to the way markup languages as such are defined, others result from the specific text type used. This differentiation is helpful in understanding the way such problems are handled within the TEI, but they can also be applied to markup languages in general. Markup languages are not something that should simply be used since it is defined as a standard, but since they use formal languages to apply annotations as a description on a meta-level, they provide a means to analyze and reflect upon this act as such.

TEI: an Overview

Syd Bauman (Brown University)

Lou Burnard (Oxford University)

This talk will present a broad overview of the TEI Consortium and the TEI Guidelines (P5).

The Text Encoding Initiative (TEI) Guidelines have become one of the central tools of the digital humanities landscape, and have revolutionized the creation and use of digital texts for scholarly research. Produced and maintained by the TEI Consortium, the Guidelines are now used widely in a range of scholarly applications, including digital libraries, scholarly editions, manuscript and historical archives, linguistic corpora, individual scholarly projects, and thematic research collections. The TEI community has produced an immensely useful text encoding system that works on many levels ― for both simple and very complex forms of data representation ― to ensure that humanities texts can be created, stored, exchanged, and archived in a manner that is both effective and expressive. This talk will first describe the TEI Guidelines as a text encoding standard, and their diversity of use within the community of TEI projects. Different disciplinary communities have produced their own specifications for using the TEI Guidelines, and these will be briefly discussed with respect to how they relate to the TEI Guidelines themselves. I will then present how the Guidelines themselves are technically organized, with an overview of how customizations for use by individual projects are produced. I will breifly discuss how the TEI Consortium itself is organized, with an emphasis on the work of the TEI Special Interest Groups, which reflect particular community interests and research areas. In concluding, I will describe the various ways in which individuals, projects, and institutions may become more closely involved.

Towards an internationalized and localized TEI

Sebastian Rahtz (Oxford University)

The Text Encoding Initiative Guidelines have been widely adopted by projects and institutions in many countries in Europe, the Americas, and Asia, and are used for encoding texts in dozens of languages. However, the Guidelines are written in English, the examples are largely drawn from English literature, and even the names of the elements are abbreviated English words. We need to make sure that the TEI and its Guidelines are internationalized and localized so that they are accessible in all parts of the world.

The paper describes how the TEI project can develop internationally, including

A review of why localisation and internationalisation matter
A discussion of how the TEI architecture can be leveraged to support internationalised versions
The application of the W3C ITS guidelines to the TEI work
Practical results from a pilot project, and future translation plans
The tools needed to make use of an internationalised TEI
The steps towards ontologies in the TEI

XML mark-up of biographical and prosopographical data

Matthew J. Driscoll (Kopenhagen University)

My paper will present work currently under way within the TEI for marking-up biographical and prosopographical data, in other words information on people, including such things as dates and places of birth and death, marriage and family relations, social origins, places of residence, education, occupation, religion, experience of office and so on.

Exploring TEI XML documents with XQuery

James Cummings (Oxford Text Archive)

This paper commences with a basic introduction and survey of the W3C XML Query Language (XQuery), using example XQueries to convey the basics of the language and the potential of using XML Databases. It looks at the various expressions which constitute an XQuery and the functions, mostly inherited from XPath, which can be used to located and extract data from an XML Database. The paper then continues by looking at using XQuery with TEI P5 XML documents in specific, with the popular native XML database eXist, and introduces additional aspects such as the use of namespaces and some of the useful extensions implemented by eXist. How to construct queries to return TEI documents, and parts of TEI documents, based on specific criteria are demonstrated. Following this, the paper explains step-by-step the creation of a basic XQuery web application for retrieving information from an eXist database using XML in conjunction with the Apache Cocoon web publishing framework.

Presenting TEI texts using topic maps

Conal Tuohy (New Zealand Electronic Text Centre, Victoria University)

This presentation is about a method for presenting complex TEI texts.

Many electronic text archives transform their TEI texts into HTML for publishing their texts on the World Wide Web. Typically each chapter or page is transformed from TEI into a separate web page. Such a method produces websites that have the same structure as a physical book.

However, TEI is more powerful than HTML and can encode many other features of interest than just chapters, pages, and paragraphs. For example, TEI is also used to encode information about people and places and events, as well as literary criticism, and linguistic analysis. Indeed, TEI is designed to be extended to suit all kinds of scholarly needs.

These more complex aspects of text encoding are more difficult to transform into HTML. Because TEI is designed to be convenient for scholars to encode complex information, rather than for readers to understand it, it is necessary to transform the TEI into another form suitable for display. For instance, where a TEI corpus includes references to people, these references might be collated together to produce an index. For practical purposes, it is often necessary to extract information from TEI into a database, so that it can be queried conveniently and transformed into a web site.

The new "Topic Map" standard of the International Standards Organisation is a suitable technology for solving this problem. A topic map is a kind of Web database with an extremely flexible structure. This presentation will demonstrate and describe a framework for using TEI together with Topic Maps to produce large websites which can be navigated easily in many directions.

Poster presentations

TEI @ RCH

Dot Porter (Collaboratory for Research in Computing for Humanities, University of Kentucky)

The Collaboratory for Research in Computing for Humanities uses TEI P5 in all its new and developing projects. Our poster will highlight current and developing projects in RCH, and the various ways that we are taking advantage of the flexibility offered by the TEI P5.

Through the Neolatin Colloquia Project, directed by Ross Scaife, Professor of Classics, graduate students and faculty associated with the UK Institute for Latin Studies are creating a variety of materials for the renewed study and enjoyment of neo-Latin colloquia scholastica, texts that date primarily from the 16th and 17th centuries. Modules used to encode the colloquia include those for Performance Texts (drama), Names and Dates (namesdates), and Common Core (core) – especially for the tagging of bibliographic citations and references.

The Latin Lexicography Project (LLP), also directed by Scaife, is building a web-accessible Latin dictionary, initially populated by digitizing and harmonizing the markup of several important Latin lexica with coverage up to about 1850 CE, then growing ever more comprehensive through the assimilation of additional lexica. For the LLC, we are using the dictionaries module to progressively mark up a number of classical Latin and neolatin dictionaries originally published in print.

Still in the planning stages, the Collectio Dacheriana Project directed by Abigail Firey, History Department, will make extensive use of the Critical Apparatus tags (in the textcrit module) in order to record the many variants in this collection of Carolingian canon law.

Under the direction of Ben Withers, Associate Professor and Chair of the Art and Art History Department, the Old English Hexateuch Project will bring together a group of Anglo-Saxon scholars with a variety of specialties to build an edition of the extensively illustrated tenth-century manuscript, British Library Claudius B iv. The edition will make extensive use of the Manuscript Description module (msdescription), and will also propose extensions to the modules for Text Criticism (textcrit) and Transcription of Primary Sources (transcr).

Venetus A Project, in cooperation with Harvard's Center for Hellenic Studies, is part of the Homer Multitext Project. The Venetus A project seeks to create a complete image-based edition of the Biblioteca Nazionale Marciana, Venice, Venetus A, a tenth-century Byzantine manuscript containing the earliest copy of Homer's Iliad plus several layers of annotations. Venetus A will take advantage of the Manuscript Description module (msdescription), and in addition will illustrate TEI interaction with the Classical Text Services (CTS) protocol, and image-text mapping between TEI and METS.

The Versioning Machine

Susan Schreibman (University of Maryland)

The Versioning Machine is open source software for displaying and comparing multiple versions of texts. The display environment provides for features traditionally found in codex-based critical editions, such as annotation and introductory material, while taking advantage of opportunities of electronic publishing, such as providing a frame to compare diplomatic versions of witnesses side by side, allowing for manipulatable images of the witness to be viewed alongside the diplomatic edition, and providing users with an enhanced typology of notes.

The Versioning Machine supports display of XML texts encoded according to the guidelines of the Text Encoding Initiative (TEI). Texts may be encoded individually (as separate documents) or may be encoded according to the TEI's "critical apparatus tagset" (TEI.textcrit) to encode all witnesses in one XML file. The critical apparatus tagset offers the most efficient and thorough methodology for inscribing variants in a structured, machine-readable format. The Versioning Machine provides for enhanced functionality of texts encoded according to this tagset via synchronized scrolling and line matching.

This poster session will demonstrate the use of and applications of "The Versioning Machine".

Using the TEI gaiji module

Christian Wittern (Kyoto University)

The TEI working group on character encoding has developed a module that allows the representation of characters beyond those encoded in the Unicode standard, which is the character encoding standard used in XML and therefore also in TEI. This module adresses to types of problems:

for a character that has been specified in Unicode, the encoder whishes to describe specificly what glyph has been used to render this character, out of the many possibilites that are used to represent it
for a character that exists not (yet?) in Unicode, the encoder needs to represent it somehow

In this poster, some examples are given to show the applications and usages of this module.

The CBETA electronic Tripitaka: An example of a succesful application of TEI to a large premodern Chinese text corpus

N.N. (Chinese Electronic Buddhist Text Association)

The Chinese Electronic Buddhist Text Association (CBETA) embarked on an ambitions project to digitize the whole of the Chinese Buddhist Tripitaka in 1998. With the help of a small, engaged team and largely through efficient use of text-processing and markup technologies, within the last eight years, a total of 100 Volumes amounting to more than 120 million characters have been distributed free of charge over the Internet and on CD-ROM in a large variety of formats, reaching from PDF files ready to print to text files that are formatted to be read on hand-held devices or cell-phones.

Considerable effort has been made to not only produce a highly accurate transcription of the source text, but also to correct numerous misprints and to add textcritical notes, thus making it a ressource highly praised by scholars of Buddhism all over the world. Technically, a highly customized version of TEI P4 has been used internally, but more recently an experimental version based on TEI P5 has been made available. Taking advantage of the P5 gaiji module, this version now encodes all the more than 8000 characters or variant forms used in these texts, but not yet found in Unicode in a standardized and exchangeable form.

In this poster an application for reading and studying these texts will be showcased.

Navigating a Sea of Texts: Topic Maps and the Poetry of Algernon Charles Swinbure.

John Walsh and Michelle Dalmau (Indiana University)

Topic Maps, including their XML representation, XML Topic Maps (XTM), are powerful and flexible metadata formats that have the potential to transform digital resource interfaces and support new discovery mechanisms for humanities data sources, such as large collections of TEI-encoded literary texts. Proponents of topic maps assert that topic map structures significantly improve information retrieval, but few user-based investigations have been conducted to uncover how humanities researchers and students truly benefit from the rich and flexible conceptual relationships that comprise topic maps.

The proposed poster will provide an introduction to Topic Maps and how a collection of TEI-encoded literary texts, specifically, the Swinburne Project http://swinburnearchive.indiana.edu, benefit from the use of topic maps. The poster will also provide an overview of the methodology used for the comparative usability study that was designed to assess the strengths and weaknesses of a topic map-driven interface versus a standard search interface. The interfaces that were presented to users will be demonstrated along with key findings from the usability study. Lastly, design alternatives based on the usability findings will also be presented.

The results of this study are intended to move the discussion of topic maps in the digital humanities beyond demonstrating the novel to providing evidence of the impact of Topic Maps and their extension of existing classificatory structures on the humanities researcher's discovery experience. We hope to provide those who are implementing topic maps or similar metadata structures in digital humanities resources with design recommendations that will ensure successful user interaction.

Untangling Āgama literature - A Digital Comparative Edition of the Bieyi za ahan jing

Marcus Bingenheimer (Chung-Hwa Institute of Buddhist Studies)

The Digital Comparative Edition of the Bieyi za ahan jing is a project undertaken by the the Chung-hwa Institute for Buddhist Studies, Taipei (www.chibs.edu.tw) and funded by a three- year grant from the Chiang Ching-kuo Foundation for Scholarly Exchange 蔣經國基金會 (www.cckf.org/index-e.htm).

The Bieyi za ahan jing 別譯雜阿含經 (BZA) in 16 fascicles containing 364 sutras belongs to the early Chinese Buddhist texts collectively called Ahan (Āgama) sutras 阿含經. Ahan literature constitues the earliest stratum of Buddhist literature. The originals (in Buddhist Sanskrit) are largely lost, only a few fragments have survived. Next to the Chinese tradition only the Theravāda tradition has preserved a comprehensive set of these sutras in Pāli. While the Nikāyas, as the Ahan sutras are called here, have been extensively studied and fully translated into English, Japanese and German, there are extremely few translations or critical editions of the Chinese Ahan sutras.

Generally, all of the 364 short sutras contained the BZA have at least one parallel in Chinese and one Pāli parallel (with commentary). Often there are several parallels in Chinese and Pāli, sometimes even a fragment in Buddhist Sanskrit has survived.

The aim of the project is to create a digital comparative edition of the BZA, which clarifies these text-clusters. The edition will be freely available to the public. Moreover we are working on an English translation of the BZA text. Textbase for Chinese is the CBETA edition, for Pāli text the Vipassana Research Institute has granted us permission to use the text of the Chaṭṭha Saṅgāyana CD.

The markup of the XML files is designed according to the encoding scheme of the Text Encoding Initiative (TEI) which is transformed into HTML for the user. The markup expresses the basic dialogic structure of the content, names, differentiates between prose and verse parts, and connects them to the authoritative printed versions. For the Pāli and longer Chinese parallels the markup distinguishes between larger parallel and non-parallel passages.

The texts within a cluster are linked through a comparative catalog. If time allows, we will add phrase-level markup for better alignment of the parallels within a text-cluster. Middleware between the source files and the user application will be eXist, an XML database. The delivery system based on eXist is a first for Buddhist Studies as well as Humanities Computing in Taiwan. The end-user selects the cluster s/he wants to view online and can further select which of the texts in the cluster to display, provisionally in a three column layout.

The comparative digital edition:

enables the user to conveniently compare the different texts of a cluster
refines and expands the contents of the 364 clusters
adds a new punctuation to the BZA and the ZA sutras
provides an annotated English translation of selected sections of the BZA
enables statistical analysis by creating parallel corpora
is extensible and allows for further material to be added
serves as model for future projects that try to reorganize and represent the maze of Āgama literature

XXQ: a query language for XML corpora

Lou Burnard (Oxford University)

This poster will intruduce XXQ -- the new XML query language currently under development for use with XAIRA (XML Aware Indexing and Retrieval Architecture). The poster will also introduce Xaira, of course, but the main focus will be on the idea of an engine-independent query language for XML text. We maybe could have called it Xpath-plus, but the key features about it are

(a) it's not Xquery so it doesn't try to pretend XML structures are relations
(b) it's not Xpath so it doesn't restrict you to searching within a single hierarchy
(c) it's not grep so its atoms are lexical tokens rather than characters.

It is a pattern matching language with the expressive power of regular expressions, and comparable weaknesses (no look-ahead), but one which is represented by a simple XML vocabulary.

Markup problems: Syntactical analysis and steps to their resolution

OHYA Kazushi (Tsurumi University)

One of the difficulties in applying markup is often perceived as result of the syntax of markup languages. In this poster, I will present some of the frequently encountered problems. This is intended primarily for those beginning their work with markup and intend to prepare data according to the TEI guidelines.

Japanese translation project of the TEI Guidelines

OHYA Kazushi (Tsurumi University)

Christian Wittern (Kyoto University)

In order to provide support to the encoding of texts in Japanese, a project to translate the TEI Guidelines into Japanese has started. At the moment, a rough draft version of large parts of the P4 version has been prepared. This will be enhanced and refined, with the ultimate goal of preparing a Japanese version of P5, as soon as that becomes stable enough to be translated. This poster will present the current state of the work and hopes to attract more collaborators in this project.

Markup of the "Comprehensive Mirror for Aid in Government"

NAKADATE Hamana (Kyoto University)

The research group responsible for the "Construction of a knowledgebase of Chinese-character-based documents" within the 21st Century COE program at Kyoto University "Toward an Overall Inheritance and Development of Kanji Culture" commenced work on markup of the section pertaining to the Tang period (618-906) of the "Comprehensive Mirror for Aid in Government" (Zizhi tongjian), a well known history compiled by Sima Guang in the mid of the eleventh century. Aim of this work is to annotate names of persons, places and works, extract and expand this information into a comprehensive networked ressource and thus combine traditional textual and historical scholarship with the digital technology of the 21st century and try to lay a basis for new developments within East Asian Studies. In this poster, the state of the work, some examples, problems and possibilities will be shown.