Textual Markup and the Study of Literature

  1. What's in a Text?
  2. Types of Markup
  3. Textual Markup as a Scholarly Activity
  4. XML (Extensible Markup Language)
  5. From XML to (X)HTML
  6. Why Use XML?
  7. XML and the TEI Schema
  8. Encoding the Literary Text
  9. Further Reading

What’s in a Text?

Where is the meaning of a literary text to be found?

  • In the physical appearance of the letters?
  • In the original material form from which all copies derive?
  • In the intent of the author?
  • In the interpretation of the reader?

There are many facets of textual meaning. Some are expressed through layout, structure, and content. Others are interpreted meanings. Textual markup makes the meanings explicit so that they can be processed reliably, either by computer algorithms or by scholars working with the text. This document provides a gentle (and mostly theoretical introduction) to the methods of using textual markup for the exploration of literary meaning.

Return to Top

Types of Markup

  • Procedural markup specifies how content should be processed. It is generally concerned with how it should appear, not with its meaning. It is generally used only when content will serve a single purpose.
  • Descriptive markup identifies the logical components of a text. It is generally concerned with the text’s meaning, rather than appearance, and so is readable by humans and machines. It does not tie the document to a single purpose.

Descriptive markup allows us to make explicit distinctions in the text in a formal way. It helps identify what aspects of the text are, rather than what they look like.

Return to Top

Textual Markup as a Scholarly Activity

Encoding textual markup requires editorial and interpretive decisions. Markup can help answer research questions and deciding what markup is needed can be a research activity in itself. Detailed document analysis is needed before encoding for the resulting markup to be useful. You must ask which features to markup, why you are choosing to markup these features, and how consistently you will be able to do so.

Return to Top

XML (Extensible Markup Language)

XML is language for document markup that was designed specifically for the web. A document’s content is divided up into descriptive elements which form a hierarchical tree (a single root and many nodes). XML looks very similar to html, except that it must be well-formed (follow strict coding rules) and it is extensible (not limited to a small set of elements). A basic xml file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<message priority="high">Hello!</message>
</root>

The "priority" part of the element is called an attribute. The first line is called the xml declaration because it tells the reader/computer that this is an xml document. Its attributes include the version of xml and the character encoding system (UTF-8 is suitable for many languages).

Notice that the element tells us that "Hello!" is a message. It does not tell us anything about how the text should look. We can use a separate stylesheet to style elements however we want (e.g. normal, but bold if the value of the priority attribute is "high"). Style and content are kept separate.

Return to Top

From XML to (X)HTML

Web pages are marked up in html (or xhtml, the version that follows the rules of xml). Browsers are programmed to display (x)html in specific ways. That is, they transform the marked up content so that it will have a pre-determined appearance. Since (x)html was not originally intended for this purpose, it is not very good for supplying style. So it may be supplemented by a more powerful styling language such as css (Cascading Style Sheets). These provide the browser with further styling instructions.

An xml document has no inherent style properties and cannot be styled directly by css. Instead, the xml elements first have to be transformed into (x)html elements in order for them to appear in a web page. Complex processing instructions are needed to do this. One of the most common methods of doing this is xslt (Extensible Stylesheet Language Transformations). This is a language for selecting xml elements and writing them into other documents which may include (x)html and css code. A document in this language is perhaps deceptively called an xslt stylesheet. The language is complicated, but here is what an xslt stylesheet might do.

  1. Send to the browser the following the basic xhtml code for a web page.
  2. Inside the xhtml code write a css stylesheet which specifies that elements in the "important" class will be in bold.
  3. Find all elements in the xml document above and insert it inside xhtml <p> elements.
  4. Insert elements with the priority attribute set to "high" inside <p class="high">.

This will produce a web page that says: Hello!

Return to Top

Why Use XML?

So why use xml? Why not just directly create a web page using (x)html and css? There are a large number of reasons, and the ones given here are only examples.

  1. The boldface style of Hello! on your screen does not necessarily indicate that this item is a message or that it is high priority. If you did a word search of the document you could find the word "hello", but you couldn’t find a high priority message.
  2. The xml document is not tied to the single medium of the web browser screen. One could design a program to read the document aloud and have an "audio stylesheet" increase the volume of high priority elements.
  3. The example above does not demonstrate the powerful ways in which xslt stylesheets can transform the text. For instance, an xslt stylesheet could go through a document finding only messages by a single author and insert only those in the output xhtml.

In short, an xml document can be put to more uses than an xhtml document. It is not restricted to the single platform of a web browser. Even in the web environment, it is a great deal more flexible for the conveying and manipulating of textual meaning.

Return to Top

XML and the TEI Schema

XML documents have to follow a schema, a pre-determined set of elements and attributes. This is the case for (x)html; however, with an xml document, you produce your own schema, one that contains the elements and attributes relevant for structuring analyzing your document. You are not bound to use the ones devised for displaying documents in web browsers.

The rules for creating schemas are complex. One shortcut is to use a pre-made schema and then modify it. The one must suitable for the study of literary texts is the TEI (Text Encoding Initiative) schema. This schema was created specifically for the study of literary documents, especially by scholars working in the humanities. The benefit of using the TEI schema is that your document will be readable by a wide variety of applications which can process TEI-encoded documents.

Using the TEI schema is as simple as stating that you are using it in your xml document and then following the coding guidelines in the TEI documentation. Unfortunately, the guidelines are very complex (there are over 300 TEI elements). That said, a simple TEI document is not too hard to produce.

Applying an XSLT stylesheet to your document is also simple. Underneath the xml declaration, you include a link to your stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="path_to_stylesheet.xsl" type="text/xsl"?>

The suffix .xsl is used for xslt stylesheets.

Writing the stylesheet is another matter. XSLT is a complex language which is much harder to learn than other markup languages. Therefore, it is recommended that considerations of how the document will appear in output be left to the end of a project.

Return to Top

Encoding the Literary Text

We thus return to the questions asked at the beginning of this document. How do we find meaning in a literary text, and how can we use textual markup to encode this meaning? The TEI schema provides some useful guidance, and the next portion of this document will explore what its guidelines have to offer. We can also, of course, supplement the TEI schema if necessary, and we should be thinking about elements or attributes me might need, but which are not specified by the TEI.

Return to Top

Further Reading

This introduction is heavily indebted to the series of tutorials put together by James Cummings for the Man of Law's Tale Project workshop at Adam Mickiewicz University , Poznan, Poland. Following this link will lead you to other resources from the TEI @ Oxford. More complete information on the TEI can be found on the main Text Encoding Initiative web site.

Return to Top