Textual Markup and the Study of Literature

This document is in its first draft form. Substantial changes to its code and content are anticipated.

What's in a Text?
Textual Markup as a Scholarly Activity
Types of Markup
The History of Textual Markup
XML (Extensible Markup Language)
From XML to (X)HTML
Why Use XML?
XML and the TEI Schema
Encoding the Literary Text
Further Reading

What’s in a Text?

Where is the meaning of a literary text to be found?

In the physical appearance of the letters?
In the original material form from which all copies derive?
In the intent of the author?
In the interpretation of the reader?

There are many facets of textual meaning. Some are expressed (though not always explicitly) through layout, structure, and content. Others are interpreted meanings. Textual markup makes the meanings explicit so that they can be processed reliably, either by computer algorithms or by scholars working with the text. Here is an example of textual markup for The Taming of the Shrew:

<title rend="italic">The Taming of the Shrew</title>

The title text goes inside a <title> tags (which together define an element). The "rend" (i.e. render) portion is called an attribute, and it indicates that the element should be displayed in italics. We now know what the text is and how it should be displayed. Whilst this may seem obvious, it is not obvious to a computer. Furthermore, the same techniques can be used much more sophisticated information of larger texts, which can aid in the exploration of literary meaning.

Return to Top

Textual Markup as a Scholarly Activity

Encoding textual markup requires editorial and interpretive decisions. Markup can help answer research questions and deciding what markup is needed can be a research activity in itself. Detailed document analysis is needed before encoding for the resulting markup to be useful. You must ask which features to markup, why you are choosing to markup these features, and how consistently you will be able to do so. The process involves interpretation of the text, and the product is a tool for further interpretation.

Return to Top

Types of Markup

Procedural markup specifies how content should be processed. It is generally concerned with how it should appear, not with its meaning. It is generally used only when content will serve a single purpose. The typesetting codes used by editors are typical example of procedural markup.
Descriptive markup identifies the logical components of a text. It is generally concerned with the text’s meaning, rather than appearance, and so is readable by humans and machines. It does not tie the document to a single purpose.

Descriptive markup allows us to make explicit distinctions in the text in a formal way. It helps identify what aspects of the text are, rather than what they look like. In the example above, the <title> element is descriptive, but the "rend" attribute is procedural since it only refers to appearance. Text in italics does not necessarily have to be a title.

Return to Top

The History of Textual Markup

The most important markup system is the Standard Generalized Markup Language (SGML), which was in the 1960s for the sharing of information in government, law, and industry. It is fantastically complicated, and most of the time only scaled-down versions of it were used. One of the most important is the Hypertext Markup Language (HTML) created in the 1980s and 1990s for displaying texts on the internet. As web browsers became more sophisticated, HTML was developed along increasingly procedural lines to increase the possibilities for the display of information in web pages. However, the increasing use of the internet created a greater and greater need for descriptive markup, so, in the late 1990s, another version of SGML, the Extensible Markup Language (XML) was developed to address this need. The coding rules for XML were less flexible than those of HTML, and, in order to create greater compatibility between the two, HTML was modified to follow the rules of XML in 2000. This variety of HTML is known as XHTML, and it is increasingly becoming the standard for coding web pages. XML coding can easily be transformed into XHTML. Most of the international standards and recommendations for coding digital texts are administrated by the World Wide Web Consortium (W3C).

Return to Top

XML (Extensible Markup Language)

XML is language for document markup that was designed specifically for the web. A document’s content is divided up into descriptive elements which form a hierarchical tree (a single root and many nodes). XML looks very similar to html, except that it must be well-formed (follow strict coding rules) and it is extensible (not limited to a small set of elements). A basic xml file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<message priority="high">Hello!</message>
</root>

The "priority" part of the element is called an attribute. The first line is called the xml declaration because it tells the reader/computer that this is an xml document. Its attributes include the version of xml and the character encoding system (UTF-8 is suitable for many languages).

Notice that the element tells us that "Hello!" is a message. It does not tell us anything about how the text should look. We can use a separate stylesheet to style elements however we want (e.g. normal, but bold if the value of the priority attribute is "high"). Style and content are kept separate.

Return to Top

From XML to (X)HTML

Web pages are marked up in HTML (or XHTML, the version that follows the rules of XML). Browsers are programmed to display (x)html in specific ways. That is, they transform the marked up content so that it will have a pre-determined appearance. Since (X)HTML was not originally intended for this purpose, it is not very good for supplying style. So it may be supplemented by a more powerful styling language such as css (Cascading Style Sheets). These provide the browser with further styling instructions.

An xml document has no inherent style properties and cannot be styled directly by css. Instead, the xml elements first have to be transformed into (x)html elements in order for them to appear in a web page. Complex processing instructions are needed to do this. One of the most common methods of doing this is XSLT (Extensible Stylesheet Language Transformations). This is a language for selecting xml elements and writing them into other documents which may include (X)HTML and CSS code. A document in this language is perhaps deceptively called an XSLT stylesheet. The language is complicated, but here is what an XSLT stylesheet might do.

Send to the browser the following the basic XHTML code for a web page.
Inside the XHTML code write a CSS stylesheet which specifies that elements in the "important" class will be in bold.
Find all elements in the XML document above and insert it inside XHTML <p> (paragraph) elements.
Insert elements with the priority attribute set to "high" inside <p class="high">.

This will produce a web page that says: Hello!

Return to Top

Why Use XML?

So why use XML? Why not just directly create a web page using (X)HTML and CSS? There are a large number of reasons, and the ones given here are only examples.

The boldface style of Hello! on your screen does not necessarily indicate that this item is a message or that it is high priority. If you did a word search of the document you could find the word "hello", but you couldn’t find a high priority message.
The XML document is not tied to the single medium of the web browser screen. One could design a program to read the document aloud and have an "audio stylesheet" increase the volume of high priority elements.
The example above does not demonstrate the powerful ways in which XSLT stylesheets can transform the text. For instance, an XSLT stylesheet could go through a document finding only messages by a single author and insert only those in the output document.

In short, an XML document can be put to more uses than an XHTML document. It is not restricted to the single platform of a web browser. Even in the web environment, it is a great deal more flexible for the conveying and manipulating of textual meaning.

Return to Top

XML and the TEI Schema

XML documents have to follow a schema, a pre-determined set of elements and attributes. This is the case for (X)HTML, where the schema has been defined by the W3C. However, with an XML document, you produce your own schema, one that contains the elements and attributes relevant for structuring analyzing your document. You are not bound to use the ones devised for displaying documents in web browsers. This is important, since different kinds of texts will require different sets of elements. For instance, you can see a portion of President Obama's Change agenda here, and this is what the XML code looks like.

The rules for creating schemas are complex. One shortcut is to use a pre-made schema and then modify it. You could do this with the Obama Change agenda schema, but numerous scholars have already put their minds to creating schemas for literary texts. The schema which is fast becoming the standard for literary research is that of the TEI (Text Encoding Initiative). This schema was created specifically for the study of literary documents, especially by scholars working in the humanities. The benefit of using the TEI schema is that your document will be readable by a wide variety of applications which can process TEI-encoded documents.

Using the TEI schema is as simple as stating that you are using it in your xml document and then following the coding guidelines in the TEI documentation. Unfortunately, the guidelines are very complex (there are over 300 TEI elements). That said, a simple TEI document is not too hard to produce.

Applying an XSLT stylesheet to your document is also simple. Underneath the xml declaration, you include a link to your stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="path_to_stylesheet.xsl" type="text/xsl"?>

The suffix .xsl is used for xslt stylesheets.

Writing the stylesheet is another matter. XSLT is a complex language which is much harder to learn than other markup languages. Therefore, it is recommended that considerations of how the document will appear in output be left to the end of a project.

Return to Top

Encoding the Literary Text

We thus return to the questions asked at the beginning of this document. How do we find meaning in a literary text, and how can we use textual markup to encode this meaning? The TEI schema provides some useful guidance and also allows us to supplement the schema if it does not contain elements or attributes we might need. Designing a schema appropriate for literary text is a literary research project in an of itself. Thinking about how we would encode the meaning of the text is a valuable way for us to understand that meaning and how it is conveyed.

Return to Top

(X)HTML Tutorial