Big Picture View on TEI and Character Encoding

**reissued with corrections on pre-defined characters in XML**

These resources will help to think about the “big picture” of what XML and TEI coding are for and why we use them, and I hope you will find them illuminating for our work on the Mitford project. Understanding why we’re coding as we are will really help us to explain our methodologies for the Digital Mitford in our various scholarly and even non-scholarly contexts.

1. from the TEI By Example project:  This is a series of tutorials, some of which we may go over together in our workshop. Their Introduction to Text Encoding and the TEI is pithy and educational, explaining the concepts of markup and meta-language in practical contexts. (And it’s not very long, either!)

2.  Unicode standard, fonts, and special characters in the TEI:

See the TEI P5 chapter on Languages and Character Sets: There’s a lot of detail here, but the chapter tells you something of the cultural contexts of the coding and character sets we work with.  Much of this may not have an immediate practical bearing on our project, but when it does, we’ll know it–as soon as we try to encode a strange postal mark!  The chapter might also show us how lucky we are: Were we to be working on more ancient texts, we might find ourselves needing to declare alternative character sets! Speaking of which, there are a few things we need to know about unicode:

3. Unicode and”predefined entities” in XML:

There are five (count ’em, 5!)  special reserved keyboard characters that are reserved for XML processors–when we’re coding in XML, and we use an “&” character, XML processors “read” these to be processed in a particular way, and they are a very common source of errors and problems with XML coding.  To deal with these characters, we find and replace them with their Unicode equivalents–and we’ll find ourselves doing quite a LOT of this in our project as we code our texts:

<   &lt;

>   &gt;

&  &amp;

”   &quot;    (We only have to worry about this when it’s inside an element’s attribute value, like this: <element attribute=” “quote” “> This really shouldn’t happen in our project!

‘   &apos;  (see above on the quotation mark)

When we need to “escape” these literal characters, we use their unicode equivalents in a special format that begins with the “&” character–and <oXygen/> will give us a convenient menu whenever we use the “&” because it’s programmed to “know” the escape routes we need.

I am posting this here because I want to encourage us to think about what we’re doing as we code: this is not exactly a literal reproduction of Mitford’s texts, but rather a machine-readable surrogate!

Addenda:  1. For your amusement: some wretched poetry written on the occasion of Unicode turning 20 on Sept. 10, 2008!

2. For your edification, especially those of you on my team who study the materiality of texts and fonts. This is more detail on the principles of Unicode.


One comment

  1. Reblogged this on Confessions of a Digital Romanticist and commented:
    I just assembled this on my Digital Mitford blog for our project team, but perhaps it’s of broader interest for those of us who “geek out” over the materiality of text…and code.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: