Introducing OpenArabic mARkdown

2 minute read :: Posted on November 8, 2015

TEI XML has long become the standard for tagging humanistic texts for research purposes. It is the standard in most digital libraries (including the Perseus Digital Library). Having texts in a TEI XML format that conforms to the standards of a long-standing library allows one to take advantage of libraries’ infrastructure and analytical tools that have been developed since the appearance of TEI XML. Converting texts into XML, however, is a rather long and complicated process.

Texts in Arabic make things even more complicated. Right-to-left (RTL) and left-to-right (LTR) text in one file is one the major challenges. Since the cursor changes the direction of its movement when crossing the boundary between RTL and LTR text, it is difficult to place the cursor properly, and one often ends up changing a wrong part of the text. The direction of paired characters is visually confusing, and it is often next to impossible to say whether a given angle bracket—perhaps the most important XML character—is an opening character or a closing one. Moreover, the shapes of Arabic letters in a text file are dynamically changing as one types or edits Arabic text, and many text editors do not handle this properly (particularly on Mac). In addition to these technical challenges, there are too many Arabic texts to convert—and most of them are multivolume titles—and too few people who have both training and willingness to do that.

In the beginning of my digital research I have considered TEI XML as a working format, but I had to give up on this option, since converting a 50-volume book (~3,4 million words) would have taken forever. After reviewing existing approaches, I came up with a rather simple tagging system that allowed me to create a structured, machine-readable text, without sacrificing years of my life. In many ways, this system was inspired by markdown—“a text-to-HTML conversion tool … that allows [one] to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).”

The main goal of mARkdown is to provide a simple system for tagging structural elements in Arabic texts that would facilitate algorithmic analysis in the same way as more complex TEI XML does. In principle, mARkdown does not require any special editor, but my current workflow relies on EditPad Pro, which supports right-to-left languages, Unicode, and large files. However, it is the support of custom highlighting and navigation schemes that makes this text editor particularly convenient for mARkdown.

Since I have been using my mARkdown for my own research purposes, it has not yet been developed into an easily reusable system. This is my first attempt to provide a detailed description and explain how it can be used. I expect that mARkdown will undergo some minor changes in the upcoming months. The most recent description can be accessed from the main menu above.

mARkdown in EditPad Pro activated with the “magic value” in test_textFile

Leave a Comment