Metrical annotation for a verse treebank

T. M. Rainsford1 and Olga Scrivner2

1Institut für Linguistik/Romanistik, Universität Stuttgart
2 Indiana University

E-mail: 1tmr740-ac@yahoo.co.uk
2obscrivn@indiana.edu

Abstract

We present a methodology for enriching treebanks containing verse texts
with metrical annotation, and present a pilot corpus containing one Old Oc-
citan text. Metrical annotation is based on syllable tokens, and is generated
semi-automatically using two algorithms, one to divide word tokens into syl-
lables, and a second to mark the position of each syllable in the line. Syn-
tactic and metrical annotation is combined in a single multi-layered ANNIS
corpus. Three initial findings based on the pilot corpus illustrate the close
relation between syntactic and metrical structure, and hence the value of en-
riching treebanks in this way.

1 Introduction

The goal of the project presented here is to develop a methodology for enriching
treebanks containing verse texts with detailed metrical annotation. The earliest
texts preserved for many European languages, in this case Occitan, are frequently
in verse, and it is therefore desirable when analysing these texts to take into con-
sideration any possible effect of the verse form on the syntax.

In the present paper, we will outline the methodology we have adopted in pro-
ducing a small pilot treebank containing a 10th-century Occitan verse text, a frag-
ment from a verse adaptation of Boethius’ De Consolatione Philosophæ (hence-
forth Boeci). The pilot corpus is available online at www.oldoccitancorpus.org.
In working on the treebank aspects of the corpus, we have built on the work carried
out for the Old Occitan Flamenca text by Scrivner, Kübler, Vance and Beuerlein
[15].

149


2 Background

2.1 Why enrich treebanks with metrical annotation?

There exists a consensus among linguists that the syntax of verse texts differs from
that of prose, with unusual word orders adopted to fit the constraints of the metre.
For example, in introducing a study of Early Old French (12th-century) syntax,
Labelle [11] feels obliged to acknowledge that “the disadvantage of concentrating
on this period of time is that the available texts are in verse, and we might expect
freer word order, with probably more scrambling to accommodate the rhyme.” A
difficult task is only made harder by the fact that data extracted from modern tree-
banks (e.g. the MCVF Old French treebank, Martineau et al. [12]) does not contain
any metrical information, so it is not possible to establish, for example, whether a
particularly unusual word order may have been adopted to place a rhyming word
at the end of the line without referring back to the source edition. This problem of
information loss can be resolved by adding metrical annotation to a treebank. Fur-
thermore, it allows researchers to write combined syntactic and metrical queries,
placing them in a position to demonstrate whether specific metrical constraints,
especially at the end of the line (the rhyme) and at the half-line boundary (the
cæsura), are in fact associated with unusual syntactic structures.

2.2 What information should be included in metrical annotation?

Corpora containing metrical annotation are relatively rare (see section 2.3 below),
and there is little consensus regarding which metrical and/or prosodic features
should be encoded. Indeed, it would not even be desirable for all metrical cor-
pora to contain the same information, since different versification systems exploit
different aspects of linguistic structure (e.g. the distinction between light and heavy
syllables is fundamental in classical Latin verse, but irrelevant for versification in
modern Romance languages). However, every metrical annotation system must
take account at some level of the defining unit of a verse text: the line.1 Beyond
the line, the annotation scheme may choose to mark:

• Segments bigger than the line, e.g. stanza, poem

• Segments smaller than the line, e.g. half-line, foot, syllable, mora

• Line-linking phenomena, e.g. rhyme, assonance, alliteration

The metre of Boeci is typical of Old Occitan (and Old French) epic texts. The
poem is written in lines of ten counted syllables, divided regularly into two half-
lines by a cæsura between the fourth and fifth counted syllables (stressed syllables
are underlined):

1The division of a verse text into such “correlatable and commensurable segments” is considered
a defining feature by metricists, cf. Gasparov [8], p. 1.

150


(1) 1
Nos

2
jo-

3
ve

4
om- ne

/
/

5
quan-

6
dius

7
que

8
nos

9
es-

10
tam

‘We young men, when young we are [. . . ]’ (l. 1)

The fourth and tenth counted syllables must bear a lexical stress (e.g. om-, -tam),
while post-tonic syllables at the end of the first half-line are not counted (e.g. -ne).
Lines are linked into laisses of irregular length by assonance: a simple form of
rhyme, in which the final stressed vowels of lines, but not necessarily preceding
or following consonants, must be similar. For instance, the first laisse of the poem
contains lines ending with an /a/ vowel (estam : parllam : esperam : annam : fam :
clamam); the third with an /o/ vowel (fello : pejor : quastiazo, etc.). Therefore,
in order to describe the metrical structure of the poem completely, the annotation
scheme should mark both properties of the laisse and those of the syllable in addi-
tion to the line.

It should be noted that multi-layered annotation is not necessary to encode this
kind of information. For example, a major corpus of historical Dutch song, the
Nederlandse Liederenbank2, does not annotate stanzas, lines or stressed syllables
explicitly. Instead, metrical properties are given by a complex “stanza form” tag
which is included in the metadata for each text. For instance, the metre of the
text with incipit Doersocht en bekent hebt ghi / Mi Heer mijn sitten mijn opstaen
is given as 3A 4B 3A 4B 3C 4D 3C 4D: eight lines, rhyming ABABCDCD, con-
taining alternately three and four stressed syllables. However, an approach of this
kind has clear drawbacks when metrical annotation is to be combined with other
annotation layers, since it provides no means of establishing correspondances at
the token level.

2.3 Which corpora can serve as models?

Corpora containing metrical annotation segmenting the text into units smaller than
the line are relatively rare. For syllabic verse, the Anamètre project3 has produced
a metrically annotated corpus of Classical and Modern French verse, using a series
of Python scripts to mark up the text for syllable structure and to identify vowel
phonemes [3]. A similar approach is adopted for the Corpus of Czech Verse4, but
here the metrical annotation also marks stressed and unstressed syllables, since this
distinction is essential to Czech metre. While most metrical information is included
in line-level tags, indicating the metre of the line as a series of “feet”5, these tags are
generated by an automated algorithm which divides the line into syllables [7]. The
syllable-level representation in the database includes both a phonetic transcription
of the syllable, and whether it bears a lexical stress [13]. Both corpora are intended

2http://www.liederenbank.nl/
3http://www.crisco.unicaen.fr/verlaine/index.html
4http://www.versologie.cz/en/kcv_znacky.html
5A fixed sequence containing one stressed and a number of unstressed syllables, e.g. iamb

(unstressed–stressed), trochee (stressed–unstressed).

151


for the study of purely metrical phenomena: the Czech corpus, for example, has
been used to establish a database of metres used in poetry and a database of rhymes.

Corpora which combine prosodic and syntactic annotation are more widespread,
and share with the present corpus a need for multiple tokenization, since syntac-
tic annotation is based on words and phrase structures, while prosodic or metrical
annotation is based on syllabic structures. The Rhapsodie project has annotated
a corpus of spoken French using two different base units: phonemes for prosodic
structure and lexemes for syntactic structure [9]. Prosodic and syntactic annota-
tion is organized in separate tree structures but they are interconnected by means
of DAGs (directed acyclic graphs). Another method is introduced in the DIRNDL
project6. Here, a corpus of German radio news is annotated on prosodic, syntactic
and discourse levels. Each layer is presented as a separate graph that is connected
to others via pipeline links [10]. However, despite some core similarities, it is
important to note that the prosodic annotation of spoken language differs greatly
from metrical annotation, since unlike poetry, spoken language is not designed to
fit a metrical template. Metrical annotation is in this regard rather simpler, as only
phenomena which are metrically relevant (e.g. syllables, stress, rhyme) need be
included. Moreover, there is little need to include audio or even phonetic tran-
scriptions, particularly when dealing with historical texts for which the precise
phonology is often uncertain.

To our knowledge, the only extant corpus which combines metrical and tree-
bank annotation is the recently-released Greinir skáldskapar7 corpus of historical
Icelandic verse, which combines syntactic, phonological and metrical annotation
[6]. The corpus is accessible through a purpose-built online portal, queries are
formulated using drop-down menus, and the interface is intended to facilitate com-
bined syntactic and metrical queries (e.g. “find all line-initial subjects that allit-
erate”). However, it should be noted that Icelandic alliterative verse is organized
according very different principles from the syllabic verse of Old Occitan, and thus
the annotation procedure presents very different challenges.

3 Methodology

From the preceding discussion, we may identify two main challenges in enriching
a treebank with metrical annotation:

1. Designing and creating a layer of metrical annotation

2. Combing metrical annotation with a treebank in such a way as to be easily
searchable (ideally using existing tools)

6http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/dirndl.en.
html

7http://bragi.info/greinir/

152


3.1 Creating metrical annotation

With regard to the first challenge, we elected to use a multi-layered approach for
the metrical annotation based on both line and syllable tokens. This multimodal
approach allows for overlapping and hierarchically conflicting layers of annotation
that would otherwise be incompatible in a traditional inline single-level corpus
[16]. The use of syllable-level tokenization is essential to create a detailed repre-
sentation of the metrical structure of the text, and is crucial if the corpus is also
to be used to investigate the metrical characteristics of Old Occitan texts (e.g. the
extent to which stressed syllables are used to create a regular rhythm [14]). More-
over, it also allows automatic identification of the position of the cæsura, a metrical
position which, like the end of the line, is likely to be associated with syntactic con-
stituent boundaries.8

In order to create the annotation, we first devised a simple algorithm to di-
vide the words in the text into syllables, a relatively straightforward task given the
comparatively phonemic orthography of Old Occitan. This (i) identifies syllable
nuclei (i.e. sequences of vowels), (ii) learns permitted onset and coda clusters from
word-initial and word-final consonant sequences and (iii) divides sequences of con-
sonants between vowels into coda and onset accordingly. The results produced by
the algorithm were manually corrected, and the position of the lexical stress was
added.9

The second phase of generating the annotation involved labelling each syllable
according to its position in the line. This is more complex, since some syllables
are subject to variable elision rules, and may not ‘count’ towards the ten syllables
in the line. Two principal elision rules were modelled:10

• Synalepha: Word-final unstressed vowels may be elided when followed by
a word-initial vowel. For example, in the sequence El.l(a) ab below, the final
unstressed vowel of the pronoun ella ‘she’ is not counted:

(2) 1
El-

2
l(a) ab

3
Bo-

4
e- ci

/
/

5
par-

6
let

7
ta

8
dol-

9
za-

10
ment

‘She spoke so sweetly to Boethius.’

• Syneresis: Some vowel–vowel sequences within words may count either as
one or two syllables. The most notable example is the imperfect ending -ia,
where the two possible scansions presumably reflect two possible pronunci-
ations: /i.a/ or /ja/.

8In order to create a comprehensive representation of the metre of the text, laisse units could also
be marked. However, as we felt this was of less immediate interest for the study of the syntactic
structure of the text, this layer of annotation is not currently implemented.

9The efficiency of this part of the workflow could be improved in future work by integrating
existing syllabification algorithms, such as the finite-state syllabification method present for Middle
Dutch by Bouma [2], pp. 29–31.

10These metrical rules are common not just in Old Occitan but in Old Romance in general (see
Chambers [1], pp. 5–7).

153


Any syllable potentially subject to either synalepha (given the right phonological
context) or to syneresis was manually tagged as such in the input to the algorithm.

Labelling each syllable according to its metrical position in the line was carried
out by a ‘scansion’ algorithm, which operates in the following way:

1. select a variable elision rule (no elision/synalepha/syneresis);

2. apply selected variable elision rule(s);

3. apply positional elision rules (post-tonic syllables at cæsura);

4. if the line has ten counted syllables, and the fourth and the tenth bear lexical
stress, mark the line as correctly scanned;

5. else, select a different variable elision rule, or a combination of rules, and
return to 2;

6. if, once all possible combinations of variable rules have been applied, the
line does not scan correctly, mark as ‘unscannable’.

For example, when scanning the line given in (2), the algorithm begins by assuming
no elision (step 1):

(2′) 1
El-

2
la

3
ab

4 /
Bo-

5
e-

6
ci

7
par-

8
let

9
ta

10
dol-

11
za-

12
ment

This scansion is unchanged by step 3 (since the cæsura is not in the correct position)
and fails at step 4. On the second pass, the algorithm elects to apply synalepha at
step 1, giving the following provisional scansion at step 2:

(2′′) 1
El-

2
l(a) ab

3
Bo-

4 /
e-

5
ci

6
par-

7
let

8
ta

9
dol-

10
za-

11
ment

Step 3 then notes the presence of a word-final unstressed syllable after the cæsura
and marks it as ‘uncounted’, giving the following scansion:

(2) 1
El-

2
l(a) ab

3
Bo-

4
e-

/
ci

5
par-

6
let

7
ta

8
dol-

9
za-

10
ment

The line is then marked as correctly scanned at step 4.
No manual intervention is needed to correct the output of the algorithm: since

the algorithm is supplied with all accepted rules of Occitan versification, lines
marked as ‘unscannable’ (25 of 257) are genuine metrical exceptions in the manuscript
text (see Chambers [1], p. 8–9).

3.2 Combining metrical and treebank annotation

The treebank annotation was created using automated part-of-speech tagging and
parsing followed by manual correction, following the method described for the
Flamenca text by Scrivner, Kübler, Vance and Beuerlein [15]. We elected to use

154


the open-source ANNIS platform as our corpus search engine, which uses PAULA-
XML as its preferred input data format [5].11 The platform’s flexible architecture
allows for multidimensional representations of corpora [16], while the web-based
query engine is suitable not only for various overlapping annotation schemas but
also for different levels of segmentation, which is impossible in most corpus tools.
This is crucial for a corpus in which the two core layers of annotation rely on
different tokenizations: words in the treebank, syllables in the metrical annotation.
The two tokenizations are quite separate, since it is not the case that a word can
be treated simply as a spanned sequence of syllable tokens. Word boundaries also
occur within syllables: for example, of the four word boundaries in the sequence
e te m fiav’ eu ‘in you I trusted’ (l. 75), only two coincide with boundaries in the
syllable tokens e.tem.fi.a.veu.

Metrical annotation was exported directly to the PAULA-XML format from
the scansion module, and was combined with syntactic annotation converted to the
same format. Although it permits multiple layers, PAULA-XML still requires all
units to be defined based on indivisible tokens. We elected to use individual charac-
ters as tokens, defining both syllables and words using character spans. However,
in the ANNIS interface, this arbitrary token level can be hidden, leaving only the
relevant higher-level unit (word, syllable and line) visible to the user. At present,
the user can work offline with a local version of ANNIS, or work on-line with a
server-based version, which we have created.

4 Some sample findings

It is relatively straightforward to use the ANNIS query language to study the rela-
tionship between syntactic and metrical annotation layers. In Figure 1, the query
finds all finite clauses (IP-MAT and IP-SUB elements) ending at the cæsura, i.e.
right aligned (_r_) with the fourth syllable in the line, or with an elided syllable
(“el”) which immediately follows it (.). Queries of this nature have already led us
to some intriguing findings.

Firstly, there is very strong correlation between the metrical structure of Boeci
and its syntactic structure. Recall that each ten-syllable line is divided into two
half-lines of four and six syllables by a cæsura. It transpires that of the 355 finite
clauses in the text, every single one ends at a half-line boundary: 302 at the end of
line, 53 at the cæsura.12 Moreover, there is not a single line which does not end
with a finite clause boundary. While this tendency is not unusual,13 it is perhaps
more surprising to see it exceptionlessly applied in our text. Moreover, it illustrates

11PAULA-XML must however be converted to the native relANNIS format using the the Salt-
NPapper converter (http://korpling.german.hu-berlin.de/saltnpepper/) before the data is
usable in ANNIS.

12We exclude 5 finite clauses which end within lines tagged as ‘unscannable’.
13Devine and Stephens [4] note a similar pattern in Ancient Greek verse, arguing convincingly

that this effect is due to the association of syntactic constituent boundaries with prosodic constituent
boundaries in natural language, such as the intonational or the phonological phrase.

155


cat = /IP-(MAT|SUB).*/
& syll_in_line = "4"
& ( #1 _r_ #2

| ( syll_in_line = "el"
& #2 . #3
& #1 _r_ #3 )

)

Figure 1: AQL query to identify IP-MAT and IP-SUB constituents ending at the
cæsura.

the extent to which the syntax of this text is constrained by the metre: effectively,
every finite clause must be four, six or ten syllables long, or contain one or more
embedded finite clauses of four, six or ten syllables.

Secondly, there is a strong correlation between the length of the lexical item
and its position in the line. Figure 2 shows that polysyllabic and monosyllabic
words of the same part-of-speech show radically different distributional tenden-
cies. In all cases, polysyllables are more likely to occur at the end of the line than

polysyllabic monosyllabic
Part-of-speech line-medial line-final line-medial line-final

oxytonic common nouns 5 81 90 17
oxytonic past participles 7 12 14 0

oxytonic finite verbs 33 23 166 13

Figure 2: Position of oxytonic (= stress on final syllable) lexical items within the
line; selected parts of speech.

monosyllables. In the case of common nouns, 94% of polysyllables are line-final
but only 16% of monosyllables; in the case of finite verbs, only 7% of monosyl-
lables are line-final; in the case of past participles, no polysyllables are line-final.
Since, as we have seen, line-final position is also usually clause-final, we can con-
clude that in this text, polysyllabic lexical items are most likely to occur at the end
of the clause. Whatever the cause of this phenomenon (and it may not necessar-
ily be purely metrical), it is likely to have important consequences for word order
in the text, and it can only be studied in a corpus which contains syllable-level
annotation.

Finally, one important area of syntactic variation in Old Occitan (and Old
French) is the relative order of an infinitive and its core complement. Using the
treebank, we can identify 20 cases in which the infinitive and its core complement
(direct object, or directional complement of a motion verb) occur together: ten with
the order CV and ten with the order VC. 18 out of these 20 cases are line-final (9
CV and 9 VC). In all of the CV cases, the infinitive takes an -ar ending, suggest-
ing that assonance may have played a role in the selection of one word order over

156


another. Since stress on the penultimate syllable (paroxytonic stress) is excluded
in line-final position, paroxytonic nouns (e.g. ri.que.za, l. 83; chai.ti.ve.za, l. 88)
are only found in CV orders, while the paroxytonic infinitive (metre, l. 22, 59) is
only found with VC order. It therefore seems possible that metrical factors (final
stress, assonance) contribute to this syntactic variation, and so should be taken into
consideration.

5 Conclusion

Having highlighted the importance of considering metrical factors in syntactic
analysis, we outline an implemented, extensible methodology for creating a layer
of metrical annotation and combining it with a treebank using the ANNIS platform.
Our method is not applicable to only one text, nor even just to Old Occitan epic
verse in general, but can be applied with few major modifications to texts from
metrical tradition based primarily on a fixed number of syllables per line.

We demonstrate some preliminary findings from our pilot corpus in order to
suggest future directions for linguistic research; however these are necessarily lim-
ited by the size of the corpus. More far-reaching conclusions may be drawn in
particular from corpora combining verse and prose, in which the prose texts can
be used to establish a ‘baseline’ of frequent syntactic structures to which the verse
texts can be compared. Such an approach may help us to further our general un-
derstanding of the interaction of metrical constraints and syntactic variation.

Acknowledgements

T. M. Rainsford would like to acknowledge the generous support of the British
Academy, through his recent post-doctoral fellowship at the University of Oxford,
in making this research collaboration possible.

References

[1] Chambers, Frank M. (1985) An Introduction to Old Provençal Versification,
Philadelphia: American Philosophical Society.

[2] Bouma, Gosse and Hermans, Ben (2013) Syllabification of Middle Dutch. In
F. Mambrini, M. Passarotti, C. Sporleder (eds.) Proceedings of the Second
Workshop on annotation of Corpora for Research in the Humanities.

[3] Delente, Éliane and Renault, Richard (2009) Les étapes du traitement automa-
tique d’un poème. Presentation given at “Le patrimoine à l’ère du numérique”,
10–11 December 2009, Université de Caen [http://www.crisco.unicaen.
fr/verlaine/ressources/patrimoine_Caen.pdf].

157


[4] Devine, Andrew M., and Stephens, Laurence D. (1984) Language and Meter:
Resolution, Porson’s Bridge, and their Prosodic Basis, Chico, CA: Scholars
Press.

[5] Dipper, Stefanie (2005). XML-based Stand-off Representation and Exploita-
tion of Multi-Level Linguistic Annotation. In R. Eckstein, R. Tolsdorf (eds),
Proceedings of Berliner XML Tage, pp. 39–50.

[6] Eythórsson, Þórhallur, Karlsson, Bjarki, and Sigurðardóttir, Sigríður Sæunn
(2014) Greinir skáldskapar: A diachronic corpus of Icelandic poetic texts. In
Proceedings of LREC 2014: Workshop on Language Resources and Technolo-
gies for Processing and Linking Historical Documents and Archives – Deploy-
ing Linked Open Data in Cultural Heritage, Reykjavík, Iceland, pp. 35–41.

[7] Ibrahim, Robert and Plechác̆, Petr (2011) Toward Automatic Analysis of Czech
Verse. In B. P. Scherr, J. Bailey, E. V. Kazartsev (eds.) Formal Methods in
Poetics, Lüdenscheid, RAM, pp. 295–305.

[8] Gasparov, M. L. (1996) A History of European Versification, tr. by G. S. Smith
and Marina Tarlinskaja, ed. by G. S. Smith with Leofranc Holford-Stevens,
Oxford, Clarendon Press.

[9] Gerdes, Kim, Kahane, Sylvain and Pietrandrea, Paola (2012) Intonosyntactic
data structures: The rhapsodie treebank of spoken French. In Proceedings of
the 6th Linguistic Annotation Workshop, Jeju, republic of Korea, pp.85–94.

[10] Eckart, Kerstin, Riester, Arndt and Schweitzer, Katrin (2012) A Discourse
Information Radio News Database for Linguistic Analysis. In C. Chiarcos,
S. Nordhoff, S Hellmann (eds) Linked Data in Linguistics. Representing and
Connecting Language Data and Language Metadata Springer, Heidelberg, pp.
65–75.

[11] Labelle, Marie (2007) Clausal architecture in Early Old French. Lingua 117:
289–316.

[12] Martineau, France, Hirschbühler, Paul, Kroch, Anthony and Morin, Yves
Charles (2010) Corpus MCVF annoté syntaxiquement, Ottawa: University of
Ottawa [http://www.arts.uottawa.ca/voies/corpus_pg_en.html].

[13] Plechác̆, Petr and Ibrahim, Robert (2014) Database of Czech Verse, Presenta-
tion given at “Frontiers in Comparative Metrics 2”, 19–20 May 2014, Tallinn.

[14] Rainsford, Thomas M. (2010) Rhythmic Change in the Medieval Octosyl-
lable and the Development of Group Stress. In F. Neveu, V. Muni-Toke, T.
Klingler, J. Durand, L. Mondada and S. Prévost (eds) Congrès mondial de lin-
guistique française: CMLF 2010, Paris, Institut de linguistique française), pp.
321–36.

158


[15] Scrivner, Olga, Kübler, Sandra, Vance, Barbara, and Beuerlein, Eric (2013)
Le Roman de Flamenca : An annotated corpus of old Occitan. In F. Mambrini,
M. Passarotti, and C. Sporleder (eds), Proceedings of the Third Workshop on
Annotation of Corpora for Research in Humanities, pp. 85–96.

[16] Zeldes, Amir, Ritz, J., Lüdeling, Anke, and Chiarcos, Christian (2009) AN-
NIS: A search tool for multi-layer annotated corpora. In Proceedings of Corpus
Linguistics, Liverpool.

159