Metrical annotation for a verse treebank T. M. Rainsford1 and Olga Scrivner2 1Institut für Linguistik/Romanistik, Universität Stuttgart 2 Indiana University E-mail: 1tmr740-ac@yahoo.co.uk 2obscrivn@indiana.edu Abstract We present a methodology for enriching treebanks containing verse texts with metrical annotation, and present a pilot corpus containing one Old Oc- citan text. Metrical annotation is based on syllable tokens, and is generated semi-automatically using two algorithms, one to divide word tokens into syl- lables, and a second to mark the position of each syllable in the line. Syn- tactic and metrical annotation is combined in a single multi-layered ANNIS corpus. Three initial findings based on the pilot corpus illustrate the close relation between syntactic and metrical structure, and hence the value of en- riching treebanks in this way. 1 Introduction The goal of the project presented here is to develop a methodology for enriching treebanks containing verse texts with detailed metrical annotation. The earliest texts preserved for many European languages, in this case Occitan, are frequently in verse, and it is therefore desirable when analysing these texts to take into con- sideration any possible effect of the verse form on the syntax. In the present paper, we will outline the methodology we have adopted in pro- ducing a small pilot treebank containing a 10th-century Occitan verse text, a frag- ment from a verse adaptation of Boethius’ De Consolatione Philosophæ (hence- forth Boeci). The pilot corpus is available online at www.oldoccitancorpus.org. In working on the treebank aspects of the corpus, we have built on the work carried out for the Old Occitan Flamenca text by Scrivner, Kübler, Vance and Beuerlein [15]. 149 2 Background 2.1 Why enrich treebanks with metrical annotation? There exists a consensus among linguists that the syntax of verse texts differs from that of prose, with unusual word orders adopted to fit the constraints of the metre. For example, in introducing a study of Early Old French (12th-century) syntax, Labelle [11] feels obliged to acknowledge that “the disadvantage of concentrating on this period of time is that the available texts are in verse, and we might expect freer word order, with probably more scrambling to accommodate the rhyme.” A difficult task is only made harder by the fact that data extracted from modern tree- banks (e.g. the MCVF Old French treebank, Martineau et al. [12]) does not contain any metrical information, so it is not possible to establish, for example, whether a particularly unusual word order may have been adopted to place a rhyming word at the end of the line without referring back to the source edition. This problem of information loss can be resolved by adding metrical annotation to a treebank. Fur- thermore, it allows researchers to write combined syntactic and metrical queries, placing them in a position to demonstrate whether specific metrical constraints, especially at the end of the line (the rhyme) and at the half-line boundary (the cæsura), are in fact associated with unusual syntactic structures. 2.2 What information should be included in metrical annotation? Corpora containing metrical annotation are relatively rare (see section 2.3 below), and there is little consensus regarding which metrical and/or prosodic features should be encoded. Indeed, it would not even be desirable for all metrical cor- pora to contain the same information, since different versification systems exploit different aspects of linguistic structure (e.g. the distinction between light and heavy syllables is fundamental in classical Latin verse, but irrelevant for versification in modern Romance languages). However, every metrical annotation system must take account at some level of the defining unit of a verse text: the line.1 Beyond the line, the annotation scheme may choose to mark: • Segments bigger than the line, e.g. stanza, poem • Segments smaller than the line, e.g. half-line, foot, syllable, mora • Line-linking phenomena, e.g. rhyme, assonance, alliteration The metre of Boeci is typical of Old Occitan (and Old French) epic texts. The poem is written in lines of ten counted syllables, divided regularly into two half- lines by a cæsura between the fourth and fifth counted syllables (stressed syllables are underlined): 1The division of a verse text into such “correlatable and commensurable segments” is considered a defining feature by metricists, cf. Gasparov [8], p. 1. 150 (1) 1 Nos 2 jo- 3 ve 4 om- ne / / 5 quan- 6 dius 7 que 8 nos 9 es- 10 tam ‘We young men, when young we are [. . . ]’ (l. 1) The fourth and tenth counted syllables must bear a lexical stress (e.g. om-, -tam), while post-tonic syllables at the end of the first half-line are not counted (e.g. -ne). Lines are linked into laisses of irregular length by assonance: a simple form of rhyme, in which the final stressed vowels of lines, but not necessarily preceding or following consonants, must be similar. For instance, the first laisse of the poem contains lines ending with an /a/ vowel (estam : parllam : esperam : annam : fam : clamam); the third with an /o/ vowel (fello : pejor : quastiazo, etc.). Therefore, in order to describe the metrical structure of the poem completely, the annotation scheme should mark both properties of the laisse and those of the syllable in addi- tion to the line. It should be noted that multi-layered annotation is not necessary to encode this kind of information. For example, a major corpus of historical Dutch song, the Nederlandse Liederenbank2, does not annotate stanzas, lines or stressed syllables explicitly. Instead, metrical properties are given by a complex “stanza form” tag which is included in the metadata for each text. For instance, the metre of the text with incipit Doersocht en bekent hebt ghi / Mi Heer mijn sitten mijn opstaen is given as 3A 4B 3A 4B 3C 4D 3C 4D: eight lines, rhyming ABABCDCD, con- taining alternately three and four stressed syllables. However, an approach of this kind has clear drawbacks when metrical annotation is to be combined with other annotation layers, since it provides no means of establishing correspondances at the token level. 2.3 Which corpora can serve as models? Corpora containing metrical annotation segmenting the text into units smaller than the line are relatively rare. For syllabic verse, the Anamètre project3 has produced a metrically annotated corpus of Classical and Modern French verse, using a series of Python scripts to mark up the text for syllable structure and to identify vowel phonemes [3]. A similar approach is adopted for the Corpus of Czech Verse4, but here the metrical annotation also marks stressed and unstressed syllables, since this distinction is essential to Czech metre. While most metrical information is included in line-level tags, indicating the metre of the line as a series of “feet”5, these tags are generated by an automated algorithm which divides the line into syllables [7]. The syllable-level representation in the database includes both a phonetic transcription of the syllable, and whether it bears a lexical stress [13]. Both corpora are intended 2http://www.liederenbank.nl/ 3http://www.crisco.unicaen.fr/verlaine/index.html 4http://www.versologie.cz/en/kcv_znacky.html 5A fixed sequence containing one stressed and a number of unstressed syllables, e.g. iamb (unstressed–stressed), trochee (stressed–unstressed). 151 for the study of purely metrical phenomena: the Czech corpus, for example, has been used to establish a database of metres used in poetry and a database of rhymes. Corpora which combine prosodic and syntactic annotation are more widespread, and share with the present corpus a need for multiple tokenization, since syntac- tic annotation is based on words and phrase structures, while prosodic or metrical annotation is based on syllabic structures. The Rhapsodie project has annotated a corpus of spoken French using two different base units: phonemes for prosodic structure and lexemes for syntactic structure [9]. Prosodic and syntactic annota- tion is organized in separate tree structures but they are interconnected by means of DAGs (directed acyclic graphs). Another method is introduced in the DIRNDL project6. Here, a corpus of German radio news is annotated on prosodic, syntactic and discourse levels. Each layer is presented as a separate graph that is connected to others via pipeline links [10]. However, despite some core similarities, it is important to note that the prosodic annotation of spoken language differs greatly from metrical annotation, since unlike poetry, spoken language is not designed to fit a metrical template. Metrical annotation is in this regard rather simpler, as only phenomena which are metrically relevant (e.g. syllables, stress, rhyme) need be included. Moreover, there is little need to include audio or even phonetic tran- scriptions, particularly when dealing with historical texts for which the precise phonology is often uncertain. To our knowledge, the only extant corpus which combines metrical and tree- bank annotation is the recently-released Greinir skáldskapar7 corpus of historical Icelandic verse, which combines syntactic, phonological and metrical annotation [6]. The corpus is accessible through a purpose-built online portal, queries are formulated using drop-down menus, and the interface is intended to facilitate com- bined syntactic and metrical queries (e.g. “find all line-initial subjects that allit- erate”). However, it should be noted that Icelandic alliterative verse is organized according very different principles from the syllabic verse of Old Occitan, and thus the annotation procedure presents very different challenges. 3 Methodology From the preceding discussion, we may identify two main challenges in enriching a treebank with metrical annotation: 1. Designing and creating a layer of metrical annotation 2. Combing metrical annotation with a treebank in such a way as to be easily searchable (ideally using existing tools) 6http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/dirndl.en. html 7http://bragi.info/greinir/ 152 3.1 Creating metrical annotation With regard to the first challenge, we elected to use a multi-layered approach for the metrical annotation based on both line and syllable tokens. This multimodal approach allows for overlapping and hierarchically conflicting layers of annotation that would otherwise be incompatible in a traditional inline single-level corpus [16]. The use of syllable-level tokenization is essential to create a detailed repre- sentation of the metrical structure of the text, and is crucial if the corpus is also to be used to investigate the metrical characteristics of Old Occitan texts (e.g. the extent to which stressed syllables are used to create a regular rhythm [14]). More- over, it also allows automatic identification of the position of the cæsura, a metrical position which, like the end of the line, is likely to be associated with syntactic con- stituent boundaries.8 In order to create the annotation, we first devised a simple algorithm to di- vide the words in the text into syllables, a relatively straightforward task given the comparatively phonemic orthography of Old Occitan. This (i) identifies syllable nuclei (i.e. sequences of vowels), (ii) learns permitted onset and coda clusters from word-initial and word-final consonant sequences and (iii) divides sequences of con- sonants between vowels into coda and onset accordingly. The results produced by the algorithm were manually corrected, and the position of the lexical stress was added.9 The second phase of generating the annotation involved labelling each syllable according to its position in the line. This is more complex, since some syllables are subject to variable elision rules, and may not ‘count’ towards the ten syllables in the line. Two principal elision rules were modelled:10 • Synalepha: Word-final unstressed vowels may be elided when followed by a word-initial vowel. For example, in the sequence El.l(a) ab below, the final unstressed vowel of the pronoun ella ‘she’ is not counted: (2) 1 El- 2 l(a) ab 3 Bo- 4 e- ci / / 5 par- 6 let 7 ta 8 dol- 9 za- 10 ment ‘She spoke so sweetly to Boethius.’ • Syneresis: Some vowel–vowel sequences within words may count either as one or two syllables. The most notable example is the imperfect ending -ia, where the two possible scansions presumably reflect two possible pronunci- ations: /i.a/ or /ja/. 8In order to create a comprehensive representation of the metre of the text, laisse units could also be marked. However, as we felt this was of less immediate interest for the study of the syntactic structure of the text, this layer of annotation is not currently implemented. 9The efficiency of this part of the workflow could be improved in future work by integrating existing syllabification algorithms, such as the finite-state syllabification method present for Middle Dutch by Bouma [2], pp. 29–31. 10These metrical rules are common not just in Old Occitan but in Old Romance in general (see Chambers [1], pp. 5–7). 153 Any syllable potentially subject to either synalepha (given the right phonological context) or to syneresis was manually tagged as such in the input to the algorithm. Labelling each syllable according to its metrical position in the line was carried out by a ‘scansion’ algorithm, which operates in the following way: 1. select a variable elision rule (no elision/synalepha/syneresis); 2. apply selected variable elision rule(s); 3. apply positional elision rules (post-tonic syllables at cæsura); 4. if the line has ten counted syllables, and the fourth and the tenth bear lexical stress, mark the line as correctly scanned; 5. else, select a different variable elision rule, or a combination of rules, and return to 2; 6. if, once all possible combinations of variable rules have been applied, the line does not scan correctly, mark as ‘unscannable’. For example, when scanning the line given in (2), the algorithm begins by assuming no elision (step 1): (2′) 1 El- 2 la 3 ab 4 / Bo- 5 e- 6 ci 7 par- 8 let 9 ta 10 dol- 11 za- 12 ment This scansion is unchanged by step 3 (since the cæsura is not in the correct position) and fails at step 4. On the second pass, the algorithm elects to apply synalepha at step 1, giving the following provisional scansion at step 2: (2′′) 1 El- 2 l(a) ab 3 Bo- 4 / e- 5 ci 6 par- 7 let 8 ta 9 dol- 10 za- 11 ment Step 3 then notes the presence of a word-final unstressed syllable after the cæsura and marks it as ‘uncounted’, giving the following scansion: (2) 1 El- 2 l(a) ab 3 Bo- 4 e- / ci 5 par- 6 let 7 ta 8 dol- 9 za- 10 ment The line is then marked as correctly scanned at step 4. No manual intervention is needed to correct the output of the algorithm: since the algorithm is supplied with all accepted rules of Occitan versification, lines marked as ‘unscannable’ (25 of 257) are genuine metrical exceptions in the manuscript text (see Chambers [1], p. 8–9). 3.2 Combining metrical and treebank annotation The treebank annotation was created using automated part-of-speech tagging and parsing followed by manual correction, following the method described for the Flamenca text by Scrivner, Kübler, Vance and Beuerlein [15]. We elected to use 154 the open-source ANNIS platform as our corpus search engine, which uses PAULA- XML as its preferred input data format [5].11 The platform’s flexible architecture allows for multidimensional representations of corpora [16], while the web-based query engine is suitable not only for various overlapping annotation schemas but also for different levels of segmentation, which is impossible in most corpus tools. This is crucial for a corpus in which the two core layers of annotation rely on different tokenizations: words in the treebank, syllables in the metrical annotation. The two tokenizations are quite separate, since it is not the case that a word can be treated simply as a spanned sequence of syllable tokens. Word boundaries also occur within syllables: for example, of the four word boundaries in the sequence e te m fiav’ eu ‘in you I trusted’ (l. 75), only two coincide with boundaries in the syllable tokens e.tem.fi.a.veu. Metrical annotation was exported directly to the PAULA-XML format from the scansion module, and was combined with syntactic annotation converted to the same format. Although it permits multiple layers, PAULA-XML still requires all units to be defined based on indivisible tokens. We elected to use individual charac- ters as tokens, defining both syllables and words using character spans. However, in the ANNIS interface, this arbitrary token level can be hidden, leaving only the relevant higher-level unit (word, syllable and line) visible to the user. At present, the user can work offline with a local version of ANNIS, or work on-line with a server-based version, which we have created. 4 Some sample findings It is relatively straightforward to use the ANNIS query language to study the rela- tionship between syntactic and metrical annotation layers. In Figure 1, the query finds all finite clauses (IP-MAT and IP-SUB elements) ending at the cæsura, i.e. right aligned (_r_) with the fourth syllable in the line, or with an elided syllable (“el”) which immediately follows it (.). Queries of this nature have already led us to some intriguing findings. Firstly, there is very strong correlation between the metrical structure of Boeci and its syntactic structure. Recall that each ten-syllable line is divided into two half-lines of four and six syllables by a cæsura. It transpires that of the 355 finite clauses in the text, every single one ends at a half-line boundary: 302 at the end of line, 53 at the cæsura.12 Moreover, there is not a single line which does not end with a finite clause boundary. While this tendency is not unusual,13 it is perhaps more surprising to see it exceptionlessly applied in our text. Moreover, it illustrates 11PAULA-XML must however be converted to the native relANNIS format using the the Salt- NPapper converter (http://korpling.german.hu-berlin.de/saltnpepper/) before the data is usable in ANNIS. 12We exclude 5 finite clauses which end within lines tagged as ‘unscannable’. 13Devine and Stephens [4] note a similar pattern in Ancient Greek verse, arguing convincingly that this effect is due to the association of syntactic constituent boundaries with prosodic constituent boundaries in natural language, such as the intonational or the phonological phrase. 155 cat = /IP-(MAT|SUB).*/ & syll_in_line = "4" & ( #1 _r_ #2 | ( syll_in_line = "el" & #2 . #3 & #1 _r_ #3 ) ) Figure 1: AQL query to identify IP-MAT and IP-SUB constituents ending at the cæsura. the extent to which the syntax of this text is constrained by the metre: effectively, every finite clause must be four, six or ten syllables long, or contain one or more embedded finite clauses of four, six or ten syllables. Secondly, there is a strong correlation between the length of the lexical item and its position in the line. Figure 2 shows that polysyllabic and monosyllabic words of the same part-of-speech show radically different distributional tenden- cies. In all cases, polysyllables are more likely to occur at the end of the line than polysyllabic monosyllabic Part-of-speech line-medial line-final line-medial line-final oxytonic common nouns 5 81 90 17 oxytonic past participles 7 12 14 0 oxytonic finite verbs 33 23 166 13 Figure 2: Position of oxytonic (= stress on final syllable) lexical items within the line; selected parts of speech. monosyllables. In the case of common nouns, 94% of polysyllables are line-final but only 16% of monosyllables; in the case of finite verbs, only 7% of monosyl- lables are line-final; in the case of past participles, no polysyllables are line-final. Since, as we have seen, line-final position is also usually clause-final, we can con- clude that in this text, polysyllabic lexical items are most likely to occur at the end of the clause. Whatever the cause of this phenomenon (and it may not necessar- ily be purely metrical), it is likely to have important consequences for word order in the text, and it can only be studied in a corpus which contains syllable-level annotation. Finally, one important area of syntactic variation in Old Occitan (and Old French) is the relative order of an infinitive and its core complement. Using the treebank, we can identify 20 cases in which the infinitive and its core complement (direct object, or directional complement of a motion verb) occur together: ten with the order CV and ten with the order VC. 18 out of these 20 cases are line-final (9 CV and 9 VC). In all of the CV cases, the infinitive takes an -ar ending, suggest- ing that assonance may have played a role in the selection of one word order over 156 another. Since stress on the penultimate syllable (paroxytonic stress) is excluded in line-final position, paroxytonic nouns (e.g. ri.que.za, l. 83; chai.ti.ve.za, l. 88) are only found in CV orders, while the paroxytonic infinitive (metre, l. 22, 59) is only found with VC order. It therefore seems possible that metrical factors (final stress, assonance) contribute to this syntactic variation, and so should be taken into consideration. 5 Conclusion Having highlighted the importance of considering metrical factors in syntactic analysis, we outline an implemented, extensible methodology for creating a layer of metrical annotation and combining it with a treebank using the ANNIS platform. Our method is not applicable to only one text, nor even just to Old Occitan epic verse in general, but can be applied with few major modifications to texts from metrical tradition based primarily on a fixed number of syllables per line. We demonstrate some preliminary findings from our pilot corpus in order to suggest future directions for linguistic research; however these are necessarily lim- ited by the size of the corpus. More far-reaching conclusions may be drawn in particular from corpora combining verse and prose, in which the prose texts can be used to establish a ‘baseline’ of frequent syntactic structures to which the verse texts can be compared. Such an approach may help us to further our general un- derstanding of the interaction of metrical constraints and syntactic variation. Acknowledgements T. M. Rainsford would like to acknowledge the generous support of the British Academy, through his recent post-doctoral fellowship at the University of Oxford, in making this research collaboration possible. References [1] Chambers, Frank M. (1985) An Introduction to Old Provençal Versification, Philadelphia: American Philosophical Society. [2] Bouma, Gosse and Hermans, Ben (2013) Syllabification of Middle Dutch. In F. Mambrini, M. Passarotti, C. Sporleder (eds.) Proceedings of the Second Workshop on annotation of Corpora for Research in the Humanities. [3] Delente, Éliane and Renault, Richard (2009) Les étapes du traitement automa- tique d’un poème. Presentation given at “Le patrimoine à l’ère du numérique”, 10–11 December 2009, Université de Caen [http://www.crisco.unicaen. fr/verlaine/ressources/patrimoine_Caen.pdf]. 157 [4] Devine, Andrew M., and Stephens, Laurence D. (1984) Language and Meter: Resolution, Porson’s Bridge, and their Prosodic Basis, Chico, CA: Scholars Press. [5] Dipper, Stefanie (2005). XML-based Stand-off Representation and Exploita- tion of Multi-Level Linguistic Annotation. In R. Eckstein, R. Tolsdorf (eds), Proceedings of Berliner XML Tage, pp. 39–50. [6] Eythórsson, Þórhallur, Karlsson, Bjarki, and Sigurðardóttir, Sigríður Sæunn (2014) Greinir skáldskapar: A diachronic corpus of Icelandic poetic texts. In Proceedings of LREC 2014: Workshop on Language Resources and Technolo- gies for Processing and Linking Historical Documents and Archives – Deploy- ing Linked Open Data in Cultural Heritage, Reykjavík, Iceland, pp. 35–41. [7] Ibrahim, Robert and Plechác̆, Petr (2011) Toward Automatic Analysis of Czech Verse. In B. P. Scherr, J. Bailey, E. V. Kazartsev (eds.) Formal Methods in Poetics, Lüdenscheid, RAM, pp. 295–305. [8] Gasparov, M. L. (1996) A History of European Versification, tr. by G. S. Smith and Marina Tarlinskaja, ed. by G. S. Smith with Leofranc Holford-Stevens, Oxford, Clarendon Press. [9] Gerdes, Kim, Kahane, Sylvain and Pietrandrea, Paola (2012) Intonosyntactic data structures: The rhapsodie treebank of spoken French. In Proceedings of the 6th Linguistic Annotation Workshop, Jeju, republic of Korea, pp.85–94. [10] Eckart, Kerstin, Riester, Arndt and Schweitzer, Katrin (2012) A Discourse Information Radio News Database for Linguistic Analysis. In C. Chiarcos, S. Nordhoff, S Hellmann (eds) Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata Springer, Heidelberg, pp. 65–75. [11] Labelle, Marie (2007) Clausal architecture in Early Old French. Lingua 117: 289–316. [12] Martineau, France, Hirschbühler, Paul, Kroch, Anthony and Morin, Yves Charles (2010) Corpus MCVF annoté syntaxiquement, Ottawa: University of Ottawa [http://www.arts.uottawa.ca/voies/corpus_pg_en.html]. [13] Plechác̆, Petr and Ibrahim, Robert (2014) Database of Czech Verse, Presenta- tion given at “Frontiers in Comparative Metrics 2”, 19–20 May 2014, Tallinn. [14] Rainsford, Thomas M. (2010) Rhythmic Change in the Medieval Octosyl- lable and the Development of Group Stress. In F. Neveu, V. Muni-Toke, T. Klingler, J. Durand, L. Mondada and S. Prévost (eds) Congrès mondial de lin- guistique française: CMLF 2010, Paris, Institut de linguistique française), pp. 321–36. 158 [15] Scrivner, Olga, Kübler, Sandra, Vance, Barbara, and Beuerlein, Eric (2013) Le Roman de Flamenca : An annotated corpus of old Occitan. In F. Mambrini, M. Passarotti, and C. Sporleder (eds), Proceedings of the Third Workshop on Annotation of Corpora for Research in Humanities, pp. 85–96. [16] Zeldes, Amir, Ritz, J., Lüdeling, Anke, and Chiarcos, Christian (2009) AN- NIS: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguistics, Liverpool. 159