Institut für Maschinelle Sprachverarbeitung

Universität Stuttgart

Pfaffenwaldring 5B

D-70569 Stuttgart

Master thesis

How Well Do
Language Models

Understand Grammar?
A Case Study On

Japanese

Gerhard Christian Breul

Studiengang: M.Sc. Infromatik

Prüfer*innen: Prof. Sebastian Padó

Betreuer: Dmitry Nikolaev

Beginn der Arbeit: 04.05.2022

Ende der Arbeit: 04.11.2022


Contents

1 Introduction 3

2 Background 6

2.1 Transformer-based Architectures . . . . . . . . . . . . . . . . . . . . . 6

2.2 Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Transitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Japanese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Related Work 12

3.1 Grammatical Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Probing Approaches . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Behavioral Approaches . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Methods 18

4.1 Defining “Understanding” . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Selection of a Grammatical Rule . . . . . . . . . . . . . . . . . . . . . 19

4.3 Selection of Language Models . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 BERT Architecture . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.2 GPT-2 Architecture . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Verb Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.2 Selection of Arguments and Adverbs . . . . . . . . . . . . . . 27

4.4.3 Sentence Construction . . . . . . . . . . . . . . . . . . . . . . 28

4.4.4 Alternative Datasets . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Prediction of Transitivity . . . . . . . . . . . . . . . . . . . . . . . . . 30

1


5 Results 34

5.1 Rare Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Relative Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Causal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Bidirectional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Perplexity Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.6 Agreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.7 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Discussion 47

7 Conclusion 49

A Appendix 50

A.1 Agreement figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A.2 Zusammenfassung auf Deutsch . . . . . . . . . . . . . . . . . . . . . . 51

2


Abstract

Modern attention-based language models such as BERT and GPT have

been shown to outperform previous state-of-the-art models on many NLP

tasks. This performance implies a level of understanding of grammatical struc-

tures. This work attempts to contribute to the growing body of research as-

sessing this understanding, by exploring language models’ ability to predict

the transitivity of verbs in Japanese, which seems to be somewhat underrep-

resented in research compared to English. I consider a variety of language

models with different architectures, tokenization approaches, training data,

and training regimes. In doing so, I find that bidirectional models outperform

unidirectional ones, that different types of perplexity calculation can be ad-

vantageous in certain situations and should be considered on a case-by-case

basis, and that the tested models only gain a somewhat limited understanding

of the grammar required for the Transitivity Prediction task.

1 Introduction

Modern language model architectures based on the Transformer architecture pro-

posed by Vaswani et al. (2017) demonstrate an impressive ability to produce and

process natural language. Models like BERT (Devlin et al., 2018), GPT (Radford

et al., 2018), and XLNet (Yang et al., 2019) have set new standards for a multitude

of NLP tasks, such as question answering, translation, summarization, and diverse

tasks pertaining to language understanding. To achieve such results, these models

have to construct some internal representation of syntactical and semantic informa-

tion about a given sequence, in order to, for example, extract the correct answer

to a given question. For the performance of many NLP tasks, it is likely beneficial

to be able to reliably identify parts of speech or dependencies between words, or

recognize the topic of a given context. In other words, a machine needs a degree of

understanding about the inner workings of a language in order be able to process it

effectively.

Indeed, recent research shows that it is possible to extract specific information

3


about grammatical structures from word embeddings (Hewitt and Manning, 2019;

Liu et al., 2019a). However, not all attempts at extracting grammatical knowledge

are as successful (Kogkalidis and Wijnholds, 2022). Models seemingly creating rep-

resentations of some, but not all of the grammar of a language, implies that their

understanding of language may be incomplete. What knowledge models learn, and

how to extract it, is a topic of ongoing research (Rogers et al., 2020; Lin et al.,

2019; Hu et al., 2020). According to this research, not only the type of grammatical

structure that is inspected, but also other factors, such as what kind of architec-

ture and training regimen is employed, seem to play a role in the level of grammar

understanding a model has (Kim et al., 2020).

English, as the de facto lingua franca of the scientific world, has a big share of re-

search efforts regarding grammatical capability of language models directed towards

it. Generally, models appear to be able to grasp grammar of this language relatively

well, which may in part be owed to the fact that most models are specifically de-

signed with it as their first priority, and in part to some of the specific attributes of

the language itself, such as a strict sentence order and simple conjugation. However,

as one might suspect, this does not necessarily translate well into other languages.

A different syntax necessitates learning different rules, possibly requiring a more

complex architecture. An example of how Transformer-based models fail to under-

stand grammar is given by Kogkalidis and Wijnholds (2022). They show that BERT

fails to recognize non-context free patterns of dependencies in Dutch. Many of these

analyses inspect a single model on one or more specific tasks. Considering and com-

paring multiple types of models may yield insights into which model properties affect

grammatical understanding in what way.

All this is to say that our knowledge about the limits of what Transformer-based

language models can and cannot understand is still quite incomplete. In this work, I

assess the grammatical “understanding” of a range of pretrained Transformer-based

models.

I take into consideration three generative models that utilize variants of the

GPT-2 architecture, and four models based on BERT architecture, three of which

are trained on BERTs MLM task, and one which is trained to discriminate “im-

4


poster” tokens, tokens that have been replaced by a smaller generative model. The

selected models represent a range of different training tasks and regimens, different

architectures, and tokenization approaches. As the method of determining syntax

understanding, I use the prediction of transitivity in Japanese sentences: Models de-

cide, given two Japanese sentences, one with a transitive and one with an intransitive

verb, which of the two otherwise identical sequences is more plausible.

This method requires the models to be able to perform three separate basic tasks:

• They need to be able to differentiate between transitive and intransitive verbs.

• They need to determine the number and type of arguments referring to the

verb.

• They are required to know the rules according to which each type of verb

becomes viable or unviable depending on the number and type of arguments.

I use perplexity as a metric to determine a model’s preference for one type of verb

over the other. As for some of these models, perplexity of a sequence or part of a

sequence can be calculated in multiple ways, I consider a total of 19 combinations

of models and methods of perplexity calculations.

I find that model architecture, tokenization approach, and training regimen af-

fect performance on the task selected for evaluation, with unidirectional models

performing worse than bidirectional ones. Models’ predictions tend to agree most

strongly with predictions made by models of the same architecture, with different

modes of likelihood calculation within a model only occasionally agreeing less with

others of the same model than with architecturally similar but distinct models. An

exception to this can be observed in multilingual models. These models, despite

using BERT architecture, even if their overall performance is decent (as is the case

for XLM-RoBERTa), produce predictions that do not strongly correlate with those

of either uni- or bidirectional models, suggesting that the language features they

consider are distinct from other models.

This thesis is structured in seven chapters: After this introductory chapter, I

will lay the foundation in the Background chapter, followed by a chapter reviewing

5


related work. After this, the approach used for this research will be detailed in the

Methods chapter. This is followed by the presentation and consequent discussion

of the results in the following two chapters. Finally, I summarize my work in the

Conclusion chapter.

2 Background

In this chapter, I will review two of the research areas this work refers to: Attention-

based deep learning language models based on the Transformer architecture, and

Japanese grammar, especially concerning transitivity.

2.1 Transformer-based Architectures

Since the publication of the original Transformer Vaswani et al. (2017), language

models based on its design have become ubiquitous in NLP. The Transformer archi-

tecture consists of an encoder and a decoder, each consisting of originally 6 de- or

encoder blocks stacked on top of each other, respectively. One such encoder block

is made up of a multi-head self-attention layer followed by a fully connected feed-

forward layer. On the decoder side, each block has an additional attention layer

with masked attention at the bottom, in effect creating a unidirectional model. In

contrast to state-of-the-art models of the time, which were based on convolution or

recurrent layers, this model relies completely on attention mechanisms, making it

easier to train, while significantly outperforming them on the translation task.

Unsurprisingly, such results inspired the development of models that use the same

attention mechanism. Arguably the most well-known of those are the pretrained

models BERT (Devlin et al., 2018), which is based on its encoder, and GPT (Radford

et al., 2018), which implements its decoder.

(Devlin et al., 2018) adapt the number of encoder blocks, the number of attention

heads per attention layer, and the hidden size of the feed forward layer, and propose

two architectures: a smaller version with 110M parameters, and a larger one with

6


340M parameters. These models are trained on two tasks: Masked Language Model-

ing, also known as the cloze task (Taylor, 1953), and Next Sentence Prediction. For

Masked Language Modeling, 12% of tokens are replaced by the [MASK] token, 1.5%

of tokens are replaced by a different random token, and 1.5% are not replaced. The

model then has to reconstruct the original tokens within the sentence. Replacing

some tokens with random ones instead of masks appears to be rather important

for performance on downstream tasks, where often the mask token is not used. The

Next Sentence Prediction task is the binary decision of, given two sentences A and

B, whether sentence B follows sentence A.

BERT received much attention by researchers (Rogers et al., 2020), and many

improvements to its training regimen have been proposed. For example, Liu et al.

(2019b) find in their own evaluation that the Next Sentence Prediction Task is

not conducive to downstream task performance, and therefore drop the objective.

Instead, their model, RoBERTa, is trained on more training data and longer se-

quences. Another avenue of research concerns the acquisition of multiple languages

by a single model. Multilingual BERT 1 is a model that uses identical architecture

and training objective to BERT-base, but is trained on over 100 languages (102 for

the original and 104 for the newer, cased model). While it has weaknesses, espe-

cially for low-resource languages (Wu and Dredze, 2020), it performs surprisingly

well at cross-lingual model transfer (Pires et al., 2019), given its lack a cross-lingual

pretraining objective. Combining the training improvelemts of RoBERTa with a

multilingual approach, Conneau et al. (2019) propose XLM-RoBERTa, effectively a

multilingual RoBERTa model. This model is trained on one to two orders of magni-

tude more data per language and seems to generally outperform Multilingual BERT.

Clark et al. (2020) propose a different method for improving on BERT’s language

understanding. Instead of Masked language modeling, their model, ELECTRA, is

trained on a Replaced Token Detection task. The generator, a small Masked Lan-

guage Model, which, similar to BERT but smaller in size, gets sequences with 15%

of its tokens masked, which it replaces. The second model, which is architecturally

identical to BERT, called the discriminator, now is tasked with determining which

1https://github.com/google-research/bert/blob/master/multilingual.md

7


of the tokens in the sequence produced by the generator belonged to the original

sequence, and which were replaced. As a result of this regimen, ELECTRA requires

less training, as all tokens of a sequence are considered, and has less of a disconnect

between pretraining and fine-tuning.

While Transformer-encoder based architectures are useful for many applications,

given their bidirectional nature, their ability to generate language is limited. For

such tasks, a generative language model such as GPT (Radford et al., 2018) or

its evolution GPT-2 (Radford et al., 2019) is preferable. These models implement

the Transformer’s decoder, meaning its defining feature is masking in its attention

layers, preventing tokens from attending to left-hand context. GPT-2 builds on GPT

mainly by increasing its size (from 117M parameters to 345M parameters for the

GPT-2 base model).

The functional difference between the BERT-type and GPT-type models that

is expected to be most relevant for understanding grammar is GPT’s architectural

inability to attend to right-side context as well as BERT’s limited ability to consider

multiple sentences as a result of its training objective. As such, a slightly altered

masked language modeling objective, for example as employed by RoBERTa, may

be beneficial to such a model.

2.2 Linguistics

As this work centers around a linguistics problem, I will survey research on the topic

of transitivity, which is the main topic of interest, in this section.

2.2.1 Transitivity

Transitivity is an important concept in languages all over the world. The term

refers to a property of a clause which describes the transferal of an action from

one party, the agent, to another, the patient. Intuitively, the easiest method of

determining whether a phrase has high or low transitivity is to observe the number

of participants; with a single participant, no action can be transferred, resulting in

8


low transitivity, while multiple participants indicate some level of action transferal,

increasing transitivity. Apart from number of Participants, Hopper and Thompson

(1980) identify nine other parameters which affect a clause’s transitivity, such as

aspect, which considers whether an action is completed, and agency, which considers

the ability of the agent to effect the transfer of the action.

Transitivity as determined by agent count often affects the choice of verbs across

languages. Generally, some verbs require multiple arguments, while others accept

only one. An example in English is the verb to go, which only accepts a subject

as argument (for example I go), making it an intransitive verb, while a transitive

verb like to throw can take a subject and a direct object, as in The boy throws a

rock. Often, transitive and intransitive verbs describing the same situation share an

origin. To use another simple English example, I open the door and The door opens

share the same verb, even though the first sentence takes two arguments, while the

second takes two. This is what is known as a labile alternation pattern. A different

pattern is exhibited by the German verb pair liegen ‘to lie’ and legen ‘to lay sth.

down’. Although both words are similar in meaning and share a historic root, the

former does not accept a direct object, while the latter requires it. This type of

relation, where both verbs share a root and neither is derived from the other, is

known as a equipollent pattern.

Haspelmath (1993) looks at verb pairs from 21 languages, and finds stark dif-

ferences between languages in the way transitive and intransitive verbs are derived:

Some languages, such as Greek, German, and English, have a strong preference for

labile patterns, using the same verb in clauses with different transitivities. Others,

such as Russian and Romanian, tend to use an anticausative pattern, meaning the

intransitive verb is derived from the transitive one. In the case of Japanese, the ma-

jority of verb pairs display an equipollent pattern. However, causative (where the

transitive verb is derived from the intransitive) and anticausative derivation patterns

do also occur. Importantly, labile constructions are effectively absent form Japanese,

meaning that every verb is either transitive or intransitive. A pair of verbs with dif-

fering transitivity, where both alternants share an origin, is called a transitivity

pair.

9


2.2.2 Japanese

The Japanese language has aspects that set it apart form others and which need

to be addressed. Likely the most obvious of those is its writing system. Japanese

uses three distinct scripts: hiragana, katakana, and kanji. Hiragana and katakana,

together known as kana, are syllabaries which essentially denote the same syllables.

The difference between the two lies in what they are used for: Hiragana are used

for inflections of verbs and for words with a grammatical function like particles, as

well as some native Japanese words. Katakana are mostly used to transcribe non-

Chinese foreign loanwords into Japanese. Each of these scripts encode around 100

unique syllables, with which any Japanese word can be spelled out. Finally, kanji

are logographic characters originating from Chinese. Rather than syllables, each of

these characters represents a meaning and usually has multiple ways to be read. As

an example of Japanese script, we consider the sentence ‘(I) open the door’:

ドア を 開ける。

doa wo a-keru

The first two symbols (ドア) are katakana and spell the English loan word ‘door’.

The next symbol (を) marks the direct object and as such, is a hiragana character.

Lastly,開ける is the verb of the sentence. It consists of the kanji開 as its stem, which

describes the action of opening, and the inflection written in hiragana. Hiragana used

for inflection of verbs and adjectives in this manner are called okurigana.

I will avoid the use of Japanese script wherever possible for the sake of simplicity.

Instead, to make Japanese readable, it will be written in romaji, a transcription of the

Japanese script using Latin letters, used for example for typing. Since romaji directly

translates each kana into a sequence of Latin letters, there are some differences to the

notation usually used in literature, which transcribes the reading of words, instead of

their spelling. For example, Japan’s capital Tokyo is written as Toukyou in romaji,

as a written ou is usually read as a long o. Another such case is the accusative

marker, which as romaji is written as wo, while usually (but not always) read as o

in modern Japanese. Given that exact pronunciation is not the focus of this work,

using romaji instead of a transcript based on phonology, as is commonly used in

10


linguistics literature, should be sufficient.

As stated in the section about transitivity, Japanese strictly distinguishes be-

tween transitive and intransitive verbs, and usually derives both from a common

root. For Japanese text, this root in most cases is denoted by the kanji stem of a

verb, with the okurigana differentiating transitive from intransitive. The alternants

of transitivity pairs therefore tend to look quite similar to each other, and there is

no general rule to decide a verb’s transitivity purely based on its okurigana, with-

out additional lexical information. In terms of the structure of simple sentences, the

language has verb-final order. Apart from this, sentence structure is relatively free,

although subject-object-verb is considered the standard. Instead of by position, ar-

gument types are conveyed by particles, markers which are appended to the noun

phrase. The ga particle marks the subject of a verb. In terms of transitivity, this

denotes the agent from whom the action originates. The wo particle marks the direct

object of a transitive verb. In certain situations, it is acceptable to leave out such

particles (Minashima, 2001). However, this phenomenon is mostly observed in the

spoken language, and a short preliminary experiment suggested that BERT does not

have a good understanding of it, which is unsurprising, given its training dataset

was Wikipedia. The ha particle is a so-called topic marker. An argument with this

marker can, depending on the context, take the role of either subject or object. Like

arguments, adverbial phrases can also be placed at any point before the verb. Apart

from a few exceptions, relative clause construction in Japanese uses a gap strategy

(Comrie and Polinsky, 1993), where the relative clause is a basic clause structure

with a missing argument, prefixed to the head noun, which fills the role of the gap.

Another feature of Japanese is that it is a pro-drop language, meaning that

pronouns are regularly omitted. This creates not only an obvious difficulty in deter-

mining presence or absence of arguments, but also for training, as zero anaphora con-

structions may trow language models off. Umakoshi et al. (2021) propose a method

to alleviate this problem by using parallel text of a language that is not pro-drop,

as in such languages, the dropped argument will usually be explicitly named. The

pro-drop tendency combined with the lack of verb conjugations based on grammat-

ical person and number has the added effect of making Japanese uniquely sensitive

11


to context, as for example an omitted subject can often not be determined without

it.

Languages such as Japanese and Chinese also set themselves apart by their lack

of spaces to mark the beginning and end of words. For NLP, this mainly complicates

tokenization, which requires more sophisticated means of recognizing words. This

aspect and its effects will be examined more closely in coming chapters.

An example of a Japanese grammar is given by Martin (2003).

3 Related Work

In this chapter, I will go into some of the existing research concerning neural language

models, especially with regard to their capacity for syntactical understanding.

On the topic of syntactical understanding, a growing body of research can be

found, which, as already eluded to in the introduction, mainly focuses on English

models and syntax. The approaches used in this research can roughly be categorized

into one of two types: Behavioral approaches, which evaluate a model’s outputs given

specific inputs, effectively considering it as a black box, and probing approaches,

where outputs of a model’s layers are evaluated using a so-called probe, which takes

a specified layer’s output as input features to make a prediction. By training such a

probe in a supervised fashion, one can extract specific information encoded in those

layers.

3.1 Grammatical Knowledge

A comprehensive overview of the research done on BERT is given by Rogers et al.

(2020). This overview collects proposed improvements and insights gained from more

than 150 studies, finding generally applicable statements about the extend and lim-

itations of BERTs syntactic and semantic knowledge. It also introduces many of

the training regime improvements that have been suggested, such as RoBERTa

(Liu et al., 2019b), which drops the next sentence prediction task and increases

12


the amount of training data by an order of magnitude, and XLNet (Yang et al.,

2019), which instead of masking, uses word order permutation. While this analysis

mainly concerned with research on monolingual English BERT, and its findings are

therefore not guaranteed to be applicable to other languages or multilingual models,

it serves as an important overview of what one should expect BERT to be capable

of.

Investigating computational limitations, Bhattamishra et al. (2020) explore the

ability of transformer-based architectures to recognize formal languages. They find

these models to be limited in their ability to recognize certain types of regular

languages, as compared to LSTMs. Bai et al. (2021) show that it is possible to

improve BERT’s and RoBERTa’s performance on downstream tasks by training

attentions to reflect syntax trees. This implies that where data annotated with such

information is available, it is beneficial to utilize this kind of partially supervised

training approach.

3.1.1 Probing Approaches

A probe, as mentioned above, often refers to a single (trained) linear transformation

or a small neural network consisting of a one or a few feed-forward layers. It takes

the output of a certain layer of the model as its input and is then trained to make

task-specific predictions based on this. Probes are commonly used to extract en-

coded information from different layers of models, as a method of finding linguistic

knowledge in these encodings.

Peters et al. (2018) probe different types of bidirectional models - an LSTM,

Transformer, and CNN - for grammatical and contextual knowledge using an array

of diverse tasks, such as POS Tagging and constituency parsing. Their results show

that for many of the posed tasks, good predictors can be found within a model’s

layer outputs, with differing tasks’ most optimal representations being found within

different layers of the model. It should be noted that the implementation of the

transformer model Peters et al. (2018) used, while bidirectional, does not closely

resemble BERT, but rather a very small forward- and backward GPT, as it utilizes

13


a Transformer decoder rather than an encoder.

In similar experiments, Jawahar et al. (2019) probe BERT for syntactic and

semantic information and find that syntactic information is generally encoded in the

middle layers, with semantic information being encoded further up. Tasks requiring

the processing of long-range dependency information such as verb-subject-agreement

are found to require higher layers.

Such results are corroborated by Lin et al. (2019), who find that BERT en-

codes positional information in early layers, with deeper layers encoded information

becoming increasingly complex. A similar approach is used by Liu et al. (2019a).

They compare performance of transformer-based models’ (namely BERT and GPT)

layer-wise best-performing probes to state-of-the-art task-specific models and find

that performance is competitive for many, but not all tasks. The authors determine

this to be due to the models requiring more precise data for these tasks.

Hewitt and Manning (2019) extract distances between layer token representa-

tions of BERT and ELMo. Using these distances, they attempt to recreate the

sentence’s parse tree. They observe that these trees can indeed be reconstructed rel-

atively reliably from the distances calculated from embeddings extracted at middle

layers of the models (layer 7 and 8 for 12-layer BERT), and increasingly less reliably

from distances extracted from lower and higher layers. The implication is that the

models in question do construct representations of syntax at certain points within

them.

Mueller et al. (2022) compare multilingual BERT and XGLM, a multilingual

auto-regressive language model, as well as multiple monolingual models with regard

to subject-verb-agreement encoded in their neurons, and find that multilingual mod-

els share syntax-sensitive neurons across languages, with XGLM sharing more than

multilingual BERT, and that auto-regressive models encode knowledge in a distinct

fashion from masked language models.

What knowledge requires specialized training to get encoded also varies from

language to language: Koto et al. (2021) explore the document-level information

encoded in Spanish, Chinese, German, and English models. They find that while

14


the point at which the models encode this information tends to be a similar layer

across languages, the quality can vary significantly, implying that tasks that are

difficult in one language may be simpler in another.

An example of a difficult task is given by Kogkalidis and Wijnholds (2022),

who probe Dutch BERT’s output layer for its knowledge of certain non-context free

constructions. They find BERT largely unable to model such structures. Their results

also point towards a possible reason for BERT’s performance discrepancies between

English and other languages, as the investigated structures would be ungrammatical

in English.

In a similar fashion, Ueda et al. (2020) perform cohesion analysis on Japanese

text by training a probe on top of BERT’s output. While their approach outperforms

the previous state of the art on multiple tasks, results are likely still far from human

scores.

3.1.2 Behavioral Approaches

Another method of evaluating linguistic knowledge in models is to simply observe

their behavior. In the case of masked models, one intuitive method to do so would

be to mask a certain word and compare probabilities of grammatical versus un-

grammatical tokens. An example of such an approach is demonstrated by Goldberg

(2019), who assesses BERTs grammatical knowledge by masking a token and com-

paring its likelihood with that of a tokenwhich is similar in most aspects, but which

violates a certain syntactical rule, such as subject-verb-agreement. For example,

given the sequence “The medicine [MASK] an effect”, one could compare the likeli-

hood of “has” to that of “have” in the position of the masked token. This effectively

forces the model to make a decision between a grammatical and an ungrammatical

sequence. According to the author, BERT performs surprisingly well in this setting.

This approach, however, is restricted to specific models and situations. In order

to compare a wider range of models, a universally applicable approach is required.

Warstadt et al. (2020) construct BLiMP, a dataset of minimal pairs to evaluate

models’ ability to answer a multitude of syntactical questions. Specifically, this is

15


achieved by comparing probabilities of two minimally different sentences, where one

sentence is ungrammatical, similar to Goldberg (2019). The sequence with the higher

probability is considered to be the model’s prediction. Transformer-based models

outperform the LSTM and 5-gram comparisons on most tasks, with GPT-2 even

reaching performance comparable to humans on some of them. Using probability

of a sequence, while intuitive and effectively applicable to any language model,

comes with its own problem: sequence probability, as the product of each token’s

probability, is highly dependent on sequence length, meaning that if one sequence is

longer that its counterpart, it is highly likely to have a lower probability, regardless

of grammatical acceptability.

Commonly used methods to evaluate downstream performance of models are

benchmarks such as SQuAD and GLUE: SQuAD (Rajpurkar et al., 2016) and

SQuAD 2 (Rajpurkar et al., 2018) are datasets to evaluate Question answering per-

formance, by supplying the model with a paragraph of text about which it is then

asked to answer a question. SQuAD 2 introduces the possibility of unanswerable

questions making the task more difficult. GLUE (General Language Understanding

Evaluation) (Wang et al., 2018) is a test suite collecting multiple tests to quantify

language understanding, such as a test for linguistic acceptability, where models are

asked to decide whether an English sentence is grammatical (Warstadt et al., 2019),

sentiment analysis (Socher et al., 2013), and question answering.

Xiang et al. (2021) create CLiMP, a Chinese language understanding benchmark

based on minimal pairs. They compare BERT, multiple LSTM models, and 5-gram-

models. while BERT outperforms its competitors with 81.8% accuracy by a wide

margin, it is still far away from the 95.8% human agreement. Predictions are deter-

mined analogously to BLiMP. More recently, Song et al. (2022) generally confirm

the findings of Xiang et al. (2021) by creating another minimal-pair-based Chinese

Evaluation dataset named SLiNG, and improving on some of the aspects of CLiMP

that were seen as problematic. They expand the scope of models under consideration

to include a wide array of mono- and multilingual transformer models. Furthermore,

instead of likelihood, they use perplexity (or pseudo-perplexity, if applicable) as the

deciding metric. This was one of the criticisms leveled at the prior approach, as

16


likelihood is by design extremely sensitive to sequence length, which varied in some

sentence pairs in CLiMP. Chinese monolingual BERT, as the best-performing of the

tested models, reaches an accuracy of 84.8% on this set, which is still significantly

below the 97.1% average accuracy of the human control. Interestingly, multilingual

models seem to perform especially poorly on both of these Chinese benchmarks.

3.2 Perplexity

As mentioned above, likelihood can be a problematic as a metric because of its

dependence on sequence length. Perplexity, as the negative inverse log likelihood

normalized by sequence length, is, at least in theory, not correlated to length. Given

this property, the metric is commonly used to judge a model’s preference of one

sequence over another.

Salazar et al. (2019) argue that, for bidirectional models like BERT, calculation

of likelihood and perplexity need to be slightly adjusted, as the original formulas only

consider one-sided context. To calculate a metric mirroring perplexity, they propose

to sequentially mask each token in order to calculate the sequence likelihood which

is then used to compute what they call “Pseudo-Perplexity”. Given that this process

requires as many runs of the model as there are tokens in the sequence, it is somewhat

computationally expensive compared to perplexity computations on other models.

Therefore, the authors propose perplexity calculation without masking, based on

the likelihoods BERT assigns to each token, which only requires one iteration, as a

trade-off between task performance and computing requirements.

In this work, when the term perplexity is used in the context of masked language

models, it refers to what Salazar et al. (2019) call pseudo-perplexity, unless otherwise

specified.

An example for the use of perplexity for grammar understanding assessment is

given by Marvin and Linzen (2018): Comparing perplexity of RNNs on minimal

pair sentences that differ in syntactic correctness they find that at least according

to this metric, such models do not display much grammatical knowledge. Miaschi

et al. (2021) investigate the variables that affect perplexity scores for GPT-2 and

17


BERT. They conclude that the two models react to similar, but distinct aspects of

sequences. This implies that these models may learn syntax differently due to their

architectural differences.

Lee et al. (2021) propose to use perplexity in the context of fact-checking. To do

so, they give one or more sentences of evidence for some fact and a claim, which either

agrees or disagrees with the evidence, to a pretrained BERT and GPT-2 model.

They then measure the perplexity of the evidence. If it is above a certain learned

threshold, the claim is considered false, else it is considered true. This approach to

fact-checking performs well in a few-shot-setting compared to models fine-tuned on

this task. GPT-2 outperforms BERT on this task, which the authors assume to be

due to the difference between perplexity and pseudo-perplexity , but may also be due

to pretrained BERT not having any pretraining objective that requires consideration

of more than two sentences.

Wang et al. (2022) discuss flaws of perplexity as a metric: They find that while

it is supposed to be a measure of linguistic acceptability, it is influenced by factors

such as sequence length, repetition, and punctuation marks. This should be kept in

mind when planning to use the perplexity metric as decider. Most approaches get

around this by using perplexity to force a decision between two minimally different

sequences. Such sequences should only differ in one feature in order to eliminate

as many sources of difference in perplexity as possible. One should avoid differing

punctuation marks or repetitions, and differences in length should be kept to a

minimum. The described effect of perplexity decreasing with increasing sequence

length can also be observed on the dataset used in this work, for at least three

different language models.

4 Methods

In this chapter, I will elaborate on the approach used determine grammatical un-

derstanding of current language models.

18


4.1 Defining “Understanding”

The Goal of this work is to determine the capacity of modern language models to

“understand” grammar. To do so, we first have to find an agreeable on a definition

of understanding. There may be a discussion to be had about whether an artificial

model can truly understand anything without being conscious. For the purpose of

this work, however, this view is somewhat impractical, as consciousness is a noto-

riously difficult concept to define. Understanding, in the context of this work, thus

should not presuppose consciousness. On the other hand, most would likely agree

that guessing the right answer to a question based only on statistical analysis, for

example as exemplified by n-gram models, does not constitute understanding, in the

same way that a student learning all answers to a math quiz by heart, instead of

calculating each according to a certain set of rules, can not be said to have under-

stood these rules. Distinguishing between learned rules and statistics requires some

ingenuity (Anil et al., 2022). Considering this, understanding, for a language model,

should ideally involve learning abstract rules from the training data and being able

to apply them to unseen data. For this work, I will define understanding as such:

A language model can be said to understand a grammatical rule, if it can reliably

decide whether a given sequence fulfills it.

4.2 Selection of a Grammatical Rule

To evaluate grammatical understanding, a specific rule or set of rules needs to be

selected. I decided to use the rules governing the viability of transitive vs intransitive

verbs in simple Japanese sentences for this purpose. I will call the associated prob-

lem of deciding whether given a certain context in Japanese, a transitive verb or its

intransitive counterpart is more appropriate, Transitivity Prediction. The deci-

sion to use Transitivity Prediction as a means to assess grammatical understanding

is motivated by some of the aspects of the language and the problem itself:

• The verb-final sentence structure in Japanese allows unidirectional models to

evaluate the verb in a sentence with a similar amount of information about

19


the context as bidirectional models. The only part of the context the verb in

a unidirectional model is unable to attend to is the sentence-ending token.

• Having transitivity pairs that share a stem and meaning helps to reduce the

difference between transitive and intransitive sentences, potentially reducing

the effect of co-occurrence biases and other unwanted sources of noise, as such

effects will usually be present in both sentences. In some cases, transitive and

intransitive sentences differ by as little as a single token.

• Since Japanese has a relatively free sentence order, apart from the verb gen-

erally being bound to the end of a sentence, big datasets can be constructed

using relatively few components by changing the order of constituents. Further-

more, this allows us to observe how architectures deal with sentence structures

that are in this respect more complicated than those found in languages with

stricter restrictions, such as English.

• Determining the actual viability with regard to Transitivity Prediction is as

simple as counting the number and type of arguments: If a sentence has two

arguments or one argument that is a direct object, it is obligatorily transitive;

else, both verbs are grammatically acceptable.

Of course, there are also some aspects of the language that pose challenges:

In Japanese, as opposed to many European languages such as English or German,

ellipsis of arguments is very common. In such cases, the missing argument is typically

implied in the context. As a consequence, even in sentences that have, for example,

no specified direct object, such an argument might still be implied. Importantly for

us, this means that sentences with a single argument or even no specified arguments

are not automatically intransitive. Thus, while one can construct a sentence that

requires a transitive verb by supplying a direct object to the sentence, there is no

method to construct a definitively intransitive sentence, since there is always the

possibility of an implied direct object. Another challenge of working with Japanese

is the lack of spaces as word delimiters. Tokenizers have to use more sophisticated

methods to separate text than simply considering each substring between two spaces

20


as a word. For Japanese, successful models employ a grammatical parser that uses a

dictionary to identify words, the output of which is then further separated according

to the WordPiece algorithm. However, if no language-specific specific information

such as a dictionary, is used, as may be the case with for example some multilingual

models, more general tokenization strategies can lead to problems. An example of

such a problem is described in the the subsection regarding XML-R, within the next

section.

4.3 Selection of Language Models

Eight pretrained language models based on state-of-the-art architectures available

on huggingface.co were selected for analysis, with the goal of covering a wide range

of parameter combinations, such as architecture, training data, training approaches,

and tokenization approaches. Five of these are based on BERT architecture. The

other three are variants of GPT-2 in differing sizes. In this section, I will go over the

specifics of each model. Later on, I will refer to each model by the name in brackets

instead of its full name for convenience.

4.3.1 BERT Architecture

• Japanese whole word masking BERT (BERT): 2 This model uses the same

architecture as BERT-base (Devlin et al., 2018), meaning it has 12 sets of an

attention layer followed by a fully connected feed-forward layer. Each atten-

tion layer consists of 12 attention heads, while each feed-forward layer has a

hidden size of 768. This results in a model with 110 million parameters. For

tokenization, it uses the MeCab morphological parser (Kudo, 2005), an open

source text segmentation library for Japanese, with the IPA dictionary, to seg-

ment text into words, and then uses the WordPiece algorithm to segments

them further, resulting in a vocabulary size of 32000. The training dataset for

2can be found at huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking, commit

ab68bf4, downloaded 26.05.2022

21


this model is Japanese Wikipedia, using the dump from September 1, 2019.

While it is not specifically stated by the creators, WordPiece is likely trained

on the same dataset.

• Japanese character based BERT (cBERT): 3 Architecturally, this model is

identical to BERT. Training was done on the same dataset as BERT as well.

Its unique aspect (among the selected models) is its tokenization: According

to its description, this model also uses MeCab with IPA to segment text into

words, but then further separates these words into single characters, resulting

in a character-based tokenizer. Any word that is not recognized by the MeCab

parser is considered an unknown token, and will not be considered for further

tokenization. This result is that rare kanji not occurring in words from the dic-

tionary do not find their way into the vocabulary. Another, slightly different

character based BERT model pretrained by cl-tohoku is also available on hug-

gingface: this model uses Unidic 2.1.2 instead of the IPA dictionary, resulting

in a bigger vocabulary of 6144, compared to the original size of 4000. A small

preliminary experiment showed both models performing similarly well on a

transitivity prediction task. With the first model outperforming the second

by a small margin, I decided to only consider the first for further analysis, in

order to keep the number of models being analyzed manageable.

• Multilingual BERT (mBERT): 4 This model once again uses the same archi-

tecture as those above, but was trained on the Wikipedia datasets of 102 differ-

ent languages according to its description on huggingface (this number differs

from the 104 languages cited by Pires et al. (2019)). To process Japanese text,

this model represents each kanji as its own token, as a character-based model

would, but uses WordPiece for non-kanji symbols. This results in somewhat of

a hybrid character- and WordPiece-based tokenization approach.

• XLM-RoBERTa-base (XLM-R): 5 (Conneau et al., 2019) This multilingual

3huggingface.co/cl-tohoku/bert-base-japanese-char, commit 6aa4c7b, downloaded 26.05.2022
4huggingface.co/bert-base-multilingual-uncased, commit 800c34f downloaded 07.09.2022
5huggingface.co/xlm-roberta-base, commit f6d161e downloaded 17.09.2022

22


model uses the same architecture as the models above. It is, however, trained

on a much bigger dataset than the previous models: in order to learn 100

languages, apart from Wikipedia, it also uses CommonCrawl data which has

been cleaned following Wenzek et al. (2019), increasing the amount of training

data per language by two orders of magnitude on average. Just like its name-

sake RoBERTa, this model forgoes the next sentence prediction task used in

the training of BERT. Instead, it uses consecutive sentences cropped to the

maximum sequence length in training. It uses the SentencePiece algorithm for

tokenization, which is useful for a multilingual model, as it does not require

language dependent logic, but, in the case of Japanese, creates a problem for

language understanding: Due to the lack of spaces as word delimiters, Senten-

cePiece sometimes produces erroneous tokenizations. It occasionally separates

sequences in such a way that tokens end up containing parts of multiple words.

A crucial example of this misbehavior for our application is its tendency to

fuse particles with the stems of verbs, especially if those verbs co-occur with

the particles and are otherwise not very common. This becomes particularly

noticeable with the “wo” particle, since the symbol for this particle does not

appear in any other context. Such behavior is problematic because a token

containing a particle and part of a verb may obscure the underlying words to

the model, making it difficult to determine their properties. Such a token can

not easily be categorized as either the stem of a verb of a certain type, nor

as a particle, but will likely constitute its own category of word to the model.

Ultimately, such inconsistencies could impair the model’s ability to create an

internal representation of a sequence’s grammatical structure during training

as well as evaluation, resulting in reduced performance.

• ELECTRA base Japanese discriminator (ELECTRA): 6 This model is the

discriminator part of an ELECTRA base model as described by Clark et al.

(2020), trained on the Japanese Wikipedia dump from June 1, 2021. Instead

of masked language modeling, the ELECTRA discriminator is trained to rec-

6huggingface.co/izumi-lab/electra-base-japanese-discriminator, commit 9a50f72, downloaded

12.08.2022

23


ognize whether a token has been replaced by a smaller-scale masked-language-

model, in this context called the generator. In training, 15% of tokens are

masked and then predicted by this generator. If the generator prediction dif-

fers from the original token, this replaced token is considered an imposter,

if the prediction is correct, it is considered not an imposter. Owing to this

pretraining regimen, the probabilities for whether or not any given token is

an imposter is expected to be below 15%. One potentially beneficial effect of

this approach is that the discriminator does not require a mask token in train-

ing, making it easier to fine-tune the model for downstream tasks which do

not require it, as there is no discrepancy between (masked) pretraining and

(maskless) fine-tuning. Another effect is a drastic reduction in training time,

since the model can effectively use all tokens of a sequence for training, instead

of being limited to masked tokens. The discriminator uses the same architec-

ture as BERT, while the generator has one third of the size. The ELECTRA

model I am evaluating uses an approach to tokenization almost identical to

the Japanese BERT model: MeCab using the IPA dictionary for word tok-

enization, WordPiece for subword segmentation. The difference between these

tokenizers is ELECTRAs slightly bigger vocabulary of size 32768.

4.3.2 GPT-2 Architecture

• Japanese GPT-2 (GPT-2): 7 Representing a typical mid-sized GPT-2 archi-

tecture, this model, which is the most popular monolingual Japanese language

model of this specific type on huggingface, was selected. GPT-2 models like

this one implement the decoder of the Transformer architecture proposed by

Vaswani et al. (2017) in contrast to the Transformer-Encoder-based BERT.

Specifically, this model consists of 24 Transformer decoder layers, each of which

are themselves made up of a masked attention layer with 16 attention heads

and a feed-forward-layer with a hidden size of 1024, resulting in a significantly

bigger model than BERT with 345M parameters. It was trained on Japanese

7huggingface.co/rinna/japanese-gpt2-medium, commit f464b76, downloaded 12.04.2022

24


CC-100 (Wenzek et al., 2020), a 15GB CommonCrawl dataset, and Wikipedia,

with a causal language modelling objective. The exact Wikipedia version is not

named, but Git changelogs point towards the August 1, 2021 dump 8. Tok-

enization is done using a SentencePiece tokenizer that was trained on the same

Wikipedia dataset. As explained in the section on XLM-R above, this tokenizer

shows some unintended behavior which could impact performance negatively.

• Japanese GPTneo (GPTneo): 9 This model uses the same tokenization as

GPT-2, with the only difference in architecture being an increased hidden size

of 2048, making it the biggest model out of all those selected by a small margin,

with around 350M parameters (even though the models full name, gpt-neo-

japanese-1.3B, would suggest a much bigger number). The training objective is

identical to GPT-2, and GPTneo was trained on the same datasets in addition

to Japanese OSCAR (Suárez et al., 2019; Ortiz Suarez et al., 2020), a large

dataset obtained by language classification and filtering of the Common Crawl

corpus.

• Japanese GPT-2 small (GPT-2 small): 10 This model implements the small ver-

sion of GPT-2, with 12 instead of 24 layers, 12 instead of 16 attention heads

per attention layer, and a hidden size of 768 instead of 1024, resulting in 117M

parameters, effectively bringing its size in line with BERT. Unlike the two

other GPT-2 based models, this model uses simple byte pair encoding without

a dictionary for tokenization. In practice, this bypasses SentencePiece’s prob-

lematic behavior, at least for the situation described, where SentencePiece

considers a verb stem and a preceding argument marker to be one word, on

the constructed datasets being used. Training was done on Wikipedia and a

subset of the Japanese CC-100 dataset, although details on this subset are not

specified.

8https://huggingface.co/rinna/japanese-gpt2-medium/commit/ae4875affd0259f0cd8debaea23174fc524c05df
9huggingface.co/yellowback/gpt-neo-japanese-1.3B, commit 69add76, downloaded 01.09.2022

10huggingface.co/ClassCat/gpt2-base-japanese-v2, commit 52e7199, downloaded 04.09.2022

25


4.4 Dataset Generation

To generate sequences to evaluate these four models on, I construct simple sentences

consisting of a verb at the end, and up to three further constituents, which can be

either arguments or an adverbial.

4.4.1 Verb Selection

To select a suitable set of verbs, I start out with a list of 306 verb pairs compiled

by Kageyama and Jacobsen (2016). The pair hairu “to enter” - ireru “to put in” is

added manually by me. This specific pair is missing in the original list because the

verbs are derived from different stems, as indicated by the stems’ differing readings.

However, in writing, these verbs use the same root kanji. Since our models can only

consider the written form, this pair appears like many other pairs on the list to

them. I then filter pairs from this list according to certain criteria:

• if according to the JMdict dictionary (Breen, 2004), one or both alternants

are usually written using kana alone. For such pairs, models would be much

less likely to recognize the verb in its kanji form.

• if both alternants share the same transitivity, meaning the list and JMDict

conflict.

• Differing root kanji, as part of the idea behind Transitivity Prediction is to

make the difference between sentence pairs as minimal as possible, which in-

cludes differing verb stem writings.

• if an alternant is not present in the dictionary at all.

For the remaining pairs, I record their number of occurrences in the Wikipedia

corpus. This corpus is relevant because as shown above, it is a common training

set for all models under evaluation. Verb pairs where the number of occurrences of

one alternant is zero are disregarded as well, as in these cases, a model trained on

Wikipedia did not have had an opportunity to learn the transitivity of a verb. After

26


this filtering, 225 out of the original 306+1 verb pairs remain candidates for the

dataset. Of these 225 verb pairs, the 50 most common ones are selected for the main

dataset. I determine how common a verb pair is by the number of occurrences of its

less common alternant in the Wikipedia dataset.

Using these verb pairs for an evaluation dataset ensures that every model has had

sufficient opportunity to learn the relevant grammatical properties of these verbs.

One caveat here is that the Wikipedia datasets that were used for training are not

identical, as they were created from different dumps at different points in time.

Presumably, more common verbs would allow the language model to learn whether

or not a verb is transitive with more certainty, thus improving performance on a

task where the objective is to determine if a sentence should have a transitive or

intransitive verb. For Transitivity Prediction, recognizing transitivity of a verb is

a necessary, but not a sufficient condition. The model also needs to recognize the

number and type of dependent arguments, and understand that depending on these

arguments, an intransitive verb may be ungrammatical. Since abilities concerning

arguments are independent of the verb, more knowledge about the verb itself is

not expected to improve them. Therefore, while more common verbs are expected

to induce better performance on transitivity prediction on average, the value of a

model’s verb knowledge would plateau at a certain point, after which the ability to

recognize arguments likely becomes the deciding factor for improvement.

4.4.2 Selection of Arguments and Adverbs

Apart from verbs, arguments and adverbials are required for the evaluation dataset.

The arguments were selected to cover a wide rage of attributes, from personal pro-

noun to intangible concept: watashi ‘I’, kare ‘he’, hito ‘person’, kangofu ‘nurse’, neko

‘cat’, tori ‘bird’, ongaku ‘music’, jiyuu ‘freedom’, isu ‘chair’, zairyou ‘ingridients’,

doa ‘door’. The first two arguments are definite personal pronouns. The second pair

are nouns representing persons, making them are no longer necessarily definite. Neko

‘cat’ and tori ‘bird’ are animate but no longer describe a person. The last five nouns

represent inanimate things or concepts, with isu ‘chair’, zairyou ‘ingridients’, and

27


doa ‘door’ being inanimate and tangible, while ongaku ‘music’ and jiyuu ‘freedom’

are intangible. These nouns and pronouns, combined with one of three particles (wo,

ha, ga) make up an argument. Lastly, there are three adverbial phrases that are used

to create the dataset: Katteni ‘voluntarily, on its own accord’, yoku ‘well, often’, and

tabun ‘probably’. Sentences are formed by prefixing any combination of zero or one

adverbial and 0 to 2 differently marked arguments in any order to a verb.

4.4.3 Sentence Construction

Now that all components have been selected, sentences can be constructed from

them. By creating all possible combinations of verbs, arguments and adverbials that

can be built in this manner, we arrive at a dataset with 374750 transitive and 374750

intransitive sentences. Sentence construction can be described using a context-free

grammar:

S → C + V + .

C → (AGH)

C → (AGW )

C → (AHW )

C → (AH)

C → (AW )

C → (AG)

C → (GH)

C → (GW )

C → (HW )

C → A

C → H

C → W

C → G

C → {}
(AGH) → (AG)H

(AGH) → (AH)G

(AGH) → (GH)A

(AGW ) → (AG)W

(AGW ) → (AW )G

(AGW ) → (GW )A

(AHW ) → (AH)W

(AHW ) → (AW )H

(AHW ) → (HW )A

(AG) → AG

(AG) → GA

(AH) → AH

(AH) → HA

(AW ) → AW

(AW ) → WA

(GH) → GH

(GH) → HG

(GW ) → GW

(GW ) → WG

(HW ) → HW

(HW ) → WH

V → verb

A → adverbial

G → noun+ ga

H → noun+ ha

W → noun+ wo

Capitals denote non-terminals, and lower case words are stand-ins that denote

one of the associated set of terminals.

28


This dataset will be called the base dataset, in order to distinguish it from

alternative datasets described next.

4.4.4 Alternative Datasets

Apart from the main dataset, three smaller datasets were created to further explore

certain aspects that may affect Transitivity Prediction, the first of which was already

briefly mentioned:

• Rare verbs : For this dataset, I utilize the 50 least common verb pairs of the

filtered verb pair list, instead of the 50 most common, creating a dataset of

equal size. This dataset is intended to indicate the effect of having less training

data on the task. Generally, I expect to see decreased performance for the rare

verbs dataset, especially for those models that were trained exclusively on the

Wikipedia corpus.

• Relative clauses: This dataset introduces relative clauses to arguments.

Specifically, I randomly sample 10000 sentence pairs form the base dataset

and prefix one of three generic relative clauses to one of its arguments. I ex-

haustively generate all sentences that can be created in this manner, thus

creating six new sentence pairs for sentences with two arguments, and and

three new sentence pairs for sentences with one argument. This results in a

dataset with 59067 sentence pairs. The three generic relative clauses are:

kinpatsu no onna ga shiranakatta Arg

‘the Arg which the blonde woman didn’t know’

kinpatsu no onna wo tasuketa Arg

‘the Arg which helped the blonde woman’

kinpatsu no onna ga ki wo tsuketa Arg

‘the Arg the blonde woman was wary of’

The intended effect of modifying the sentences like this is increased structural

complexity and the introduction of particles not related to the main verb of

the sentence. The first sentence introduces a subject-marking ga-particle into

29


the sentence, the second introduces an object-marking wo-particle, and the

last one introduces both. Given the additional markers, a model that relies

on co-occurrence of particles with certain types of verbs for predictions is

expected to perform worse on this dataset. For good performance, the model

needs to be able to differentiate between particles that belong to a relative

clause and those that do not. These specific clauses were selected for their

relative neutrality as well as to introduce the different particles, while working

with any kind of argument, be it animate or inanimate, tangible or intangible,

without becoming nonsensical.

• Longer sentences: Any difference in predictive ability on the relative clauses

dataset compared to the base dataset might not be the effect of the relative

clauses themselves, but of the elongated input sequence. To isolate the effect

of a longer sequence without a relative clause, I add a neutral phase to the

beginning of each sentence in the 10000-sentence-pair sample used for the con-

struction of the relative-clause-dataset. The phrase used is:

utsukushii tenki na hi, tokidoki...

‘on a day with beautiful weather, sometimes...’

4.5 Prediction of Transitivity

To decide whether a model prefers the transitive or intransitive alternant of a verb

in a given sequence, I compare the perplexity of those sequences to each other. This

metric is useful for our application because it takes into account token likelihoods

while theoretically being independent of the number of tokens. Technically speaking,

perplexity is the inverse likelihood of a sequence, normalized by its length. As such, it

can be thought of as a metric describing how acceptable the language model deems

a sequence to be, with low perplexity values corresponding to high acceptability.

Perplexity by itself, however, has been shown to not necessarily be a good predictor

of grammatical acceptability, as factors such as sequence length can have an effect

on it (Wang et al., 2022). In fact, my own results confirm that sequence length has a

30


negative correlation with perplexity for the base dataset as well. While this finding

can not be generalized as the dataset only reflects very specific, simple Japanese

constructions, it is something to be kept in mind. A sequence might have a higher

perplexity value than another merely due to the fact that one argument rarely occurs

in the proximity of the other, but since arguments are chosen independently of the

verb, this change in perplexity has nothing to do with the actual transitivity of

the sentence. Thus, I compare the perplexity value of one sentence to that of a

minimally different sentence, with the only difference between them being the verb.

This ensures that any bias affecting the perplexity of one sentence is likely also

affecting the perplexity of the other.

Given the sequence X of length n consisting of tokens x1, x2, ..., xn, perplexity

can be described by the following formula:

PP (X) = n

√√√√ n∏
i=1

1

Lmodel(xi, X)

where Lmodel(xi, X) is the likelihood assigned by the model to token xi given the

sequence X. Thus, to compute perplexity, it is necessary to extract these likelihoods

form the models we want to investigate. Due to differences in architecture and

training objective, these values will have differing interpretations from model to

model.

For the BERT-based masked language models, meaning BERT, cBERT, ml-

BERT, and XLM-R, the likelihood of a token in a sequence represents the probability

the model assigns to the token if it is masked:

LBERT (xi, X) = PBERT (xi|Xmasked)

With Xmasked representing the sequence X with token xi replaced by a mask token:

Xmasked = (x1, ...xi−1,mask, xi+1, ...xn)

Given a sequenceX = (x1, x2, x3), we can find PBERT (x2|Xmasked) by masking x2 and

calculating the probability with which the model predicts x2 for the mask token. As

mentioned, perplexity calculated in this manner is usually called pseudo-perplexity.

31


Alternatively, likelihood for masked language models can be computed without

the use of masking, by using the outputs of the unmasked sequence directly:

LBERT (xi, X) = PBERT (xi|X)

This alternative method may not seem as intuitive for models pretrained on masked

language modelling, as for such models we are often interested in predictions for

masked tokens, of which we do not have any in this case. It is, however, relevant when

fine-tuning the model for downstream tasks, as mask tokens are usually no longer

used, meaning that during fine-tuning, the model works with unmasked sequences

and tokens and their likelihoods. Not using masks results in much lower perplexity

scores because of the generally very high likelihood of each token due to the model’s

training objective. This is not an issue, as perplexity scores are compared only within

a method, not between them.

For the GPT variants being investigated, likelihoods are computed in a similar

fashion, with the difference that likelihood is the probability of the token given (only)

its left side context:

LGPT (xi, X) = PGPT (xi|(x1, ..., xi−1))

To use the same example as above, where we calculate the likelihood of the second

token, : LGPT (x2, X) = PGPT (x2|x1)

Note that because the sequences we compare to each other differ only in their verbs,

which are on the far right of the sequence, the likelihoods of tokens further left do

not depend on the differing part of the sequence. Thus, the likelihood of any token

left of the verb’s tokens in the transitive sequence is equal to its counterpart in

the intransitive sequence. This means that by comparing perplexity, we effectively

compare likelihoods of the tokens representing the verbs and their right-side context

(which in this case is limited to the sentence-ending punctuation mark).

The probabilities put out by the ELECTRA model have a slightly different inter-

pretation from those from other models. Due to not being trained to predict tokens,

but rather to distinguish between tokens from the original data and tokens that have

been replaced by a generator, ELECTRA outputs the probability PELECTRA(xi|X)

32


of each token xi being an imposter, given the sequence X. The complementary prob-

ability 1− PELECTRA(xi|X) then represents the probability of a token belonging to

the original sequence. I will use this complementary probability as the likelihood of

token xi:

LELECTRA(xi, X) = 1− PELECTRA(xi|X)

Since in training, the probability of a token being an imposter was less than 0.15,

average likelihoods are generally greater than 0.85, which leads to very low perplexity

compared to other models. As we only compare perplexity within, not between

models, this should not be an issue, since perplexity ratio between two alternative

sentences, if calculated for the same model, still remains an indication for which out

of the two sequences is more likely.

The methods used to calculate perplexity introduced above take perplexity of

all tokens of a sequence into account. A different approach is to only consider the

likelihoods of tokens referring to the verb of the sentence. Formally, this approach

is described by a slightly altered formula:

PPverb(X) = l

√√√√k+l∏
i=k

1

Lmodel(xi, X)

where k is the index of the first token of the verb, and l is the number of tokens

representing it. As the verb is the only part that is different between the intransitive

and transitive sentences, it supposedly holds a big influence on the resulting differ-

ence in perplexity as well. Thus, a high perplexity of the verb can be interpreted as

the model considering it a bad fit for the context. The idea is that by disregarding

likelihoods of tokens in the context, this method avoids some co-occurrence biases

models might have learned, resulting in less noisy results. One not very obvious

drawback of this method is that it requires every token to be either part of the verb

or part of the context. This essentially prevents the Sentence-Piece-based tokenizers

of XLM-R, GPT-2, and GPTneo from using this method. For GPT-2 small, basing

predictions on verb perplexity is possible, but unnecessary: Since the context to the

left of the verb is identical for both sequences that are being compared, and since

generative models do not attend to right hand context, the likelihoods of the con-

texts left of the verbs are the same. The verbs’ likelihoods are the first likelihoods

33


Sequence Sequence-Masked Verb-Only Verb-Only-Masked

BERT + + + +

cBERT + + + +

mBERT + + + +

ELECTRA + +

XLM-R + +

GPT-2 +

GPT-2 small + +*

GPTneo +

Table 1: model-method-compatibility

that differ between the transitive and intransitive sequence, and as such, are the

deciding factor for a prediction based on which sequence has lower perplexity. If a

transitive sequence has a higher verb perplexity than its intransitive counterpart,

then it will also have higher perplexity across the whole sequence, and vice-versa.

Just like in the case of calulating perplexity for the whole sequence described

above, for masked language models, we have the option of using or forgoing masks

to calculate likelihoods. The methods compatible with any given model are shown

in Table 1.

5 Results

In this section, the results of the experiments across the generated datasets, language

models, and perplexity calculation methods are shown.

I consider model accuracy under differing conditions as well as the agreement

between models and finally the coefficients that influence model behavior, generally

focusing on each model’s best performing perplexity calculation mode. The main

performance metric is the ratio of transitive predictions on obligatorily transitive

sentences. The ratio of transitive predictions on ambiguous sentences, which could

take either a transitive or intransitive verb, also needs to be considered, as a model

34


that always prefers the transitive verb would have an accuracy of 1 on the first metric,

without requiring an understanding of when which type of verb is appropriate.

The overall results for model accuracy can be seen in Table 2. Cells are color-

coded as follows: Scores below 50% are marked in red, a yellow background signifies

a score at about 65% accuracy, and a score of 80% is signified by a green background.

One very noticable feature of this table is the poor performance of mBERT on this

task. Even with its most favorable mode of perplexity calculation, it barely surpasses

random guessing on the base dataset. Curiously, it manages a comparatively strong

70.01% accuracy in verb-masked perplexity mode. This is likely due to some general

bias towards transitive verbs in the set of rare verb pairs, as a jump in the ratio of

transitive predictions (from 31.78% for the base set to 66.78% given the rare set)

can also be observed on ambiguous sentences. Given this poor performance, mBERT

will no longer be considered in further analysis.

Table 3 shows the percentage of ambiguous sentences where the model prefers

the transitive variant. Color coding goes from green at 10% to yellow at 35% to

red at 60%. Here, lower values are desirable, as high values may indicate that a

model performs well based on a general pro-transitive bias, instead of grammatical

understanding.

5.1 Rare Verbs

The rare verbs dataset was expected to negatively impact the accuracy of predic-

tions due to models having fewer opportunities to learn the transitivity of a verb.

This expectation holds true for the two BERT models, whose training dataset is con-

strained to an older and thus smaller Wikipedia dump, and which therefore had the

least exposure to rare words. While the effect of rare verbs on predictions is modest,

decreasing accuracy by around two percentage points for each model’s respective

best perplexity mode, this indicates that more training data could have a positive

effect on grammatical understanding, allowing the models to better recognize verb

types. For other models using BERT architecture, the effect is less clear. In the case

of both ELECTRA and XLM-R, whether rare verbs increase or decrease accuracy

35


Model base rare longer rel. cl.

BERT masked 70.79 68.69 74.32 63.51

BERT unmasked 69.34 65.73 65.31 59.05

BERT verb masked 66.49 59.36 65.78 56.04

BERT verb unmasked 61.51 62.66 55.20 53.90

cBERT masked 65.54 66.48 63.44 63.97

cBERT unmasked 71.42 69.74 66.96 67.93

cBERT verb masked 59.19 63.22 56.40 57.32

cBERT verb unmasked 67.12 65.67 60.54 63.23

mBERT masked 52.47 56.72 50.58 49.33

mBERT unmasked 51.07 43.42 57.29 46.80

mBERT verb masked 48.82 70.01 50.45 48.39

mBERT verb unmasked 41.11 60.37 37.85 36.77

ELECTRA 73.83 74.49 74.37 70.00

ELECTRA verb 71.80 70.46 68.50 71.33

XLM-R masked 69.71 73.34 63.50 60.50

XLM-R unmasked 71.58 61.22 78.52 65.21

GPT-2 56.94 65.11 57.10 55.59

GPT-2 small 58.53 61.14 54.93 54.04

GPTneo 61.71 63.49 58.87 57.09

Table 2: Accuracy across datasets in percent.

36


Model base rare longer rel. cl.

BERT masked 25.13 43.78 16.89 20.85

BERT unmasked 28.52 47.85 25.11 27.85

BERT verb masked 18.38 49.24 11.42 17.05

BERT verb unmasked 19.43 46.99 09.13 17.20

cBERT masked 24.10 39.59 08.68 19.79

cBERT unmasked 43.73 49.71 32.88 44.29

cBERT verb masked 28.04 47.78 13.70 27.85

cBERT verb unmasked 40.77 48.94 21.92 35.16

ELECTRA 43.87 40.44 32.88 44.60

ELECTRA verb 47.91 42.27 28.31 53.88

XLM-R masked 35.32 51.39 26.03 27.09

XLM-R unmasked 57.59 52.00 59.36 50.53

GPT-2 42.92 44.87 42.92 45.51

GPT-2 small 17.99 38.14 12.33 15.98

GPTneo 40.81 39.29 38.81 41.70

Table 3: Percentage of transitive predictions for ambiguous sentences.

37


compared to the base set seems to depend on the the perplexity mode. In the case

of XLM-R, masked mode in fact gains some accuracy in this situation, while losing

over 10% in masked mode. Both modes show the same respective tendency for am-

biguous sentences, indicating that at least some of the difference between common

and uncommon verb pairs is motivated by a different general bias, rather than a lack

of knowledge about uncommon pairs. Given that XLM-R was trained on almost two

orders of magnitude more data, this is unsurprising. What is confusing, however, is

that these tendencies move in opposite directions: While masked mode tends towards

transitive predictions given rarer verbs, unmasked mode strongly prefers intransitive

predictions for the same verb pairs. This is not the only instance of XLM-R behaving

in an unexpected manner, as will be discussed later. It is possible that the model

attends to some features that have not been considered.

For ELECTRA, results for rare verb pairs remain relatively stable. While ELEC-

TRA is trained on a more current and therefore bigger version of the Wikipedia

dataset, the size of this training data is still similar to that of the two BERT mod-

els. As such, its consistent performance on rare verbs suggests a solid understanding

of verb types and therefore supports the claims of drastically improved training

efficiency due to its pretraining objective made by Clark et al. (2020).

An interesting phenomenon can be observed in the three GPT-based models.

For GPT-2, and to a lesser extend GPT-2 small and GPTneo, less frequent verbs

seem to have a positive effect on transitive predictions on obligatorily transitive

sentences, where only for GPT-small shows by an increase in transitive predictions

on ambiguous sentences. For the other two models, this behavior suggests that their

knowledge of the verbs’ transitivity is not the limiting factor for performance on this

task. This is especially pronounced for GPT-2, which gained around 8% accuracy,

while transitive predictions on ambiguous sentences increased by only roughly 2%.

While much weaker, the same effect is seen for GPTneo. A possible explanation

is that given the fairly large training sets of these models, they have no issues

learning transitivities even for rare verbs, while for common verbs frequency and

co-occurrence bias start becoming more relevant to the models.

The generally observable trend is that rare verbs do not have a very strong impact

38


on accuracy, but do (at times sharply) increase transitive predictions on ambiguous

sentences, especially for models that are otherwise very good at predicting possible

intransitive sentences as intransitive.

5.2 Relative Clauses

The introduction of relative clauses affects phrases in multiple ways: It increases

sequence length, and makes sentence structures more complicated, by adding ad-

ditional arguments not dependent on the main verb of the sentence and possibly

confusing particles. Comparing effects on the relative clauses dataset to those on

the longer dataset should in theory allow us to isolate the impact increased com-

plexity has from effects that are merely due to sequence length, as longer sentences

are designed to increase the latter without significantly increasing the former.

Results draw an interesting picture: One might have expected to observe a ten-

dency to more frequently assign transitive verbs to longer sentences, as such sen-

tences are, on average, longer, given the additional argument they can accommodate.

However, this generally does not hold true for the models under inspection.

Apart from BERT and XLM-R, most models, when using their best perplexity

modes, seem to perform somewhat similar for longer and relative clauses. This im-

plies that these models do in fact recognize these structures relatively reliably, as a

negative effect on performance is more likely a result of increased length than com-

plexity. For XLM-R, here we find another example of strange behavior: Its unmasked

mode accuracy on the longer dataset looks quite impressive, but is somewhat put

into perspective when considering it is strongly biased towards transitive predic-

tions, as is demonstrated by its behavior on ambiguous sentences. In masked mode,

it shows much less bias, but also loses around 10% accuracy on longer sequences.

BERT loses a similar amount of accuracy when given relative clause sentences. While

it seems to handle longer sequences quite well, the increased complexity introduced

with relative clauses seems to have a particularly strong effect on this model. Inter-

estingly, the same observation can not be made with cBERT, which in unmasked

mode loses some accuracy for longer sentences in general, but deals surprisingly well

39


with the increased complexity of relative clauses. ELECTRA, while showing some

loss of accuracy likely as a result of more complex sentence structure in its base

perplexity mode, retains the title as the most accurate model.

Lastly, none of the three GPT-based models display a strong behavioral difference

between longer- and relative-clause-sentences, and only GPT-2 small seems to be

affected by sequence length. However, given their poor baseline accuracy scores

despite their (apart from GPT-2 small) relatively strong bias towards transitive

verbs, their performance does not measure up to their BERT-based counterparts.

5.3 Causal models

As mentioned, the generative models in this comparison seem to under-perform.

There is some evidence to suggest that the Sentence-Piece tokenizer used by GPT-2

and GPTneo is partly to blame for this shortcoming: GPT-2 small, while having

less than a third of the parameters that GPT-2 has, still outperforms it on almost

every metric. Given that the defining differences between the two models are their

tokenizers and size, and given that size appears to have a significant positive effect

on performance, as seen in the comparison between GPT-2 and GPTneo, it stands

to reason that a model of the size of GPTneo using GPT-2 small’s tokenizer may

very well reach competitive performance.

5.4 Bidirectional Models

The models based on BERT architecture display a wide variety of behaviors. While

BERT performs comparatively well for most tasks, but seems to get confused by rel-

ative clauses, cBERT handles those much better than expected, instead losing some

performance for longer sequences in general. ELECTRA demonstrates consistently

high accuracy, implying a good understanding compared to other models, but also

has a stronger bias towards transitive verbs, especially in verb perplexity calcula-

tion mode. A similar but even more pronounced behavior can be seen in XLM-R

in unmasked mode: Here, ambiguous sentences consistently have a higher than 50%

40


chance to be predicted as transitive. This should be considered a detriment, since

in the Wikipedia dataset, the ratio of transitive to intransitive verbs from the set of

common verb pairs is roughly 1.

All in all, while all of the bidirectional models in this test (apart from mBERT)

outperform the three unidirectional models (at least in their best modes), they

display a surprising amount of variance in their strengths and weaknesses.

5.5 Perplexity Modes

The method of calculating perplexity (Using masks versus not using masks, consid-

ering only the verb’s likelihood versus considering that of the whole sequence) is an

influential factor for the quality of predictions, which needs to be evaluated on a

case-by-case basis for each model.

Given that Transitivity Prediction revolves around deciding the more appropriate

verb, a reasonable assumption to make is that most of the acceptability difference

in each sequence can be found in this verb. The likelihoods of other tokens may

introduce noise by interacting with other tokens, unrelated to transitivity. Avoiding

this noise is the motivating idea for the verb perplexity mode. However, the only

model with which this mode produces accuracies that are comparable to those of

other modes of the same model is ELECTRA. This implies that in the case of

ELECTRA, most of thee difference in perplexity between the sentences does in fact

stem from the verb. For obligatorily transitive sentences, this seems to hold true

most of the time, as in such situations, an intransitive verb is generally correctly

considered as less likely. For ambiguous sentences, however, tokens other than the

verb seem to lose some likelihood if the verb is transitive, resulting in higher sequence

perplexity compared to verb perplexity. A different effect can be observed with

BERT: regardless of masking, the verb itself displays a bias towards the intransitive

prediction, when compared to the rest of the sentence. For the most part, the same

holds true for cBERT.

The approach of calculating likelihoods of tokens without masks was proposed

by Salazar et al. (2019) as a trade-off between computational demand and accuracy.

41


Figure 1: Agreement graph of base

dataset

Figure 2: Agreement graph of rare

verbs dataset

It also resembles the situation for most downstream tasks that do not implement

a mask token. The unmasked mode was therefore not expected to perform as well

as the masked mode. This expectation is confirmed by BERT, but not my cBERT,

resulting in BERTs best-performing perplexity mode being the masked mode, while

for cBERT, it is the unmasked mode. I argue that in spite of lower accuracy in

some situations, ELECTRA’s base mode and XLM-R’s masked mode should be

considered the best modes of their respective models, since their alternative modes

display extreme transitive bias which can be considered problematic as described

above.

5.6 Agreements

To understand how these models differ in their language understanding, it may be

helpful to see what models produce similar predictions. Attending to similar features

of a sentence should logically produce similar results. Tables showing agreements

between all the models for each dataset can be found in appendix A.

A general trend that emerges is that, unsurprisingly, different modes of the same

model tend to agree strongly with each other. For example, BERT’s average within-

model agreement across all modes for the base dataset is above 80%, which is higher

than its agreement with any other model. High agreement within models can be

observed for all tested models. Another observation is that both of the big GPT-

based models agree the most with each other. While their absolute agreement value

is not as high as that between other models, their agreement with others is even

42


lower. Interestingly, GPT-2 small does not seem to behave very similar to these

two models, instead agreeing most strongly with BERT’s “verb” modes. On one

hand, this implies that GPT-2 and GPTneo behave differently due to their size,

their tokenizer, or a combination of both. Agreement with BERT verb on the other

hand is not very surprising when one considers how similar the pretraining tasks of

each model become in this situation: Predicting a masked verb token at the end of

a sentence and predicting the next token in a sentence that only requires a verb are

intuitively closely related tasks.

Figures 1 and 2 are an attempt to visualize the different families of models

that show similar behavior. To create these visualizations, first, a fully connected

graph with models as nodes and their agreements as edge weights was created.

Then, starting with the lowest edge weight, edges where removed until the next edge

removal would disconnect the graph. Node positions were then adjusted according

to the ForceAtlas2 algorithm (Jacomy et al., 2014).

Figure 1, which represents agreements on the base dataset, shows that models

effectively create three clusters: The main cluster contains BERT-based monolin-

gual models, as well as GPT-2 small, with modes of the same models usually being

closer together. Apart from this cluster, there is a cluster made up of the two bigger

GPT-type models, and lastly, the cluster of XLM-R. GPT-2 and GPTneo forming

a distinct group is expected, as these models already demonstrated some funda-

mental differences. XLM-R’s distinct behavior has two possible explanations: Its

SentencePiece-based tokenizer is the chief suspect, but its multilingual training may

also be causing it to process language in a different way from its monolingual coun-

terparts. The clustering for the longer and relative clauses dataset draws a similar

picture for the GPT cluster, while the XLM-R cluster moves towards the center

cluster.

However, different behavior can be observed for the rare verbs dataset (Figure

2). Here, of the two clusters that form, one consists of cBERT and XLM-R, while

the other includes both the GPT- and the remaining BERT-based models. GPT

models behaving differently for rare verbs had already been expected, given the

results we saw on their performance. The relatively high agreement with BERT and

43


ELECTRA suggests that models with GPT architecture consider similar features

for verbs where not enough data is present to base decisions on co-occurrence. With

increasing familiarity with the verb and its usual context, agreement starts to drift

apart. The high level of agreement between cBERT and XLM-R for this dataset

is hard to explain: As these models differ in tokenization, training set, and to a

lesser extent, even pretraining task, cBERT agreeing with XLM-R, as opposed to,

for example, the on paper much more similar BERT, seems almost arbitrary.

5.7 Logistic regression

To find out what sentence features inform the decision of a model, I fit a regression

model to its predictions. For this, i use the R library glmnet (Friedman et al., 2010),

which uses Elastic-Net-regression by default. I utilize ten-fold cross validation. The

considered independent variables are the identity of the verb pair, the constituent

order (meaning the order in which arguments and adverbials are prefixed to the

verb), the identity and marker of the first and second argument, the adverbial used,

and in the case of the relative clauses dataset, the position and type of relative clause

added. The value of these coefficients tell us how strongly each of these features

correlates with a transitive or intransitive prediction.

Across all models and modes, there are 181 coefficients to consider, plus 7 more

for relative clauses. Ideally, those coefficients that indicate an argument with a wo

particle should have a high positive values, indicating high correlation with transi-

tive predictions. The same goes for coefficients indicating the presence of multiple

arguments. Factors such as verb pair identity, adverbial identity, and type and po-

sition of relative clauses should not play too big of a role. Tables 4, 5, 6, and 7 show

the maximum and minimum coefficients for BERT masked and GPT-2. 11

11The four coefficient tables resulting from this regression are available at these links: https:

//drive.google.com/file/d/15IaFkOiyo0Q85Vk2CKSCgs8UwP36YrBo/view?usp=sharing,

https://drive.google.com/file/d/1XeL1VdrN7ADr4PGjQSehYaEV-6jMvPFT/view?usp=

sharing, https://drive.google.com/file/d/1pn4M8gZj5qsZuo1IiTgJJd6nQBhZOo9X/view?

usp=sharing, https://drive.google.com/file/d/1t_m8AKtvtnb5_Qs-G7mhVAv3J1Xh2G3C/

view?usp=sharing

44

https://drive.google.com/file/d/15IaFkOiyo0Q85Vk2CKSCgs8UwP36YrBo/view?usp=sharing
https://drive.google.com/file/d/15IaFkOiyo0Q85Vk2CKSCgs8UwP36YrBo/view?usp=sharing
https://drive.google.com/file/d/1XeL1VdrN7ADr4PGjQSehYaEV-6jMvPFT/view?usp=sharing
https://drive.google.com/file/d/1XeL1VdrN7ADr4PGjQSehYaEV-6jMvPFT/view?usp=sharing
https://drive.google.com/file/d/1pn4M8gZj5qsZuo1IiTgJJd6nQBhZOo9X/view?usp=sharing
https://drive.google.com/file/d/1pn4M8gZj5qsZuo1IiTgJJd6nQBhZOo9X/view?usp=sharing
https://drive.google.com/file/d/1t_m8AKtvtnb5_Qs-G7mhVAv3J1Xh2G3C/view?usp=sharing
https://drive.google.com/file/d/1t_m8AKtvtnb5_Qs-G7mhVAv3J1Xh2G3C/view?usp=sharing


marker1w 2.591735947

arg1 id1.marker1g 2.278545687

marker2w 2.246562109

arg1 id2.marker1g 1.767797672

verb pair id286 1.507920641

COwgav 1.371698718

verb pair id265 1.361102779

arg2 id1.marker2g 1.270775074

arg1 id4.marker1g 1.265723244

arg2 id2.marker2g 1.216086325

Table 4: Max. coefficients for BERT

on base

verb pair id141 -4.459019510

arg1 id10.marker1g -2.562459711

arg1 id8.marker1g -2.534395423

verb pair id153 -2.512351104

COagv -2.425492253

arg1 id7.marker1g -1.994570176

COgv -1.984526274

verb pair id282 -1.712175691

verb pair id296 -1.524854822

arg1 id11.marker1w -1.514078384

Table 5: Min. coefficients for BERT

on base

X.Intercept. 2.715971e+00

COwav 2.712243e+00

COgwav 2.632784e+00

COhwav 2.622277e+00

arg1 id3.marker1w 1.773101e+00

marker2w 1.693284e+00

verb pair id28 1.665896e+00

arg1 id2.marker1w 1.664022e+00

verb pair id25 1.410473e+00

verb pair id270 1.289851e+00

Table 6: Max. coefficients for GPT-2

on base

verb pair id271 -1.024969e+01

verb pair id153 -9.289100e+00

verb pair id129 -6.962328e+00

verb pair id255 -6.494575e+00

verb pair id141 -5.261678e+00

verb pair id281 -4.998784e+00

verb pair id77 -4.334255e+00

verb pair id140 -4.236711e+00

verb pair id217 -4.175004e+00

verb pair id150 -4.169882e+00

Table 7: Min. coefficients for GPT-2

on base

45


Particles indicating a direct object and therefore a sentence requiring a transitive

verb have strong correlation to transitive predictions across most models and modes.

We can observe that such a particle has a stronger impact on the prediction if it is

further back in the sentence and therefore closer to the verb. Another feature that

shows a strong positive correlation with transitive predictions is a definite personal

pronoun followed by the subject marker ga. Almost all tested models seem to assume

a subject high in agency such as a definite person to be transferring an action to

something or someone else, making a transitive verb more likely.

The identity of verb pairs often has a strong negative correlation with transitive

predictions for both GPT-2 and GPTneo. In fact, for both of these models, the 20

coefficients most strongly related to intransitive predictions on the base dataset are

all verb pair identities. This effect seems to be less severe on the rare verbs dataset,

lending credibility to the hypothesis that these models decide based on frequency

biases for common verbs. Strong verb bias, as observed here, may also partially

explain these models’ underwhelming performance. While the two big GPT-based

models most strongly exhibit this property, they are by no means the only models

with strong intransitive bias for certain verb pairs. All models seem to behave in this

manner to some extent. Interestingly, verb perplexity modes seem to consistently

rely more on verb identity for their predictions than their whole-sequence-considering

counterparts. Furthermore, well-performing modes such as BERT masked, cBERT

unmasked, and ELECTRA base, seem to have few verb pairs with high positive

coefficients, implying their transitive predictions are usually not due to verb bias.

Relative clauses seem to be recognized surprisingly well by most models. A rel-

ative clause containing the object-marking particle wo is not correlated with more

transitive predictions. In fact, the opposite seems to be true. Relative clauses of any

kind have on average an intransitive coefficient. However, the effect of relative clauses

is also highly dependent on the position of the relative clause: Relative clauses modi-

fying the argument closer to the verb usually cause a moderate intransitive tendency,

while those modifying an argument further left cause a slight transitive tendency on

average. This behavior is consistent across models.

Adverbial phrases have no strong correlation with predictions. However, GPT-2

46


and GPTneo seem to consider certain constituent orders, which often include ad-

verbials, as indicators of transitive sentences. Their emphasis on specific constituent

orders indicates that these models are looking for sentences following standard sen-

tence order, which for Japanese is SOV. The constructed datasets do not exclusively

use this order, giving another possible explanation for the poor performance of these

models.

6 Discussion

In this chapter, I will discuss the findings and limitations of this work.

The goal of this research was to determine the extend to which current state-of-

the-art pretrained language models are able learn and apply grammatical rules, as

exemplified by the task of Transitivity Prediction. Results show large differences in

performance between the best and worst performing models.

The worst performing models, apart from mBERT, were the generative GPT

models. There are multiple possible explanations for their results: First and foremost,

the SentencePiece-based tokenizer used by both GPTneo and GPT-2 creates faulty

tokenizations containing tokens that contain parts of multiple words, making it

difficult for the model to assign a specific grammatical role to a token in the way it

normally would. This might also make learning grammatical structures migh more

difficult. Evidence for this can be seen in the the performance of GTP-2 small

compared to GPT-2. Despite its smaller size, the former outperforms the latter,

likely because it does not have to contend with a faulty tokenizer. However, when

compared to the bidirectional models being analyzed, which all effectively have the

same size as GPT-2 small, its performance is still the weakest by a large margin.

The explanation for GPT’s under-performance is therefore likely to be found in

its architecture. Although the task was specifically chosen to create as much of an

even playing field as possible between bidirectional and unidirectional models, some

differences are hard to compensate for. While the deciding factor, the verb, is at

the end of the sentence, giving both architectures access to the complete context,

47


bidirectional models also have the information that the sequence does in fact end

after the verb. This information is not available to unidirectional model, making it

theoretically possible for them to try to salvage the grammaticality of a sequence

by adding more words.

Furthermore, analysis of regression model coefficients indicates that unidirec-

tional models have much stronger biases due to verb pair identity, a property they

share with the verb perplexity modes of unidirectional models. This is somewhat

plausible, considering that in both cases, perplexity is based exclusively the verb. In

fact, the predictions of some of the verb modes do bare a resemblance to those of

GPT-2small in both overall accuracy and labels for individual sentences.

For the bidirectional models, multiple methods of calculating perplexity were

explored. Masked models allow for perplexity calculation with and without masking.

While not masking tokens is the approach that more closely resembles the reality of

most fine-tuning tasks, it is rarely employed when it comes to perplexity calculation.

This might be an oversight, as cBERT’s results suggest that some models achieve

better performance with this perplexity mode. Further research may be required to

gain a better understanding of the different strengths and weaknesses of masked and

unmasked modes of perplexity calculation.

The second variation in perplexity calculation is to only consider the token(s) un-

der inspection, which may be appropriate depending on the task, and, as mentioned,

in some ways resembles the perplexity of unidirectional models. However, this mode

ultimately did not reach the performance level of its alternative. A possible reason

for this are biases for or against certain verbs. Other tokens such as particles may

also have a moderating effect in such cases.

While bidirectional models outperformed unidirectional ones, none of the tested

models achieved exceptional results. There are several possible explanations:

• The models do not have the required grammatical understanding.

• The methods used are unable to extract the model’s grammatical understand-

ing.

48


• Flaws in the test set-up introduce noise.

While some results, such as BERT seemingly getting confused by relative clauses,

point towards a lack of grammatical understanding, others, like GPT-2’s preference

for certain sentence structures, suggest that understanding may be present within

the model, but may not always be expressed in its behavior. Lastly, within the

dataset that is used, most sentences are semantically nonsensical.