Generating random knowledge graphs from rules

Abstract

A knowledge graph is a datastructure that is capable of storing knowledge. Besides that, there are several methods that use knowledge graphs to derive more information. These methods need to be validated with example knowledge graphs. However, real data might not be available or not contain desired properties. Thus, there are use cases that benefit from the generation of synthetic knowledge graphs. To define a synthetic knowledge graph, there is the need of a characterization that expresses how the synthetic data should look. In this thesis, I use Horn clauses for this characterization because of their good balance of expressiveness and complexity, their use in the field of rule mining, and their base role in the logical language Datalog. As clauses are usually not represented perfectly in real data, the goal of this thesis is to generate a knowledge graph that does not perfectly fulfil given Horn clauses, but in a desired degree of fulfillment. During the thesis, I developed and implemented two modifiable algorithms to generate knowledge graphs. On the one hand, I adapted the general hill climbing technique to generate knowledge graphs. On the other hand, I implemented a greedy algorithm which orders a given set of Horn clauses using logical subsumptions between their bodies, and then add edges to fulfil one Horn clause after the other, in the computed order. Both algorithms aim to fulfil the goal of this thesis by generating synthetic knowledge graphs according to given Horn clauses, each with a degree of fulfilment. The degree of fulfilment of any Horn clauses is characterized by body support, the number of times the premise of the Horn clauses is fulfilled, and support, the number of times the premise and conclusion of the Horn clauses are fulfilled. Additionally, there is the confidence which is the fraction of support and body support, i.e., the percentage of cases the Horn clause is fulfilled. All code is published such that anyone can try it. During the evaluation, random sets of Horn clauses were produced and the implementations generated corresponding knowledge graphs. Generated knowledge graphs were compared by considering the difference between the expected and the actual degree of fulfilment for each Horn clause. The result is that generation variant hill climbing with the initial state set to the result of the greedy algorithm with rule order based on subsumption yields the best results. Also, the difficulty of generating a good knowledge graph increases along with the overlapping degree of the input set of Horn clauses. Note that the overlapping degree reflects how many relation names occur in how many Horn clauses of the set. Lastly, the state-of-the-art mining tool AMIE found many Horn clauses in the generated graphs which were not intended by the set given to the generator.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By