A virtualization approach to scalable enterprise content management

Wagner, Frank Oliver

A virtualization approach to scalable enterprise content management

Files

diss_20111018.pdf (1.38 MB)

Date

2011

Authors

Wagner, Frank Oliver

Abstract

In this dissertation we present an approach for the implementation of a scalable content management system that is based on the virtualization of the repository. It allows to dynamically scale-out the repository on multiple machines in order to adjust the system to the current load. This is an important precondition when offering a content management system as a service in a cloud to customers of different and changing sizes. As an example we will look at e-mail archiving.

In personal and especially in business life one is daily confronted with a huge amount of information, and E-mails have become a dominant part. The appropriate management of the information is very important. It has to support the typical use cases as well as huge and changing volumes of information, and it has to be cost-efficient. Moreover, in a globalized world the information has to be accessible from anywhere on the world at any time.

In this dissertation we present an approach to manage the information that allows growing and shrinking the system in a wide range. Therefore we first introduce a repository virtualization layer that decouples the applications from the direct usage of the repository. For each request it determines the responsible repository and forwards the request to it. Transparent for the applications and their users it is possible to add new machines to the system or to remove superfluous ones.

First we describe the data an e-mail archive has to deal with. These are first and foremost the e-mails, but there are some other metadata, too. For the implementation of the data model we are using a combination of a search engine and a relational database. The focus lies on the search engine as it is better suited for semi-structured and unstructured information. With the typical fuzzy information need of the user the search engine provides more relevant results than the rather strict and precise databases. When ingesting a new e-mail into the system the full-text is extracted and is added to a full-text index. A cluster of search engines makes the full-text indexes accessible by the users. The database is mostly used for consistency information and access control information. A normalization of the information in the search engine is also investigated.

The repository virtualization layer allows distributing the data over multiple machines. The starting point for this distribution is the partitioning schema. In order to achieve a good performance a good scalability is important to keep the communication between the machines at a minimum. Especially very frequent operations and time-critical interactive operations should be executed locally on one machine and join operations should be avoided. For the distribution of the data to the machines we are using a distributed hash table from the peer-to-peer system OpenChord. By using consistent hashing the identifiers of the documents, which we compute as a hash over their content, are uniquely mapped to the currently responsible machine.

To store the original documents we evaluated different approaches. In the simplest approach a regular file system was chosen. It has the lowest overhead, but also provides the fewest optional features. Thus e.g. to rearrange the data, one has to implement a different mechanism which typically by transmits them over the network. Shared file systems like network file systems and cluster file systems have an advantage here. With an appropriate organization of the file system it is relatively easy and fast to assign even large fractions of the documents to another machine. When the documents are stored in a content management system further functionalities provided by these systems like replication and hierarchical storage management can be exploited. We are also looking at peer-to-peer systems which provide a wide spectrum of different functionalities.

Besides forwarding the operations on individual documents to the responsible machine, the repository virtualization layer also has to rearrange the data in case a machine joins or leaves the system, or in order to balance the load. While moving data from one machine to another this fraction of the data will not be available. In order to keep this time low the movement is done in two phases: In the first phase only the data that is absolutely necessary for the new node to start working is transferred. In the second phase, where the larger fraction of the data is moved, no locking is necessary. This is implemented as an extension to OpenChord that deals with some additional operations and the movement of the large data sets.

In dieser Arbeit wird ein Ansatz zur Implementierung eines skalierbaren Content Management Systems vorgeschlagen, dessen Grundlage die Virtualisierung des Repositories ist. Damit wird ein dynamisches scale-out des Repositories auf mehrere Maschinen zur Anpassung an die aktuelle Last möglich. Dies ist eine wichtige Voraussetzung, um ein Content Management System unterschiedlich großen Kunden als Dienstleistung in einer Cloud anbieten zu können. Als Beispielanwendung dient die E-Mail-Archivierung.

Die großen Datenmengen zu beherrschen ist eine anspruchsvolle Aufgabe, die eine gute Planung voraussetzt. In einem sehr dynamischen wirtschaftlichen Umfeld können sich die Anforderungen aber auch schnell wieder ändern. Sind die Daten dann in einem nicht mehr angemessenen System gefangen, dann können sie nur mit sehr hohem Aufwand wieder in ein anderes System migriert werden. Daher ist ein flexibles und dynamisch anpassbares System mit einer guten Skalierbarkeit zu bevorzugen.

In dieser Arbeit wird ein Ansatz vorgeschlagen, der das Wachsen und Schrumpfen des Systems in einem großen Bereich unterstützt. Dazu wird zunächst eine Virtualisierungsschicht eingeführt, die die Anwendungen von der direkten Nutzung des Repositories trennt. Dadurch wird es möglich die Anfragen an die jeweils betroffene Maschine weiter zu leiten. Transparent für den Benutzer können so weitere Maschinen hinzugefügt oder überflüssige wieder entfernt werden.

Zunächst werden die in einem E-Mail-Archiv anfallenden Daten beschrieben und ein Datenmodell aufgestellt. Bei der Implementierung dieses Datenmodells wird auf eine Kombination aus Datenbank und Suchmaschine gesetzt. Dabei liegt der Schwerpunkt hier bei der Suchmaschine, da sie für typischerweise nicht sehr präzisen Suchen auf schwach strukturierten Daten besser geeignet ist. Beim Einfügen von neuen Dokumenten wird der enthaltene Text extrahiert und in einen Volltext-Index eingefügt. Ein Cluster aus Suchmaschinen stellt dem Benutzer dann die Such-Funktionalität auf den erzeugten Volltext-Indexen zur Verfügung. Die Datenbank wird im Wesentlichen für interne Informationen zur Sicherstellung der Konsistenz und für die Zugriffskontrolle verwendet.

Innerhalb der Virtualisierungsschicht können die Daten auf mehrere Maschinen verteilt werden. Die Grundlage dafür bildet ein geeignetes Partitionierungsschema. Eine für die Performanz und Skalierbarkeit des Systems wichtige Bedingung ist eine minimale Kommunikation zwischen den Maschinen. Daher wird hier Wert darauf gelegt, dass häufig ausgeführte sowie zeitkritische, interaktive Operationen möglichst lokal auf einer Maschine durchgeführt werden. Insbesondere sind Verbund-Operatoren (Joins) zwischen großen Datenmengen auf unterschiedlichen Maschinen zu vermeiden. Für die Verteilung der Daten auf die Maschinen wird eine verteilte Hashtabelle aus dem Bereich Peer-to-Peer-Computing verwendet. Diese nimmt den als Prüfsumme über den Inhalt berechneten Schlüssel eines Dokumentes und weist ihn zu jeder Zeit eindeutig einer Maschine zu.

Für die Ablage der rohen Archivdaten werden verschiedene Alternativen untersucht. Die Ablage in einem Dateisystem hat den geringsten Overhead, sie bietet aber auch die wenigsten zusätzlichen Funktionen. So muss insbesondere für die (Um-)Verteilung der Daten ein anderer Mechanismus gesucht werden, der die Daten dann typischerweise über das Netzwerk kopiert. Hier bieten verteilte Dateisysteme wie z.B. Netzwerk-Dateisysteme und Cluster-Dateisysteme einen Vorteil. Bei einer entsprechenden Organisation der Dateien können auch große Datenmengen relativ einfach einer anderen Maschine zugeordnet werden. Bei der Ablage in einem Content Management System können dort vorhandene Funktionen wie Replikation und hierarchisches Speichermanagement genutzt werden. Auch P2P-Systeme werden betrachtet, die von sich aus eine sehr große Palette an unterschiedlichen Funktionen bieten.

Schließlich wird die Implementierung der Virtualisierungsschicht diskutiert. Neben der Weiterleitung der Operationen auf einzelnen Dokumenten (CRUD-Operationen) ist eine wesentliche Funktion dieser Schicht die Umorganisation der Daten beim Hinzufügen und Entfernen von Maschinen sowie zur Sicherstellung einer gleichmäßigen Verteilung der Last. Die Implementierung der Virtualisierungsschicht baut auf dem P2P-System OpenChord auf, das per konsistentem Hashing über den Inhalt der Dokumente diese auf die vorhandenen Maschinen verteilt.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-68266
http://elib.uni-stuttgart.de/handle/11682/2780
http://dx.doi.org/10.18419/opus-2763

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

A virtualization approach to scalable enterprise content management

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By