Workload based provenance capture reduction

Jadhav, Priyanka

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-9233

Langanzeige der Metadaten

DC Element	Wert	Sprache
dc.contributor.author	Jadhav, Priyanka	-
dc.date.accessioned	2017-09-26T15:49:19Z	-
dc.date.available	2017-09-26T15:49:19Z	-
dc.date.issued	2017	de
dc.identifier.other	1002616263	-
dc.identifier.uri	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-92502	de
dc.identifier.uri	http://elib.uni-stuttgart.de/handle/11682/9250	-
dc.identifier.uri	http://dx.doi.org/10.18419/opus-9233	-
dc.description.abstract	Multiple solutions have been developed that collect provenance in Data-Intensive Scalable Computing (DISC) systems like Apache Spark and Apache Hadoop. Existing solutions include RAMP, Newt, Lipstick and Titian. Though these solutions support debugging within the dataflow programs, they introduce a space overhead of 30-50% of the size of the input data during provenance collection. In a productive environment, this overhead is too high to permanently track provenance and to store all the provenance information. That is why solutions exist that reduce the amount of provenance data after their collection. Among those are Prox, Propolis and distillations. However, they do not address the problem of incurring space overhead during the execution of a dataflow program. The existing provenance reduction techniques do not consider optimizing the provenance reduction based on particular use cases or applications of provenance. The goal of this thesis is to find and evaluate application-dependent provenance data reduction techniques that are applicable during execution of dataflow programs. To this end, we survey multiple applications and use cases of provenance like data exploration, monitoring, data quality etc. In addition, we analyze how provenance is being used in them. Furthermore, we introduce nine data reduction techniques that can be applied to provenance in the context of different use cases. We formally describe and evaluate four out of the nine techniques - sampling, histogram, clustering and equivalence classes on top of Apache Spark. There is no benchmark available to test different provenance solutions. Hence, we define six scenarios on two different datasets to evaluate them. We also consider the application of provenance in each scenario. We use these techniques to obtain reduced provenance data then, we introduce three metrics to compare the reduced provenance data to full provenance. We perform a quantitative analysis to compare different techniques based on these metrics. Afterwards, we perform a qualitative analysis to examine the effectiveness of different reduction techniques in the context of a particular use case.	en
dc.language.iso	en	de
dc.rights	info:eu-repo/semantics/openAccess	de
dc.subject.ddc	004	de
dc.title	Workload based provenance capture reduction	en
dc.type	masterThesis	de
ubs.fakultaet	Informatik, Elektrotechnik und Informationstechnik	de
ubs.institut	Institut für Parallele und Verteilte Systeme	de
ubs.publikation.seiten	119	de
ubs.publikation.typ	Abschlussarbeit (Master)	de
Enthalten in den Sammlungen:	05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
MasterThesis_PriyankaJadhav.pdf		16,44 MB	Adobe PDF	Öffnen/Anzeigen

Zur Kurzanzeige

Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.

Universität Stuttgart

OPUS - Online Publikationen der Universität Stuttgart