Improve content extraction in web pages for browser reader modes

Berg, Jan

Improve content extraction in web pages for browser reader modes

Files

ba_thesis_janberg_final.pdf (1.3 MB)

Date

2020

Authors

Berg, Jan

Abstract

Web content extraction is the process of extracting specific information on websites with the help of an algorithm. It is used for a variety of different applications. Search engines use it to find the relevant information on a website to help index the website. Browser read modes improve the user experience by only showing the main content of the website to the user and removing all the noise like advertisements and navigational elements. The problem with main content extraction is that there is no perfect solution to it. Algorithms try to guess the important content of a website and not always succeed with that. The most used main content extraction algorithms today work by analyzing the underlying HTML structure of the website based on hand tuned heuristics such as word count and the used HTML tags. They do not consider other aspects such as position and size of elements. In this work we try to improve the accuracy of main content extraction algorithms currently used with the help of visual features such as position and size of elements. To evaluate the results we implemented two versions of a main content extraction algorithm as a plugin for the Chromium web browser. The first version only used heuristics based on features from the website that can be read directly from the HTML source file. The second algorithm additionally takes the styling of the website into account which requires parsing the HTML and CSS files files of the website. Based on our measurements the visual based algorithm had a higher accuracy than the normal algorithm (80,1% instead of 73,2%).

Web content etxtraction ist das Extrahieren von bestimmten Informationen auf Websites mithilfe eines Algorithmus. Es wird für eine Vielzahl unterschiedlicher Anwendungen verwendet. Suchmaschinen verwenden es, um die relevanten Informationen auf einer Website zu finden und die Website zu indizieren. Browser-Lesemodi verbessern die Benutzererfahrung, indem sie dem Benutzer nur den Hauptinhalt der Website anzeigen und alle störenden Inhalte wie Werbung und Navigationselemente entfernen. Das Problem bei der main content extraction besteht darin, dass es keine perfekte Lösung dafür gibt. Algorithmen versuchen, den wichtigen Inhalt einer Website zu erraten und sind damit nicht immer erfolgreich. Die heute am häufigsten verwendeten Algorithmen zur Extraktion von Hauptinhalten analysieren die zugrunde liegende HTML-Struktur der Website auf der Grundlage von Heuristiken wie der Wortanzahl und den verwendeten HTML-Tags. Sie berücksichtigen keine visuellen Aspekte wie Position und Größe von Elementen. In dieser Arbeit versuchen wir, die Genauigkeit der derzeit verwendeten Algorithmen zur main content extraction mithilfe visueller Merkmale wie Position und Größe von Elementen zu verbessern. Um die Ergebnisse auszuwerten, haben wir zwei Versionen eines Algorithmus zum main content extraction als Plugin für den Chromium-Webbrowser implementiert. In der ersten Version wurden nur Heuristiken verwendet, die auf Elemente der Website basieren welche direkt aus der HTML Dateil gelesen werden können. Der zweite Algorithmus berücksichtigt zusätzlich das Styling der Website, bei dem die HTML und CSS Dateien der Website geparst werden müssen. Basierend auf unseren Messungen hat der visuell basierte Algorithmus eine höhere Genauigkeit als der normale Algorithmus (80,1% anstelle von 73,2%).

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-110503
http://elib.uni-stuttgart.de/handle/11682/11050
http://dx.doi.org/10.18419/opus-11033

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Improve content extraction in web pages for browser reader modes

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By