Efficient solution of sparse linear systems arising in engineering applications on vector hardware

Tiyyagura, Sunil Reddy

Efficient solution of sparse linear systems arising in engineering applications on vector hardware

Files

PhDtheses_Tiyyagura.pdf (4.44 MB)

Date

2010

Authors

Tiyyagura, Sunil Reddy

Abstract

Block-based Linear Iterative Solver (BLIS) is a scalable software library for solving large sparse linear systems, especially arising from engineering applications. BLIS has been developed during this work to particularly take care of the performance issues related to Krylov iterative methods on vector systems. After several failed attempts to port some general public domain linear solvers onto the NEC SX-8, it is clear that the developers of most solver libraries do not focus on performance issues related to vector systems. This is also true for other software projects due to the fact that clusters of scalar processors were the dominant high performance computing installations in the past few decades. With the advent of vector processing units on most commodity scalar processors, vectorization is again becoming an important software design consideration. In order to understand the strengths and weaknesses of various hardware architectures, benchmarking studies have been done in this work. These studies show that the vector systems are well balanced than most scalar systems with respect to many aspects that determine the sustained performance of many real world applications.

The two main performance problems with the public domain solvers are addressed in this work. The first problem of short vector length is solved by introducing a vector specific sparse storage format. The second and the more important problem of high memory latencies is addressed by blocking the sparse matrix. Most engineering problems have multiple unknowns (degrees of freedom) per mesh point to be solved. Typically, public domain solvers do not block all the unknowns to be solved at each mesh point. Instead, they assemble and solve each unknown separately which requires a huge amount of memory traffic. The approach adopted in this work reduces the load on the memory subsystem by blocking all the unknowns at each mesh point and then solving the resulting blocked global system of equations. This is a natural approach for engineering simulations and results in performance improvement on scalar systems due to cache blocking and on vector systems due to reduced memory traffic.

Preconditioning is one of the areas in linear solvers that is still actively researched. A preconditioned system of equations has better spectral properties and hence the solution methods will converge faster than with the original system. The key consideration is to keep the time needed for the additional work of preparing the preconditioner as low as possible while at the same time improving the condition number of the resulting system as much as possible. Block based splitting methods and scaling are effective preconditioners than their point based counterparts and at the same time are also efficient. Block based incomplete factorization implemented in BLIS is also more efficient than the corresponding point based method. Robust scalable preconditioners such as the algebraic multigrid method are also available in BLIS. The performance measurements of three application codes running on the NEC SX-8 and using BLIS to solve the linear systems are presented.

Lastly, memory bandwidth limitations of new hardware architectures such as the multi-core systems and the STI CELL Broadband Engine are studied. The efficiency and scaling of BLIS is tested on the multi-core systems. Also, the performance of blocked sparse matrix vector product kernel is studied on the STI CELL processor.

Block-Based-Linear-Iterative-Solver (BLIS) ist eine skalierbare numerische Software-Bibliothek zur Loesung großer schwachbesetzter linearer Gleichungssysteme, wie sie besonders in Ingenieursanwendungen auftreten. BLIS wurde entwickelt mit Ruecksicht auf die Besonderheiten von parallelen Vektorsystemen. Erreichen hoher Gleitkommaleistung war unser zentrales Ziel. Einige fehlgeschlagene Versuche auf dem Vektorrechner NEC SX-8 mit ueblichen Public Domain Solvern machten deutlich, dass die Entwickler keine Ruecksicht auf die Eigenschaften von Vektorsystemen genommen haben. Dies gilt ebenso fuer andere Softwareprojekte auf Grund der Tatsache, dass Cluster von Skalarprozessoren das Hoechstleistungsrechnen der letzten Jahrzehnte dominiert haben. Inzwischen haben aber Vektorrecheneinheiten in skalare Prozessoren Einzug gehalten. Vektorisierbare Algorithmen und deren Implementierungen erhalten groeßere Bedeutung. Um die Unterschiede der verschiedenen Architekturen bewerten zu koennen, haben wir verschiedene Benchmark-Studien durchgefuehrt. Diese Studien haben nachgewiesen, dass Vektorsysteme besser balanciert sind als die ueblichen Skalarsysteme, wenn Aspekte betrachtet werden, die die sustained Performance vieler realer Anwendungen bestimmen.

Zwei wesentliche Probleme werden in dieser Arbeit adressiert, die die Leistungsfaehigkeit von Public Domain Solvern begrenzen. Das Problem der ungeeigneten kurzen Vektorlaenge konnte durch die Verwendung eines vektorgeeigneten Datenformates fuer schwach besetzte Matrizen geloest werden. Dieses Format ist gleichermaßen auch fuer moderne Skalarprozessoren geeignet. Das zweite Problem der schaedlichen Auswirkung der hohen Memory-Latenzzeiten konnte durch die Verwendung von Bloecken in der Matrixstruktur geloest werden. Die meisten Ingenieursanwendungen weisen mehrfache Unbekannte (Freiheitsgrade) an jedem Knotenpunkt des zu loesenden Problems auf. Die typischen Public Domain Loeser nutzen diese Besonderheit nicht. Stattdessen assemblieren und loesen sie jede Unbekannte separat und vergroeßern damit den Druck auf das Memorysystem insbesondere beim indirekten Zugriff. Unser Ansatz veringert diesen Druck durch das Blocking der Knotenpunkt-Variablen und der Loesung des daraus resultierenden globalen Systems von Block-Gleichungen. Dies ist ein natuerlicher Ansatz fuer Simulationen aus dem Ingenieursumfeld und ermoeglicht Leistungsverbesserungen auf Skalarsystemen durch Cache-Blocking und auf Vektorsystemen durch Redukition des Memory-Verkehrs.

Praekonditionierung ist eines der Gebiete der Loesungsverfahren linearer Gleichungssysteme, auf dem noch immer aktiv geforscht wird. Ein praekonditioniertes Gleichungssystem hat bessere spektrale Eigenschaften. Deshalb konvergieren die Loesungsmethoden schneller als beim originalen System. Schluessel zu einem erfolgreichen Praekonditionierungsverfahrens ist, die Konditionszahl der veraenderten Matrix moeglichst klein zu halten bei geringem zusaetzlichem Zeitaufwand fuer die Praekonditionierung. Blockbasierte Splitting-Methoden und blockbasierte Skalierung sind numerisch effektivere Praekonditionierer als ihr punktbasiertes Gegenst¨uck und benutzen die Hardware effizienter. Blockbasierte unvollstaendige Zerlegung (ILU), wie sie in BLIS implementiert wurde, ist ebenfalls effizienter als das entsprechende punktbasierte Verfahren. Robuste skalierbare Praekonditionierer wie die Algebraische Multigridmethode sind ebenfalls in BLIS verfuegbar.

Leistungsmessungen auf der NEC SX-8 fuer drei Anwendungsprogramme unter Benutzung von BLIS werden dargestellt. Darueberhinaus werden Bandbreitenbegrenzungen neuer Hardware-Architekturen wie der STI Cell Broadband Engine analysiert. Effizienz und Skalierung von BLIS auf Multi-Core Systemen wird getestet. Die Leistungsfaehigkeit des Matrix mal Vektor Kerns fuer schwachbesetzte Matrizen mit Bloecken wird getestet und beschrieben.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-55170
http://elib.uni-stuttgart.de/handle/11682/1895
http://dx.doi.org/10.18419/opus-1878

Collections

04 Fakultät Energie-, Verfahrens- und Biotechnik

Full item page

Efficient solution of sparse linear systems arising in engineering applications on vector hardware

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By