Visualization Research Center, University of Stuttgart (VISUS) UNIVERSITY OF STUTTGART Universitätsstraße 38 D-70569 Stuttgart MASTER'S THESIS NO. 3422 Statistical Analysis and Comparative Visualization of Cellular Particle-based Simulations Ana Cristina Pintilie Course of studies: INFOTECH Examiner: Prof. Dr. Thomas Ertl Supervisor: Dipl.-Inf. Martin Falk Starting date: 2 November 2012 End date: 4 May 2013 CR Classifications: J.2, I.3.4 Abstract The present work integrates a set of functions, properties and views to the CellVis framework developed by Martin Falk in order to lay the foundations for a statistical analysis and comparative analysis/visualization module. The CellVis framework employes a stochastic simulation to model and analyze cellular signal transduction. The cell is represented as sphere con- sisting of a nucleus, proteins, obstacles and cytoskeleton filaments. The particles can move inside the cell by diffusion or by direct transportation along the cytoskeleton filaments The statistical analysis is performed by estimating the first four statis- tical moments of the sample spatial distribution of the molecules involved in the analyzed simulation. These moments can be visualized in a tree-like view and also with help of a set of time series plots, that illustrate the evo- lution of these moments through the lifetime of the simulation. The spatial distribution of the concerned particles is separated on the three spatial di- mensions. This is possible due to the fact that the motions of one particle on each axis is independent with respect to the other ones. A histogram is also calculated and graphically represented in a 2D plot in order to facilitate understanding the meaning of the estimated sample moments values. To justify the use of moments describing the spatial distribution, an external tool is used to estimate a probability density function that fits the simu- lated spatial distribution with respect to the estimated sample moments. The results are satisfactory with respect to the matching of the calculated histogram and the histogram of the fitted probability density function. So we could conclude that for typical types of distributions the use of moments can help in finding a suitable probability density function that can be used later on for predicting the spatial distribution for certain types of particles in other simulations. For a more particular type of analysis of the signal transduction process, the trajectory and reaction history of an individual particle are calculated and determined. To visualize them, 2D graphs are employed and integrated in the framework. The trajectory of a particle is plotted as the distance to the nucleus over time. The reactions and interactions are indicated with markers on the same graph. iii The comparative visualization is performed by employing difference plots and difference volume. For this, the framework is extended with the loading of a second data set. The difference graphs are used to show differences in the statistical sample moments, in the concentrations and in the histograms of the two loaded data sets. The difference volume is performed with respect to the differences of concentrations in the considered data sets. The view of the difference volume can be enabled by the user. However the option of viewing both data sets at the same time remains as a possibility of future work. iv Contents Contents v List of Figures vii Acronyms ix 1 Introduction 1 1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution and Aims . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Biological Fundamentals 7 2.1 Cellular Architecture . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Signal Transduction . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 MAPK-Cascade . . . . . . . . . . . . . . . . . . . . . 9 2.3 Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 First-Order Reactions . . . . . . . . . . . . . . . . . . 11 2.4.2 Second-Order Reactions . . . . . . . . . . . . . . . . . 11 2.4.3 Enzymatic Reactions . . . . . . . . . . . . . . . . . . . 12 3 Moment-Based Statistical Analysis 13 3.1 First-order moment: Mean . . . . . . . . . . . . . . . . . . . . 14 3.2 Second-order moment: Variance . . . . . . . . . . . . . . . . . 15 3.3 Third-order moment: Skewness . . . . . . . . . . . . . . . . . 16 3.4 Forth order moment: Kurtosis . . . . . . . . . . . . . . . . . . 17 3.5 Correlation and Covariance . . . . . . . . . . . . . . . . . . . 18 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 History of Individual Molecules 29 4.1 Trajectory of Single Molecules . . . . . . . . . . . . . . . . . . 29 4.2 Identification of Reactions and Reaction Partners . . . . . . . 30 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 v 5 Comparative Visualization 37 5.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Comparative Visualization in CellVis . . . . . . . . . . . . . . 38 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.1 Concentration and Histogram-based comparison . . . 40 5.3.2 Statistical Parameters-based comparison . . . . . . . . 44 5.3.3 Concentration Difference Volume . . . . . . . . . . . . 52 6 Conclusions and Future Work 55 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Bibliography 59 A Identification of reactions and reaction partners 63 Declaration 65 vi List of Figures 2.1 Eukaryotic cell . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Diagram of the basic structures of an actin filament . . . . . 8 2.3 Signal transduction process . . . . . . . . . . . . . . . . . . . 9 3.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Standard deviation . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Tree view of statistical parameters . . . . . . . . . . . . . . . 21 3.6 Time series plot of the first four moments . . . . . . . . . . . 22 3.7 Correlation coefficients . . . . . . . . . . . . . . . . . . . . . . 22 3.8 Test configuration . . . . . . . . . . . . . . . . . . . . . . . . 23 3.9 Histogram first time step . . . . . . . . . . . . . . . . . . . . 24 3.10 Histogram last time step . . . . . . . . . . . . . . . . . . . . . 25 3.11 Distribution fitting for molecule of type mol2 . . . . . . . . . 26 3.12 Distribution fitting for molecule of type Receptor A . . . . . 27 3.13 Distribution fitting for molecule of type Mol B Skeleton . . . 27 4.1 Trajectory view . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Trajectory of type mol3nucleus . . . . . . . . . . . . . . . . . 32 4.3 Reaction view of type mol3nucleus . . . . . . . . . . . . . . . 33 4.4 MAPK simulation configuration . . . . . . . . . . . . . . . . . 33 4.5 Particle reaction history-MAPK . . . . . . . . . . . . . . . . . 34 4.6 Particle reaction history-MAPKp . . . . . . . . . . . . . . . . 35 4.7 Receptor history . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Radial concentration profile . . . . . . . . . . . . . . . . . . . 40 5.2 Concentration and histogram difference plot-case I . . . . . . 41 5.3 Concentration and Histogram difference plot-case II . . . . . 42 5.4 Concentration and histogram difference plot-case III first time step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Concentration and histogram difference plot-case III last time step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.6 Mean and standard deviation difference plot-case I . . . . . . 45 vii 5.7 Skewness and kurtosis difference plot-case I . . . . . . . . . . 46 5.8 Correlation coefficients difference plot-case I . . . . . . . . . . 46 5.9 Mean and standard deviation difference plot-case II . . . . . . 47 5.10 Skewness and kurtosis difference plot-case II . . . . . . . . . . 48 5.11 Correlation coefficients difference plot-case II . . . . . . . . . 49 5.12 (a) Mean and (b) standard deviation difference plot-case III . 50 5.13 (a) Skewness and (b) kurtosis difference plot-case III . . . . . 50 5.14 Correlation coefficients difference plot-case III . . . . . . . . . 51 5.15 Color map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.16 Concentration difference volume-case I . . . . . . . . . . . . . 52 5.17 Concentration difference volume-case II . . . . . . . . . . . . 53 5.18 Concentration difference volume-case III . . . . . . . . . . . . 54 5.19 Concentration difference volume-different density scales . . . 54 viii Acronyms DNA Deoxyribonucleic Acid DVR Direct Volume Rendering MAPK Mitogen-Activated Protein Kinase MAPKp phosphorylized Mitogen-Activated Protein Kinase MCA Metabolic Control Analysis SVG Scalable Vector Graphics ix Chapter 1 Introduction Systems biology is concerned with the study of biological functions and mechanisms. A biological system of interacting entities is analyzed as a whole rather than analyzing each individual entity separately. Systems biol- ogy aims to provide means of understanding biological processes on a large scale, like the association of gene with function, the detailed modeling of the interaction among proteins and metabolites and the function of cells. It has a wide-range of applications, as it involves a number of underlying disci- plines, including biology, computer science, mathematics, physics, chemistry and the social sciences. Analyzing the individual functions of genes, proteins or other cellular components is, however, not enough to capture the complexity of a cell, its functions and organisms. The whole system needs to be analyzed as an ensemble. For this we need to determine its organization and its controlling mechanism by investigating the dynamic of genes and proteins, together with the interactions between them. The group of molecules that work to- gether to control the cell functions, such as cell division or death represent a signaling pathway. A signaling pathway consists of a series of chemical reactions that enable passing the external signal from one molecule to an- other until the signal reaches the target. Signal transduction is the process by which cells communicate with each other and their environment and in- volves a multitude of proteins that can be in active or inactive states. In their active (phosphorylated) state they act as catalysts for the activation of subsequent steps in the signaling cascade. The result is the activation of a transcription factor which in turn initiates a gene transcription event. The ordered sequence of the reactions in this process is known as the signal transduction pathway. These pathways exhibit a high dynamic complexity and an adaptive behavior. The general goal of theoretical systems biology is therefore to develop computer models that predict the properties of the large, adaptive, intercon- 1 1.1. Motivation and Scope nected networks that are found in living organisms. One can distinguish two broad categories of questions that guide the direction of systems biology re- search. The first category encompasses topics of medical importance and is typically characterized by forward-engineering approaches that focus on pre- venting or combating diseases [1]. The second category includes problems of industrial interest such as the genetic engineering of microbes in order to maximize product formation, the creation of robust-production strains and so on. The applications of the second category comprise an important reverse-engineering component whereby microbes with attractive properties are scrutinized for the purpose of transferring any insights learned from their functions to the further improvement and optimization of production strains [1]. A few examples of approaches that have been developed in recent years and successfully applied to relatively small systems are: metabolic con- trol analysis (MCA), signaling pathways, reconstruction of flux mapsa and metabolic engineering. Computational simulations are a useful tool to fully capture the tempo- ral aspect of the entire biological system. Thus, simplified or mesoscopic cell models are often used in systems biology. Particle-based simulation are well suited for covering the different components of the cell’s complex architecture [2]. The present work deals with developing methods to help the data anal- ysis process of the signal transduction pathway. 1.1 Motivation and Scope The motivation for this thesis resulted from the need to reduce the com- plexity of the spatial-temporal data of the particle-based simulations used in systems biology. The intracellular architecture, particle features, the exact molecule count and the molecules precise position can be visualized directly from the simulation, e.g, with glyphs [2]. However, only the spatial distri- bution at a given time can be easily observed. In order to better describe the spatial distribution at a given time step and to see how this distribution evolves trough time, it is necessary to find and estimate a set of parame- ters that are capable of quantifying the spatial distribution. Once a set of parameters is chosen, their evolution throughout the entire simulation can be easily viewed with a time series plot, thus reducing the spatial-temporal data of the crowded environment of the cell to a simple time series plot where the distribution tendencies of the different types of molecules can be more clearly shown. Once we have a set of parameters that are able to roughly describe the spatial distribution of the different molecule types involved in one simula- tion, the idea of a comparative visualization of several simulations comes 2 CHAPTER 1. INTRODUCTION naturally. The simulations might simply differ by the initial random seeds or might employ different parameter sets. For the comparative visualization we can consider either two data sets that are shown side-by-side or several data sets are combined and the differences are highlighted. Though the set of parameters that describe the evolution and the outcome of one simula- tion can be quite large, the comparison between two simulation runs has to be done by considering those parameters that can actually provide a mean- ingful information. Since the possibilities for comparison are plenty, this work proposes a comparison based on concentration and statistical sample moments. The diffusion process of one specific molecule and the reactions in which the molecule is involved are also hard to spot [2]. Therefore the utility of visualizing the path that one molecule follows arises. The trajectory of one molecule can be very important since the main interest is to determine if and how many of the molecules of interest reach their target, in this case the nucleus. Having a view of the several molecule paths provides the possibility to analyze the rate of success in reaching the target and also to understand the role of the motor proteins in propagating the signal. Besides the interest in whether or not one molecule reaches the target, the determination of the other molecules that may interact or react with the selected molecule might also be of interest. Developing an approach that determines when reactions take place and which particles are involved provide the means to estimate a reaction distribution and to estimate the approximate region where different reactions are more likely to take place. Considering the motivation of developing a module capable of providing a number of statistical parameters in order to simplify the analysis of the simulation data by experts, the scope of this work is to implement and integrate this module in the existing framework CellVis of Martin Falk. The main goals of the thesis are: • estimation of a set of statistical parameters that describes the spatial distribution of certain molecule types and provide a simple overview of their time evolution • trajectory visualization of individual molecule, with the possibility to include the potential reaction partners • developing a visualization for the comparison of two simulation runs 1.2 Contribution and Aims This thesis lies the basis for a statistical analysis module. It provides the means to analyze the data sets resulting from the simulation with a set of estimated parameters specific to a spatial distribution, in this case: mean, 3 1.3. Thesis Outline variance, skewness, kurtosis and correlation coefficients. These parameters represent the basic parameters that come to mind when describing a distri- bution. But in order to have a deeper understanding of the processes that take place inside the cell other functionalities have to be added to the sta- tistical module, e.g.: determining clusters of molecules and their evolution in time focusing on their movement and size development. Considering the statistic data involving a single molecule, a trajectory and a reaction history view is integrated. The possibility to track and an- alyze the activity of a molecule can be extended to track certain types of molecules of interest and extract other statistical information like how many of these molecules reach their target or what is life expectancy of a certain type of molecule. The algorithm for determining the reactions can be used also to provide information about the reaction distribution or to compare the number of reactions at one time point with the theoretical reaction rate for validation purposes. The CellVis framework was mainly designed to deal just with one data set. In this thesis a functionality that enables loading information about another data set is implemented. At the time being, only a small part of the information that can be extracted from a second data set is used and the visualization is based on differences of the two simulations. For the future work, a complete module of loading a second data set could be included so that two data sets can be viewed at the same time together with a visualization of highlighted differences. 1.3 Thesis Outline Chapter 2 provides a brief introduction in certain biological aspects and is supposed to give an insight on the biological terms, processes and systems that are taken in consideration in this work. Chapter 3 describes the statistical methods that were implemented in order to have better understanding not only of the outcome of the simulation but also throughout its entire evolution and how these parameters vary from one time step to another. The concept of a statistical moment is presented together with the problematic of the method of moments used to estimate a probability density function. The estimated sample moments resulting from the simulation are illustrated and discussed in the results section. The intracellular processes do not involve only the movement of molecules but also different type of reactions and interactions. Thus Chapter 4 presents an approach for visualizing the trajectory of one molecule and also its re- action history. The history does not only include the potential reaction partners of the molecule in question but also the reaction partners of the reaction partners. Thus we have a complete description of the reaction chain involving the molecule of interest. 4 CHAPTER 1. INTRODUCTION Though analyzing one data set might be useful, sometimes the informa- tion that we obtain from one data set is not enough and it needs to be put in perspective and analyzed together with other data sets. In Chapter 5 the basics of a comparative visualization approach are introduced and the possibility of loading a second data set is added to CellVis in order to view a volume in the difference of molecule concentrations. The last chapter summarizes the presented work, presenting the result- ing conclusions and outlining the possibilities of future development of the statistical analysis module. 5 Chapter 2 Biological Fundamentals In order to better understand the role of this thesis, of the implemented statistical methods and of the few simulation aspects that were analyzed, it is important to have a clear description of the biological cell architecture and of the intracellular processes that are considered in this work. 2.1 Cellular Architecture From the biological point of view cells are of two types: prokaryotic and eukaryotic. Prokaryotes are a type of organisms enclosed by a plasma mem- brane but lack a membrane-bound nucleus. Eukaryotes, unlike prokayotic cells, contain a nucleus surrounded by double membrane in which the ge- netic material is stored. Besides having a nucleus the eukaryotic cells also contain other enclosed compartments called organelles. The organelles of an eukaryotic cell can be seen in Figure 2.1. Since CellVis framework is a tool designed to simulate and visualize signal transduction processes in cells that have a well defined nucleus, the cell architecture of interest in this case will be the one of the eukaryotic cells. Figure 2.1: Eukaryotic cell[3] 7 2.1. Cellular Architecture The nucleus and mitochondrion are bounded by a double membrane having an intermembrane space as a separator. All other organelles are sur- rounded by a single membrane. The nucleus can be considered the “brain” of the cell, because it manages and directs the cell activities and it contains the genetic material called chromosomes made out of DNA. Mitochondria are responsible for producing energy from food; they represent also the site of cellular respiration. The lysosome breaks down worn-out or unneeded cellular components and some ingested materials with the help of diges- tive enzymes. The Golgi apparatus is in charge of modifying, sorting and packaging proteins. The endoplasmic reticulum represents an internal mem- brane system where certain types of proteins are assembled with the help of ribosomes. The plasma membrane acts as boundary between the cell and its environment and regulates cell’s absorption and diffusion of certain materials. Between the plasma membrane and the nucleus lies the cytoplasm that comprises all the other organelles and the cytosol consisting of an aqueous medium. The main structure consisting the cytoplasm of the eukaryotic cells is the cytoskeleton. The cytoskeleton is a complex network of protein fibers that define the shape of the cell. The cytoskeleton also serves as a traffic network for the organelles in the cytoplasm and also most of them are attached to the cytoskeleton. This protein fiber network is consisted of three types of filaments that differ by size, type of subunit and the arrangement of these subunits (Figure 2.2) [3]. Figure 2.2: Diagram of the basic structures of an actin filament (AF), in- termediate filament (IF) and microtubule (MT)[3]. Actin filaments (AF), also called microfilaments (7 - 8nm) have a twisted 8 CHAPTER 2. BIOLOGICAL FUNDAMENTALS two-stranded structure. The microtubules (MT) are hollow structures re- sembling a tube (24 - 25nm) and the intermediate filaments (IF) have a rope-like structure (10nm) [3, 4]. The cytoskeleton also plays an impor- tant role in the transport of signal molecules, as the different filaments that contain the cytoskeleton serve as a “highway” towards the target, e.g. the nucleus, for the proteins involved in the signal transduction [5]. 2.2 Signal Transduction The multitude of processes involved in how a cell converts an external stimulus into a specific response is referred to as signal transduction. Signal transduction starts with an external signal to a receptor and re- sults in a change of the cell status or of one its functionalities (Figure 2.3). This process is intermediated via a cascade of messenger signal molecules inside the cell. It begins with a signal to a receptor and ends with a change in cellular function which is typically mediated via a cascade of additional messenger signals within the cell. There are two main ways of adjusting the state of the cell. One way is binding an external signal-molecule as a ligand to a membrane receptor. The receptor protein is activated by this binding and can further activate other signaling molecules that eventually carry the message to the desired target. The second way is when the external signal can itself move through the plasma membrane and to the intracellular receptors [4]. Figure 2.3: Signal transduction process [4] 2.2.1 MAPK-Cascade Sequential activation of kinases is a common mechanism of signal trans- duction in many cellular processes. This work deals with the MAPK (mitogen- activated protein kinase) signaling cascades. The MAPKs are a group of 9 2.3. Transport protein Serine/threonine kinases that are activated as a result of an exter- nal signal and mediate signal transduction process. The MAPK pathway can be generalized by the following sequence: Stimulus > MAPKKK > MAPKK > MAPK > Response where MAPKK is the kinase of MAPK and MAPKKK is the kinase of MAPKK. The MAPKKK is activated by G proteins such as Ras, Rac and Rap1 or other enzymes that are activated by external signals [6, 7]. This cascade amplifies the transmitted signals that can eventually activate sev- eral regulatory molecules in the cytoplasm and in the nucleus to initiate a number of cellular processes such as proliferation, differentiation, survival, development and death [8]. Deviated or uncontrolled course of the MAPK signaling pathways is involved in the development of many human diseases including Alzheimer’s disease (AD), Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS) and various types of cancers [9, 10]. Thus these path- ways are considered potential therapeutic targets . 2.3 Transport In the crowded environment of the cell, molecules can move by diffusion which is characterized by a random movement on very short distances. The signaling cascade involves several reactions and dozens of proteins. Diffu- sion is a very slow transportation method for a reaction product to the next reaction partner along the pathway. To overcome this inefficiency in trav- eling through large spaces between different compartments of a cell, cells have developed a mechanism that cluster molecules in a common pathway together, assembling on a common “scaffold”[3]. Besides the undirected diffusion, some signaling cascades also involve the directed transportation along the cytoskeleton with the help of motor proteins. Most often the target of the signaling molecules is the nucleus, where these molecules can influence and trigger certain genes. The trans- port process in this case is a combination of undirected diffusion and linear transport along the cytoskeleton[11]. 2.4 Reactions The partitioning in oragnelles of the eukaryotic cell permits different kinds of biochemical reactions to take place. Although each organelle per- forms a specific function in the cell, together they form an ensemble and cooperate in order to meet the overall needs of the cell. For example, biochemical reactions in a cell’s mitochondria transfer energy from fatty 10 CHAPTER 2. BIOLOGICAL FUNDAMENTALS acids and pyruvate molecules into an energy-rich molecule called adenosine triphosphate (ATP) [3]. One or more particles can participate in a reaction, resulting in one or more products. For a reaction to take place it is necessary that the involved molecules are sufficiently close to each other. In addition they have to feature the correct orientation with respect to their position and that the energetic conditions are proper to initiate the reaction. 2.4.1 First-Order Reactions A first-order reaction depends linearly only on the concentration of just one reactant (see eq. 2.4.1), even though other reactants might be present. This translates in a spontaneous decay or in a change in the particle’s state. An example for first-order reactions can be considered the deactivation or unbinding process. d dt ci = −kici (2.4.1) where ki represents reaction rate constant and cirepresents the concentration of molecule i and the reaction rate is defined by the reaction law as follows: ri = kici (2.4.2) The rate of the reactions can be measured in Ms-1, where M represents moles per liter and s represents seconds, therefore the rate unit is molar per second. As the reactant is being depleted, the reaction rate decreases proportionally. The first-order reaction rate can be viewed also as a proba- bility per unit of time and this probability is independent of the molecules concentration. 2.4.2 Second-Order Reactions Second-order reaction, also called bimolecular reactions, involve two re- actants. These two molecules collide with each other if they are close enough together. The rate of this reaction is determined by the reactants concen- trations and by a reaction rate constant. rij = kijcicj (2.4.3) The value of the second-order reaction rate constant kij is determined by the rate of the diffusion of the molecules. The rate of the diffusion depends on the size and shape of the molecules and the temperature and viscosity of the medium. The influence of these factors is summed up in the diffusion co- efficient D, with unit ofm2s−1. The diffusion coefficient represents a measure that describes how a particle moves in its medium. Therefore, the reaction 11 2.4. Reactions rate constant can be defined by the diffusion coefficient and the area of in- teraction between molecules and is described by the Debye-Smoluchowski equation [12]: kij = 4pi b(Di +Dj)N0103 (2.4.4) where b represents the interaction radius between the two molecules, mea- sured in meters, Di and Dj are the diffusion coefficients of the respective particles and N0 represents Avogadro’s number. The scaling with 103 is done so that the unit is converted into M−1s−1. 2.4.3 Enzymatic Reactions An enzyme is a protein that speeds up the chemical reaction in a living organism. It acts as a catalyst for specific chemical reactions, converting a certain types of reactants called substrates into specific products. Enzymatic reactions represents a special case of the second-order re- actions, because the enzyme involved in the reaction has the role of both reactant and product. As mentioned before, the enzyme E converts the substrate S into the product P by forming an intermediate complex ES. E + S ES → E + P (2.4.5) The reaction rate constant can be determined using just two parameters: the unimolecular reaction rate constant kcat and the Michaelis constant KM . kES = − kcat KM + cS (2.4.6) However, in this case kES is not a constant anymore, since it depends on the concentration of the substrate, Michaelis constantcS . 12 Chapter 3 Moment-Based Statistical Analysis Statistical analysis is considered an essential research component that plays a vital role in a considerable number of research areas like: agriculture, biology, education, economics, management, engineering and psychology. Systems biology usually is dealing with large sets of data, whether they are result from simulations or from experiments. The raw numerical data is hard to analyze and the need to extract corresponding features and characteristics in order to put in evidence certain trends emerges. Statistical analysis refers to a set of methods that process large amounts of data with the purpose to reveal these trends, if any exist. It is a common practice in statistical analysis to describe the shape of the distribution of a population by means of a finite set of quantities sum- marizing the location, dispersion, skewness, peakedness and tail behavior of the unknown population. Classical measures of distributional shape have been defined by means of algebraic moments of different orders, resulting in the mean for location estimation, the variance to measure the spread and the standardized measures of skewness and kurtosis to further reveal any asymmetries and levels of peakedness. The Method of Moments [13] is one of the most common used methods of estimating a set of parameters that characterize a distribution. It was introduced by Karl Pearson in the 1880’s. From a known probability dis- tribution it relatively straightforward to extract the respective sequence of moments, however describing a distribution with a known sequence of mo- ments is a complex problem that it is still discussed by mathematicians and finding a corresponding probability density function is not always possible [14, 15]. Given a data set that is sufficiently large and has a tendency of clustering around a particular value, it could be useful to try to characterize its shape and spreadness by a set of parameters called moments of distribution. These 13 3.1. First-order moment: Mean moments represent a quantization of the average of powers of the variables included in the considered data set. The Kth moment mk of a probability distribution X is defined: mk = E(Xk) for k = 1, 2, 3, ... (3.0.1) where E() is the expectation operator. The Kth sample moments mˆk is estimated: mˆk = 1 n n∑ i=1 Xki for k = 1, 2, 3, ... (3.0.2) and the Kth centered sample moment or Kth sample moment mˆk ′ about the mean is: mˆk ′ = 1 n n∑ i=1 (Xi − mˆ1)k for k = 1, 2, 3, ... (3.0.3) where X1,X2, . . . , Xn represent a random sample of size n from the considered X distribution. It is important to differentiate between dealing with a complete popula- tion and samples taken from it. When dealing with a complete population, the value of an individual moment is actually a constant and represents a parameter of the population’s distribution, but when dealing with a sam- ple population an individual moment is considered a random variable and its value differs from sample to sample. Therefore the value of one sample moment represents an estimation of the respective moment of the complete population. In the present case, the considered data set in represented by the posi- tions of each type of particle present in the simulation. Though sometimes it is useful and necessary to analyze just one particle, most times is better and easier to first look at the all particles of one type as a group. In this thesis, the first four moments will be computed and analyzed so that the evolution of the spatial distribution of the inner cell particles can be quantized and described with a set of standardized parameters. The first two moments-mean and variance-measure the ability of a distribution to center around certain value. The third moment -skewness- is most useful to investigate when dealing with non-normal distributions, as it measures the asymmetry of the distribution and the forth moment -kurtosis- also used to describe non-normal distributions, as an indicator that helps determining the flatness of the distribution. 3.1 First-order moment: Mean The first-order moment about zero is also called mean or expectation 14 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS Figure 3.1: Mean of distribution and it can be computed using the following formula: mˆ1 ′ = 1 n n∑ i=1 Xi (3.1.1) The mean is also considered the mass center because it indicates the balance point of the respective distribution. As it can be seen, the mean represents the average of all sample points of the given distribution. 3.2 Second-order moment: Variance Variance represents the second-order moment about the mean: mˆ2 ′ = 1 n n∑ i=1 (Xi − mˆ1′)2 (3.2.1) and it indicates the spread of the distribution, where mˆ1 ′ is the first order moment: mean. As it can be observed, the term (Xi − mˆ1′) measures the deviation of Xi from the estimated sample mean, therefore the variance is also called the mean squared deviation. Another valuable indicator of the spreadness of the data is the standard deviation, which is calculated as the square root of the variance 3.2.2. The standard deviation is a measure of the dispersion about the mean and it is often used instead of the variance as an indicator of how much the data is deviated from the center. σ = √ mˆ2 ′ (3.2.2) 15 3.3. Third-order moment: Skewness Figure 3.2: Standard deviation of distribution In order to obtain an unbiased estimator of the variance, the sample variance is best to be computed as follows: mˆ2 ′ = 1 n− 1 n∑ i=1 (Xi − mˆ1′)2 (3.2.3) An estimator is considered unbiased if the mean value of an estima- tor equals the true value of the quantity it estimates, in this case m2, the variance of the complete population. Therefore the squared differences are divided by n − 1and not by n. However, when dealing with a large set of data, dividing by n− 1or by n doesn’t produce a significant difference. 3.3 Third-order moment: Skewness The third sample moment about the mean is called skewness. While the mean, variance and standard deviation are dimensional quantities and share the same unit measure as the individuals from the populations, skewness is defined as adimensional, a pure number that describes the shape of the distribution. mˆ3 ′ = 1 n n∑ i=1 Xi − mˆ1′√ mˆ2 ′ 3 (3.3.1) Skewness is an indicator of symmetry or more correct of asymmetry in the distribution. A symmetric distribution is characterized by a skewness equal to zero. A positive skewness indicates that the tail of the distribution 16 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS tends more to the right side, as for a negative skewness which indicates a tailing more to the left side( Figure 3.3). In practice, it is almost impossible to have an exact zero skewness, therefore small deviations from zero of the skewness do not actually reflect any asymmetry in the distribution of the data set. (a) (b) Figure 3.3: Skewed distribution (a) to the right (b) to the left Regarding the sample skewness, the same problem as with the sample variance arises. Considering the fact that the actual mean of the distribution is not known and we operate only with a an estimator, the actual division should also be with n− 1. 3.4 Forth order moment: Kurtosis The forth moment about the mean, called kurtosis, is a also a nondimen- sional quantity, the same as the skewness. Its value is a relative measure of flatness or peakedness of the distribution. It considered a relative mea- sure, because the meaning of its value is always taken in consideration with respect to the one belonging to a normal distribution. The sample kurtosis can be computed as follows: mˆ4 = 1 n n∑ i=1 ( Xi − mˆ1√ mˆ2 )4 (3.4.1) In order to facilitate the analysis of the sample kurtosis, a term of −3is introduced as follows: mˆ4 = 1 n n∑ i=1 ( Xi − mˆ1√ mˆ2 )4 − 3 (3.4.2) so that for a normal distribution the sample kurtosis is approximately zero. 17 3.5. Correlation and Covariance A distribution characterized by a positive sample kurtosis is called leptokur- tic, as for a distribution with negative sample kurtosis is called platykurtic. The in-between distributions are referred to as mesokurtic. From Figure 3.4 we can get an intuitive idea of how the value of kurtosis can be interpreted. On the top right corner of the figure the kurtosis values of several distributions are listed. Figure 3.4: Distributions with different values for Kurtosis [16]. 1.D: Laplace distribution, also known as the double exponential distribution. 2.S: hyper- bolic secant distribution. 3.L: logistic distribution. 4.N: normal distribution. 5.C: raised cosine distribution. 6.W: Wigner semicircle distribution. 7.U: uniform distribution. It can be observed that a high positive kurtosis value for is an indicator of a steep peaked distribution, as for a negative value of kurtosis points to more flat distribution up to uniform distributions. 3.5 Correlation and Covariance When dealing with a set of random variables, there is an interest in how these variables influence each other. Therefore the need of finding a measure that indicates the strengthness of the relationship between these variables. Covariance and correlation are two measures that suggest if there is a relationship between two or more random variables and the direction of this relationship. However these two indicators are not able to indicate the cause and effect of the relationship, just to reflect whether the random variables are independent or not and if not, how strong is their dependency. Covariance is an unstandardized measure for the relationship between two random variables: X and Y. Covariance represents a generalization of 18 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS the variance, when more the one random variable is considered. Since we are working with a sample from the whole population, the sample covariance instead of the covariance will be taken in consideration. Cov(X,Y ) = 1 n n∑ i=1 ( Xi − X¯ ) ( Yi − Y¯ ) (3.5.1) where X¯ and Y¯ represent the estimated sample mean of X and Y. Please note that the division by n−1 can be used for the same reasons explained in the previous sections. The covariance is an unstandardized measure because it depends on the units in which the two random variables are measured and so analyzing the strengthness of the X and Y relationship by the value of the covariance becomes a subjective process and is different from experiment to experiment. Thus the notion of correlation and correlation coefficients emerged. Correlation is a standardized measure and it is actually a form of covariance normalized by the standard deviation of each of the random vari- ables. And so, regardless of the unit type of measurement, the correlation coefficients will have the same type of unit. The definition of the correlation coefficient between X and Y is: rXY = Cov(X,Y ) σXσY (3.5.2) where σX and σY are the respective standard deviations of X and Y. Corre- lation coefficients are therefore invariant to measure units and are actually bounded: − 1 ≤ rXY ≤ 1 (3.5.3) If the correlation coefficient is zero, it means that there is a very weak or nonexistent relationship between X and Y. Since the correlation coeffi- cients only detect linear dependencies between X and Y, it is not safe to say that these two variables are independent if their correlation coefficient is null. However in the reverse case, if it is known that the two variables are independent then we can consider the correlation coefficients equal to 0. If |rXY | is close or equal to 1, it means that between X and Y exists a strong dependency and we can say that X and Y a strongly correlated. A negative correlation coefficient indicates a negative linear dependency, for example if X increases then Y decreases. This dependency is also called anti-correlation. As for a positive correlation coefficient, X and Y tend to increase together. Covariance and correlation coefficients are symmetric, meaning: 19 3.6. Results Cov(X,Y ) = Cov(Y,X) (3.5.4) rXY = rY X When working with more then two random variables it is useful to de- termine the variance-covariance matrix, in order to have one data structure that is able to characterize the relationship between the considered variables. If we assume three random variables: X, Y and Z, their variance-covariance matrixC looks as following: C =  Cov(X,X) Cov(X,Y ) Cov(X,Z)Cov(Y,X) Cov(Y, Y ) Cov(Y, Z) Cov(Z,X) Cov(Z, Y ) Cov(Z,Z)  (3.5.5) Since on the diagonal lies the covariance between one random variable and itself, this is actually the variance of the respective random variable, thus the name of variance-covariance matrix. We already stated that the covariance is a symmetric operator. This means that the variance-covariance matrix is a symmetric matrix with respect to the first diagonal. In the same way, the correlation c matrix is defined: c =  rXX rXY rXZrY X rY Y rY Z rXZ rXY rZZ  (3.5.6) The correlation matrix maintains the symmetry of the covariance matrix and because it contains on the main diagonal the correlation coefficients of the variables with themselves, the diagonal will always contains just terms of 1’s. 3.6 Results The implementation was done in C++ and Qt and the new developed functions were integrated in the CellVis framework. In the present case, the data set representing the sample population was considered as the spatial distribution of different types of particles inside the cell. The spatial distribution is described by a position vector. This vector is three-dimensional, where the dimensions correspond to the Cartesian co- ordinates x, y and z. The estimated statistical parameters were computed with respect to each axis and to each type of molecule. Thus, functions to estimate the first four moments and the correlation coefficients between the three coordinates were implemented and a tree view was integrated in the graphical interface of CellVis in order to inspect the numerical values of these parameters and to browse through all the time steps of the simula- 20 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS tions and through molecule types. A print screen of this view can be seen in Figure 3.5: Figure 3.5: Tree view of statistical parameters Having a view that shows the numerical value is of help when the interest is focused on one time step or to compare these parameters between several time steps. However, just from this view it is hard to have an insight of these parameters tendencies and to visualize the general trend of the different types of molecules involved in the simulation. To facilitate this, several time series plots are integrated in CellVis. Thus a graphical view of how all these parameters evolve through time is provided. Time series plots are available for all the parameters included in the tree view and they are grouped 3 by 3 in one window, so that in one window the tendencies on each the three axes are visible. Figure 3.6 the time evolution of the first for moments visualized in CellVis and in Figure 3.7 the evolution of the correlation coefficients is depicted. 21 3.6. Results (a) (b) (c) (d) Figure 3.6: Time series plot of the first four moments (a) mean (b) standard deviation (c) skewness (d) kurtosis Figure 3.7: Time series plot of the correlation coefficients The statistical analysis is very sensitive to the number of individuals in the sample population. In order to obtain reliable information from the es- timated statistical parameters, the number of molecules of each type should 22 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS be sufficiently large. The initial count of the molecules present at the begin- ning of the simulation can be set by the user. Since some type of particles are products of a reaction between two or more particles, their total number cannot be controlled by the user and might be significantly low. In this case, the statistical analysis will not provide meaningful and reliable information. As it can be observed from Figure 3.6 the yellow curve is highly unstable, especially for the higher order moments, such as skewness or kurtosis. The yellow curve corresponds to the type of molecules that are able to actually reach the nucleus, as it can be seen in Figure 3.8 and in the current simula- tion their number varies from 2 to 8, which is much less information than is necessary for the statistical analysis to extract realistic characteristics about this type’s distribution. Severe fluctuations of these parameters can also be observed when particles of a certain type are formed for the first time and their number continues to increase until it reaches a rather stable count. Thus, it is recommended to look at the values of the statistical parameters from the moment when the curve tends to stabilize and has a smother trend. Figure 3.8: Simulation configuration To make it more easier to put these parameters into context, a function that computes the histogram of each time step over the cell’s diameter of each molecule type’s distribution was implemented. The size of the interval set to compute the histogram is left as a parameter and can be further adjusted if the simulation characteristics requires it. To view this histogram, a 2-D plot was added to the main window. The plot represents the histogram of the current time step and during the visualization the view is redrawn so that the new loaded time step is taken in consideration. Considering the same simulation used in Figure 3.6 we can have a look at the histogram of each type of molecules at the first time step (Figure 3.9 ). From this figure we can already form an opinion on how the data is distributed. The dark green curve corresponds to the receptor molecules 23 3.6. Results that lie only at the membrane of the cell, therefore the respective distribution is almost uniform. If we consider the orange curve, we can observe normal distribution like shape. But if we look at the corresponding kurtosis value, we see that this value is almost constant and is close to -1 for all three axes, meaning that this distribution resembles more a Wigner semicircle distribution. This makes sense, since none of the particles can occupy the space inside the nucleus, so even though the mean value is almost always the same as the center of the nucleus, the number of particles close to the mean is smaller then without the occupied space of a nucleus. Figure 3.9: Histogram view at time step 0 The red curve, initially has a bounded distribution between 3 and 7, as it can seen in Figure 3.9 and at the end of the simulation its distribution is bounded only the membrane radius (Figure 3.10). If we take a look at the evolution of the statistical parameters, we can see how this evolution takes place. The estimation of the mean doesn’t change, because the majority of the respective type of particles still tends to group around the center of the cell, but the standard deviation increases from 1 to almost 2.5, indicating a spreading of the molecules, which can also be observed from the histogram at the last time step. In almost all cases, the skewness is approximately zero. Only the light blue curve has a negative skewness on X and Z axis, meaning that the cor- responding distribution has a longer tail to the left side of the mean. This is due to the fact that the current simulation is characterize by an asym- metric cytoskeleton, that is polarized in the same direction. The light blue molecules represent signaling molecules that bounded with motor proteins, thus with the cytoskeleton and can move along the filaments. Therefore an asymmetry in the cytoskeleton also translates into an asymmetry in this type of molecule distribution. 24 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS Figure 3.10: Histogram view of last time step The histogram view, together with the computed statistical parameters provide quantifiable information about the particles distribution, making it easier to understand the dynamics involved in the simulation and how the different type of molecules are in involved or influenced in the signal transduction process. In order to verify and justify the need for estimating these parameters, we used an external tool to try to fit a distribution given just the first four moments and the size of the sample population. The tool that was used is called R and it represents both a free software programming language and a software environment for statistical computing and graphics [17]. Using two special packages like “SuppDists” and “PearsonDS” to include additional distributions such as Pearson or Johnson, we were able to find distributions that were characterized by the same mean, variance, skewness and kurtosis as the ones obtained from the simulation. In this case, the Pearson distribu- tion of type IV was used, because there is a direct connection between the first four moments and the parameters that characterize this distribution. Pearson Type IV is used to fit observed distributions obtained from data or Monte Carlo simulations and its probability density function is given by: f(x)dx = k [ 1 + ( x− λ a )2]−m exp [ −ν tan−1 ( x− λ a )] dx (m > 1/2) (3.6.1) where m, ν, a and λare real-valued parameters and -1 < x < 1 (k is a normalization constant that depends on m, ν and a)[18]. The symbols of m, ν, a and λ described by Karl Pearson in [15]. To obtain the desired distribution, we can just use the first four esti- mated sample moments: mˆ1 ′ , mˆ2 ′ , mˆ3 ′ and mˆ4 ′ and the Pearson’s type IV 25 3.6. Results parameters can be derived from the following equations: r = 2(m− 1) = 6(mˆ4 ′ − mˆ3′2 − 1) 2mˆ4 ′ − 3mˆ3′2 − 6 (3.6.2) ν = − r(r − 2)mˆ3 ′√ 16(r − 1)− mˆ3′2(r − 2)2 (3.6.3) a = √ mˆ ′ 2[16(r − 1)− mˆ3 ′2(r − 2)2] 4 (3.6.4) λ = mˆ1 ′ (r − 2)mˆ3′ √ mˆ2 ′ 4 (3.6.5) Using the Pearson type IV distribution inside the R package and the statistical parameters corresponding to several types of particles at a given time step we can try to find a probability density function that fits our data distribution. For starting we choose to take a look at the molecule type “mol2” from the simulation configuration shown in Figure 3.8 and we can compare the probability density function obtained with R with the histogram obtained from the simulation e.g. at times step zero for X axis (Figure 3.11). (a) (b) (c) Figure 3.11: Distribution fitting for molecule of type mol2: (a) mol2 his- togram from CellVis. (b) mol2 histogram from R. (c) mol2 probability density function from R. It can be observed that the fitting results correspond approximately with the initial histogram data and the resulting probability density function can be saved and used further to compare with other simulation results in order to make a generalization. The considered distribution is symmetrical and we would like to see if it is also possible to fit a probability density function on other types of distributions. For example, lets consider the case of the molecules of type Receptor A from Figure 3.8. 26 CHAPTER 3. MOMENT-BASED STATISTICAL ANALYSIS (a) (b) (c) Figure 3.12: Distribution fitting for molecule of type Receptor A: (a) Re- ceptor A histogram from CellVis. (b) Receptor A histogram from R. (c) Receptor A probability density function from R. From Figure 3.12 we can see that also in the case of an almost uniform distribution the fitting operation provides encouraging results for the use of the method of moments. As a last example, we consider the distribution of molecules of type Mol B Skeleton from Figure 3.8, which is a skewed distribution and we would like to establish if we can find a probability density function that has approximately the same four moments and a resembling histogram with the one computed from CellVis. (a) (b) (c) Figure 3.13: Distribution fitting for molecule of type Mol B Skeleton: (a) Mol B Skeleton histogram from CellVis. (b) Mol B Skeleton histogram from R. (c) Mol B Skeleton probability density function from R. For the Mol B Skeleton type of particles, the last time step was consid- ered instead of the first, due to the fact that these type of molecules are a result of other particles interaction with the cytoskeleton and therefore they are not present in the initial state of the simulation. Figure 3.13 shows that also in the case of an asymmetrical distribution, fitting a probability density function using the method of moments can provide satisfactory results. In order to make it possible to handle the data resulting from the sta- tistical analysis module in CellVis with other tools specially designed for statistical analysis and computation, a “save” function was implemented. With this function it is possible to export the plot as gnuplot or svg. 27 Chapter 4 History of Individual Molecules In this work the history of individual molecules is represented by the particle trajectory and its possible reaction partners. Traditionally two or three dimensional plots are employed to visually represent the trajectory and dynamics of a single protein [19, 20]. Instead of graphs, Falk et al. proposed a visualization tool that uses three dimensional glyphs [2]. 4.1 Trajectory of Single Molecules It is of interest to investigate the trajectory of a certain particle and to observe if the respective particle changes its type as a result of a reaction with other molecules or interaction with the cytoskeleton or nucleus. A particle is identified by a unique ID. By tracking a certain molecule it is possible to investigate how this particle propagates through the cell’s environment. It is also possible to determine the type of motion of the particle, if is by diffusion or by motor proteins. The trajectory might indicate if the particle was involved in a reaction, if it was replaced by another protein and if the particle maintained its direction towards the nucleus. Particles can also remain blocked within the cytoskeleton filaments. The trajectory is determined by computing the distance from the con- sidered particle’s position to the nucleus position at each time step available in the visualization. Thus, another 2D plot view is added to CellVis to show the trajectory as a 2D curve, where the X axis represents the distance to the nucleus and is bounded by the nucleus radius as minimum and the membrane radius as maximum and Y axis represents the number of time steps. The choice of using a 2D graph to show the particles trajectory resulted from the intention of providing a clear and simple graphical representation of the particles motion. Since mostly the particle’s movement is achieved by 29 4.2. Identification of Reactions and Reaction Partners diffusion, characterized by Brownian motion, a 3D view of the particle’s tra- jectory can become fuzzy and the general trajectory or direction may become hard to observe. By plotting the trajectory with respect to the particle’s distance to the nucleus, it is easy to observe the particle’s motion tendency, whether is directed towards the nucleus, tends to remains in certain region or even switches its direction. 4.2 Identification of Reactions and Reaction Part- ners To fully understand the dynamics of a certain particle, besides analyzing its motion, it is important to identify also the transformations the respective particle suffers along its path. These transformations might be the result of a reaction in which the particle might become involved. There are several types of reaction that are covered and can be simulated in CellVis and they correspond to the types of reactions presented in Section 2. In the context of CellVis, reactions are a result of a collision between two particles or the spontaneous decay of one particle [2]. The newly formed particle is placed at the same position of the involved reactants, since other placements might not be possible due to the presence of other cell components. Not only reactions can influence the state of a particle, but also inter- actions. Interactions consist of a binding process between two proteins. As a result of this binding, a protein complex might be formed or a protein might carry another protein through the cytoplasm. In this thesis, two types of interactions are considered, the first represents interaction with the cytoskeleton and the second type is an interaction with the nucleus. With the available information about the particles trajectory, we can now try to determine when a reaction takes place and what are the particles are involved. Determining when a reaction occurs is relativity easy, since information about each particle’s type is available at each time step. So when a change of type is found or when the particle’s ID is not found in the next time step, we can suspect that a reaction or interaction took place. Finding the other reactions partners, is a more sensitive problem, since it can be a matter of guessing. If the data resolution coincides with the sim- ulation resolution then the identification of reaction partners yields precise results. When a reaction or interaction has taken place, can be easily de- termined by investigating the type of the chosen particle. To determine the other reaction partners, we need to analyze the information corresponding to the time step where the particle’s type has changed and also the informa- tion at the previous time step. For each type of particle information about the reactions or interactions that it can be involved in is available from the simulation configuration. Further, for each type of reaction or interaction the number of molecules involved are stored together with the role of each 30 CHAPTER 4. HISTORY OF INDIVIDUAL MOLECULES type of molecule: reactant, product or enzyme. Depending on what role the particle of interest plays in the reaction, e.g. is a reactant, product or enzyme, the type of the other reaction partners is identified and group of potential reaction partners is chosen with respect to their distance to the particle of interest. From this group of potential partners, depending on how many partners are left to be determined, we choose the ones that corre- spond to a smaller distance. The search for reaction partners is performed by handling two consecutive time steps, the time step when a change in the state of the selected particle has occurred and its corresponding previous or next time step. The pseudocode for determining reactions and reaction partners is listed in the appendix A. In the case of an interaction, just two proteins are involved and the signaling molecule receives the type of the carrying protein. Therefore we just signal the interaction by changing the color of the protein to the new type and mark the time step when this change is identified. The reference distance is chosen also to include larger distances up to 1/5th of the cell radius. The information from consecutive saved time steps are not necessary two consecutive time steps in the actual simulations. As a result it may seem that a reaction just took place and we would expect that the reaction partners in the next time step should be really close to the particle of interest, but in the real case it might not be so and the reaction partners can be actually further apart. A simulation can contain more than 1 million time steps. It is not possi- ble to store all the time steps from the simulation. By omitting information from the simulation, the process of determining the reaction partners is a probabilistic process, since there is no certain way to determine if the chosen reaction partners are the real reaction partners. However, the incremental changes in the particles positions from one time step to another are insignif- icant and it is possible to choose a data resolution without loosing valuable information. 4.3 Results To illustrate the results in determining the trajectory and the reaction partners of a certain molecule, a test configuration was considered as in Figure 3.8. An example of trajectory view is shown in Figure 4.1. The curve color matches the color of the molecule type, as it was set initially in the simulation configuration. The red markers are meant to indicate an event in the history of the particle. This event can be either a reaction either an interaction. The two cases shown in the above figure cover two types of interaction as they were defined in Figure 3.8. In both cases it is noticeable that when an interaction takes place, either with the cytoskeleton filaments or with the nucleus the particle’s trajectory becomes much smother and is 31 4.3. Results not characterized by a Brownian motion anymore. (a) (b) Figure 4.1: Trajectory view (a) with cytoskeloton interaction (b) with nu- cleus interaction The particle can be picked from the 3D view of the CellVis by clicking on it. If the add particle trajectory button is pushed, then, besides the 3D view of its path already implemented, also the plot trajectory is triggered. The trajectory computation takes all time steps from the current loaded simula- tion into consideration. Since the particle’s trajectory is only computed as a result of a user’s interaction, time steps are loaded separately and all other views are not affected by this. Therefore it is possible to compute a particle’s whole trajectory at any time step in the simulation and once the plotting trajectory action is triggered, all the time steps are loaded sequentially and the particle’s position is searched in every time step Figure 4.2: Trajectory of type mol3nucleus The trajectory view can be used not just for one particle, but also for several molecules, in case we are interested in several particles. To illustrate 32 CHAPTER 4. HISTORY OF INDIVIDUAL MOLECULES this, a group of molecules that have reached the nucleus by the end of the simulation is chosen (Figure 4.2). This view can be useful, if we want to determine and observe any general tendencies that might be indicated by the graph. Unfortunately, the graph in this case becomes hard to understand and the user can have difficulties to distinguish between the trajectories of dif- ferent particles. In this sense, another graph is introduced, a graph where only the particles distance to the nucleus when a reaction or interaction takes place are plotted. Consequently, it would be more easy to observe where some types of reaction are more likely to take place (Figure 4.3) and a reaction distribution can be observed. Figure 4.3: Reaction view of type mol3nucleus The X axis remains as the distance towards the nucleus, but Y axis becomes a count of the number of reactions in which a particles is involved. Now the graph is clearer and an idea of how several types of reactions are distributed can be formed. This type of graph can be useful also to illustrate a chain of reactions between several molecules. Figure 4.4: Simplified MAPK cascade simulation configuration To illustrate an example of reaction history for a chosen particle, a simu- lation with a simplified MAPK cascade was used (Figure 4.4). The enzymes 33 4.3. Results from the receptor complex activate the MAPK, resulting the phosphoryl- ized MAPKp. The MAPKp can be deactivated, dephosphorylized by the active Phosphatase. As a result of this type of reaction the Phosphatase also becomes inactive, but can return to its active state by suffering the reactivation reaction. It can be observed from the simulation configuration, that in this case we are dealing with enzymatic reactions, first and second order reaction, cytoskeleton interactions and nucleus interactions. Figure 4.5 shows the trajectory of the molecules involved in the signal transduction process. The gray curve represents the trajectory of an enzyme from the receptor complex situated at the membrane. This enzyme binds with the MAPK and now MAPK becomes MAPKp, which is then attacked by the Phosphatase. At time step 38 we can notice that a dephosphoraliza- tion reaction takes place and interferes in the signal transduction process by phosphorylizing the MAPKp and preventing it from reaching the nucleus and transmitting the signal. (a) (b) Figure 4.5: Particle history (a) Particle history including trajectories (b) Reaction history The reaction history view is shown in 4.5(b) for a more clearer look at the reactions placement. However, in this case the reaction history view turns out to be inexact. From Figure 4.5 we can see that reactions do not align, since on Y axis we count the reactions for each particle. In this case, the first reaction for the Phosphatase corresponds to the second reaction of the MAPK and thus results in erroneous information. If we were to properly align the reactions, then the Y-axis does not make sense anymore, since the reaction count will be wrong. Therefore, it is recommended to use this type of plot just for illustrating the reaction of one particle per reaction. In order to maintain the original approach, of using a graph that shows all the partners involved in a reaction and the positioning of the reaction, the Y axis will have to eliminated and the graph represented just as tree-type of view. 34 CHAPTER 4. HISTORY OF INDIVIDUAL MOLECULES (a) (b) Figure 4.6: Particle history (a) Particle history including trajectories (b) Reaction history An example of activation pathway that ends in the signaling protein reaching the nucleus can be seen in Figure 4.6. As a study case, we might also be interested to determine how many MAPKs one receptor is able to activate. For this we picked a random enzyme from the receptor complex, computed its trajectory and determined several MAPKs that were activated by this enzyme. In Figure 4.7(a) the trajectories of the implicated MAPKs and the chosen enzyme are drawn. It can be observed that the only protein that finally reaches the nucleus is the one that also benefits by transport with the motor proteins. The role of motor proteins in the signal propagation will be further investigated in Chapter 5. For this case, we can also use the Reaction History Plot, since in this case the MAPKs do not react with each other and the reactions do not have to be aligned (Figure 4.7(b)). (a) (b) Figure 4.7: Receptor history (a) Receptor history including trajectories (b) Reaction history 35 Chapter 5 Comparative Visualization 5.1 Previous work In scientific data analysis, comparison of different data sets is an impor- tant function, whether the data comes from the same source or not. It is of interest to find similarities or differences in several data sets. In order to do so, a comparison can be made between different models, different res- olutions, differences in algorithms, different experimental results, etc. One of the most important motivation for comparative data analysis is model validation. In this case, the output of numerical models can be compared with experimental data. A more suggestive method for comparative data analysis is to visually represent similarities or differences between the data sets of interest. This method is called comparative visualization. Simple 2D graphs are often used to show time development of numerical characteristics of data sets and therefore, can be easily used to show the differences between these characteristics, by “A-B” type of comparison. More complex meth- ods for comparative visualization can be represented by three main areas: image-based, data-level and topological comparisons [21, 22]. Image level comparison can be considered one of the simplest and com- mon approach for interpreting the differences between two or more data sets. This method first generates images from the data sets that are about to be compared. The simplest way to perform the comparison is to show these images side-by-side. However, this approach burdens the user with the task of identifying the regions that might represent a difference and to quantify it. Although showing images side-by-side does not provide any reliable and quantifiable comparison data, this approach is well suited for comparing more then two data sets. Another method of image level comparison is to use difference images. In this case, two registered images are simply subtracted and the compari- son is done pixel-by-pixel. This approach provides the user with an actual numeric information about the differences between the two data sets and 37 5.2. Comparative Visualization in CellVis with the use of different masks or filters, certain types of differences can be highlighted. The interpretation of the image difference is dependent on the used color map. Since the data color mapping is typically a non-linear transformation, the interpretation should be done accordingly. The draw- back of this method is that one-to-one correspondence between pixels must exist. Also, image level comparison is dependent on the transfer function used in the direct volume rendering (DVR) process. For example, some DVR algorithms interpolate color values and other interpolate data values, thus the result can be significantly different. Therefore, another drawback of the image based comparison is that this approach does not take in consid- eration the methods of rendering. To overcome this, Williams and Uselton [23] provided guidelines to objectively compare rendered images, by speci- fying certain rendering parameters, methodology for difference classification and a set of metrics to quantify and classify the differences, or similarities between two volume rendered images. Data level comparison techniques use actual raw data of the compared data sets. Thus the comparison is done first, at the data level and after the result is finally visualized. This approach was used by Trapp and Pagen- darm [24] to compare aircraft design and more recently by Childs et. al. using cross-mesh field evaluations and derived quantities [25]. To make use of these method, one must chose a proper metric for performing the com- parison. This can translate in a drawback for this approach, since the data comparison looses its generality and becomes application dependent. Sev- eral metrics used for data level comparison in volume rendering algorithms are analyzed in detail by Kim and Pang [26]. Feature based comparison does not use the raw data but certain features extracted from the data set. This technique is more flexible then the other two comparison method, because the user has a range of features to chose from to identify differences in the data sets. Also, feature level comparison enables a more complex analysis of the data sets. Feature based comparison can be viewed as an extension of data level comparison. The main differ- ence is that feature based comparison doesn’t compare the data itself but the features extracted from it. An example on how to compare two meth- ods of calculating streamribbons using a combination by side-by-side and overlaying comparison was presented by Pagendarm and Post [27]. 5.2 Comparative Visualization in CellVis In this work, the raw data is represented by the particles position and types at each time step. Since comparing the particles position by directly calculating the difference of positions does not make any sense, other mea- sures of quantifying the particles spatial distribution were chosen to perform the comparison. These measures are represented by the concentration of the 38 CHAPTER 5. COMPARATIVE VISUALIZATION particles inside the cell and the histogram of their spatial distribution. In chapter 3 a set of statistical parameters were estimated and visualized for a single data set. Thus, the idea of comparing two data sets with respect to these parameters comes naturally. The two different simulation data can be obtain by setting differently their initial state. This can be accomplished by choosing different random seeds for the same configuration or by keeping the same configuration and modifying the random seeds. The comparison is however limited to data sets that are characterized by the same number of simulation steps and with the same shape characteristics of the cell and nucleus. The comparative visualization is achieved by performing the comparison between the first four statistical moments of the two data sets and between the concentrations and histograms. These differences can indicate a change in clustering, spreadness, skewness and uniformity of the spatial distribution described by the two data sets. To visualize these differences a set of 2D graphs were employed. Besides the 2D graph, for the differences in concen- trations, also a 3D volume is rendered. To render these differences, another transfer function is chosen to map the differences in the density values to color and opacity. The protein concentration or density is approximated by defining volume regions and computing the corresponding local densities. These regions are defined by subdividing the cell into volume elements (voxels). Then for each voxel the number of enclosed molecules is determined. The density is then computed by dividing the number of molecules by the voxel’s volume [28]. The histogram is computed in very similar way as the concentration, just that in this case the histogram is computed for each of the three dimensions and the range of the histogram is the cell’s diameter. This range is divided into equal intervals and for each interval the number of contained molecules is determined. The decision to have also a histogram view besides the con- centration was made in order to illustrate also possible asymmetries in the spatial distribution. Since CellVis was originally designed to support just one data set at a time, the task of loading and providing the same functionalities for a second data has to first reconsider the original structure of the framework. Therefore, the actual comparison is performed using external tools, like MATLAB and Gnuplot. Since every plot can be saved and its data exported to external files, the manipulation of the actual data using external tools is relatively easy. The difference in the statistical parameters is carried out in MATLAB, the result is saved as gnuplot files and visualized using Gnuplot. In CellVis a function that loads the information of a second data set is integrated. This function just handles the simulation information, but does not update any views for the second data set. The function is created so that a rendered volume representing the difference in concentrations of two data sets can visualized if the user desires it. 39 5.3. Results For comparison, three types of different data sets representing a simpli- fied MAPK cascade (Figure 4.4) are chosen. In the first two cases, data sets with same starting random seeds are considered, but different simula- tion configurations. In the first case the transport by motor proteins was deactivated for the second data set and in the second case the inactiva- tion reaction of the Phosphatase was disabled. Last, we consider two data sets characterized by the same simulation configuration but with different starting random seed. 5.3 Results 5.3.1 Concentration and Histogram-based comparison Before performing the comparison between two radial concentration pro- files corresponding to different simulation data sets, we used a rectangular filter to smooth the computed concentration profile. Initially the concen- trations were computed by dividing radius in a fixed number of intervals. In this manner the resulting concentrations present abrupt changes because each interval is considered separately. By using a rectangular filter, we consider also proteins from neighboring intervals when computing the final corresponding concentration. The window of this rectangular filter is con- sidered a parameter when computing the radial concentrations, thus it can be easily modified to adjust the filter size. Figure 5.1: Radial concentration profile Figure 5.1 shows the two concentration profiles at the same time step. 40 CHAPTER 5. COMPARATIVE VISUALIZATION The first graph illustrates the results of using a rectangular filer, as the second one represents the initial calculated concentrations. It is easy to see that by using this filter we obtain smoother curves that might reflect better the reality. However, when dealing with molecules that are localized only on certain region, is best to adapt the size of the interval according to the size of the region. This could be the case of the particles that form the receptor complex, that can be found only at the membrane. In this case the use of a rectangular filter is not indicated since it would suggest a wider spreading of these molecules than in the real case. To present the results of the comparative visualization with respect to radial concentration and histogram, we chose to compare the concentration profile and the spatial distribution histogram at the first and last time step of the simulation, in order to indicate the differences in the starting state of the simulation and in its outcome. (a) (b) Figure 5.2: (a) Concentration and (b) histogram difference plot for two different simulations, with the second simulation having transport by motor proteins disabled However for the first two cases that consider data sets with the same starting spatial distribution, the difference in concentrations at time step zero would be equal to zero and therefore was not chosen to be viewed. In Figure 5.2 the difference in the concentrations and histogram at the end of the simulation are represented and it can be observed that high differences 41 5.3. Results are indicated in the case of Target type molecules and MAPKp*. Con- sidering the fact that the second data set corresponds to a configuration that does not enable transport by motor proteins, it is to be expected that the respective data set will not include MAPKp that can interact with the cytoskeleton ( MAPKp* ). By disabling transport by motor proteins, the number of proteins that reach the nucleus ( Target ), is significantly smaller or even equal to zero, thus indicating the important role of motor proteins in the signal transduction process. From the histogram difference plot one can see that the number of phos- phorylized MAPK ( MAPKp ) is lower in the first data set (indicated by a negative difference), due to the presence of motor proteins that transform a part of MAPKp into MAPKp*. In Figure 5.3 the differences in histograms and concentrations corre- sponding to the last time step of the simulation for the second type of comparison are shown. In this case all types of molecules are present at the end of the simulation. It can be seen that the difference in concentrations is almost insignificant. Considering that the reactivation reaction of the Phos- phatase is disabled for the second data set’s configuration, one can conclude that this type of reaction rarely takes place and therefore its deactivation does not produce significant changes in the outcome of the simulation. (a) (b) Figure 5.3: (a) Concentration and (b) histogram difference plot for two simulations, with second simulation having the inactivation reaction disabled 42 CHAPTER 5. COMPARATIVE VISUALIZATION For the third type of comparison two data sets that differ by the initial spatial distribution of the molecules are considered. In this case it is neces- sary to have a look at the differences of concentrations and histograms also at the first time step, to identify significant initial similarities or disparities and whether or not they are constant or evolve during the simulation. The initial difference is illustrated in Figure 5.4. There are no significant dif- ferences in any of the present molecule types. This can be justified by the fact that even though the spatial distribution is different the total number of molecules and the placement of the cytoskeleton is kept the same. This type of comparison is useful when we are investigating several samples of the same type of biological system to see if its dynamic presents the same char- acteristics through the entire simulation or it presents a completely random behavior. (a) (b) Figure 5.4: (a) Concentration and (b) histogram difference plot for two simulations with the same starting random seeds, at time step zero The difference in the outcome of the two simulation is shown in Figure 5.5, where it can be observed that in both cases the target (nucleus) is being reached by a similar number of signaling molecules and also cytoskeleton interactions occur in similar quantities. Again, the calculated differences are insignificantly small, suggesting that just by varying the initial positions of the particles involved in the simulation, without considering any change in the reaction configuration or cytoskeleton polarization, does not produce 43 5.3. Results meaningful differences in the signal transduction process. From this we can conclude that the dynamic of the intracellular processes is not affected by changing of the initial particles positions as long as they present the same distribution. (a) (b) Figure 5.5: (a) Concentration and (b) histogram difference plot for two simulations with the same starting random seeds, at the last time step 5.3.2 Statistical Parameters-based comparison Performing also a statistical parameters based comparison completes the comparative visualization and enriches the data analysis process. Analyzing similarities or differences in these parameters together with the difference of concentrations and histograms provides more information about the actual similarities or differences in the considered data sets, the outcome of the two simulations and in the signal propagation process. This comparison is accomplished by computing the differences of the evolution of the first four statistical sample moments and the results can be viewed in similar manner as in the previous section, by the use of 2D graphs. The difference of the mean can indicate a different tendency of clustering around a certain value in the considered data sets, as for the difference of the standard deviation can reflect differences in the data spreadness. Taking in consideration also the difference in skewness and flatness of the data distribution results in a more complex comparative analysis. 44 CHAPTER 5. COMPARATIVE VISUALIZATION For comparison, the same three cases as analyzed in the previous section were considered. In Figure 5.6, 5.7, 5.8 the results of the comparison of the two data sets corresponding to the first type of comparison are illustrated. The absence of Target and MAPKp* proteins in the case of the data set with deactivated transport by motor proteins can be observed also here. (a) (b) Figure 5.6: (a) Mean and (b) standard deviation difference plot between two simulation with the same random seeds but, with the second simulation having transport by motor proteins disabled In Figure 5.6 the evolution through time of the difference between the mean and standard deviation is shown. While the graph representing the dif- ference in mean values does not indicate any significant difference in the time evolution of the two data sets, the standard deviation comparison indicates a larger spreadness in the case of inactive Phosphotase and the phosphoryl- ized MAPK. This is caused by the lack of transport by motor proteins in the second simulation and thus the transport towards the nucleus is significantly slowed and the phosphorylized MAPK tend to remain more closely to the membrane. Since the motion of the MAPKp is slowed down and not many of them can get closer to the nucleus, it means that less dephosphoryliz- ing reactions take place further from the membrane, therefore the inactive Phosphatase is unlikely to appear closer to the nucleus, thus justifying the larger spatial deviation of the inactive Phosphatase in the first data set. 45 5.3. Results (a) (b) Figure 5.7: (a) Skewness and (b) kurtosis difference plot between two simu- lation with the same random seeds but, with the second simulation having transport by motor proteins disabled Figure 5.8: Correlation coefficients difference between two simulation with the same random seeds but, with the second simulation having transport by motor proteins disabled 46 CHAPTER 5. COMPARATIVE VISUALIZATION Figure 5.7 shows the difference of the evolution of skewness and kurtosis of the considered data sets. The difference in skewness is hard to interpret and reflects more the stochastic nature of the signal transduction process. The difference in kurtosis, however reflects the difference in the simulation configuration. If we consider again at the phosphorylized MAPK case, one can observe positive difference in the evolution of kurtosis. This indicates that in the case with inactive direct transportation the spatial distribution of the MAPKp is more flat and does not present a high clustering around the mean (nucleus). This is again an effect of the slow motion of the particles, that in this case move only by diffusion. Figure 5.8 shows the difference of the evolution of the correlation coeffi- cients that describe the correlation between the spatial distributions of one type of particle on each dimension. However, in the considered simulations, there are no general tendencies of the correlation coefficients and therefore also their difference presents a noisy like evolution. (a) (b) Figure 5.9: (a) Mean and (b) standard deviation difference plot between two simulation with the same random seeds but, with the second simulation having the inactivation reaction disabled The results of the second type of comparison, where the effect of disabling the reactivation reaction of the Phosphatase was taken in consideration, are presented in Figure 5.9, 5.10, 5.11. In Figure 5.9 the difference of the evolution of the first two statistical moments can be visualized. It can 47 5.3. Results be observed that when the curves become stable there are no significant differences. The high differences in both mean values and standard deviation values are an indicator of a delay in the appearance of certain types of molecules in the case of the second simulation. This delay can be observed in the case of the inactive Phosphatase, the phosphorylized MAPK that binds to the cytoskeleton ( MAPKp* ) and the the MAPKp that reach the nucleus. Analyzing Figure 5.10 and 5.11, where the difference in skewnees, kurto- sis and correlation coefficients is visualized, the same delay can be observed when higher difference values are reached. Besides the observed delay, the two simulations seem relatively similar with respect to their outcome, con- clusion that was also drawn in the previous section, when the concentration profile and histograms were compared. (a) (b) Figure 5.10: (a) Skewness and (b) kurtosis difference plot between two sim- ulation with the same random seeds but, with the second simulation having the inactivation reaction disabled In Figure 5.11 we can notice differences in the correlation coefficients that have an absolute value higher then 1. We already know that correla- tion coefficients are bound between -1 and 1, therefore their difference will be bound by -2 and 2. If of interest is just to see if there is a difference in correlation and not also the direction of this correlation, the difference can be evaluated by using the absolute values of the respective correlation coef- 48 CHAPTER 5. COMPARATIVE VISUALIZATION ficients and for further analysis of the direction of the correlation to consult directly the coefficients of the two simulations. Since none of the consid- ered configurations provide any linked restraints in the particles movement on each axis, these correlation coefficients are expected to be close to zero when the molecule count stabilizes and their differences would result just in a noisy signal. Figure 5.11: Correlation coefficients difference plot between two simulation with the same random seeds but, with the second simulation having the inactivation reaction disabled The results of the simulation comparison between two data sets that share the same simulation configuration but with different initial random seeds can be viewed in Figure 5.12, 5.13 and 5.14. In the first figure, the difference of the mean and standard deviation values are presented. In this case, again it is observable a delay in the inactivation of the Phosphatase, in the use of motor proteins and in reaching the nucleus. Regarding the mean value, on all the three axes the differences vary slightly around zero, when large enough number of molecules of the respective type is reached. The difference in standard deviation values indicate some difference in spreadness in the case of the inactive Phosphatase and MAPKp transported along the cytoskeleton’s filaments. 49 5.3. Results (a) (b) Figure 5.12: Mean and standard deviation difference plot for two simulation with different starting random seeds (a) (b) Figure 5.13: Skewness and kurtosis difference plot for two simulation with different starting random seeds 50 CHAPTER 5. COMPARATIVE VISUALIZATION In Figure 5.13 , we can observe differences in the skewness and kurtosis of the two simulations. Since skewness and kurtosis can both take also negative values, to correctly interpret the result of the differences between different values they take, it is necessary to analyze the subtracted values themselves and their sign. For example on the Z axis, in the inactive Phospahatse case, a negative difference in skewness is present. Without knowing the sign of the original skewness values of the two simulations, we can not state whether this indicates a skewness on different sides or simply a more skewed distribution on the same side. Also, observing that the higher differences appear in the first half of the simulation, we can conclude that these differences can be influenced also by the small number of particles of the respective type and that a more reliable trend of these differences can be viewed after a sufficiently large number of particles has been reached, e.g. in the second half of the simulation. Since in this case the comparison is made keeping the same simulation configuration, it is to be expected that the results of the two simulations would be similar. Therefore, it is not of so much interest in comparing the spatial distributions, when there is no difference in the configuration that might alter the shape of these distributions. In this case, information from several simulation runs can be extracted to make an analysis on how fast the external signal propagates and how many of the signaling proteins are likely to reach the target, therefore performing another statistical analysis with respect to several data sets that present the same characteristics in their molecules distributions. Figure 5.14: Correlation coefficients difference plot for two simulations with different starting random seeds 51 5.3. Results 5.3.3 Concentration Difference Volume For rendering the concentration difference volume the difference in con- centrations in computed. The protein density computation was already an integrated function in CellVis and is done automatically when loading the particles positions into the 3D view. To compute the difference of densi- ties, we load the second set of proteins over the first one and when setting the shader parameters we use a negative density scale value, therefore sub- tracting the concentrations of the second data set from the first one. The resulting scalar field is mapped to a color range. For positive values of concentrations the color map from Figure 5.15, where small differences are mapped to blue and big differences are mapped to red. The negative values are mapped to a grayscale color map, where large negative differences are close to black and small negative differences are close to white. (a) (b) Figure 5.15: Color map (a) for positive differences (b) for negative differences Very small differences in concentrations, that are approximately zero are mapped with a total transparency, to illustrate the similarities in the clustering of proteins of the two considered data sets. To present the results for this last part of the comparative visualization section, the same three cases of comparison as in the previous sections were chosen. (a) (b) (c) Figure 5.16: Concentration difference volume for two data sets, one with no cytoskeleton intereaction (a) for time step 0 (b) for intermediate time step (c) for last time step Figure 5.16 illustrates the evolution over time of the differences in con- centrations when one of the two data sets does not include transport by motor proteins. For the first time step, no volume is visible since the ini- tial positions of all implicated molecules are the same for both data sets 52 CHAPTER 5. COMPARATIVE VISUALIZATION and therefore the difference in null. When analyzing the resulting rendered volume at an intermediate time step, it can be observed that positive dif- ferences are migrating towards the center, as for negative differences remain closer to the membrane. This is justified by the fact that the first data set benefits of the transport by motor proteins that facilitate and speeds the signal transduction process. Since proteins can move faster towards the nucleus in first simulation data, a higher concentration for this data set is expected around the target. For the second data set, transport is only en- sured by diffusion and most proteins can not move as fast and remain at a considerable distance from the nucleus. The last time step shows the same effect: the outcome of the first simulation data produces a more crowded environment around the nucleus. Figure 5.17 illustrates the results from the second comparison case, when the reactivation of Phosphatase reaction is deactivated for the second loaded data set. The first time step reveals the same effect as the first case. The two data sets having the same initial states, the starting concentrations difference volume is also equal to zero. The intermediate time step does not suggest any visible tendencies in the concentration difference. The last time step however, reveals a larger distribution of the negative difference values, especially around the nucleus. Considering that in the second simulation data, the Phosphatase can not be reactivated and therefore the activated MAPKp can not be attacked by it and deactivated, it could make sense that in the case of the second data set more phosphorylized MAPKs are able to reach the target. (a) (b) (c) Figure 5.17: Concentration difference volume for two data sets, one with reactivation reaction of Phosphatase disabled (a) for time step 0 (b) for intermediate time step (c) for last time step Figure 5.18 illustrates concentration differences between two data sets that have the same simulation configuration but have a different initial state. In this case, the evolution and the outcome of both simulations seem to be very similar, as it was deduced also from comparing the radial concentrations 53 5.3. Results and the statistical parameters. In this case, as a plus, it can be observed that the original distribution of concentration differences tend to remain the same. Therefore we can conclude then that initial crowded regions do not migrate significantly, they just suffer a relatively small change in shape. (a) (b) (c) Figure 5.18: Concentration difference volume for two data sets with the same simulation configuration, but different initial random seeds (a) for time step 0 (b) for intermediate time step (c) for last time step (a) (b) (c) Figure 5.19: Concentration difference volume for different density scales (a) same scale with a low value (b) same scale with a higher value (c) first data set with higher scale Figure 5.19 shows how the concentration difference volume can be ma- nipulated. For Figure 5.19(a) and (b) equal density scales were used for both data sets. In the first case a relative low density scale is chosen and the resulting volume can be seen in Figure 5.16(a). If we want to emphasize larger differences (both positive and negative) we can choose a larger den- sity scale as in Figure 5.19(b). In Figure 5.19(c) different density scales were chosen. This could be justified if we want to visualize a weighted difference, in order to emphasize the concentrations of one data set over the other one. 54 Chapter 6 Conclusions and Future Work 6.1 Summary The objective of this work is to provide means for a statistical analysis over the previous obtained simulation data and to expand the CellVis frame- work to handle two sets of data for performing a comparative visualization. For the statistical analysis module, the method of moments is chosen to describe the spatial distribution of certain types of particles involved in the simulation. The spatial distribution is considered separately on every of the three dimensions, since the movements on each of the three axes of a particle are independent. Hence, the first four sample moments are estimated for every time step stored from the simulation, for the spatial distribution on each axis. The evolution of these moments throughout the simulation is illustrated with the use of a set of time series plots, that are updated with every new loaded time step. To facilitate placing these estimated parameters into context, the spatial distribution’s histogram is also calculated and visualized. The histogram for each type of particle is shown and computed for each axis, to provide a schematic view of the shape of the spatial distribution. This shape is further characterized and described by the estimated sample moments. To justify the relevance of using the methods of moments to describe a distribution, an external tool is used to estimate probability function given the first four moments resulting from the analysis of the simulation data. The results are satisfactory with respect to the matching of the histograms, demonstrating that the method of moments can provide enough informa- tion to find a probability density functions that fits the distribution of the sample population. This probability density function is not unique and the fitting operation can provide different results with each try. The estimated probability density function can further be used to predict the behavior of 55 6.1. Summary molecules in other simulations. The correlation coefficients between the spatial coordinates on the three axes were also estimated. The poor correlation between these coordinates is demonstrated by visualizing the evolution of their coefficients. However, if these coordinates are independent, can not be decided just by investigating these coefficients. The statistical analysis section focuses on the general aspects of the simu- lation data. However, it is important also to provide means of investigating particular aspects of the simulation, like the individual trajectory or the chain reaction of a certain particle. The trajectory is computed as the dis- tance from the particle of interest to the nucleus and is illustrated in a 2D graph, showing the trajectory development over time. Analyzing the par- ticles trajectory we can observe if the chosen signaling protein succeeds in transmitting the signal to the target, in this case the nucleus. An analysis of the reactions in which a particle is involved is performed in order to further investigate the particularities of the signal transduction pathway. Determin- ing the reaction partners of a certain molecule can be a probabilistic process when the data disk resolution does not match the data simulation resolu- tion. The only means of identifying the possible reactants is to investigate their position with respect to the particle of interest. It might be possible to detect a reaction, but not find any reactants, since the present search does not cover the whole cell space, but just a volume with the radius of 1/5th of the cell’s membrane around the particle of interest. However, this is an unlikely event, since the considered volume is sufficiently large with respect to the cell’s size. It might also be possible to detect false reactants if two reactions of the same type occur in the same stored time step. This case can also be considered unlikely. To illustrate the results of determining the trajectory and the reaction partners of a selected molecule, a simulation of a simplified MAPK cascade is considered and an example of how the external signal propagates through the signaling molecules is shown. The comparative visualization is mostly carried with the help of other external tools, such as MATLAB and Gnuplot, since the task of providing the same functionalities for a second data set is more complex and CellVis’s structure needs to be reconsidered. However, for the visualization of the differences in molecules concentrations as a rendered volume, a function that completes the loading of a second data set is integrated in the CellVis framework. For the comparative visualization, a set of 2D plots were used to illus- trate the differences in statistical parameters, concentrations and histograms of two second data sets. These plots provide quantifiable information about the similarities or differences between two second data sets. For a more suggestive visualization, the difference in the concentration profile of two simulations were mapped to colors according to a predefined transfer func- 56 CHAPTER 6. CONCLUSIONS AND FUTURE WORK tion and rendered in a form of a volume. This type of visualization also provides the user with the possibility to emphasize the differences by tuning a density scale included in the Property View of CellVis. For the comparative visualization it is important how we choose the con- sidered data sets. For example, if it is of interest to investigate the general tendencies of a type of system, we can choose data sets that are charac- terized by the same configuration, but differ in their initial state. If we desire to investigate the influence of one type of molecules or of different types of transport, we can choose second configuration that has enabled or disabled a certain reaction or activates/deactivates transport by motor pro- teins. According to these cases, a set of different configurations representing a simplified MAPK cascade were chosen and compared. 6.2 Outlook The integrated methods, views and graphs represent just the basics of a data analysis module. While the method of moments is the oldest method of deriving point estimators and almost always produces asymptotically unbi- ased estimators, the resulting estimators are not always the best. To find a better probability density function that fits the simulation data, we can use the first four moments as starting values for a maximum likelihood fitter. The estimation of the first four moments provide useful information in the case of typical distributions, however if some types of particles form multiple clusters and do not have just one central value, the estimated mo- ments are not able to reflect that. In this case, it is useful to determine the forming of clusters and to find means to characterize them. The clusters can be described by their size, e.g. estimated volume or shape, if the shape is conventional or by their motion. It is of interest also to investigate how these clusters evolve trough time, whether they increase or decrease in vol- ume or number of molecules and what is their life expectancy or how fast they form and how fast they spread. Methods to measure the correlation between the motion of several types of particles or the occurring of several types of reactions or interactions can be further developed to understand the role and the connections between all the consisting entities of the cell. It is also important to determine the multiplication factor of each molecule type from one time step to another or their expected lifetime an average travel distances, therefore completing the statistical analysis process. Once we are able to identify when reactions take place, we can perform an analysis of the reaction distribution. It might be of interest to determine regions where certain types of reaction are more likely to take place. For this, we can extract all reactions of a certain type and graphically represent their position. We can also compare the number of reactions that take place 57 6.2. Outlook in a single time step with the theoretical expected reaction count presented in Section 2, in order to validate the employed model. The reaction partners identification algorithm can be further improved by using an intelligent search with respect to distance. This algorithm can start with a minimum distance and further increase it until viable reaction partners are found. For the comparative visualization, first the framework has to be modified to provide enable all the functionalities also for a second data set. Therefore we can integrate the difference graphs obtained with MATLAB and Gnuplot. It could be useful to give the possibility to visualize two simulations side-by- side, together with their corresponding volume of concentrations difference. Also other features can be extracted from the simulation data to be used in the comparative visualization process. We can consider the number of clusters, their occupied volume, their lifetime. The multiplication factor of each type of proteins can be used to make a comparison of two simulation runs, in order to compare the efficiency of the activation process of MAPK. To sum up, the statistical analysis together with the visualization process provide a complete tool for interactive data analysis of the cell’s mechanisms and functionalities. The need for comparative analysis and visualization is present in any research area. Future work can consist of several improve- ments and extend the data analysis process with other functionalities or by employing other approaches. The possibilities mentioned above are just a few improvements that can be further integrated in this framework. How- ever, a collaboration and consultation with biologists can reveal many other needs and other ideas of enriching the data analysis process can emerge. 58 Bibliography [1] I. Rigoutsos and G. Stephanopoulos. Systems Biology: Volume II: Net- works, Models, and Applications (Series in Systems Biology). Oxford University Press, 2006. [page 2] [2] M. Falk, M. Klann, M. Reuss, and T. Ertl. Visualization of signal trans- duction processes in the crowded environment of the cell. Proceedings of IEEE Pacific Visualization Symposium 2009, PacificVis ’09:169–176, 2009. [pages 2, 3, 29, and 30] [3] H. Lodish, A. Berk, P. Matsudaira, C. A. Kaiser, M. Krieger, M. P. Scott, L. Zipursky, and J. Darnell. Molecular Cell Biology. W. H. Freeman Publishers, 2004. [pages 7, 8, 9, 10, and 11] [4] M. Klann. Development of a Stochastic Multi-Scale Simulation Method for the Analysis of Spatiotemporal Dynamics in Cellular Transport and Signaling Processes. PhD thesis, University of Stuttgart, 2011. [page 9] [5] G. Forgacs, S. H. Yook, P. A. Janmey, H. Jeong, and C. G. Burd. Role of the cytoskeleton in signaling networks. Journal of Cell Science, 117:2769–2775, 2004. [page 9] [6] T. Wada and J. M. Penninger. Mitogen-activated protein kinases in apoptosis regulation. Oncogene, 23:2838–2849, 2004. [page 10] [7] B. Kofahl and E. Klipp. Modelling the dynamics of the yeast pheromone pathway. Wiley InterScience, 21:831–850, 2004. [page 10] [8] R. Seger and E. G. Krebs. The mapk signaling cascade. The FASEB Journal, 9 no. 9:726–735, 1995. [page 10] [9] E. K. Kim and E.-J. Choi. Pathological roles of mapk signaling pathways in human diseases. BBA - Molecular Basis of Disease, 1802(4):396–405, 2009. [page 10] [10] L. Santarpia, S. L. Lippman, and A. K. El-Naggar. Targeting the mitogen-activated protein kinase ras-raf signaling pathway in cancer therapy. Expert Opin Ther Targets, 16(1):103–119, 2012. [page 10] 59 Bibliography [11] B. N. Kholodenko. Four-dimensional organization of protein kinase signaling cascades: the roles of diffusion, endocytosis and molecular motors. The Journal of Experimental Biology, 206:2073–2082, 2003. [page 10] [12] T. D. Pollard, W. C. Earnshaw, and J. Lippincott-Schwartz. Cell Bi- ology. Saunders/Elsevier, 2008. [page 12] [13] G. Casella and R. L. Berger. Statistical Inference. Cengage Learning; 2nd edition, 2001. [page 13] [14] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 2nd edition: The Art of Scientific Computing. Cam- bridge University Press, 2007. [page 13] [15] K. Pearson. Contributions to the mathematical theory of evolution.-ii. skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A, 186:343, 1895. [pages 13 and 25] [16] Kurtosis of well-known distributions. http://en.wikipedia.org/ wiki/File:Standard_symmetric_pdfs.png. [page 18] [17] R. Ihaka and R. Gentleman. R: A language for data analysis and graph- ics. Journal of Computational and Graphical Statistics, 5 no.3:299–314, 1996. [page 25] [18] J. Heinrich. A guide to the pearson type iv distribution. Collider Detector at Fermilab internal note, 6820, 2004. [page 25] [19] G. L. Hazelbauer. Myriad molecules in motion: Simulated diffusion as a new tool to study molecular movement and interaction in a living cell. Journal of Bacteriology, 187(1):23–25, 2005. [page 29] [20] K. Lipkow, S. S. Andrews, and D. Bray. Simulated diffusion of phos- phorylated chey through the cytoplasm of Escherichia coli. Journal of Bacteriology, 187(1):45–53, 2005. [page 29] [21] V. Verma and A. Pang. Comparative flow visualization. IEEE Trans- actions on Visualization and Computer Graphics, 10(6):609–624, 2004. [page 37] [22] H. G. Pagendarm and F. H. Posts. Comparative visualization - ap- proaches and examples. Fifth Eurographics Workshop on Visualization in Scientific Computing, 1994. [page 37] [23] P. L. Williams and S. P. Uselton. Foundations for measuring volume rendering quality. Technical Report NAS-96-021, NASA Numerical Aerospace Simulation, 1996. [page 38] 60 BIBLIOGRAPHY [24] J. Trapp and H. G. Pagendarm. Data level comparative visualization in aircraft design. Visualization ’96. Proceedings., pages 393–396, 1996. [page 38] [25] H. Childs, S. Ahern, J. Meredith, M. Miller, and K. I. Joy. Com- parative visualization using cross-mesh field evaluations and derived quantities. Scientific Visualization: Interactions, Features, Metaphors, 2011. [page 38] [26] K. Kim and A. Pang. Ray-based data level comparison of direct vol- ume rendering algorithms. Scientific Visualization, Dagstuhl Workshop Proceedings, page 137, 1999. [page 38] [27] H. G. Pagendarm and F. H. Post. Studies in comparative visualization of flow features. Scientific Visualization, Overview, Methodologies and Techniques, pages 211–227, 1997. [page 38] [28] M. Falk, M. Klann, M. Reuss, and T. Ertl. 3d visualization of concen- trations from stochastic agent-based signal transduction simulations. Proceedings of the 2010 IEEE international conference on Biomedical imaging: from nano to Macro, pages 1301–1304, 2010. [page 39] 61 Appendix A Identification of reactions and reaction partners for (time step t = last:first) { load time t if particle type has changed { if the particle of interest is reactant { identify the types of the other n reactants, and m enzymes for each particle of these types { compute the distance to the particle of interest chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first n, and m members from each list load the time step t+1 identify the types of the s corresponding products for each particle of these types { compute the distance to the particle of interest chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first s members from each list } if the particle of interest is enzyme { identify the types of the other n reactants for each particle of these types { compute the distance to the particle of interest chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first n members from each list load the time step t+1 identify the types of the s corresponding products for each particle of these types { compute the distance to the particle of interest 63 chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first s members from each list } if the particle of interest is product { identify the types of the other s products for each particle of these types { compute the distance to the particle of interest chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first s members from each list load the time step t-1 identify the types of the the other n reactants, m enzymes for each particle of these types { compute the distance to the particle of interest chose the ones with a distance smaller then 1 place them in an ordered list according to type } take the first n,m members from each list } } } 64 Declaration I hereby declare that I have independently written the thesis submitted by me. I have indicated all passages whose exact wording or analogy have been derived from published or unpublished works of others. All sources and other means used for this thesis have been indicated by me. Neither the exact contents nor major segments of the thesis have yet been submitted to an examination authority. Stuttgart, 3 May 2013 Ana Cristina Pintilie 65