Atomic Architectural Component Recovery for Program Understanding and Evolution Evaluation of Automatic Re-Modularization Techniques and Their Integration in a Semi-Automatic Method Von der Fakultät Informatik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Rainer Koschke aus Stuttgart Hauptberichter: Prof. Dr. E. Plödereder Mitberichter: Prof. Dr. J. Ludewig Tag der mündlichen Prüfung: 27. Oktober 1999 Institut für Informatik, Universität Stuttgart 2000 3 Acknowledgments I would like to thank my thesis advisor, Erhard Plödereder, for his continual sup- port over all my years in his group. He has provided valuable guidance and advice and he has shared many interesting ideas with me. I am thankful to Thomas Eisenbarth, Jean-François Girard, and Georg Schied for many interesting discus- sions on the topics of my thesis that have broadened my perspective. I appreciate the support of Hiltrud Betz and Thomas Eisenbarth in the implementation of the analyses described in this thesis. I would also like to thank Andreas Zendler for his guidance in finding appropriate statistical tests for the experiment conducted for this thesis. Last but not least I want to thank Hausi Müller and his group for making Rigi available to me. 4 5Table of Contents List of Abbreviations................................................................................. 12 Abstract ...................................................................................................... 13 Zusammenfassung ..................................................................................... 15 Part I Prelude..................................................................................................... 17 Chapter 1 Introduction ................................................................................................ 19 1.1 State of the Practice................................................................................ 19 1.2 The Importance of Software Architecture.............................................. 20 1.3 Research in Architecture Recovery ........................................................ 22 1.4 Scientific Questions Addressed in this Dissertation............................... 24 1.5 Project Context ....................................................................................... 24 1.6 Overview of the Following Chapters...................................................... 25 Chapter 2 Terminology ................................................................................................ 27 2.1 Reengineering and Software Maintenance Terminology ....................... 27 2.1.1 Forward Engineering ........................................................................ 27 2.1.2 Reverse Engineering......................................................................... 28 2.1.3 Restructuring .................................................................................... 28 2.1.4 Reengineering................................................................................... 28 2.1.5 Program Evolution and Software Maintenance................................ 29 2.2 Terminology in Software Architecture................................................... 29 2.3 Architecture Recovery............................................................................ 29 2.4 Embedding this Work into Architecture Recovery ................................ 30 2.4.1 The Horseshoe Model of Architecture Recovery ............................. 31 62.4.2 A Revised Conceptual Model of Architecture Recovery .................32 2.5 Components ............................................................................................34 2.5.1 Cohesion of Modules........................................................................36 2.5.2 Recovered Atomic Components .......................................................37 2.5.2.1 Abstract Data Types....................................................................37 2.5.2.2 Abstract Data Objects .................................................................38 2.5.2.3 Hybrids: State-based ADTs or ADOs with Subordinated Types 40 2.5.2.4 Strongly Connected Components ...............................................40 2.5.2.5 Other Kinds of Atomic Components ..........................................41 2.5.2.6 Primary Target Atomic Components ..........................................41 2.5.2.7 Abstractness of Atomic Components .........................................42 Chapter 3 Basic Structural Information ....................................................................43 3.1 Base Entities and Relationships.............................................................43 3.1.1 Architectural Quarks and their Relationships...................................44 3.1.2 Record Components..........................................................................46 3.1.2.1 Enclosing Relationship for Record Components........................48 3.1.2.2 Refining the Reference Relationship for Record Components...50 3.1.2.3 Modeling References ..................................................................51 3.1.3 Entity-Relationship Model for Base Entities ....................................53 3.2 Components ............................................................................................54 3.2.1 Atomic Components .........................................................................54 3.2.2 Subsystems .......................................................................................55 3.2.3 Entity-Relationship Model for Components.....................................56 3.3 Module Decomposition ..........................................................................57 3.4 Resource Usage Graph ..........................................................................58 3.4.1 Nodes ................................................................................................58 3.4.2 Edges.................................................................................................58 3.4.3 Attributes ..........................................................................................60 3.4.4 Notational Conventions ....................................................................60 3.5 Predicates and Functions for Nodes and Edges......................................60 3.5.1 General Predicates ............................................................................60 3.5.2 General Neighbor Functions.............................................................62 3.5.3 Neighbor Functions for Base Entities...............................................64 3.5.4 Elements of Components..................................................................65 3.5.4.1 Direct Elements of a Component................................................65 3.5.4.2 Indirect Elements of a Component .............................................66 3.5.4.3 Partial Subset Relationship .........................................................67 3.6 Views of the Resource Usage Graph ......................................................67 3.7 Identity, Affinity, and Correspondence of Components .........................69 7 3.7.1 Affinity for Atomic Components ..................................................... 70 3.7.2 Affinity for Subsystems.................................................................... 71 3.7.3 Properties of the Correspondence Relationship ............................... 75 3.7.4 Correspondence Relationship and Views ......................................... 78 Chapter 4 Components in the Programming Language C ....................................... 81 4.1 Analyzing C Code .................................................................................. 82 4.2 The Programming Language C .............................................................. 83 4.2.1 Modules and Macros ........................................................................ 83 4.2.2 Name Spaces .................................................................................... 84 4.2.3 Types................................................................................................. 85 4.2.3.1 Typedefs...................................................................................... 86 4.2.3.2 Enums ......................................................................................... 88 4.2.3.3 Structs and Unions...................................................................... 88 4.2.3.4 Anonymous Types ...................................................................... 90 4.2.4 Global Variables ............................................................................... 93 4.2.5 Constants .......................................................................................... 94 4.2.6 Record Components ......................................................................... 95 4.2.7 Functions .......................................................................................... 95 4.2.8 References ........................................................................................ 97 4.2.8.1 Reference Information Approximation ...................................... 97 4.2.8.2 References and Dereferences.................................................... 100 4.3 Information Hiding in C....................................................................... 101 4.4 The Language and Other Factors ......................................................... 102 Part II Automatic Techniques .................................................................. 107 Chapter 5 Basic Techniques and Metrics of Component Detection ...................... 109 5.1 What All Techniques Have in Common............................................... 109 5.1.1 Working Example........................................................................... 110 5.1.2 Characteristics of the Techniques................................................... 112 5.2 Global Object Reference Heuristic ...................................................... 113 5.3 Same Module Heuristic ........................................................................ 115 5.4 Part Type Heuristic ............................................................................... 117 5.5 Same Expression Heuristic................................................................... 118 5.6 Internal Access / Non-Abstract Usage Heuristic.................................. 121 5.7 Delta-IC................................................................................................ 124 5.8 Internal and External Connectivity....................................................... 136 5.9 Schwanke´s Arch Approach................................................................. 138 85.10 Type-based Cohesion.........................................................................143 5.11 Strongly Connected Components .......................................................147 5.12 Dominance Analysis...........................................................................148 5.13 Preliminary Taxonomy of Basic Structural Techniques .....................151 Chapter 6 Evaluation of the Basic Techniques.........................................................153 6.1 Reference Corpus .................................................................................153 6.1.1 Systems Studied..............................................................................154 6.1.2 Obtaining the Reference Atomic Components ...............................154 6.1.2.1 Reference Components for Aero, Bash, and CVS....................155 6.1.2.2 Reference Components for Mosaic...........................................157 6.2 Comparison of Candidate and Reference Components........................158 6.2.1 Classification of Matches ...............................................................158 6.2.2 Detection Quality............................................................................160 6.3 Benchmark Results for the Basic Techniques ......................................164 6.4 Analysis of False Positives ...................................................................170 6.4.1 Average Size of False Positives Before Manual Validation............170 6.4.2 Overlooked Atomic Components ...................................................170 6.4.3 Common Patterns of False Positives...............................................171 6.5 Qualitative Comparison ........................................................................175 6.6 Distinctive Contribution of Individual Techniques...............................180 6.7 Analysis of True Negatives...................................................................181 6.8 Summary...............................................................................................183 Chapter 7 Similarity Clustering ...............................................................................185 7.1 Overview of the Approach....................................................................186 7.2 Similarity Between Groups of Entities.................................................188 7.3 Similarity Between Entities ..................................................................189 7.3.1 Features...........................................................................................189 7.3.2 Indirect Relations............................................................................191 7.3.3 Feature Weights ..............................................................................196 7.3.4 Direct Relations ..............................................................................201 7.3.5 Informal Information ......................................................................202 7.4 Clustering Result ..................................................................................204 7.5 Integration of Other Approaches ..........................................................205 7.6 Establishing the Similarity Metric Parameters .....................................209 7.6.1 Statistical Analysis of Edge Distribution........................................209 7.6.1.1 Scope for the Data.....................................................................210 7.6.1.2 Used Data..................................................................................212 9 7.6.1.3 Data for Aero, Bash, CVS, and Mosaic.................................... 214 7.6.2 How to Find the Parameters ........................................................... 220 7.6.2.1 Traditional Optimization Strategies.......................................... 221 7.6.2.2 Sample Partitioning .................................................................. 224 7.6.2.3 Evaluation of the Training Methods ........................................ 228 7.6.2.4 Comparison to Other Techniques ............................................. 237 7.7 Implementation Hints ........................................................................... 239 7.7.1 Used Data Structures ...................................................................... 240 7.7.2 Initialization.................................................................................... 240 7.7.3 A Refined Clustering Algorithm .................................................... 242 7.7.4 Matrix Representation .................................................................... 245 7.7.5 Implementing the Priority Queue ................................................... 248 7.8 Differences from Previous Approaches................................................ 249 7.9 Summary .............................................................................................. 249 Part III The Semi-Automatic Method ..................................................... 253 Chapter 8 Combined and Incremental Techniques................................................. 255 8.1 Ways of Combinations ......................................................................... 256 8.2 User Information .................................................................................. 257 8.2.1 User Information and Components Views ..................................... 257 8.2.2 Constraints for Components Views ................................................ 259 8.2.3 User Actions ................................................................................... 259 8.3 Combining Operators ........................................................................... 260 8.3.1 Composition ................................................................................... 262 8.3.1.1 Restrictions Imposed by the Interactive Employment .............. 263 8.3.1.2 Algorithm for the Composition Operator ................................. 266 8.3.1.3 Composition for Connection-based Techniques....................... 267 8.3.1.4 Composition for Metric-based Techniques............................... 269 8.3.1.5 Composition for Graph-based Techniques ............................... 272 8.3.1.6 Partitioning Clusters with Mutually Exclusive Elements ......... 277 8.3.1.7 Transforming Clusters into Components.................................. 280 8.3.2 Set Operators for Combining Components Views ......................... 283 8.3.2.1 Deep Union............................................................................... 292 8.3.2.2 Deep Intersection...................................................................... 296 8.3.2.3 Deep Symmetric Difference ..................................................... 298 8.4 Voting Approach................................................................................... 301 8.4.1 Summarized Agreement ................................................................. 301 8.4.2 Ways of Using the Voting Approach .............................................. 302 8.4.3 Agreement of Individual Techniques ............................................. 305 10 8.4.3.1 Agreement of Connection-based Techniques ...........................306 8.4.3.2 Agreement of Metric-based Techniques ...................................308 8.4.3.3 Agreement of Graph-Based Techniques ...................................311 8.5 Summary...............................................................................................312 Chapter 9 A Semi-Automatic Method to Detect Components................................315 9.1 Method Overview .................................................................................316 9.2 Analysis Selection and Application......................................................318 9.3 Metric Selection, Adjustment, and Ranking.........................................319 9.4 Presentation, Validation, and Acceptance ............................................320 9.5 Detection Strategy ................................................................................321 9.6 Extensibility of the Framework ............................................................325 Chapter 10 Experiments to Evaluate the Semi-Automatic Method.........................327 10.1 Goals of the Experiments ...................................................................327 10.2 Experimental Subjects ........................................................................328 10.3 Experiment to Evaluate the Semi-Automatic Method........................329 10.3.1 Hypotheses....................................................................................329 10.3.2 Experimental Materials.................................................................330 10.3.3 Tool Support .................................................................................331 10.3.4 Experimental Design ....................................................................332 10.3.5 Measurement of the Dependent Variable......................................333 10.3.6 Experimental Results ....................................................................334 10.3.7 Statistical Analysis........................................................................335 10.3.7.1 Exact U-Test............................................................................335 10.3.7.2 Exact Fisher-Pitman Test ........................................................337 10.3.8 Summary.......................................................................................339 10.4 Case Study for Maintenance Support .................................................341 10.4.1 Task 1 - Change of Data Structure................................................343 10.4.2 Task 2 - Lifetime Analysis............................................................344 10.4.3 Task 3 - Exact Interface Identification..........................................345 10.4.4 Task 4 - Concept Recognition and Clone Detection.....................345 10.4.5 Summary.......................................................................................346 Part IV Finale.......................................................................................................349 Chapter 11 Related Research ......................................................................................351 11.1 Other Automatic Component Detection Techniques..........................351 11.1.1 Metric-based Approaches .............................................................351 11.1.1.1 Belady and Evangelisti ...........................................................351 11 11.1.1.2 Hutchens and Basili ................................................................ 352 11.1.1.3 Girard and Würthner............................................................... 352 11.1.1.4 Mancoridis et al. ..................................................................... 353 11.1.2 Concept Analysis.......................................................................... 353 11.1.2.1 Mathematical Background...................................................... 354 11.1.2.2 Lindig and Snelting’s Approach ............................................. 358 11.1.2.3 Siff and Reps’s Approach ....................................................... 360 11.1.2.4 Sahraoui et al.’s Approach ...................................................... 364 11.1.2.5 Canfora et al.’s Approach ....................................................... 368 11.1.2.6 Summary of Concept Analysis ............................................... 369 11.1.3 Dataflow-based and Domain-based Approaches.......................... 370 11.1.3.1 Valasareddi and Carver ........................................................... 370 11.1.3.2 Gall and Klösch ...................................................................... 370 11.2 Semi-Automatic Methods .................................................................. 371 11.2.1 Müller et al. (Rigi)........................................................................ 371 11.2.2 Kazman and Carrière (Dali) ......................................................... 372 11.2.3 Yeh et al. (ManSART) .................................................................. 373 11.2.4 Gall, Klösch, and Weidl................................................................ 373 11.2.5 Murphy, Notkin, and Sullivan (Software Reflexion Model) ........ 374 11.3 Plan Recognition Techniques ............................................................. 375 11.4 Connector Detection Techniques ....................................................... 375 Chapter 12 Conclusions ............................................................................................... 377 12.1 Conclusions ........................................................................................ 377 12.2 Future Research.................................................................................. 384 12.2.1 Data Flow Information ................................................................. 384 12.2.2 Domain Knowledge...................................................................... 385 12.2.3 Research Directions Concerning the Method............................... 386 12.2.4 Role Identification ........................................................................ 386 12.2.4.1 Intra-Component Roles........................................................... 387 12.2.4.2 Inter-Component Roles........................................................... 388 12.2.5 Protocols ....................................................................................... 389 12.2.6 Subsystem Detection .................................................................... 390 12.2.7 Architectural Conformance .......................................................... 390 12.3 Final Remarks .................................................................................... 391 Appendix A Entity-Relationship Model for Basic Structural Information.............. 393 Bibliography ............................................................................................. 399 Index .......................................................................................................... 409 12 List of Abbreviations ADO Abstract Data Object ADT Abstract Data Type CPP C Preprocessor ExtC External Connectivity HC Hybrid Component IC Internal Connectivity ICS Institute of Computer Science IESE Fraunhofer Institute of Experimental Software Engineering IML Intermediate Language IntC Internal Connectivity KLOC Thousand (Kilo) Lines of Codes LOC Lines of Codes RS Related Subprograms RUA Resource Usage Analysis RUG Resource Usage Graph 13 Abstract The literature is rich of fully automatic and semi-automatic techniques for com- ponent recovery and their number is still growing. The abundance of published methods calls for frameworks to unify, classify, and compare them in order to make informed decisions. This thesis introduces a classification of component recovery techniques based on a unification of 23 techniques. Focussing on struc- tural techniques, 16 fully automatic techniques are classified into connection-, metric-, graph-, and concept-based subcategories and the commonalities and vari- abilities of these techniques are discussed in depth. Beyond the qualitative com- parison, 12 structural techniques are evaluated quantitatively (concept-based techniques were excluded). To that end, an evaluation scheme is introduced that allows to measure recall and precision of component recovery techniques with respect to a set of reference components ascertained by software engineers. Among the evaluated techniques is our new metric-based technique named Simi- larity Clustering. The evaluation scheme based on a set of expected components manually compiled by 5 software engineers for four C systems with altogether 136 KLOC shows that Similarity Clustering is among the best techniques for all systems, but it also has more false positives than other techniques. The overall result of this comparison is that none of the fully automatic techniques has a suf- ficient detection quality. In order to overcome this problem, a semi-automatic method is presented in this thesis in which computer and maintainer collaborate to detect components. The method is supported by a framework that integrates the existing fully automatic techniques. In this framework, the automatic techniques can be run successively and their results be validated by the user. For this purpose, all the techniques are enhanced to work incrementally. The unification of the automatic techniques makes it possible to implement incremental variants for whole classes of tech- niques. The results of the techniques can be combined by high-level operators modeled on intersection, union, and difference for fuzzy sets. An alternative way of integration is offered by a voting approach that summarizes the individual agreement of automatic techniques. 14 Despite of the new ways of combining the automatic techniques, the semi-auto- matic method inherits weaknesses of the integrated techniques. Future research should investigate whether these weaknesses may be overcome with additional, more precise information gained from dataflow analyses and more domain-ori- ented information. However, all methods will always have to cope with vagueness and subjectivity of the grouping criteria for components. 15 Zusammenfassung In der Literatur findet sich eine große Anzahl voll- und halbautomatischer Tech- niken zur Komponentenerkennung und ihre Zahl wächst stetig. Die Fülle der publizierten Techniken macht eine Klassifikation und Bewertung notwendig, um begründete Entscheidungen bei der Auswahl einer geeigneten Technik zu ermög- lichen. In dieser Arbeit wird eine Klassifikation basierend auf einer Vereinheitli- chung von 23 Techniken zur Komponentenerkennung eingeführt. Eine engere Betrachtung von 16 strukturellen Techniken liefert eine Subkategorisierung in verbindungs-, metrik-, graph- oder begriffsbasierte Techniken, deren Gemein- samkeiten und Unterschiede eingehend vorgestellt werden. Über den rein qualita- tiven Vergleich hinaus werden 12 strukturelle Techniken quantitativ beurteilt (begriffsbasierte Techniken werden nicht näher ausgewertet). Zu diesem Zweck wird ein Auswerteschema für Komponentenerkennungstechniken vorgestellt, mit dessen Hilfe die Erkennungsqualität hinsichtlich einer Menge von Referenzkom- ponenten, die durch Software-Ingenieure manuell ermittelt werden, genau bestimmt werden kann. Unter den bewerteten Techniken befindet sich unsere neue metrikbasierte Technik Similarity Clustering. Bei der Auswertung anhand des eingeführten Bewertungsschemas und der von Software-Ingenieuren erkann- ten Referenzkomponenten für vier C-Systeme mit zusammen ca. 136 KLOC befindet sich Similarity Clustering bezüglich seiner Wiederfindungsrate stets unter den besten Techniken; allerdings ist auch eine höhere Anzahl unzuordenba- rer Komponenten als bei anderen Techniken zu verzeichnen. Als Resultat ergibt sich insgesamt, dass keine der automatischen Techniken eine ausreichende Erkennungsqualität aufweisen kann. Um diesen Mangel auszugleichen, wird eine halbautomatische Methode einge- führt, in der Computer und Mensch bei der Erkennung zusammenwirken. Die Methode wird unterstützt durch eine Integration der vollautomatischen Techni- ken, bei der die Analysen sukzessive mit anschließender Validierung durch den Benutzer ausgeführt werden können. Hierzu werden die Techniken zu inkrement- ellen Techniken erweitert. Die Vereinheitlichung der Techniken erlaubt die ein- heitliche Implementierung inkrementeller Varianten für ganze Klassen von 16 Techniken. Die Resultate der Techniken können mittels Operatoren kombiniert werden, denen die Mengenoperationen Schnitt, Vereinigung und Differenz für unscharfe Mengen zugrunde liegen. Eine alternative Art der Integration ist der sogenannte Abstimmungsansatz, bei dem die individuellen Zustimmungen der Techniken zusammengefasst werden. Trotz der neuen, mächtigen Möglichkeiten, die automatischen Techniken zu kom- binieren, wird die halbautomatische Methode durch Schwächen der einbezogenen automatischen Techniken beeinträchtigt. Zukünftige Forschung sollte untersu- chen, ob die Schwächen der Techniken mit präziseren Informationen, die durch Datenflussanalysen hergeleitet werden können oder sich durch Vorwissen über das Anwendungsgebiet ergeben, beseitigt werden können. Nichtsdestotrotz wer- den auch zukünftige Verbesserungen stets mit der Vagheit und teilweisen Subjek- tivität der Gruppierungskriterien für Komponenten konfrontiert sein. 17 Part I Prelude 19 Chapter 1 Introduction In 1985, Lehman and Belady stated the so-called Lehman’s laws. Out of the orig- inal five the two “laws” (hypotheses, really) most relevant to the context of this thesis are repeated here: (1) The law of continuing change: A program that is used in a real-world environment necessarily must change or become progressively less useful in that environment. (2) The law of increasing complexity: As an evolving program changes, its structure tends to become more complex. Extra resources must be devoted to preserving and simplifying the structure. The work presented in this thesis aims at methods and tools to preserve and sim- plify the structure of a system in order to support program evolution. With pro- gram evolution, any modification of a software product is meant that takes place after delivery to correct faults, to improve performance or other attributes, to adapt the product to a changed environment, or to add functionality. 1.1 State of the Practice Software is an increasingly important factor for the expenses and returns of mar- keted products not only within the traditionally software-dominated domains, such as telecommunication and information systems, but also in other technol- Introduction 20 ogy-oriented lines, such as mechanical engineering, aviation, astronautics, or entertainment industry, whose share of software in production costs is estimated to 30-50 percent. The average fortune-100 company has 35 millions lines of code in operation with a growth of 10 percent per year (Buss et al., 1994). It is known from diverse case studies that 60-80% of the costs of a software prod- uct arise for program evolution (Nosek and Palvia, 1990). Interestingly enough, industry and research have made surprisingly little effort to cope with the prob- lems of program evolution in comparison to the attention experienced by develop- ment of new systems. The “year 2000 problem” has put program evolution in the limelight. However, even this example of a mass change has not changed the situ- ation very much (McCabe, 1998). More than 50% of the time needed for program evolution is spent in understand- ing the program before the actual change can be designed and realized, as several case studies have shown (Fjeldstadt and Hamlen, 1984). This is because the nec- essary information for the task at hand is often not completely and correctly doc- umented and therefore has to be derived from the source code. The maintainers, being badly informed and pressed for time, tend to fix the problem only locally, mostly in those subsystems they are familiar with. These local code fixes often disregard the original design and – since they are no real solutions but treat the problem only phenotypically – provoke errors at other sites of the system and complicate future understanding. This is a vicious circle that ends in a non-main- tainable system unless preventive measures are taken. 1.2 The Importance of Software Architecture Large systems are divided into subsystems. These subsystems, also known as components, and the dependencies that exist among the components form the software architecture of a system. The software architecture is a key asset affect- ing most attributes of a system. An inappropriate or deteriorated architecture can have a disastrous effect on maintainability. Garlan and Perry describe the major impacts of a software architecture on the following aspects of a system with a focus on development of new systems rather than maintenance (1995). I will describe these aspects more from the maintainer’s point of view. 21 The Importance of Software Architecture Understandability. The software architecture provides an overview of a system at a higher level of abstraction. This overview exposes the high-level constraints on system design that a maintainer has to observe and allows a more focused search oriented at architectural information. Many original design decisions and the consequences of their disregard become only clear at this level. Reuse. In the architecture, the maintainer can not only identify the reusable com- ponents but also the existing dependencies to other parts of the system that need to be handled before the components can be reused. Current work on reuse gener- ally focuses on component libraries. Architectural design supports, in addition, both reuse of large components and also frameworks into which components can be integrated. Architecture recovery is also an enabling technology for the prod- uct line approach in which common parts of architectures of a family of systems are united and generalized into a generic architectural framework for a particular domain; the architectures of the actual systems in this domain can then be real- ized as instantiations of the general framework (Bayer et al. 1999). Evolution. The software architecture can be viewed as the skeleton of the system. Having a description of this skeleton enables the maintainer to identify load-bear- ing and potentially weak parts that need to be carefully addressed when a system is to be evolved. Furthermore, having a clear picture of a component’s dependen- cies allows one to modify the component itself without affecting other parts of the system or to change the dependencies in order to handle evolving concerns about performance, interoperability, and reuse. Errors no longer have to be fixed where they appear but where they were caused by identifying the responsible compo- nents or the undocumented dependencies and constraints. Analysis. If the recovered architecture is specified by a separate architectural description, new opportunities for analyses are provided, including high-level forms of system consistency, conformance to an architectural style, conformance to quality attributes, and domain-specific analyses for architectures that conform to a specific style. Furthermore, the architectural description can be used to check whether changes to the system conform to the design principles of the architec- ture. Introduction 22 Management. Maintenance assignments can be made on the basis of subsystems. Moreover, the software architecture provides a base for a more rigorous estima- tion of the costs and risks of a change. The quality of a system can be assessed by judging the load-bearing capacity of its architecture. Weak parts can be identified and measures to overcome these weaknesses can be better examined and aimed. For particularly problematic components, it can be decided whether they should be reengineered or newly developed. Reengineering of large systems is only fea- sible if it is done subsystem by subsystem. For this incremental migration, the dependencies have to be known and the wrapping of the not yet reengineered parts has to be planned. Since all these factors are essential for a system’s capability to evolve, a descrip- tion of the software architecture must be recovered when it is lost. Ideally, the documentation should be kept up-to-date with future changes once it was recov- ered and the need for recovery should never arise again. However, even then it might be necessary to inspect the architecture as built in order to recognize and analyze differences from the documented architecture. Furthermore, the main- tainer may need to explore the architecture as built when the higher-level descrip- tion abstracted from certain details. Recovering the software architecture and exploring the architecture as built is costly and the only available tool support in practice is far too often a symbolic debugger to trace the system at a very low level. 1.3 Research in Architecture Recovery Architecture recovery comprises detection of components (the computational parts) and connectors (the means and points of communication) of systems. It is aimed at supporting the process of program understanding for software mainte- nance and evolution. Component recovery. One major research topic in component recovery is detec- tion of subsystems (Schwanke, 1991), another one is recovery of objects and abstract data types. Though abstract data type and object detection is commonly 23 Research in Architecture Recovery driven by reuse or object-oriented system migration, it does support architecture recovery at a lower level of components. Abstract data types and objects consist solely of subprograms, types, and global variables. They are only two examples of architectural concepts we can form with these kinds of base elements. Other examples are sets of related subprograms or hybrid components. We will refer to such low-level components solely built from types, variables, and subprograms as atomic components. Connector recovery. Connectors for concurrent and distributed systems have been the primary target of connector recovery (Harris et al., 1995; Fiutem et al., 1996). However, most systems, especially legacy systems, are sequential and monolithic. Function call is the most primitive and dominating type of connector of such systems. Another common way of communication is via shared global variables. At the next higher level of connectors, we find atomic components. For example, two architectural components may communicate by means of a pipe where the pipe is implemented as an abstract data type. That is, atomic compo- nents can be connectors at a higher level of architecture. Detecting them can therefore also aid in understanding of how larger components communicate. The goal of our research is to find techniques and methods for atomic component detection in the general framework of architecture recovery. In a case study, we have evaluated several published approaches to detect abstract data types and objects (Girard, Koschke, Schied, 1997c). The overall result was that none of the techniques has the needed precision. There are several alternative approaches to overcome this: The techniques can be combined, other sources of information can be considered (for example, dataflow information or domain knowledge), or the user can be integrated into the search. This thesis proposes a semi-automatic method in which computer and maintainer work hand in hand to detect atomic components. Within this interactive framework, the techniques can be combined by simple operations triggered by the user. Due to the complexity, vagueness, and to some degree subjectivity, it is questionable whether we can ever find precise techniques that fit all cases. Therefore, atomic component recovery is a problem that has to be tackled in concert with a maintainer anyway. Hence, how this can be effectively achieved should be investigated first before we search for other sources of information. Introduction 24 1.4 Scientific Questions Addressed in this Dissertation In more detail, the following questions are going to be addressed in this disserta- tion (the respective chapters devoted to these questions are given in brackets): 1. What published structural techniques exist and how can they be unified and classified (Chapter 5)? 2. What is the recall rate and precision in atomic component detection of pub- lished techniques (Chapter 6)? 3. How can these techniques be improved individually (Chapter 5 and Chapter 7)? 4. How can these techniques be combined (Chapter 8)? 5. How can the user be integrated in atomic component detection (Chapter 9)? 6. Do automatic techniques support a maintainer in atomic component detection (Section 10.3)? 7. Are the techniques and methods for atomic component detection discussed in this work also helpful for other typical maintenance tasks (Section 10.4)? This thesis focuses on techniques for atomic component detection that leverage only structural information and investigates how far we can get with such meth- ods. Other potential sources of information are the results of control and data flow analyses and domain knowledge. This thesis does not deal with control and data flow analyses, but in the course of the thesis, I will point out where information derived from these analyses could support the structure-oriented methods. Domain knowledge comes into play by the maintainer within the interactive sce- nario, but automatic ways to leverage domain knowledge are not explored. Still, I give at least one hint on how one of the approaches could profit by the vocabulary of a domain (Section 7.3.5). 1.5 Project Context The work described in this thesis is embedded in the Bauhaus project. Bauhaus is a research collaboration between the Institute for Computer Science of the Uni- versity of Stuttgart (ICS) and the Fraunhofer Institute for Experimental Software Engineering in Kaiserslautern (IESE). The goal of Bauhaus is to find methods and techniques for architecture recovery, to explore languages to describe recov- 25 Overview of the Following Chapters ered architectures, and to investigate analyses to compare architectures as built to the specified architectures. The first step toward the general goal of Bauhaus was to investigate methods and techniques to recover the architecture based on structural information. The first researchers of this project were Jean-François Girard (IESE), Georg Schied (ICS), and I (ICS). Until mid of 1998, the three of us worked in close co-opera- tion in the field of atomic component detection. All the work of this period was jointly published. When Georg Schied left the Bauhaus project in March 1998, our teamwork was reorganized. Jean-François Girard has concentrated on detec- tion of subsystems since then. This is the reason why my work does not deal with subsystem detection. However, the combinations of the basic techniques that I propose in Chapter 8 are designed so that they are going to work in the presence of recovered subsystems. My work has continued the detection of atomic components by exploring possi- ble combinations of the basic techniques, evaluating and inventing new tech- niques, providing incremental variants of the basic techniques, and delving into ways to integrate the user in the process of atomic component detection. To give a complete picture of atomic component detection based on structural information, this thesis does not only report on new improvements since March 1998 but also on previous joint work with Jean-François Girard and Georg Schied that I will explicitly point out in the following. 1.6 Overview of the Following Chapters This thesis consists of two main parts. The first part deals with automatic tech- niques and the second part with a semi-automatic method for atomic component detection. Before we get to these two main parts, the terminology and concepts used in this thesis are introduced in Chapter 2 and Chapter 3 and the issues related to the target language are discussed in Chapter 4. The first main part begins with Chapter 5 that describes published techniques for atomic component detection and suggests individual improvements to these tech- niques. These techniques are evaluated in Chapter 6. We, namely, Jean-François Introduction 26 Girard, Georg Schied, and I, extended one of the basic techniques described in Chapter 5 in so many ways that it is presented as a technique of its own in Chapter 7. The second main part proposes ways to combine the basic techniques and shows how the techniques can be modified to work incrementally (Chapter 8). Then it presents a method in which the maintainer uses the incremental versions of the basic techniques to detect atomic components (Chapter 9) and describes a con- trolled experiment and a case study conducted to evaluate the method (Chapter 10). The last part discusses related research (Chapter 11) and summarizes the conclu- sions of this thesis (Chapter 12). 27 Chapter 2 Terminology This chapter introduces the terminology used throughout this thesis. 2.1 Reengineering and Software Maintenance Terminology The following three sections contain standard terminology in reengineering. The definitions for reverse engineering, restructuring, and reengineering were pro- posed by Cross and Chikofsky (1990). Figure 2-1 sketches the relationships between these terms graphically. 2.1.1 Forward Engineering Software engineering was primarily thought of as aiming at the development of new systems though it covers reverse engineering and reengineering as well. To Figure 2-1. Relationships between terms. forward engineering reverse engineering reengineering restructuring or change Requirements Design Code architecture recovery forward engineering reverse engineering Terminology 28 avoid the connotations of the term software engineering, the term forward engi- neering is introduced. Forward engineering is the process of moving from high- level abstractions and logical, implementation-independent designs to the physi- cal implementation of a system. 2.1.2 Reverse Engineering Reverse engineering has the reversed objective of forward engineering. Reverse engineering is the process of analyzing a subject system to • identify the system‘s components and their interrelationships and • create representations of the system in another form or at a higher level of abstraction. It is important to note that reverse engineering in and of itself does not involve changing the subject system or creating a new system based on the reverse-engi- neered subject system. It is a process of examination, not a process of change or replication. 2.1.3 Restructuring Restructuring is the transformation from one representation form to another at the same relative abstraction level, while preserving the subject system’s external behavior (functionality and semantics). Restructuring is often used as a form of preventive maintenance to improve the physical state of the subject system with respect to some preferred standard. 2.1.4 Reengineering Reengineering, also known as renovation and reclamation, is the examination and alteration of a subject system to reconstitute it in a new form and the subsequent implementation of the new form. Reengineering generally includes some form of reverse engineering (to achieve a more abstract description) followed by some form of forward engineering or restructuring. 29 Terminology in Software Architecture 2.1.5 Program Evolution and Software Maintenance Reengineering is often seen as part of software maintenance. However, the ANSI/ IEEE standard 729-1983 defines software maintenance as the “modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a changed environment” whereas the goal of reengineering is often to add new functionality to the system, which is not cov- ered by the definition of software maintenance if one interprets its definition nar- rowly. One may argue that “adapting the product to a changed environment” includes adding new functionality, but as Turski (1981) pointed out it would be a gross abuse of the term maintenance: The addition of a new wing to a building would never be described as maintaining that building. That is why I consider adding new functionality as program evolution but not as maintenance. Reengi- neering is therefore a part of program evolution. As opposed to reengineering, reverse engineering may in fact be viewed as an activity within software maintenance since its purpose is to recover information that can be used for software maintenance tasks and it does not imply any change. 2.2 Terminology in Software Architecture There are still debates about the definition of software architecture, but most agree that it should include at least components and connectors and their hierar- chical decomposition. Components are the computational parts and connectors describe the interactions between these components (Garlan and Shaw, 1993; Perry and Wolf, 1992). General examples for components are abstract data types, producer and consumer tasks, or a compiler front end; examples for connectors are procedure calls, shared global variables, pipes, or Unix sockets. 2.3 Architecture Recovery Architecture recovery is a discipline of reverse engineering that is aimed at recov- ering the software architecture of a system. It has to be demarcated from design recovery. The term design recovery has been introduced by Ted Biggerstaff Terminology 30 (1989) and is the process of recreating design abstractions from a combination of code, existing design documentation, personal experience, and general knowl- edge about problem and application domains. Biggerstaff argues that design recovery in the broad sense is so inherently unstructured and unpredictable that formal deduction alone is not sufficient and, therefore, fuzzy reasoning should be used as additional way of deriving information. The derived information is used to populate a domain model that is used to understand the software. Design recovery is distinguished by the sources and span of information it should handle. As Biggerstaff says: “The domain model differentiates design recovery research from such superficially similar efforts as reverse engineering, which automatically abstracts code to a specification level such that the specifications can be modified and revised code can be automatically regenerated. (Biggerstaff, 1989)” It is unclear what definition of reverse engineering Biggerstaff had in mind when he wrote this. The definition of reverse engineering by Chikofsky and Cross (Sec- tion 2.1.2) would cover design recovery as well, but it came after Biggerstaff’s definition of design recovery. Furthermore, the term design recovery only sug- gests that the design is to be recovered in the true sense of the word, whereas Big- gerstaff’s definition explicitly requires a domain model, which narrows the term unnecessarily. That is why I prefer the more neutral term architecture recovery. 2.4 Embedding this Work into Architecture Recovery This section will present a general framework of architecture recovery proposed by Kazman et al. to accommodate analytical and transformational processes in architecture recovery. Kazman et al.’s model is going to be refined towards a more conceptual framework that is used to show where the work described in this thesis fits in. 31 Embedding this Work into Architecture Recovery 2.4.1 The Horseshoe Model of Architecture Recovery Kazman et al. present a framework that can accommodate analysis and transfor- mation processes in architecture recovery (1998). This framework is called the horseshoe model and consists of four different levels: • source level: source code in textual representation • code structure level: the source code in an intermediate representation that enables syntax-aware analyses • function level: relationships among functions, data, and modules, providing a global system overview • architectural level: architectural elements, i.e., connectors and components Architecture recovery usually starts at the source level. The source code is parsed, syntactically and semantically analyzed, and then represented in some intermedi- ate representation at the code structure level (often by abstract syntax trees). This intermediate representation is further processed by control and data flow analy- ses. The result is basic information that can be used for deriving the software architecture. However, the represented elements at this level, i.e., the declarations, statements, and expressions of a system and their relationships do not appear in the architectural description (except for global declarations) since this description should give a more abstract overview of the system. Yet, a complete architectural description provides a mapping between the architectural concepts and the imple- menting statements and expressions. From the architectural perspective, source and code structure level can be merged into a single code level. We can omit a discussion of the code level for the pur- pose of this thesis. More about suitable intermediate representations for reverse engineering can be found in a separate paper by us (Koschke, Girard, Würthner, 1998). An architectural description typically suppresses statements, expressions, and local declarations present at code level, but at least global declarations of con- stants, variables, functions, and user-defined types turn up in such a description since they are architecturally relevant. Constants are used to specify specific aspects permanently while global variables represent state that can change over time by modifying the value of the variables. Global variables can also act as con- nectors. Global functions are primitive components and user-defined types corre- Terminology 32 spond to higher concepts either of the programming domain (stacks, lists, etc.) or application domain (deposit, person, etc.). We will refer to these four kinds of entities (global constants, variables, functions, and user-defined types) as architectural quarks because they are the building blocks of architectural elements. Conceptually, they belong to both the code level and the architectural level. They form the seam between these two levels, so-to- speak. This seam is exalted in the horseshoe model as the function level. According to Kazman et al., the function level also shows how the architectural quarks are grouped into modules. Actually, Kazman et al. used in their paper the term file instead, but we may assume that they had module in mind and were thinking of older languages in which files are used as a substitute for modules. Above the function level is the architectural level that consists of the components and connectors of the software architecture. 2.4.2 A Revised Conceptual Model of Architecture Recovery The horseshoe model is meant as a framework for analytical and transformational processes in architecture recovery and architecture-based development. In this context, it makes sense to distinguish between source and code structure level. At the source level, textual pattern matching based on regular expressions is the only kind of possible analysis whereas at the code structure level syntactic, semantic as well as control and dataflow analyses can be applied. However, there is conceptu- ally no difference between the source and code structure level since the latter is more or less an exact representation of the source code. In a conceptual model, we do not distinguish these two levels. Any kind of grouping of architectural elements, including that of architectural quarks, is an architecture property. Module organization is one kind of grouping and, therefore, belongs to the architecture domain rather than to the function level as in the horseshoe model. If we exclude module organization from the function level in the horseshoe model, only architectural quarks remain. These, however, also belong to the code level. The function level is therefore rather the seam between the code level and the architectural level than a level of its own. In the revised model, the function level is omitted. 33 Embedding this Work into Architecture Recovery Because of the problems of the horseshoe model just described, I propose a revised conceptual model that consists of the following levels: • lower code level; expressions and statements in function bodies as well as nested functions • global code level; architectural quarks: global constants, global variables, glo- bal functions, and user-defined types as well as the relationships among them • lower architectural level; groupings of architectural quarks • higher architectural levels; subsystems and connectors Above the lower architectural level, there can be several higher levels represent- ing the architecture at different levels of abstraction; the connections among these levels display the hierarchical decomposition of the architectural elements. Both components and connectors can be hierarchical and what is viewed as a compo- nent at one level may be a connector at the next higher level. This thesis is about recovery of smaller groupings at the lower level of architec- ture that consist solely of architectural quarks, i.e., elements at the global code level. Such groupings will be called atomic architectural components, or sim- ply atomic components in the following. They are atomic in the sense that they do not consist of further groupings but only and directly of architectural quarks. Atomic components are therefore the smallest components at the architectural level (besides functions which can also - under some circumstances - be consid- ered components). This merely structural definition will be refined in the follow- ing section. Atomic components may be building blocks for larger architectural components at the next higher architectural level. For example, in a case study, we have used dominance analysis to detect subsystems based on atomic components (Girard and Koschke, 1997a). Furthermore, some of the atomic components may even play the role of connectors at a higher level of abstraction, e.g., an abstract data type Queue can be used as a pipe between two components. This way, atomic components can be the starting point for detection of connectors and larger com- ponents. Terminology 34 2.5 Components Large systems are decomposed into subsystems that can be managed individually. These subsystems can be again decomposed into smaller subsystems. The small- est decomposition is a module that may consist solely of functions, subprograms, and type declarations whereas a subsystem is a grouping of modules or lower- level subsystems. Subsystems and modules are static architectural components that differ in the degree of granularity. Dynamic architectural components are instances of computational units that are created at runtime; e.g., concurrent tasks (with an own thread of control) or queues (without own thread of control). This thesis is about static components only. However, recognizing static components is often a prerequisite for finding dynamic components since the latter are often just instances of the former, such as a queue X created on the heap at runtime that is an instance of an abstract data type Queue implemented by a static component. Good design results in a decomposition in which modules, as well as subsystems, have high cohesion and low coupling. The cohesion of a module is the extent to which its individual components are needed to perform the same task (Fenton and Pfleeger, 1997). Coupling is the degree of interdependence between modules (Yourdon and Constantin, 1979). There is no standard definition of a module. Yourdon and Constantin (1979), for example, propose the following definition: A module is a contiguous sequence of program statements, bounded by boundary elements, having an aggregate identifier. This definition, stated in the late seventies when structured design was the pro- posed design method, sounds nowadays very much like the definition of a func- tion. Today, we have programming languages that support the concept of module. An example is Modula-2 that does not only have the word module as keyword in its syntax but also in its name (Wirth, 1985). A module in modern programming languages is a syntactic unit that supports encapsulation. It consists of an inter- face of the exported parts and an optional hidden implementation. The exported elements are global constants, variables, subprograms, user-defined types, and sometimes nested modules. 35 Components In its first design, a system’s decomposition may be indeed so that modules reveal low coupling and high cohesion (see Parnas, 1972, on the criteria of modulariza- tion, and Parnas et al., 1985, on the modular structure of complex systems), but during continuous maintenance the original decomposition may deteriorate. For example, a function F that actually would have belonged to module A was put into module B. This results in lower cohesion of module B and in higher coupling between A and B, since F will need details of the implementation of A. Further- more, the underlying concept of A is delocalized because it is also partly realized by B. High coupling and low cohesion make changes more difficult. Reengineer- ing has to restructure the system such that the underlying concept of A is imple- mented by module A and only by A to simplify future maintenance. The discussion reveals that a real module does not always match its underlying concept, i.e., there is a divergence between the syntactic unit and the logical unit. To distinguish these two kinds of units, we will call the latter atomic component. With module, we solely mean the syntactic unit further on and follow the typical programming language terminology in doing so. A module is a syntactic unit that is used to group entities. It consists of an interface and an optional implementation. Entities in its interface are acces- sible by other modules; the implementation is the module’s secret. A component is a group of related elements with a unifying common goal or concept relevant at the architectural level. An atomic component is a non-hierarchical component that consists of related global constants, vari- ables, subprograms, and/or user-defined types. As opposed to an atomic component, a subsystem is a hierarchical component consisting of related atomic components and/or lower-level subsystems. The goal of (re-)structuring a system is to realize an atomic component by one module and one module implements only one atomic component for the sake of maximal cohesion and minimal coupling. In practice, the degree of cohesion of a module can vary. The next section discusses this in more detail. Terminology 36 2.5.1 Cohesion of Modules Yourdon and Constantin (1979) list the following degrees of cohesion: • Functional: the module performs a single well-defined function. • Sequential: the module performs more than one function, but they occur in an order prescribed by the specification. • Communicational: the module performs multiple functions, but all on the same body of data (which is not organized as a single type or structure). • Procedural: the module performs more than one function, and they are related only to a general procedure affected by the software. • Temporal: the module performs more than one function, and they are related only by the fact that they must occur within the same time span. • Logical: the module performs more than one function, and they are related only logically. • Coincidental: the module performs more than one function, and they are unre- lated. These categories of cohesion are listed from most desirable (functional) to least desirable. This classification was established in the late seventies when the func- tional paradigm was dominating and structural design was the common design method. The more recent trend towards the object-oriented paradigm - and hence to languages and methods that support data abstraction by modules - at first sight appears to contradict the traditional ideas. A module based on data abstraction may perform several different functions; but all are related in the sense that they characterize the abstract data type, or more generally: the atomic component. Modules based on data abstraction form a special category, and Macro and Bux- ton (1987) have extended the cohesion classification to include it. They say, a module has abstract cohesion precisely when it is an abstract data type. We can generalize this to: “A module has abstract cohesion precisely when it is an atomic component.” Good design and restructuring should aim at abstractly cohesive modules. 37 Components 2.5.2 Recovered Atomic Components This section answers the questions raised by the new definition of abstract cohe- sion: What exactly is an abstract data type and what are the other kinds of atomic components recovered by the approaches described in this thesis? 2.5.2.1 Abstract Data Types Liskov and Zilles define an abstract data type (ADT) as an abstraction of a type which encapsulates all the type’s valid operations and hides the details of the implementation of those operations by providing access to instances of such a type exclusively through a well defined set of operations (1974). Abstract data types may be constrained by global constants. For example, the maximal length of a list may be determined by a constant specific to this type. Such constants are an integral part of the ADT. Furthermore, the implementation of two or more types is sometimes so much interleaved that the types cannot really be separated into distinct ADTs, e.g., a hash table and a hash table entry type. That is to say, an ADT does not necessarily consist of one type only. To sum it up, ADTs consist of a set of types (usually, only one type) and their accessor functions; global constants may also belong to the ADT if they specify aspects of the ADT. In a modern programming language supporting encapsulation, the data structure of an ADT can be hidden such that only the subprograms that belong to the ADT may access it. All other subprograms may only declare objects of the ADT and call the accessor routines. An example ADT in a modern programming language. Figure 2-2 shows the interface of an ADT stack of integers in the programming language Ada (ANSI/ ISO/IEC-8652:1995). The internal parts of the ADT are explicitly declared pri- vate (the full type declaration is given in the private section of the package for the purpose of efficient separate compilation). Only subprograms declared in this package may access these internal parts (actu- ally, in Ada 95, subprograms declared in child units of this package have also vis- ibility to the private part, but this is not important to the discussion here). The interested reader may learn more about support for data abstraction by modern Terminology 38 programming languages in Robert Sebesta’s book on concepts of programming languages (1998). Abstract data types are related to classes in object-oriented programming lan- guages but whether there is a direct correspondence between the two of them depends on the terminology. We will follow the more proper notion of a class as a set of types, whereas many object-oriented programming languages use class syn- onymously to type, e.g., SmallTalk and C++. In the proper sense, an abstract data type may be an element of a class but no class as such. I want to point out here that the relationships among the abstract data types, i.e., whether they are mem- bers of the same class and how they are derived from each other, are beyond the scope of this thesis. 2.5.2.2 Abstract Data Objects An abstract data object (ADO) is a group of global variables and constants together with the routines which access them. These clusters are also called abstract objects (Ghezzi et al. 1991) or object instances (Yeh et al. 1995). ADOs are used to capture state. The state can be manipulated and queried by the accessor routines of the ADO. No other routine may access the variables since the ADO is considered abstract. Figure 2-2. An example ADT stack of integer in Ada. package Stacks is type Stack is private; -- The ADT stack of Integer -- accessor routines function Top (S : Stack) return Integer; procedure Push (S : in out Stack; Item : Integer); procedure Pop (S : in out Stack); private type Stack_Contents is array (1..1000) of Integer; type Stack is record -- the hidden data structure Contents : Stack_Contents; Stack_Pointer : Natural := 0; end record; end Stacks; 39 Components An example of an ADO. Figure 2-3 contains an Ada package that implements an abstract data object stack of integer. The package interface lists the accessor rou- tines that can be called by a client. The global variables are hidden from clients in the package body. This example illustrates the difference of an ADO and an ADT: An ADT is built around a type. This type can be used to create as many instances of the ADT as needed (either by declaring variables of this type or by using dynamic allocation) whereas there is always one instance of an ADO since a client has no instantiation handle for ADOs and there is only one package Stack. (In Ada, one could make the package Stack generic and then instantiate it many times to get several ADOs Stack; however, we do not have generics in older programming languages.) An ADO could be generalized to an ADT by simple transformation rules: • a new record type is introduced that contains a component for each global vari- able of the ADO • the new record type is added to the parameter list of each ADO accessor rou- tine • in the body of an ADO accessor routine, accesses to the global variables are replaced by accesses to the components of the new record type; variables of the new record type are passed as actual parameter to the routine • at the client site, variables of the new record type must be declared that are then passed as actual parameters in calls to the accessor routines Figure 2-3. An example ADO stack of integer in Ada. package Stack is -- accessor subprograms function Top return Integer; procedure Push (Item : Integer); procedure Pop; end Stack; package body Stack is -- global variables Stack_Pointer : Natural := 0; Contents : array (1..1000) of Integer; ... end Stack; Terminology 40 Because of the simple shift from an ADO to an ADT, an ADO is often considered an ADT. Nevertheless, we will distinguish these two kinds of atomic components since there is a conceptual difference and also because the recovery strategies for these two are different. However, the transformation rules mentioned make clear that ADOs are at the same cohesion level as ADTs. 2.5.2.3 Hybrids: State-based ADTs or ADOs with Subordinated Types In real programs, we often find mixtures of ADTs and ADOs, i.e., atomic compo- nents that contain both types as well as variables. There are two different catego- ries of such mixtures. A state-based ADT is an ADT having state information by way of global variables. An example is an ADT that counts in a global variable how many instances are created at runtime. An ADO with subordinated types is an ADO that contains types that are an integral part of it. An example is an ADO hash table that contains two type declarations, one for hash items and one for lists of hash items (to resolve external collisions). Both types might be so special to the hash table that they cannot be reused in other contexts and are therefore no ADTs of their own. Whether we deal with a state-based ADT or an ADO with subordinated types is often hard to judge. If the distinction is not important, we will refer to both kinds as hybrid atomic components, or short hybrid components. A hybrid atomic component has abstract cohesion because it can be considered an ADT or ADO. 2.5.2.4 Strongly Connected Components Strongly connected components are sets of subprograms that call each other recursively. These subprograms form a component because none of them can be omitted without losing a piece of information for the understanding of the other subprograms in the component. They arise from the call graph; that is why pro- gramming languages do not need means to specify strongly connected compo- nents explicitly. Strongly connected components do not have abstract cohesion in the sense of Macro and Buxton’s definition since they consist solely of subprograms and are therefore no abstract data types. What kind of cohesion they actually have depends on the logical function they perform, but we can use structural informa- tion at least as a clue: It may be that there is only one entry E of the cycle from 41 Components outside and thus all other parts of the strongly connected components are subordi- nated to E. This is a strong hint that the strongly connected component performs a single function and has therefore functional cohesion. When there is more than one entry, the strongly connected component has at least logical cohesion since the functions within the component depend on each other: one cannot change or remove a single subprogram without affecting the others in the strongly con- nected component. 2.5.2.5 Other Kinds of Atomic Components Another relevant type of atomic component are sets of logically related subpro- grams (short related subprograms), as, for example, functions of a mathemati- cal library. A construct that could be used as a point of crystallization, as user- defined types for ADTs and variables for ADOs, does not necessarily exist for logically related subprograms and they are therefore harder to detect, especially when they are not even directly connected to each other, i.e., when they do not call each other. 2.5.2.6 Primary Target Atomic Components Related subprograms can only be found by some of the techniques (e.g., Similar- ity Clustering, Type-based Cohesion), though this is not their primary goal. The main targets of the techniques described in this thesis are abstract data types and abstract data objects. They represent a clear concept and are often used in practice (see Section 6.1.1). Less frequently used but still representing a clear concept are hybrid components. Strongly connected components are also useful for program understanding, but whether they also correspond to a specific concept has to be decided for each single component. For example, it could be that parts of the strongly connected component belong to different, yet related atomic compo- nents, or it could be that the strongly connected component is a complete part of an atomic component. As to strongly connected components, some techniques may also be able to detect clusters that belong to different atomic components, for example, communicating parts of related atomic components, – or even form an atomic component which we have no name for yet – and nevertheless contribute to program understanding. This should be borne in mind when the techniques are judged. Terminology 42 2.5.2.7 Abstractness of Atomic Components The definitions of abstract data types, abstract data objects, and hybrid compo- nents provided above describe the ideal situation in which programmers would always be aware and respect the encapsulation of these atomic components. In practice, in languages like C, there is only limited support to hide the implementa- tion details of atomic components and even in modern programming languages in which means to hide implementation details exist, software developers do not always use these means. As a result, the encapsulation of atomic components is often violated by direct accesses which bypass the accessor functions of the atomic components. In general, we use the adjective pure in front of an atomic component to denote that all accesses to its internal parts proceed through its interface and the adjective permissive in front of atomic components which suffer from encapsulation viola- tion. We use the convention that, when no adjective is in front of an atomic com- ponent, it is a permissive atomic component. This convention was selected because permissive components are much more frequent than pure components among the atomic components identified by a group of software engineers that analyzed several systems manually for our evaluation. Actually, their task involved deciding which function accessing internal elements of a potential atomic component were part of the abstraction and which were not. The presence of these encapsulation violations should be taken into consideration by reverse engineering techniques which attempt to identify atomic components in an automatic or semi-automatic fashion. In other words, the techniques and methods described in this thesis are aimed at identifying permissive components. These permissive components can then be encapsulated by further automatic pro- gram transformations if necessary. Such program transformations are, for exam- ple, described by Fanta and Rajlich (1999) but are not further discussed in this thesis. 43 Chapter 3 Basic Structural Information As stated in the introduction, the techniques described in this thesis are mainly based on structural information that is directly derivable from source code. This chapter presents the exploited structural information in a programming-language independent manner. This chapter also introduces the means to describe the tech- niques in a detailed way. The next chapter shows how the abstract model intro- duced in this chapter can be instantiated for the programming language C. 3.1 Base Entities and Relationships The leveraged basic information can be described by an entity relationship model whose entities are the smallest significant elements at the architectural level, namely, the architectural quarks: user-defined types, global subprograms, vari- ables and constants. This section describes the relationships among architectural quarks used to detect atomic components. These relationships can be found in most procedural programming languages. How these relationships are derived from C code is described in Chapter 4 and how they are actually used by heuris- tics for atomic component detection is described in Chapters 5 and 7. The entity-relationship model used to describe the information leveraged for component recovery makes use of inheritance for both entities and relationships. The model will be extended successively in the following sections both in terms of additional entities and relationships as well as refinement of existing entities Basic Structural Information 44 and relationship by means of inheritance. A summary of the final model can be found in Appendix A. 3.1.1 Architectural Quarks and their Relationships The architectural quarks are summarized by Figure 3-1. Variables and constants are subsumed by an abstract class object. (An abstract class is a class of which no instances exist and is used to express common properties of its derived classes. Abstract classes will be printed in italic in the following inheritance hierarchies.) An object should not be confused with an abstract data object. An abstract data object is an atomic component that contains objects (variables and constants). Neither should it be confused with an “object” in the sense of object-oriented pro- gramming. Figure 3-2 shows the relationships among architectural quarks leveraged by the techniques for atomic components detection as an entity-relationship model. The relationships and the respective roles of the involved entities are listed in Table 3- 1. This entity-relationship model is going to be refined in Section 3.1.2. Two of the relationships of Figure 3-2 can be refined: The signature-type of a sub- program is a type that occurs in its signature either as return or parameter type and an object can be referenced by using its value or taking its address; a variable can additionally be set. Altogether, we have the relationships summarized by Fig- ure 3-3. We will define the relationships in Figure 3-3 in the context of the language our analyses are aiming at in Chapter 4; a brief explanation can be found in Table 3-1. Most relationships in Table 3-1 should be fairly self-explanatory, except for the part type relationship, which will be explained here. Figure 3-1. Architectural quarks type hierarchy. user-defined type subprogram variablearchitectural quark is aconstant object 45 Base Entities and Relationships A type T1 can be used in the declaration of another type T2. In this case, we con- sider T1 a part-type of T2 (Ogando et al. 1994). T2 is the composite type of T1. For example, in the following C type declarations, Item is a part type of Node. Figure 3-2. Entity relationship model for architectural quarks. Figure 3-3. Relationship type hierarchy. subprogram user-defined type object* call signature-type actual-parameter-of reference*local-obj-of-type of-type part-type same-expression entity cardinality £ 1 ³ 0 * abstract * base relationship signature-type parameter-of return use actual-parameter-of part-type take-address-of of-type set call reference is a same-expression local-obj-of-type Basic Structural Information 46 typedef ... Item; struct Node {Item i; struct Node *next;}; The part-type relationship is transitive, i.e. if T is a part-type of S and S is a part- type of U, then T is also a part-type of U. 3.1.2 Record Components The entity-relationship model in Figure 3-2 contains only the principal constitu- ents that are to be grouped to atomic components by the techniques and the prin- cipal relationships between these constituents that are considered for this. Some of the techniques presented in the course of this thesis go beyond these entities and relationships by also considering references to record components of formal parameters, and local and global objects. That is why we enhance the entity-rela- tionship model in Figure 3-2 by explicitly modeling record components. Table 3-1. Relationships among architectural quarks. Relationship Source S Target T Meaning call subprogram subprogram S calls T set subprogram global variable S sets the value of T use subprogram object S uses the value of T take-address-of subprogram object S takes the address of T parameter-of subprogram user-defined type S has a formal parameter of T return subprogram user-defined type S returns a value of T local-obj-of-type subprogram user-defined type S has a local object of type T actual-parameter- of object subprogram S is an actual parameter in a call to T of-type object user-defined type S is of type T same-expression object object S and T occur in the same expression part-type type type S is a part type of T 47 Base Entities and Relationships Record Types. The syntax for record type declarations may vary for diverse pro- gramming languages but their essence is to specify the record components of a user-defined record type and their respective types. For example, the following C record declarations: struct Complex {float re, im;}; struct List { struct List *next; struct Complex c1; struct Complex c2; }; define two record types Complex and List. Complex has the record components re and im of type float. List is a list of pairs of Complex and has, therefore, a List pointer to the next element in the list, and two record components c1 and c2 for the two complex numbers. Record Objects. Variables and constants of record types are called record vari- ables and record constants, respectively; both are referred to as record objects. As instances of a record type, they comprise all record components of the type they are declared of plus transitively of all types of the record components. This way accesses to record components across multiple levels of record objects are possi- ble. For example, given the following declaration of a record variable in C (corre- sponding to the declarations of List and Complex above): struct List mylist; the following record component access is across two levels: mylist.c1.im Modeling Record Components. Each record object has its own separate set of record components (possibly of multiple levels) as specified by the record com- ponents of the record type and its part types. That is, we actually have two kinds of record components: 1. Record component specifier: A record component within a type declaration, which defines a part of the structure of all instances of this type. Basic Structural Information 48 2. Record component instance: An actual record component of a record object that is associated with a memory location. These two kinds of record components are explicitly modeled by extending the entity type hierarchy in Figure 3-1 as shown in Figure 3-4. Record components are separated from architectural quarks in the extended entity type hierarchy because only the latter may be grouped to atomic components ¾ record compo- nents, then, always belong to their enclosing type or object, respectively. 3.1.2.1 Enclosing Relationship for Record Components In order to capture where a record component actually belongs to, a new enclos- ing relationship is introduced. If R is a record component specifier and T is the type in which R is declared, T is the enclosing of R. Furthermore, if an object V is declared of this type T, a new record component instance R’ is added for R whose enclosing is V. If R itself is of a record type T’, a record component instance R” is added for each record component of T’ and R’ is the enclosing of each R”. Table 3-2 summarizes the enclosing relationship. Note that it is defined for two different domains. Furnishing each object with its own set of record components allows to distin- guish accesses to the same logical record component of different objects and also to make a distinction between an access to a logical record component of a global Figure 3-4. Base entity type hierarchy. Table 3-2. Relationships among base entities. Relationship Source S Target T Meaning enclosing record component specifier type S is enclosed by T enclosing record component instance global object record component instance S is enclosed by T record component architectural quark is a base entity record component specifier record component instance 49 Base Entities and Relationships object on one hand and to the same logical record component of a local object or parameter on the other hand. Example. The variable gl1 in Figure 3-5 has the tree of record components as shown on the right hand side of the figure. Note that it inherits the record compo- nents re and im twice because it has two record components of type Complex, namely, c1 and c2. Transitively, mylist has 7 (partly composite) record compo- nents, i.e., there are 7 ways to access the internal components of mylist. The type struct list, on the other hand, has only three record components reachable by reverse enclosing edges. With this model, we can distinguish three different record component accesses in the following code as shown in Figure 3-6: struct List gl1, gl2; foo (struct List pl) { gl1.c1 = gl2.c1 = pl.c1; } Figure 3-5. Enclosings of a composite variable. Figure 3-6. Record component references. struct Complex {float re, im;}; typedef struct List {struct List *next; struct Complex c1; struct Complex c2; }; struct List mylist; mylist next c2 re im c1 re imComplex re im struct List of-typeenclosing part-type next c1 c2 reference gl2 next c2 re im c1 re im struct List of-typeenclosingpart-type next c1 c2 foo gl1 next c1 re im c2 re im Basic Structural Information 50 Note that we do not explicitly model references to local objects and parameters (see Section 3.1.2.3). Note also that Figure 3-6 does not capture the induced refer- ences to the global variables gl1 and gl2 since we are primarily interested in refer- ences to record components in this section. The next two sections describe how references to objects and record components are distinguished. 3.1.2.2 Refining the Reference Relationship for Record Components There is a substantial difference in an access to a record object as a whole and an access to the record components of such an object: According to the information hiding principle, only accessor routines of the atomic component are allowed to access record components since this requires knowledge of the underlying data structure. In the following example: struct List mylist1, mylist2; mylist2 = rest (mylist1); /* statement 1 */ mylist2 = mylist1.next; /* statement 2 */ statement 1 may occur within accessor routines of List as well as in all other func- tions whereas statement 2 may only be used within accessor routines according to the information hiding principles. In order to distinguish references to record components from references to objects as a whole, we refine the set, use, and take-address-of relationships that were introduced as parts of the entity relationship model in Section 3.1.1. The refined model is shown in Figure 3-7. References to an object as a whole are either obj- address-of, obj-set, or obj-use; references to record components are comp-address-of, comp-set, or comp-use. A relationship comp-set (f, c) has to be understood as “func- tion f sets record component c”. This always implies that f also partially sets the enclosing object of c, which is explicitly modeled by a corresponding obj-set rela- tionship. In compiler terminology, comp-set compares to a partial set. I use another term because, first, the object of a partial set relationship is the composite object and not the record component in compiler terminology and, second, the reference relationship covers only references to record components whereas dereferences of pointers and array subscripts are also considered partial references in compiler terminology. 51 Base Entities and Relationships Table 3-3 lists the domains and meanings of the new reference relationships. Since it is irrelevant in the context of this thesis to further distinguish references to variables from references to constants, no additional subclasses of object refer- ences are introduced in order to simplify the presentation in the following. For reasons of conformity, obj-set is used when a global variable is set instead of a more appropriate var-set. 3.1.2.3 Modeling References There are basically the following objects that can be referenced by functions: glo- bal and local variables, formal parameters, global and local constants (must not be set), and dynamic data structures. Dynamic data structures are anonymous and can therefore not be grouped. However, dynamic data structures are accessible via pointers and these pointers are visible in the source and may be brought in for grouping. Figure 3-7. Reference relationships hierarchy. Table 3-3. Reference relationships. Relationship Source S Target T Meaning obj-set subprogram global variable S sets the value of T. obj-use subprogram object S uses the value of T. obj-address-of subprogram object S takes the address of T. comp-set subprogram record component S sets the value of T. comp-use subprogram record component S uses the value of T. comp-address-of subprogram record component S takes the address of T. base relationship use ... is a enclosing take-address-of setreference obj-address-of comp-address-of obj-set comp-set obj-use comp-use Basic Structural Information 52 Global variables and constants are explicitly captured by the entity-relationship model proposed in this chapter. References to these can be directly represented by the reference relationships introduced in the previous section, i.e., if a record component of a variable is set, a comp-set of the record component instance and an obj-set of the variable is added. If the variable is set as a whole, only an obj-set of the variable is appended. Local variables and formal parameters are not captured by the entity-relationship model. They are only used to induce local-obj-of-type and signature-type relation- ships. Therefore, references to record components of local variables and formal parameters are re-directed to the respective record component specifier of their type. This may first appear irritating since a record component specifier cannot be referenced ¾ only its instances. However, references to specifiers should be viewed as information how the type is used. This view saves us from representing formal parameters and local variables in the entity-relationship model. Further- more, if formal parameters and local variables were represented and a function has several parameters of the same type, the information about the usage of the type by the function would be spread over the distinct parameters, though it is not of interest how each single parameter is referenced but only how the type is used altogether. References to local objects and formal parameters as a whole are not represented as references to their type since the local-obj-of-type and signature-type relationships are already in place. Example. Figure 3-8 illustrates the newly introduced concepts. Note that there is no obj-use for the parameter s because parameters are not explicitly represented. Figure 3-8. Example representation of record component accesses. struct S1 {int a, b; }; struct S2 {struct S1 c,d;} v; void F (struct S1 s) { v.c.a = s.b; } v c d a b a b F obj-set comp-set enclosing struct S1 a b comp-use parameter-of 53 Base Entities and Relationships 3.1.3 Entity-Relationship Model for Base Entities Due to modeling record components, a new entity type and its relationships have been introduced in the previous section. The updated entity-relationship model is given in Figure 3-9. It is important to note that record components and the relationships associated with them are only used as additional information for grouping, but that record components themselves are not to be grouped because they always belong to their enclosing object or type, respectively. Nevertheless, in a second analysis once the atomic components have been identified, the constituents of atomic components could themselves be clustered in order to identify cohesive subparts that represent separate interfaces or services of the atomic component. However, this is beyond the scope of this thesis and will not be pursued further. Figure 3-9. Final base entity-relationship model for component detection. subprogram user-defined type object* call signature-type* actual-parameter-of referencelocal-obj-of-type of-type part-type same-expression entity cardinality £ 1 ³ 0 * abstract global variable constant record component* rec. comp. specifier rec. comp. instance reference* enclosing enclosing is-a is-a * Basic Structural Information 54 3.2 Components As stated in Section 2.5, there are basically two kinds of static components that are to be detected by architecture recovery: Subsystems and atomic components. The two of them differ in their level of granularity: Subsystems may comprise architectural quarks, atomic components, and lower-level subsystems whereas atomic components consist of related global constants, variables, subprograms, and/or user-defined types only. This section discusses how the base entity-rela- tionship model introduced in the last section can be extended to incorporate com- ponents. 3.2.1 Atomic Components An atomic component can be seen as a named set of architectural quarks. In our relational model introduced in the previous section, we can capture this as fol- lows: • atomic components are represented by a new entity type • the fact that an entity E belongs to an atomic component AC is expressed by a part-of relationship: E is a part of AC. Of course, the part-of relationship is equivalent to set membership when an atomic component is regarded as set of architectural quarks. We will use both views in the following depending upon which one is more comprehensible in the given context. Notation. Graphically, we will picture the two equivalent views as shown in Fig- ure 3-10. The one on the left hand side is the relational view, the other the set view. Figure 3-10. Two equivalent graphical notations for atomic components. AC V1 V2 V3 V4 V1 V2 V3 V4 AC A is part of B A B 55 Components In the following, identifiers AC and ACi will be used to denote atomic compo- nents. 3.2.2 Subsystems Subsystems are a means to represent hierarchical sets of related elements (archi- tectural quarks, atomic components, and other subsystems) whereas atomic com- ponents can be thought of as flat sets of related architectural quarks. Subsystems must contain at least one atomic component ¾ a component with architectural quarks only is considered an atomic component. We make a clear distinction between a subsystem and its structure. A subsystem as such is an entity in the relational model; the subsystem structure is a descrip- tion of the hierarchical composition of the subsystem. The part-of relationship is the spanning relationship for this hierarchy. While an atomic component can only have direct parts, a subsystem may also have indirect parts that are transitively derivable by the part-of relationship. Using the part-of relationship, one can think of a subsystem structure as a graph whose nodes are the parts of a subsystem and whose edges denote the part-of relationship. The root of the graph is a subsystem entity. The graph has the following properties: • A subsystem structure is not necessarily a tree: The part-of relationship expresses that an element completely belongs to another element of which it is a part. For a component C that is part of another component C’, this means that all parts of C are also part of C’. However, an element can be part of several components, i.e., we allow these hierarchies to overlap since it may not always be definitely clear where an element belongs to. • A subsystem structure is acyclic: Since a subsystem structure should be a real hierarchy, there must not be any cycle in the spanning part-of relationship. • A subsystem structure does not contain redundant part-of edges in order to be a concise description: An edge E from A to B is redundant if it is transitively derivable from other edges, in other words, if there is a path from A to B that does not contain E. The incremental techniques described in Chapter 8 generate subsystems when an entity could be added to more than one existing atomic component (see Section 8.3.1.7). Then, the entity and the atomic components are subsumed under a com- Basic Structural Information 56 mon subsystem. Furthermore, the user is allowed to add subsystems. Other than that, this thesis discusses no techniques aimed at detecting subsystems since it is concentrating on the lower level of architecture (as discussed in Section 1.5, detection of subsystems is subject of another thesis within the Bauhaus project). However, taking subsystems into account in the detection of atomic components allows for a future integration with techniques targeted at detecting subsystems. Notation. The graphical notation introduced for atomic components can be natu- rally extended to subsystems as exemplified by Figure 3-11. The node C of the example in Figure 3-11 is a subsystem. Its subsystem structure is the graph spanned by the part-of relationship and rooted by C including the nodes V1, V2, V3, V4, V5, V6, AC1, AC2, and C. In the following, identifiers C and Ci will be used to denote subsystems. 3.2.3 Entity-Relationship Model for Components Components are not covered by the base entity-relationship model of Figure 3-9 on page 53. The base entity-relationship model, representing the relationships among base entities, is therefore extended in this section. For the purpose of this thesis, it is sufficient to model components and their part-of relationship. Base entities (architectural quarks, more precisely) are the constituents of compo- nents. They do not contain other architectural quarks, that is why they are sepa- rated from components in the entity type hierarchy of Figure 3-12. Atomic components and subsystems are both architectural components. Figure 3-11. Two equivalent graphical notations for subsystems. AC2 V3 V4 V5 V1 V2 V3 V4 A is part of B A B AC1 V1 V2 V6 C V5 V6 AC2AC1 C 57 Module Decomposition The part-of relationship among components and architectural quarks is specified in Figure 3-13. Note that the part-of relationship is an n:m relationship, i.e, the same entity can be part of different components. This is due to the concession to overlapping components. 3.3 Module Decomposition Components are used to describe cohesive parts, hence, provide a logical view on the system. These components may differ from the actual or physical decomposi- tion of the system into files. In order to describe the physical decomposition, means similar to those used for components can be used. In order to describe modules a new module entity is introduced that is derived from entity. Analogous to components, the part-of relationship will be used to identify the entities of a mod- ule. More precisely, part-of (E, M) holds for an entity, E, and a module, M, if and only if E is declared in M. Figure 3-12. Entity type hierarchy. Figure 3-13. Part-of relationship among architectural quarks and components. base entity is aentity component subsystem atomic component architectural quark * part-of entity cardinality £ 1 ³ 0 * abstract component* atomic component subsystem is-a part-of Basic Structural Information 58 3.4 Resource Usage Graph An instance of the entity relationship model introduced in the previous section to capture the basic structural information leveraged for atomic component detection can be represented by a graph. This graph is referred to as resource usage graph and is going to be used to describe the techniques for atomic component detection in more detail. In the reverse engineering literature, the resource usage graph is also often referred to as resource flow graph (Müller and Uhl, 1990). However, it may be argued that the term resource flow graph is inappropriate since there is no real flow of resources in the information represented by this graph. The graph contains only very crude control and data flow information as (unordered) calls among subprograms and (directly visible) references of objects by subprograms. Addi- tionally, it describes other aspects like the parameter types a subprogram has or the decomposition of types into other types or record components. Hence, we pre- fer the term resource usage graph in the following. More formally, a resource usage graph is an attributed typed multi-graph G = (N, E) whose set of nodes, N, consists of architecture-relevant elements and whose edges, E, are relationships among these elements. Edges can be either directed or undirected. The resource usage graph is typed in the sense that we distinguish dif- ferent types of nodes (entities) and different kinds of edges (relationships). Nodes and edges are refined by means of subtyping and may be annotated with addi- tional information (e.g., a signature edge of a function, F, to a composite type, T, is annotated if F accesses internal components of a parameter of type T). 3.4.1 Nodes The nodes of the resource usage graph are all entities introduced in the course of this thesis, the edges are the relationships. A summary of all entities and relation- ships can be found in Appendix A. 3.4.2 Edges Non-symmetric relationships are represented as directed edges. For directed edges among two nodes, the two involved nodes can be referred to as follows: 59 Resource Usage Graph Let e be a directed edge, then source(e) is the node at which e starts and target(e) is the node at which e ends. Symmetric relationships, such as same-expression, will be represented by undi- rected edges for which source and target are undefined. For undirected edges, we will use the term nodes to refer to the involved nodes: Let e be an undirected edge between the nodes n1 and n2, then nodes(e) = {n1, n2}. The definition of nodes can be extended to directed edges as follows (let e be a directed edge): nodes (e) = {target(e), source(e)}. Note that nodes(e) is {n} when e is connecting the node n with itself. Example. What kind of entities and relationships we actually have depends on the programming language that is to be modeled. Though the exact way of deriv- ing the resource usage graph from C code will only be described in Chapter 4, a simple anticipating example will illustrate the concept. The resource usage graph for the C program in Figure 3-14(a) is shown in Figure 3-14(b). Figure 3-14. Example resource usage graph. typedef int T; T var; void put (T t) { var = t; } T get () { return var; } main () { put (5); get (); } getput var T obj-set obj-use of-type returnparameter-of type variable subprogram (a) (b) main call call Basic Structural Information 60 3.4.3 Attributes Nodes and edges of the resource usage graph may be annotated: x.a denotes an attribute a of node or edge x. In the course of this thesis, only boolean attributes are needed. If a boolean attribute is used in an expression as in e = signature (S, T) Ù e.non-abstract it is to be read as “it is the case that e is non-abstract”. 3.4.4 Notational Conventions We will use an italic serifeless font for nodes and edges of the resource usage graph to make explicit that we mean a concept of the resource usage graph in a given context as opposed to a programming language concept: Type is a concept of the resource usage graph, whereas “type” is a concept of programming languages. This increases the readability as in “There is only one call among two subprograms though there may actually be many calls between these subprograms” instead of “There is only one call edge among the two subprogram nodes that are used to model the two given subprograms in the C code though there may actually be many calls between the subprograms in the C code.” 3.5 Predicates and Functions for Nodes and Edges The resource usage graph will be used to describe the techniques to detect atomic components. The descriptions of the techniques contain primitive predicates and functions for nodes and edges that will be introduced in this section. 3.5.1 General Predicates The information represented by a resource usage graph can be written out using a predicate notation; each relationship is represented by a clause, referred to as edge type predicate in the following. The arguments of a clause are entities. For example, the fact that subprogram, get, uses variable, var, can textually be repre- 61 Predicates and Functions for Nodes and Edges sented as obj-use (get, var). The first argument of a clause is the subject, the sec- ond the object of the relationship, or in terms of the resource usage graph: the first is the source of the edge that represents the relationship and the second is the tar- get of this edge. We will use predicate notation within text; in figures, however, we will prefer edges. Unary clauses will be used for node and edge type predicates. For example, to express that x is a subprogram, we can write subprogram (x) as a node type predi- cate. This notation can also be used to specify the type of an edge or relationship, respectively. For example, set (e) says that the relationship e is a set relationship. Given the relationship type hierarchy we introduced before, the predicate refer- ence(e) is true whenever set (e) holds. Before we can define this more formally, the notation to express the is-a relationship among types of entities and types of relationships is introduced. Let T1 and T2 be types of either entities or relationships. Then T1 is-a T2 expresses that T1 is a T2, i.e., whatever is true for T2 is also true for T1 (but not necessarily vice versa). The is-a relation is transitive, i.e., if T1 is-a T2 and T2 is-a T3, then T1 is-a T3. Using the is-a relationship, we can define node and edge type predicates: T(x) holds when x is of type T or of a type T’ where T’ is-a T. In some contexts, we want to abstract from a specific subclass of a base entity or relationship, yet, without treating all kinds of base entities or relationships alike. For example, we may treat two edges of kinds parameter-of and return equally because they are both signature-type edges and the distinction does not matter in the given context. But we do not want to consider parameter-of equivalent to of-type because the semantic gap is too wide. Based on the is-a relationship, we consider two relationships, A and B, equivalent – denoted by A ~ B – if there is a relation type, T, where T(A) and T(B) and T is-a base relation where T ¹ base relation. Like- wise, we consider two entities, A and B, equivalent if there is an entity type, T, where T(A) and T(B) and T is-a base entity where T ¹ base entity. Basic Structural Information 62 3.5.2 General Neighbor Functions The successors and predecessors of a node in a resource usage graph R=(N,E) can be defined as follows: In the case of successors and predecessors, the direction of the edges is relevant; that is why they are only defined for directed edges. If undirected edges are involved, we can use the concept of neighbors. The set of neighbors of a node N is the union of its successors and predecessors plus all nodes that are directly con- nected to N by undirected edges. This can be expressed as follows: Note that for the above definition of neighbors, n is its own neighbor exactly when: The definitions for successors, predecessors, and neighbors do not take into account by what kind of edges the nodes are connected, which is generally rele- vant to the techniques to detect atomic components. Furthermore, for some of the techniques, only edges for which a certain boolean edge attribute holds are rele- vant. Therefore, we will extend the definitions of the neighbor functions above. Let ST be a set of type predicates for edges and a be a boolean edge attribute: successors n( ) x e EÎ( ) source e( ) n= target e( ) x=Ù( )${ }= predecessors n( ) x e EÎ( ) source e( ) x= target e( ) n=Ù( )${ }= neighbors n( ) n' nodes e( ) n n',{ }={ } e EÎ È= n neighbors n( )Î e EÎ( )nodes e( ) n' n,{ }= n n'=Ù$Û e EÎ( )nodes e( ) n{ }=$Û 63 Predicates and Functions for Nodes and Edges In order to leave out the restricting boolean edge attribute in cases in which it is not needed, the following abbreviations are defined (let true be an attribute that is always true): These neighbor functions can be naturally extended to sets of nodes. Let N be a set of nodes: Example. For the example resource usage graph in Figure 3-14 on page 59, the neighbors would be as follows: successors (put) = {var, T} predecessors (put) = {main} neighbors (put) = {var, T, main} successors (put, {obj-set}) = {var} successors n ST a, ,( ) x e EÎ( ) source e( ) n= target e( ) x T STÎ( )T e( ) e.aÙ$Ù=Ù( )${ }= predecessors n ST a, ,( ) x e EÎ( ) source e( ) x= target e( ) n T STÎ( )T e( ) e.aÙ$Ù=Ù( )${ }= neighbors n ST a, ,( ) n' nodes e( ) n n',{ }={ } e E T STÎ( )T e( ) e.aÙ$ÙÎ È= successors n ST,( ) successors n ST true, ,( )= predecessors n ST,( ) predecessors n ST true, ,( )= neighbors n ST,( ) neighbors n ST true, ,( )= succcessors N ST,( ) successors n ST,( ) n NÎ È= predecessors N ST,( ) predecessors n ST,( ) n NÎ È= neighbors N ST,( ) neighbors n ST,( ) n NÎ È= Basic Structural Information 64 predecessors (main, {call}) = Æ neighbors (put, {obj-set, call}) = {var, main} successors ({put, get}, {reference}) = {var} It is also useful to define the transitive closure for the neighbor functions. First, we introduce the n-th application of a neighbor function NF (successors, prede- cessors, neighbors): NF1 (n, ST) := NF (n, ST) NFn (n, ST) := NF (NFn-1 (n, ST), ST) where n > 1 The transitive closure of a neighbor function NF is then: transitive_closure (NF (n, ST)) := NF¥ (n, ST) The definition of the n-th application and transitive closure of a neighbor function whose first argument is a set of nodes is analogous. 3.5.3 Neighbor Functions for Base Entities Specific variants of the neighbor functions will frequently be used in the descrip- tion of the techniques for atomic component detection and are therefore defined here in Table 3-4 in terms of the neighbor functions defined above or as a rela- tional inverse (denoted by f -1) to another frequently used neighbor function. If f(x) is defined as f(x)=predecessors(x, S, a), then the relational inverse f -1 is defined as: f -1(x)=successors(x, S, a). If f(x) is defined as f(x)=successors(x, S, a), then the relational inverse f -1 is defined as: f -1(x)=predecessors(x, S, a). 65 Predicates and Functions for Nodes and Edges 3.5.4 Elements of Components In the course of this thesis, we often need to refer to the elements of a component, i.e., all its parts. Since atomic components may only consist of architectural quarks, the definition of their elements is straightforward. Because subsystems may contain further lower-level components, elements of a subsystem can be defined with respect to the level of granularity. This section defines the elements of the diverse kinds of components. 3.5.4.1 Direct Elements of a Component The direct elements of a component are all entities that are directly part of this component, in other words, there is a part-of edge from the entity to the compo- nent: direct-elements (C) = {e | part-of (e, C)} Table 3-4. Frequently used neighbor functions for (sets of) nodes. name domain range definition caller(s) subpro- grams subpro- grams predecessors (s, {call}) callee subpro- grams subpro- grams caller-1 referencing- subprograms(v) objects subpro- grams predecessors (v, {reference}) referenced-objects subpro- grams objects referencing-subprograms-1 signature-types(s) subpro- grams types successors (s, {signature-type}) signature-subprograms types subpro- grams signature-types-1 referred-entities(s) subpro- grams objects types successors (s, {reference, signature-type}) referred-by(s) objects types subpro- grams referred-entities-1 Basic Structural Information 66 In many contexts, we are interested in particular subsets of the direct elements of a component, namely, in its subprograms, objects, and types: subprograms (C) = {S | subprogram (S) Ù S Î direct-elements (C)} objects (C) = {O | object (O) Ù O Î direct-elements (C)} types (C) = {T | type (T) and Ù T Î direct-elements (C)} Analogously, the enclosing components of an entity, E, are the components of which E is a part (there can be more than one): enclosing-components (E) = {c | part-of (E, c)} If C is an atomic component, direct-elements (C) denotes all its base entities. However, if C is a subsystem, direct-elements (C) yields only the direct descen- dants of C and not all elements in a subsystem structure with multiple levels. The definitions in the following section will refine the notion of elements for sub- systems. 3.5.4.2 Indirect Elements of a Component Because subsystem structures can have several levels, we can distinguish different kinds of elements depending on how far away they are from the root of the sub- system structure in terms of part-of edges (let C be a component): Note that if C is an atomic component. All elements of a component are the base entities and lower-level components in the transitive closure of the definition above: elements1 C( ) direct-elements C( )= elementsn C( ) elementsn 1– e( ) part-of e C,( ) È= n 1³( )elementsn C( ) elements C( )=" elements C( ) elementsn C( ) n 1³ È= 67 Views of the Resource Usage Graph If C is an atomic component, . Example. The elements1 of subsystem C in Figure 3-11 on page 56 are the atomic components AC1 and AC2 and the base entity V6 whereas elements2(C) = {V1,V2,V3,V4,V5}. All its elements are {V1, V2, V3, V4, V5, V6, AC1, AC2}. 3.5.4.3 Partial Subset Relationship At several places in this thesis, we have to compare two components with each other to ascertain their degree of congruence. This can be done in terms of their elements. Since the elements of components are basically sets of entities, one important information is whether the elements of one component are a subset of the other’s elements. However, this is sometimes too strict. A less strict way of comparison is the following partial subset relationship Íp: A Íp B if and only if where 0.5 £ p £ 1.0. The tolerance parameter p in this relationship can be specified by the maintainer. If set to 1.0, A must be completely contained in B. This definition still considers a component with elements {a, b, c, d} at least a partial subset of a component with elements {a, b, d, e, f} when p ³ 0.75 though c is not present in the latter set of elements. Note that the partial subset relationship is not transitive for p ¹ 1. For example, {a, b, c} Í0.6 {a, b, d} Í0.6 {b, d, e}, but {a, b, c} Ë0.6 {b, d, e}. 3.6 Views of the Resource Usage Graph A resource usage graph represents the facts of a system that are leveraged by the atomic component detection techniques presented in this thesis. These facts are either derived from the system directly or by means of further analyses. Often, we need not all information but a specific excerpt. Views are parts of a resource usage graph representing special aspects. In the terminology of graphs, they are elements C( ) direct-elements C( )= A BÇ A ----------------- p³ Basic Structural Information 68 subgraphs of a resource usage graph, i.e., all nodes and edges of the view are also in the resource usage graph and, for all edges in the view, both ends are in the view as well. Examples for important views are the call view (or call graph in compiler terminology), which renders the call relationship among subprograms, or the type view, which indicates how types are related. Views will be used in the description of resource usage graph-based analyses to point out the parts of the resource usage graph that are relevant to the analyses. They will also be used to describe the output of analyses. Table 3-5 lists some important elementary views that are directly derived from source code and which form the basis for atomic component detection. It also mentions components views that are used to represent the decomposition of com- ponents as a result of atomic component detection techniques. Components views are further described in Section 8.2. Whereas the components view describes the logical structure of the system, i.e., identifies the cohesive parts, the module view describes the actual structure of the system as a collection of modules and their declarations as it can be directly derived from the source code. The module view is also called the physical file or module structure. Table 3-5. Common views. View Name Nodes Edges Explanation call view subprograms call call graph type composi- tion view types, record components part-type, enclosing makes apparent how types are built signature view subprograms, types signature-type specifies the parameter interface of subprograms type usage view subpro- grams, types, objects signature-type, of-type, local-obj-of-type cross-reference for the usage of types object reference view subpro- grams, objects reference indicates which objects are set or used or whose address is taken by subprograms same expres- sion view objects same-expression identifies objects that occur in the same expression 69 Identity, Affinity, and Correspondence of Components Example. The object reference view and the call view of the resource usage graph in Figure 3-14 on page 59, for example, can be found in Figure 3-15. 3.7 Identity, Affinity, and Correspondence of Components At different places in this thesis, we have to compare components to each other. A comparison for atomic components can basically be based on two aspects of a component: Its identity as an entity and its set of elements. In the case of a sub- system, the subsystem structure is a third aspect. actual parameter view subpro- grams, objects actual- parameter-of describes which object is an actual parameter of a sub- program base view base entities base relationship this is the union of the call, type composition, signature, type usage, object refer- ence, same-expression, and actual parameter views components view architectural quarks, com- ponents part-of describes the decomposition of one or more components (their part-of relationships) module view base entities, modules part-of describes the module decomposition; an entity is part-of a module if the entity is declared in the module Figure 3-15. Example views. Table 3-5. Common views. View Name Nodes Edges Explanation getput var obj-set obj-use getput maincall call object reference view call view Basic Structural Information 70 Since components are entities, they have their own identity. In terms of the resource usage graph, two components, or more generally, two entities, A and B, are identical if they are represented by the same node, denoted by A=B. If components are not identical, we can still consider them comparable if they share a vast majority of their elements. Such components are said to be affine. An exact definition of affinity for atomic components and subsystems follows below. Two components are said to correspond to each other when they are either affine or identical: (3.1) If two components do not correspond to each other, they are considered dissimi- lar. The definition of affinity follows in the next two sections. Since base entities do not have further elements, they can only correspond to each other when they are identical. 3.7.1 Affinity for Atomic Components Atomic components can be compared in terms of their elements using ordinary set relationships: We could require that the two atomic components have to have the same elements in order to be comparable. However, this is often too strict. A less strict comparison based on the set of elements can be made when only a cer- tain degree of overlap is required. Hence, two atomic components A and B are considered affine when their degree of overlapping is above or equal to a user- determined affinity tolerance parameter Q (if a maintainer insists on the requirement that the sets of elements must be equal, she can simply set Q=1.0): (3.2) Where correspondings of two atomic components are the elements shared: (3.3) correspond A B,( ) affine A B,( ) A B=( )Ú= affine A B,( ) correspondings A B,( ) correspondings A B,( ) dissimilars A B,( )+------------------------------------------------------------------------------------------------------- Q³Û correspondings A B,( ) elements A( ) elements B( )Ç= 71 Identity, Affinity, and Correspondence of Components Dissimilars denotes the elements that are either in A or B: (3.4) Inequation (3.2) is equivalent to: (3.5) The reason why we prefer (3.2) to (3.5) is that affinity for subsystems will be defined with the same structure as (3.2). It is obvious that affinity according to (3.2) is symmetric. 3.7.2 Affinity for Subsystems Affinity for subsystems is more complicated since it cannot just be based on the set of elements. It must also take the structure into account. As an example, con- sider the two subsystems C1 and C3 in Figure 3-16 where elements (C1) = {A, B, C, D, E, F, AC1, C2} and elements (C3) = {A, B, C, D, E, F, AC2, AC3}. The fol- lowing observations can be made for this example: 1. with respect to (3.2) 2. C1 and C3 differ in structure 3. AC1 and AC2 are affine but they enter the comparison based on elements according to (3.2) as if they were dissimilar Figure 3-16. Dissimilar subsystems. dissimilars A B,( ) elements A( )\elements B( ) elements B( )\elements A( )È= elements A( ) elements B( )Ç elements A( ) elements B( )È------------------------------------------------------------------- Q³ elements C1( ) elements C3( )Ç elements C1( ) elements C3( )È -------------------------------------------------------------------------- 3 5 --= Q³ Q 3 5 --£" C1 C2 AC1 A B C D C3 AC3AC2 A B C DE E FF Basic Structural Information 72 Because of observation (1), these two subsystems would be considered affine (for Q£0.6) if the definition of affinity were only based on the set of elements for sub- systems according to (3.2). However, since the two of them differ in structure, we do not want to consider them affine. Moreover, the fact that AC1 and AC2 are affine should also be taken into account by a definition of affinity for subsystems. One alternative to accommodate these requirements is to consider two subsystems affine when their subsystem structures are graph-isomorphic. A graph G1 = (N1, E1) is isomorphic to a graph G2 = (N2, E2) if there is a bijective mapping f from N1 to N2 where: An additional requirement is that the bijective function must be the identity for nodes that are in both structures. Otherwise, the bijective mapping could arbi- trarily be chosen and therefore C1 and C4 in Figure 3-17 would be considered affine though the base entities (i.e., the leaves of the subsystem structure) are not at corresponding positions in C1 and C4. However, graph-isomorphism is too strict because, first, both subsystems must have the same number of nodes and, second, the subsystem structures must be structurally identical. Instead, we want to tolerate small divergences in elements and structure. This can be achieved by a recursive definition based on (3.2) that allows discrepancies by requiring that the direct elements of two subsystems have to be affine only to a large extent (more precisely, not to be affine but to corre- spond in order to capture identical elements as well). Since the definition is recur- Figure 3-17. Isomorphic subsystem structures. n1 n2, N1Î( ) n1 n2,( ) E1Î f n1( ) f n2( ),( ) E2ÎÛ" C1 C2 AC1 A B C D C4 C5 AC4 D C B A 73 Identity, Affinity, and Correspondence of Components sive for the direct elements of a subsystem, a partial correspondence of the entire subsystem structures can be ensured. The definition of affinity for subsystems is analogous to the one for atomic com- ponents. The following definition is therefore a unifying definition for both atomic components and subsystems. It differs from (3.2) only by the replacement of the term atomic component by component that comprises both atomic compo- nents and subsystems: Two components A and B are affine if their degree of over- lapping is above or equal to a user-determined affinity tolerance parameter Q: (3.6) However, the definitions of correspondings and dissimilars for subsystems differ from the ones for atomic components. The definition of these two sets for atomic components were based on the elements of the atomic components, while we restrict correspondings and dissimilars of subsystems to the direct elements of the subsystems (correspond is defined by (3.1) on page 70): (3.7) (3.8) The definitions of correspondings and dissimilars were made in such a way that (3.6) is equivalent to (3.2) if A and B are both atomic components. In order to show this, let A and B be two atomic components. Because of direct-elements (A) = elements (A) for an atomic component, any a Î direct-elements (A) is a base entity. Furthermore, two base entities a and b can only correspond to each other if they are identical, i.e., . Hence: affine A B,( ) correspondings A B,( ) correspondings A B,( ) dissimilars A B,( )+------------------------------------------------------------------------------------------------------- Q³Û correspondings A B,( ) a b,( ) a direct-elements A( )Î b direct-elements B( )Î correspond a b,( )Ù Ù{ }= dissimilars A B,( ) a a direct-elements A( )Î b direct-elements B( )Î( )$ correspond a b,( )ØÙ{ }= b b direct-elements B( )Î a direct-elements A( )Î( )$ correspond b a,( )ØÙ{ }È correspond a b,( ) a b=Û Basic Structural Information 74 Hence, the nominator of (3.6) is equal to the nominator of (3.2). Likewise, the denominators of (3.2) and (3.6) are equivalent when a = b: Thus, (3.6) is equivalent to (3.2) for atomic components. The equivalence allows us to use (3.6) as the general definition for affine compo- nents. Using the general definition, we can compare subsystems to atomic com- ponents with respect to their affinity. For example, the atomic component AC and the subsystem C1 in Figure 3-18 are affine for Q £ 4/7. The example in Figure 3-18 shows another consequence of basing the definition of affinity for subsystems on direct elements only: Affinity as defined by (3.6) is robust against the size and structure of non-affine subcomponents, such as AC1. Figure 3-18. Affine components. a b,( ) a direct-elements A( )Î b direct-elements B( )Î correspond a b,( )Ù Ù{ } a b,( ) a direct-elements A( )Î b direct-elements B( )Î aÙ b=Ù{ }= a a,( ) a direct-elements A( )Î a direct-elements B( )ÎÙ{ }= direct-elements A( ) direct-elements B( )Ç= elements A( ) elements B( )Ç= a a direct-elements A( )Î b direct-elements B( )Î( )$ correspond a b,( )ØÙ{ } b b direct-elements B( )Î a direct-elements A( )Î( )$ correspond b a,( )ØÙ{ }È a a direct-elements A( )Î a direct-elements B( )ÏÙ{ }= b b direct-elements B( )Î b direct-elements A( )ÏÙ{ }È elements A( )\elements B( ) elements B( )\elements A( )È= AC A B D C1 AC1 X Y CC DE F E F 75 Identity, Affinity, and Correspondence of Components As already said in the beginning of this section, two components correspond if they are identical or affine. Since correspondence of direct elements is required in the definition for affinity of subsystems, correspondence of subsystems is defined recursively. The recursion in the definition of correspondence ends at the leaves of a subsystem structure, which are generally base entities. Since leaves have no further elements, they correspond if and only if they are identical. That is, corre- spondence of subsystems is well-defined. The definition of affine subsystems according to (3.6) is more general than the previously discussed alternative definition based on graph-isomorphism. Two subsystems whose structure is graph-isomorphic (and where the bijective func- tion is the identity for nodes that are in both structures) are obviously affine. However, the opposite direction is not necessarily true. This can be shown with the example in Figure 3-19. C1 and C2 are affine subsystems, but their structures are not isomorphic since there does not exist any bijective function between them. 3.7.3 Properties of the Correspondence Relationship Correspondence as defined in the previous sections has the following properties: • reflexive: "(A,Q) correspond (A,A) • symmetric: "(A,B,Q) correspond (A,B) Þ correspond (B,A) • not transitive: Ø"(A,B,C,Q)correspond(A,B)Ùcorrespond(B,C) Þ correspond(A,C) • ambiguous: Ø"(A,B,C,Q) correspond(A,B)Ùcorrespond(A,C) Þ B=C Note that Ø"(A,B,C,Q) correspond(A,B)Ùcorrespond(A,C) Þ correspond(B,C) can immediately be derived from symmetry and non-transitivity (see below). Figure 3-19. Non-isomorphic, yet affine subsystems. C1 AC1 X Y A B C2 AC2 X Y A Basic Structural Information 76 The correspondence relationship is formally defined as: Correspond (A, A) holds because the two arguments are identical and identity is a reflexive relationship. Thus, correspond is reflexive. Since identity is obviously symmetric, it remains to be shown that affinity is also symmetric. The symmetry of correspond can be shown by induction. The induc- tion begins with the smallest components, i.e., atomic components. It cannot begin with base entities since for these, affinity is not defined. Begin of induction. We have already shown that (3.6) is equivalent to (3.2) for atomic components. Equation (3.2) is obviously symmetric. Thus, affinity for components at height 1 is symmetric. Inductive step. We assume that the affinity for the direct elements of two sub- systems A and B is symmetric. Then, we have to show that: which is equivalent to Because dissimilars (A, B) = dissimilars (B, A) is obviously true, it remains to be shown that the nominators of the fractions above are equal. Because symmetric correspondence of the direct elements of A and B was assumed, the following holds: correspond A B,( ) A B=( ) affine A B,( )Ú= affine A B,( ) affine B A,( )Û correspondings A B,( ) correspondings A B,( ) dissimilars A B,( )+------------------------------------------------------------------------------------------------------- Q³ correspondings B A,( ) correspondings B A,( ) dissimilars B A,( )+------------------------------------------------------------------------------------------------------- Q³Û a AÎ b BÎ,( )correspond a b,( ) correspond b a,( )Û" 77 Identity, Affinity, and Correspondence of Components Therefore: The counter-example in Figure 3-20 shows that affinity is generally not transitive. For Q = 0.6, AC1 is affine to AC2 and AC2 is affine to AC3 but AC1 is not affine to AC3. However, for Q = 1.0 it is indeed transitive because there must not be any dissimilar elements then (without proof). The property Ø"(A,B,C,Q)correspond(A,B)Ùcorrespond(A,C)Þcorrespond(B,C) can be derived from symmetry and non-transitivity by contradiction. Let us assume that "(A,B,C,Q)correspond(A,B)Ùcorrespond(A,C) Þ correspondence(B,C), then: Hence, correspond would be transitive, which was shown to be wrong. The ambiguity of correspond is also obvious: Let B¹C and direct-elements (B) = direct-elements (C), then, if there is an A for which correspond (A, B) holds, cor- respond (A,C) holds, too, i.e., $(A,B,C,Q) correspond(A,B) Ù correspond(A,C) Ù B¹C. Hence, there may be more than one correspondent to each component. Figure 3-20. Non-transitively affine components. correspondings A B,( ) a b,( ) a direct-elements A( )Î b direct-elements B( )Î correspond a b,( )Ù Ù{ }= a b,( ) a direct-elements A( )Î b direct-elements B( )Î correspond b a,( )Ù Ù{ }= b a,( ) a direct-elements A( )Î b direct-elements B( )Î correspond b a,( )Ù Ù{ }= correspondings B A,( )= AC1 AC2 AC3 a b a b c b c correspond A B,( ) correspond A C,( )Ù correspond B C,( )Þ correspond B A,( ) correspond A C,( )Ù correspond B C,( )ÞÛ Basic Structural Information 78 3.7.4 Correspondence Relationship and Views Chapter 8 introduces incremental techniques that take a components view as input and produce a components view as output. The techniques may add elements to existing components, that is to say, the output components view may contain components of the input components view (Figure 3-21). When two incremental techniques, T1 and T2, are applied to the same input view, Vinput, the two output components views T1(Vinput) and T2(Vinput) may both contain identical compo- nents. Hence, the same component can be in the input view as well as in both out- put views. As a consequence of the application of two different techniques, the same component may have different elements in the two output views. This is so because the two techniques propose to add different elements to the same compo- nent. That is why identical components in different components views need not be affine, e.g., the component AC has different elements in T1(Vinput) and T2(Vin- put) in Figure 3-21. However, identical components can only have different ele- ments with respect to different views. Example. The atomic components AC1 and AC2 in view (a) and (b) in Figure 3- 22 are affine (for Q = 0.5) but not identical. The atomic component AC1 in view (a) in Figure 3-22 is obviously identical to AC1 in view (c), but the component has different elements in (a) and (c) and the two structures are, hence, not affine with respect to these two views. The atomic components AC2 and AC1 in view (b) and view (c) of Figure 3-22 are neither identical nor affine. Figure 3-21. Identical, yet non-affine components in different views. AC V1 V2 V3 AC V4 V5 V6 AC Vinput T1(Vinput) T2(Vinput) 79 Identity, Affinity, and Correspondence of Components Figure 3-22. Example atomic components. AC1 V1 V2 V3 A is part of B A B V3 V4 V5 AC2 V1 V2 V4 AC1 view (a) view (b) view (c) Basic Structural Information 80 81 Chapter 4 Components in the Programming Language C As a proof of concept, the approach proposed in this thesis has been implemented for programs in the programming language C. The decision to use C as a target language has practical reasons and reasons that lie in the language as such. Many legacy systems are written in C and many large C systems are available in the public domain. Furthermore, C is widely used as target language in the reverse engineering community, which allows comparable results. C supports abstraction by allowing the user to define his own types and by offering means to hide details of the implementation (see Section 4.3). Yet, the support for information hiding is quite limited and commonly unused such that reverse engineering can make a real contribution to program comprehension of C programs. Despite of the dissemina- tion of the advantages of information hiding, C is still one of the most popular programming languages. Programmers that are acquainted with the ideas of information hiding are trying to simulate the lacking means of expression. How- ever, many programmers do not know these principles or simply ignore them. All that makes C an interesting language from the reverse engineering researcher’s point of view: There is abstraction in the language, yet not enough; programs are designed with the ideas of information hiding in mind, yet these ideas are often ignored. Last but not least, C is anything else than a toy language: It has many idiosyncrasies, such as pointer arithmetic, an unsafe type system, or gotos that make analyses of C programs difficult. If an approach works for C, it is likely that it also works for languages that are at a higher level of abstraction than C. Whether this is also true for more primitive languages like Fortran77 is discussed in Section 4.4. Components in the Programming Language C 82 This chapter describes how C is mapped onto the resource usage graph that is henceforth used as a basis to the methods that are going to be proposed in the course of this thesis. An overview of the technical steps of the mapping will be described first. Then, a brief overview of the relevant features of C follows that also shows how these features are mapped to the resource usage graph. Section 4.3 describes the way information hiding could be achieved in C and Section 4.4 discusses the role of the programming language and other factors for atomic com- ponent detection in general. 4.1 Analyzing C Code Before we describe the mapping of features in C to the resource usage graph, we will give a short overview of the technical intermediate steps of this mapping. This is helpful to understand some properties of the mapping. Figure 4-1 presents an overview of the intermediate steps. First, the C code is pre- processed by a standard C preprocessor (CPP). As a consequence, macros are replaced before the actual analysis takes place; the discussion of macros is picked up by Section 4.2.1. Only the preprocessed C code is analyzed by our C front end (CF) which generates an intermediate representation (IML) for each analyzed C program unit (Koschke et al, 1998). The individual intermediate representations generated for the individual program units are then linked by an IML linker to a global description of the system. Global external references are resolved at this Figure 4-1. Mapping C code to a resource usage graph. a.c b.c c.c CPP CPP CPP CF CF CF IML linker system.iml RUA RUG c.iml a.iml Cc Cb Ca C code intermediate language resource usage graph b.iml 83 The Programming Language C stage. Finally, the actual resource usage analysis (RUA) takes place for the global system and maps the C program to the resource usage graph (RUG). All subpro- grams that do not have a body within the intermediate representation of the global system are assumed to belong to libraries; the same holds for external variables that do not have a definition. Since the preprocessor handles include files and there is no distinction between type declarations and definitions, it is not directly known to the resource usage analysis whether type declarations belong to the sys- tem. This information is provided by a command line argument to the resource usage analysis that specifies the library paths. Any declaration of a file that is in one of the given library paths is considered a library unit. Details of the resource usage analysis are described in the next section. 4.2 The Programming Language C In Chapter 3, we introduced an abstract entity relationship model that is mostly language-independent (we will find these entities and relationships in virtually all procedural programming languages). In this section, we will refine the model for the programming language C by describing the C entities and their possible rela- tionships that we model. The mapping of the information derived from C source to entities and relationships of the resource usage graph will be called resource usage analysis (RUA). 4.2.1 Modules and Macros There is no concept module in C. In order to decompose a system into individu- ally manageable parts, the programmer has to use ordinary files. Programmers often simulate the lacking concept by using one file that contains the code, thus simulating the body of the module, and one file that contains the exported decla- rations of the module - its specification. The latter is a so-called header file. A convention is to give the two files the same name except for the file suffix; the header file has the suffix “.h” whereas the “body” file has the suffix “.c”. How- ever, this is only a convention that helps a human reader to find the corresponding header file to a given body file. Header files have no meaning for a C compiler. Components in the Programming Language C 84 In order to import declarations of a header file, a preprocessor directive include is used by the importing file. A preprocessor replaces these includes by the actual file and feeds its contents into the C compiler. The C compiler itself is not aware of these files and is therefore not able to cross-check declarations within the system. Since a preprocessor is needed within a C development environment at any rate, programmers often make use of further preprocessor directives. Macros can be defined whose occurrences are textually replaced by the preprocessor. Condi- tional directives can be used to ignore or insert text depending upon macro values. Macros convey important information. For example, they are often used to define constants. When macros are expanded, only meaningless numbers remain where symbolic names were used. In the early days of C, macros were the only way to define constants. Today, the new ANSI standard for C has means to specify con- stants. However, this is too late for legacy systems and even nowadays, program- mers often stick to macros to define constants (they even have to in many cases due to an ANSI rule that excludes constants in static expressions; see Section 4.2.5). Unfortunately, macros are very hard to handle since their usage is virtually unrestricted. The source code could be written in a way that it only obeys the syn- tax rules of C after the contained macro calls are replaced. This makes analyzing source code with arbitrary macros nearly impossible and is the reason why most reverse engineering tools ignore them, i.e., they analyze the code exactly how it is presented to the C compiler when all macros are replaced. This is also true for the C front end we are using (see Section 4.1). A better and pragmatic solution would be to treat at least simple macros as if they were expressions. On the other hand, syntactic analysis is more difficult to implement when syntax errors can be caused by omitted macro replacements. 4.2.2 Name Spaces C has three different name spaces: one for enumeration, struct, and union types, one for goto labels, and one for all other kinds of identifier. Within a name space, an identifier can occur only once at the same nesting level. However, the same name can occur in all name spaces. For example, the following is legal C code (note the semantic error in the call to malloc which is likely to occur due to over- laps in name spaces: the argument List of the operator sizeof binds to the formal parameter and therefore sizeof yields the number of bytes needed to encode a 85 The Programming Language C pointer where the space required for struct List was actually meant; the compiler cannot detect such errors): struct List {struct List* next; int value;}; void f (struct List *List) { if (List->next == null) goto List; List->value = 1; List: List->next = (struct List *)malloc (sizeof (List)); } Name binding for non-external names is done by the front end and external names are resolved by the linker. However, the different name spaces for enumeration, struct, and union types on one hand and ordinary identifiers on the other hand have to be reflected in the resource usage graph as well (goto labels are not repre- sented in the resource usage graph), e.g., a struct Queue and a type Queue are repre- sented by two separate nodes in the resource usage graph. 4.2.3 Types In C, the programmer is able to write user-defined types. This way, more abstract programs can be written. However, the type system has many idiosyncrasies that make the resource usage analysis more difficult. C has numeric (discrete and floating-point) and character base types - where the latter is actually a subtype of int. Boolean types have to be simulated by discrete types. Base types of the programming language C will be ignored by the resource usage analysis since they provide little information useful for architecture recov- ery. User-defined types can be built on top of these base types using the common type constructors pointer, array, struct, and union. Structs and unions compare to records and will be discussed in Section 4.2.3.3. In the terminology of C, the type con- structors array and pointer (denoted by [] and *) are declarators and only structs, unions, enums, and all base types are considered “real types”. This differs from the terminology used for most other programming languages; we will therefore stay with the common terminology and consider a declaration such as typedef int *My_Integer_Pointers[10]; Components in the Programming Language C 86 as a declaration of a type My_Integer_Pointers that is an array with range 0..9 of pointers to int. 4.2.3.1 Typedefs In other procedural programming languages, there is an explicit construct to introduce a new type. Such type declarations have a name for the new type and describe its data structure. Typedefs in C are similar to type declarations in other languages: typedef struct Node List; This typedef introduces a name List that can be used to declare objects whose data structure is actually struct Node. However, the semantics of a typedef in C is only to introduce a synonym or abbreviation for another type. Whether one declares an object of type struct Node or List for the example above does not make a differ- ence. According to the definition in Section 3.1.1, struct Node is a part-type of List. However, there is a subtle semantic difference to the kind of part-type relationship introduced in Section 3.1.1. For example, in a record declaration like: struct Node {Item value; Key key; struct Node *next;}; Item, Key, and struct Node are part-types of struct Node, i.e., a type can have several part-types (it can even be a part-type of itself), in other words, each part-type describes only a part of struct Node. On the other hand, in the case of a typedef, the data structure of the new type is completely determined by the existing type within the typedef declaration. In order to distinguish a typedef from a type declara- tion that induces several part-types, a delineate relationship is added to the entity relationship model of Section 3.1.1. Since the delineate relationship is a special kind of the part-type relationship, delineate is derived from the part-type relation- ship (delineate is-a part-type): delineate(A,B) expresses that B is defined in terms of A as a synonym or as a new type, in other words, A delineates the structure of B. The definition of delineate covers two kinds of situations in which typedefs are used in C: 87 The Programming Language C 1. to introduce a synonym or abbreviation as in: typedef struct list List; where a programmer can use List where struct list is meant and save the effort for putting struct in front of list 2. to implement a new type by an existing type as in: typedef List Queue; where Queue and List have the same data structure and a programmer is able to call the accessor functions of List with a variable of type Queue, but Queue is meant to be a new type. The semantics of typedef in C is really to introduce a synonym, hence, variables of type List could also be actual parameters of new accessor functions of Queue; they could even be assigned to variables of type Queue and vice versa. However, pragmatically, programmers often use a typedef as a type constructor introducing a new type. Whether a typedef is meant as a declaration of a synonym or a new type cannot be decided for C in general. A case like typedef struct list List; is likely a synonym but not necessarily. Even less obvious are declarations like: typedef List list; Due to these ambiguities, the definition of delineate does not distinguish between typedefs for synonyms and new types in C. In a language that has means to specify the distinction explicitly, one would derive two subtypes of delineate for the two usages. For example, in Ada there are two explicit concepts to distinguish delin- eate: • In order to introduce a synonym to an existing type, a subtype declaration can be used: subtype Queue is List; Queue and List are equivalent. • In order to introduce a new type based on an existing type, a new type can be derived by the keyword new: type Queue is new List; Components in the Programming Language C 88 In this case, Queue inherits all primitive operations of List. However, variables of Queue cannot be assigned to variables of type List, nor can variables of type List be assigned to variables of type Queue; the two types are really distinct. 4.2.3.2 Enums Enumerations of a finite set of discrete values can be expressed by enums in C. The following declaration introduces a new enum type E whose range consists of a, b, and c: enum E {a, b, c}; In C, the values of an enum type are treated as constants of type int, i.e., assign- ments between enum and int variables are allowed. Enums provide little abstraction and atomic component detection techniques often ignore them. However, they are frequently part of a more abstract composite structure and there are indeed examples where they constitute an atomic compo- nent, e.g.: typedef enum {true, false} Boolean; Boolean and (Boolean left, Boolean right); Boolean or (Boolean left, Boolean right); Boolean not (Boolean operand); Therefore, we explicitly capture enums in the resource usage graph. 4.2.3.3 Structs and Unions Another way to build a new type is available by declaring struct and union types as in: struct Complex {float re, im;}; This declares a new record Complex with the components re and im of type float. As opposed to all other types, the types of struct and union variables must be name equivalent if the variables are to be assigned to each other. If such a type declaration is to be used, the keyword struct has to be added when the struct name is used to identify the name space. For example, in order to declare a function with Complex as parameter, one has to write: 89 The Programming Language C void f (struct Complex c); A typedef can be used to introduce an abbreviation: typedef struct Complex Complex; void f (Complex c); As mentioned in Section 4.2.2, it is legal to use the identifier Complex twice in this context due to the different name spaces. However, f could also be specified with the explicit struct Complex in its parameter list and it would not make a dis- tinction from the point of view of the language, but as noted in Section 4.2.3.1, we do make a distinction between the struct Complex and the typedef Complex. They are different types in the resource usage graph. Otherwise, all the following func- tions would be considered similar, whereas we optimistically assume that there is a reason why the programmer introduced three different types: void f (struct Complex c;) void g (Complex c); typedef struct Complex Polar; void h (Polar c); A union is a type that stores values of different types at the same storage location. One can think of it as a struct whose components have an offset of 0. Thus, assign- ing a value to component a of an instance of the following union U will override the value of component b. union U {int a; float b;}; Unions are used for type conversions or for storage optimizations, when only one of the components of an object is valid at any given time. As opposed to structs, they cannot be used to implement a heterogeneous composite concept, i.e., a con- cept with several distinct components, because they can only store one compo- nent at a time and therefore generally provide little abstraction. That is why most reverse engineering techniques ignore unions. However, we consider both structs and unions. Unions are, for example, useful to recognize the similarity of two functions when they both are related to a union. The components of structs and unions are also captured by the resource usage anal- ysis. Each component is modeled by a record component no matter whether it is part of a struct or a union. The relationship enclosing identifies the struct or union to Components in the Programming Language C 90 which the record component belongs. The types of the record components are con- sidered part-types of the struct or union. More about record components follows in Section 4.2.6. 4.2.3.4 Anonymous Types One does not need a named type in C in order to declare objects or to specify parameters. The following declarations are legal: int *a[10]; char *f (int b[]); typedef struct {int a;} T; Types without name will be called anonymous. In the last example, the struct has no name. Since it has no name, it cannot be used anywhere else than in this very type declaration of T. Therefore, one could treat this declaration as if T were the record type. However, anonymous structs can also occur in object declarations. Under such circumstances, a type is needed for the object. That is why we intro- duce a type for each struct without name no matter whether the context is an object or type declaration. The latter is done for reasons of uniformity. An artificial and unique name is assigned to the anonymous type. Anonymous unions and enums are treated analogously. A type that has a name (even if it is artificial as for anonymous structs, unions, and enums) will be called a named type; it is either an intrinsic type (int, char, float, double and their respective signed/unsigned and long/short variants) or a struct, union, enum, or typedef. Declarators and hence anonymous types can appear wherever a type is allowed; thus, anonymous types are quite frequent in C programs, though using a telling name would convey more information. There is one specific context in which anonymous types are really appropriate. Parameters are always passed by value in C and changes to the parameters therefore do not have any effect on the actual argument since the parameter is a copy of the passed object. If the value of the passed object is to be changed, call-by-reference must be simulated. Consider the following C code: 91 The Programming Language C T t; void f (T pt) { pt = ...; } f (t); The assignment to pt in the body of f does not change the value of t. If it is to be changed, the parameter type of pt must be changed to a pointer to T and the address of t must be passed at the call site. Furthermore, all occurrences of pt in the body of f must be replaced by dereferences of pt. Thus, we have to rewrite the code as follows: T t; void f (T *pt) { *pt = ...; } f (&t); Now, the type of parameter pt has become an anonymous type. Note that the pro- grammer could also introduce a new pointer type for the formal parameter, such as: typedef T *PT; and use PT instead of *T for pt in the signature. However, the new pointer type in the signature may be misleading since it suggests f to be a function for PT as opposed to be a function with a reference parameter for T. Anonymous pointer types are adequate for reference parameters and therefore frequently used in C. This scheme often involves arrays, too, since they are more or less equivalent to pointers in C. On the other hand, it is sometimes not quite clear whether an anonymous pointer type is in fact a reference parameter. For example, given the following type declarations: typedef int Item; #define MAX 100 typedef Item stack[MAX]; We would expect signatures for accessor functions of stack as follows: void push (stack s, Item i); /* example 1 */ Item pop (stack s); Components in the Programming Language C 92 However, the programmer could have mistakenly used a type descriptor “Item *” instead of the type stack in the parameter list of a function pop: void push (stack s, Item i); /* example 2 */ Item pop (Item *s); Then, s in pop looks like a reference parameter and not like an accessor function of stack. Programmers could even write the program without the type stack as fol- lows: void push (Item []s, Item i); /* example 3 */ Item pop (Item []s); or even with anonymous types only void push (int *s, int i); /* example 4 */ int pop (int *s); All these alternative declarations are equivalent in C. However, in the latter two alternatives, the connection of push and pop as accessor functions of stack is no longer visible. In example 3, we could at least assume that push and pop are related to Item. But if only one the two functions mentions stack in its signature as in example 2, there is no easy way to find out that “Item *” actually means stack; s could also be a reference parameter of type Item. In example 4, it is not even clear that push and pop deal with objects of type Item, and one cannot even know whether the two anonymous types “int *” denote the same concept or whether they are two distinct types with the same structure. If we knew that push and pop are accessor functions of the same atomic compo- nent, we would be able to conclude that the two anonymous types “int *” in example 4 are in fact equal. However, because atomic components and their accessor functions are still to be detected, it is not known whether two anonymous types denote the same concept when the resource usage analysis takes place. Fur- thermore, because there are so many anonymous types in typical C programs that all need to be represented for each occurrence when nothing is known about their meaning, the resource usage graph would contain a huge number of anonymous types - of which most are probably only reference parameters - and, consequently, be of little help to the maintainer. For these reasons, the resource usage analysis ignores all anonymous types and reduces them to their named base types, i.e., we 93 The Programming Language C will treat a parameter declaration as in f (T *t) as if it were the signature f (T t). This makes sense because t is probably a reference parameter of type T. We will handle a declaration f (T t[]) analogously since it is equivalent to f (T *t) in C. More debat- able is the fact that we also reduce more complex anonymous types to their named base type, thus f(T **t) is also treated as signature f(T t) by the resource usage analysis. However, since anonymous types are involved, there is no other type this could be attached to. Hence, we rely on the abstraction that programmers add by using named types. In order to specify more precisely how we handle anonymous types, we define a few terms needed. The base type of an anonymous type is the type of object a pointer can point to or the type of elements of an array, respectively. The named base type of an anonymous type can then be defined recursively. Let A be an anonymous type expression, then: • if the base type of A is a named type T, then the named base type of A is T • if the base type of A is an anonymous type A’, then the named base type of A is the named base type of A’ Example. Let us make an example to illustrate these definitions. Given the fol- lowing anonymous type: T *a[] The anonymous type of a is an array of pointers to T, its base type is a pointer to T; this one’s base type is T, and the named base type of a is T. Only user-defined named base types occur in the resource usage graph and the resource usage analysis reduces any occurrence of an anonymous type to its user- defined named base type if there is any, i.e., in declarations as in int *a[]; the type of a is ignored. 4.2.4 Global Variables Only global variables are represented in the resource usage graph since local vari- ables are not relevant at the architectural level; nevertheless, local variables are Components in the Programming Language C 94 not completely disregarded as Section 4.2.7 will show. Henceforth, if nothing else is said, variable always means global variable. The type of the variable in the resource usage graph is its user-defined named base type, if there is any, referred to by an of-type edge. If the variable has no user- defined named base type, no of-type edge is introduced. If two variables (or constants), A and B, occur in the same expression on the left and right hand side of assignments, in conditions of control statements, or in an actual parameter list of a function call, then same-expression (A,B) holds. It is suffi- cient for them to be once in the same expression, how often they actually occur is not regarded, and they also may occur at different nesting levels of the expression. This definition will be extended in Section 5.5 when the motivation for this rela- tionship becomes clearer. 4.2.5 Constants In C, there are basically two ways to declare constants: By means of a macro and by means of the keyword const. Using macros is a historical relict from Kernighan and Richie C (1978) where the keyword const did not exist. The only way to intro- duce symbolic names for constants was to use simple macros such as: #define Max 0 In ANSI-C (1989), the same could be expressed as: const int Max = 0; Such constants may be initialized at their declaration and thereafter not be changed anymore. Nevertheless, many programmers are still using macros to define constants partly because constants must not occur in constant expressions (Arnold and Peyton, 1992). C is unlike C++ in this regard. As said in Section 4.2.1, macros are already expanded for our front end by a pre- processor. That is to say, only constants annotated with const are recognized and represented in the resource usage graph. 95 The Programming Language C There are also objects declared as regular variables that are in fact used as con- stants, i.e., they are initialized in their declaration but never set elsewhere yet not annotated with const. This may be just for that particular system under investiga- tion in which the subsystem that contains the variable is used or because the pro- grammer forgot to declare the variable explicitly as constant. It is safe to say that a variable is actually a “constant” if it is set only once and its address is never taken. The opposite conclusion is not necessarily true: A variable whose address is taken could still be a constant when no dereference changes the variable’s value. A better approximation could therefore be ascertained by data flow analy- ses. However, since the automatic techniques for atomic component detection treat constants like variables, we do not need data flow analyses. 4.2.6 Record Components Record components are used to model the components of structs and unions types. For each record component R of a type T, a record component specifier for R is introduced whose enclosing is T. For each global object V of type T, V inherits all transitive record component instances of T as described in Section 3.1.2. 4.2.7 Functions In C terminology, subprograms are called functions no matter whether they return a result or not. A function has a definition that specifies the signature of the func- tion and its body and it has an optional declaration that only specifies its signa- ture. The latter is used when the program is decomposed into several files and the function is to be exported. However, this is not a must. If functions are neither declared nor have a definition, their designated default return type is int and their parameter list is assumed to be variable and unspecified, i.e., any number and type of actual parameters can be passed. From the standpoint of program under- standing, little is to learn about such functions. Even worse are contradicting function declarations. A function can be declared in one file as void f(int i); and in another file as int *f(float f; char *s); Components in the Programming Language C 96 Since the compiler analyzes always one file (more precisely, the file and its include files; see Section 4.2.1) per compilation, these contradicting declarations will not be detected. The compiler will generate code for the current declaration of f. This can lead to run-time errors that are hard to track down. Our linker per- forms a global analysis and finds such contradicting function declarations (see Section 4.1). Functions can only be declared at file level and are either extern or static. Extern functions are globally visible, i.e., they can be called from other files. Static func- tions are local to a file. If nothing is said, functions are extern per default. Both kinds of subprograms are represented in the resource usage graph in the same way. Not all functions need to have a definition. The code of functions can be linked into the program from libraries. When the resource usage graph is generated by a global analysis (see Section 4.1), functions without definition can be identified. They are considered library units. In C, there are basically two ways of calling a function: either calling it directly by its name or via a function pointer. Generally in the case of function pointers, several different functions could be called via a given function pointer at a spe- cific location in the source depending on the function pointer’s value at run-time. Static data flow analysis can yield an estimation which functions may be called. The set of possibly called functions is usually based on conservative assumptions. At least compilers have to make conservative assumptions to avoid generation of erroneous code. To which degree reverse engineering analyses have to be conser- vative depends on the task at hand and is still an open question in general. Because we do without data flow information, we have to ignore calls via func- tion pointers. This does not cause any harm, for the contribution of calls to atomic component detection is generally rather limited and calls via function pointers are even less relevant, at least for the main representatives abstract data type and abstract data object. The only approaches that consider calls are Arch (Section 5.9) and Similarity Clustering (Chapter 7) and even for these, a statistical analysis revealed that calls have not much weight (Section 7.6.1). However, function point- ers can be a problem when subsystems are to be detected as one of our case stud- ies indicates (Girard, Koschke, 1997a). 97 The Programming Language C Actually, there are more than these two kinds of calls. It is also possible to switch between the contexts of functions whose activation record is currently on the runtime stack by means of setjmp and longjmp under Unix. This is often used to implement exception handlers. However, since this is not part of the language and depends completely on the runtime system, we will ignore such calls. Functions make use of types by mentioning them in their signature or using them to declare local variables. For the former, we introduce signature edges and for the latter, local-obj-of-type edges to the named base types of the parameters and func- tion result, or local variables, respectively (see Section 4.2.3). Types can even occur as type of an expression at any level though they are not explicitly mentioned in the body of the subprogram. For example, in typedef struct S {int a; } T; T f (); void g () { int i; i = f().a; } function g accesses a component of type T though it has neither parameters nor local variables of type T. Such a type relationship will be reflected as local-obj-of- type since it is analogous to declaring a temporary local variable that receives the intermediate result. 4.2.8 References References to local variables, formal parameters, and global variables are repre- sented as described in Section 3.1.2.2. The references captured by the resource usage analysis are only an approximations of the actual references. This section describes the approximation and discusses the divergent points of view of the fields of compiler construction and reverse engineering on references. 4.2.8.1 Reference Information Approximation The captured references are relationships between subprograms and declared entities that can be explicitly found in the code. They represent only an approxi- mation of the real sets/uses/takes-address-of relationships. First, it could be that a Components in the Programming Language C 98 reference occurs in a path of the control flow that will never be taken (our approx- imation is an overestimation then) and, second, an object could be referenced via an alias (an object is aliased if it has more than one access path by means of static name binding or dereferenced pointers; in the following, no distinction is made between access paths via different static names or dereferenced pointers). In the case of hidden references via aliases, our approximation is an underestimation since hidden references are not detected by the resource usage analysis. Extensive control and data flow analyses would be needed to detect potential hidden refer- ences. However, these analyses can only yield an estimation since possible con- trol flow as well as aliasing are statically undecidable in general because they may depend on values that can only be established at run-time. Fortunately, it is not a problem for the purpose of atomic component detection when a reference occurs in an unfeasible path. Any reference, no matter whether it is actually executed, creates a relevant static dependency and therefore must not be ignored. Furthermore, hidden references by way of aliasing to an object are generally not relevant to atomic component detection. Remember that relevant objects to the resource usage analysis are either global objects or types. The rela- tionships for the latter are caused by parameters if it is a signature relationship or by local objects if it is a local-obj-of-type relationship. Therefore, the following combinations of aliasing that could be relevant are possible: 1. aliasing between a global object and a parameter 2. aliasing between a global object and a local object 3. aliasing between global objects 4. aliasing between parameters 5. aliasing between local objects 6. aliasing between parameters and local objects Ad (1): If a global object is aliased by a formal reference parameter but the sub- program references the global object only by means of the formal reference parameter, then it is quite likely that the subprogram does not have anything to do with the global object. The subprogram is more general and handles different objects, otherwise the programmer would have hardly introduced a parameter. If the subprogram references the global object also directly, then ignoring aliasing may result in undetected additional references to the global object by means of 99 The Programming Language C the formal parameter. However, at least the directly visible accesses are taken into account. Ad (2): An alias between a global object and a local object can arise if one’s address is taken and assigned to the other one. Assigning the address of a local object to a global object will cause run-time errors when the local object’s mem- ory location is accessed by way of the global object after the function ended. It could be used to transfer the local object to another function that is transitively called by the current function. Then the activation record of the function would still exist and hence, the local object be defined. Still, there should be a strong argument why the local object was not passed as parameter (maybe because the call went through a library whose functions have a standard signature), otherwise this could not be justified to the maintainer. Therefore, I do not believe this has any significance. Assigning the address of a global object to a local object is often used by C pro- grammers for looping over global arrays or for shorthanded accesses to compo- nents as in: char g1[10]; void f1 () { /* loop over array */ char *p; for (p = g1; *p != ’\0’; p++) *p = ’a’; } struct {int c;} g2; void f2 () { /* shorthanded access */ int *i = &g2.c; *i = 1; /* use *i instead of g2.c */ } These schemes in fact lead to a wrong mapping to the resource usage graph because the modification of g2.c by f2 via the dereferenced local variable i will be re-directed to an internal access to the type of i since local variables are not repre- sented in the resource usage graph. However, at least the fact that the addresses of the variable g2 and the record component g2.c are taken is captured. Likewise, the fact that f1 modifies g1 is not detected; only that f1 takes the address of g1 is cap- tured. Components in the Programming Language C 100 Ad (3): Ignoring aliasing between global objects causes a miss of the interrelation between subprograms that access apparently different, yet aliased global objects. However, only if the aliasing is permanent, the miss is really problematic; other- wise these subprograms may indeed be viewed as only weakly coherent. Fortu- nately, aliasing among global objects is a very rare phenomenon. Ad (4): Aliasing between parameters would occur if both parameters were refer- ence parameters and the same actual argument is passed to both of them. But again this is not the typical way of using the function, otherwise the function would have only a single parameter instead of two aliased parameters, and there- fore the function should be treated in its general case in which the two parameters are not aliased. Ad (5) and (6): Aliasing between local objects can only take place if the address of a local object is taken and assigned to another local object because subpro- grams cannot be nested in C. This will not let the references with respect to the type go unnoticed. Remember that the local-obj-of-type is a relationship between a function and a type induced by any local object of this type. If such a relationship exists, any reference to a record component of this type will be considered (except for the references to record components of parameters); it does not matter which actual local object it is and whether the type of the actually referenced local object is the one of the local-obj-of-type relationship. Similarly, if a local object and a parameter are aliases, the access to the record component of the type will be recorded in all cases either by way of the local-obj-of-type or the signature relation- ship. 4.2.8.2 References and Dereferences From a compiler’s point of view, all occurrences of b but the first one in the fol- lowing statements are uses of the variable b: struct {int a; } *b;/* 1 */ b = 0; /* 2 */ ...= *b; /* 3 */ *b = ...; /* 4 */ b->a = ...; In the expressions of 2-4, the value of b is used to address the designated object of b; that is why a compiler considers case 3 and 4 as usages of b. However, from a 101 Information Hiding in C program understanding point of view, the designated object of b is primarily rele- vant and a programmer will abstract from the difference of b and its designated object when reading this code. Consequently, from the point of view of a reverse engineer, cases 3 and 4 can pragmatically be considered settings of b. The distin- guishing factor between cases 3 and 4 and the usage of b in 2 is that b occurs on the left hand side of the assignments in 3 and 4. The resource usage analysis used to produce the results described in this thesis follows the compiler’s point of view, hence, captures the actual semantics of dereferenced variables. Future extensions should investigate whether following the reverse engineer’s point of view is more appropriate. 4.3 Information Hiding in C In previous sections, C was criticized as a language that provides only limited support for information hiding. Nevertheless, information hiding can be realized by using the rudimentary means offered by C. Unfortunately, these means are often not used by programmers. This is probably not because the language does not support information hiding in a more direct way, but because programmers ignore these means, which suggests the conclusion that they would neither follow these principles in a more abstract language. This section shows how they could apply the principles of information hiding by the existing means C provides. The lack of modules in the language must be compensated by header files for the interface and files with the actual code; let us call these body files. The header file lists all extern declarations, i.e., exported declarations, and nothing else. Local function and object declarations can be hidden in the body by the keyword static. Thus, we can easily realize an abstract data object in C by putting the global objects (declared as static) into the body file. Unfortunately, this is not so easy for abstract data types. Header files are ordinary files without semantics. There is no private section, as in Ada, where we could hide private type declarations. On the other hand, for rea- sons of efficient separate compilation, the type information has to be imported by any other body file that uses the type; this is needed to ascertain the needed stor- Components in the Programming Language C 102 age for objects of this type. For these reasons, we cannot use arbitrary data struc- tures for abstract data types in C. However, the opaque types of Modula-2 can be simulated by exporting a struct name and a pointer to this struct name which repre- sents the abstract data type. The full declaration of the struct follows in the body file and is therefore hidden for all clients. This is illustrated in Figure 4-2. Actually, the type is not really abstract because it is explicitly declared as a pointer with the consequence that assignment and comparison of such abstract data types have always reference semantics. However, clients of this module are not able to access the internal fields of this abstract data type. If these programming styles are observed and if any simulated module contains only one concept, the concepts and its accessor routines can immediately be found. Unfortunately, these means are rarely used. Current practice is to export global variables and the full type structure, to distribute accessor routines among different modules, and to put several distinct concepts into the same module. 4.4 The Language and Other Factors It is a trivial realization that the programming language has to provide means to express what we want to detect. However, these means can be very rudimentary. Talented programmers are thoroughly able to simulate higher concepts with only little support by the language. Nevertheless, the degree of support by the pro- gramming language does determine the chances of an automatic technique. There are two extremes. At one end, there are assembly languages. None of the tech- niques that we will get to know in Chapter 5 works for such primitive languages. There should be at least subprograms, objects, and types in the language. How- Figure 4-2. Example abstract data type List in C. struct list; typedef struct list *List; extern void insert (List l, Item i); struct list {List next; Item i; }; void insert (List l, Item i) { ¼ } List.h List.c 103 The Language and Other Factors ever, even assembly languages have macros, symbolic addresses, and data specifi- cations that could be used as a starting point to find substitutes for subprograms, types, and objects. At the other extreme, there are modern programming lan- guages, such as Ada, Modula-2, and C++, that provide the means for data abstrac- tion and information hiding in general. However, there is no guarantee that these means are properly used by programmers, i.e., even for these languages the tech- niques presented in this thesis could be helpful. There are two main atomic components we are searching for: abstract data objects and abstract data types. If we want to detect abstract data objects, there should be ways to express state in the language. Pure functional languages, for example, do not support the concept of state. However, in all procedural lan- guages, we have global variables whose values can be manipulated, and these are the languages that are used in practice. The language need not have a way to spec- ify the relatedness of the global objects; such means are important to abstract data types where the constituents must be explicitly united in a type. Our experiences with the systems we studied indicate quite the opposite. An abstract data object mostly consists of a set of objects that have primitive types. Programmers avoid the effort of introducing a new record type whose components are the constituents of the abstract data object, though this would make their connection obvious. If an abstract data type is to be detected, there obviously has to be a type in the first place and the user should be able to define new types using type constructors. Among common type constructors, records are the most important ones because they allow to realize heterogeneous concepts. Records have been part of all the most popular programming languages since the early 1960s when they were introduced by COBOL. Notable exceptions are past versions of Fortran. For Fortran77, one has to find the connections of single base types that together form a heterogeneous abstract concept before the heuristics described in Chapter 5 can be brought into play. The phenomenon of apparently distinct elements together forming an abstract concept is called tupling (Koschke et al., 1997). My experiences with the large Fortran77 library Spicelib of the Jet Propulsion Laboratory are that programmers use such tuples as if they were records with a fixed order of contiguous compo- nents, i.e., if they occur in a parameter list, the components of the tuples are Components in the Programming Language C 104 mostly in the same order and there is no constituent of another tuple within the sequence of components. Other indicators derived from control and data flow analysis could be used to detect tuples, e.g., accesses to some components of the tuple may always be control-dependent on other components of the tuple. Fur- thermore, domain knowledge is probably a key factor for detecting tuples. Records and likewise simulated records, i.e., tuples, are the most relevant instru- ments to model concepts from an application domain since most such concepts are heterogeneous. Knowing the domain is often crucial for deciding whether cer- tain constituents together form a unity when this unity corresponds to a concept of the application domain. Tupling is its own field of research and it was only stated as a problem but no solutions have been proposed so far. I will not discuss tupling in this thesis, yet, essential observations of this field are also relevant to atomic component detection and will be summed up here: • the target of the search is a concept that could convey important information to the maintainer • the concept is not properly specified because the programming language does not provide the means to model this concept or a programmer has ignored these means • if the language does not provide adequate means of expression, programmers find ways to simulate these means in part • there are different sources of information that can be leveraged to find the con- cept: - the typical ways how programmers simulate the lacking means of expression - control and data flow analysis - domain knowledge Most techniques described in this thesis mainly follow the first path and try to find atomic components by the way the programmer could have organized them in C. To some degree, they are based on assumptions that could be derived by control and data flow analyses, but these assumptions are only rough approxima- tions and are not validated by the techniques. Two techniques indeed perform control flow analyses on the call graph, namely, dominance analysis and strongly connected component analysis. However, other control flow analyses are not being used. Data flow analyses in particular are not exploited at all. Neither is domain knowledge, other than using the conventions of programmers to convey 105 The Language and Other Factors the information with the rudimentary means of the language, leveraged by any of the automatic approaches. The method that is proposed in Chapter 9 integrates the user in the detection process and is heading toward leveraging domain know- ledge. However, a completely automatic approach incorporating domain know- ledge is not yet in sight (see Section 11.2). Components in the Programming Language C 106 107 Part II Automatic Techniques 109 Chapter 5 Basic Techniques and Metrics of Component Detection Many (semi-)automatic techniques for atomic component detections have been proposed in the literature. However, no attempt has been made so far to compare and evaluate these methods. This chapter describes published techniques and some extensions we made. It also unifies and classifies these techniques. An eval- uation of the techniques will follow in Chapter 6. 5.1 What All Techniques Have in Common At a higher level of abstraction, an abstract data type consists of a domain of val- ues for the type and some allowed operations on that type. In an implementation of an abstract data type, the domain of values is implemented by a data structure which is read and set by routines - its operations. The user of an abstract data type can declare objects of that type and pass them as actual parameters to the opera- tions. Consequently, it is a necessary prerequisite for operations of an abstract data type to mention the data type in their signature, i.e., their parameter list or their return type in the case of functions. That is, all routines with a data type T in their signature are candidates for an operation of the abstract data type T. How- ever, this prerequisite is necessary but not sufficient. Some routines simply pass a value of T to other routines and are not true operations of T. Many routines have more than one parameter type so that it is necessary to decide which one they belong to. For all kinds of routines which convert one type into another type this can be hard to judge. Sometimes - especially in programming languages that do not provide record types such as Fortran77 - one even has to look at several base Basic Techniques and Metrics of Component Detection 110 types of the underlying programming language in a parameter list to form one abstract data type. For example, one can have a stack implementation that passes two parameters, one for the stack contents realized by an array and one for the stack pointer implemented by an integer type. Similarly, an abstract data object represents an abstraction of a state and the oper- ations that manipulate the state. The state is implemented by a set of global objects. These objects are set and used by the operations of the abstract data object. Most of the time, programmers do not make the effort to group the global objects of an ADO together as components of a record structure to make the con- nection of the objects obvious. In many old programming languages they even would not have a chance to do so because user-defined data types are not sup- ported. Even in programs written in modern programming languages, one often finds accesses to these global objects by routines that do no belong to the ADO because of efficiency considerations. All that makes it difficult to find the objects that together make up the abstract state and the routines that really represent the ADO’s operations. Considering these facts, it is obvious that the naïve approach of grouping types together with all routines whose signature refers to them and of aggregating objects with all routines that set or use them leads to erroneously large candidate components. This strategy is discussed in the next subsection for global object references. In the rest of this section, we present heuristics proposed to avoid erroneously large ADT and ADO candidates by going beyond the simple refer- ence criterion. Some of them can detect ADTs as well as ADOs, some of them are specialized in one type of atomic component. Those that can detect both types of atomic components merge the results of ADO and ADT detection into a hybrid candidate if there is a routine that belongs to both an ADT and an ADO. 5.1.1 Working Example The C code example in Figure 5-1 will be used to illustrate the various heuristics in the following. It consists of an abstract data type List of Item with the accessor routines empty, first, rest, and prepend and an abstract data object stack of Item with the accessor routines init, push, pop, and size whose implementation is based on the abstract data type List. Type Item is declared in a file of its own, whereas the List and stack concept are declared in the same file though it would probably 111 What All Techniques Have in Common have been better to separate these distinct concepts into two files. Note also the violation of the information hiding principle in the accessor function size of stack. Because there was no function length available for List that a programmer could have used, the programmer implemented size by a direct access to the component length of type List. Hence, function size, which actually belongs to the abstract data object stack, breaks the encapsulation of List. An excerpt of the resource usage graph for this example containing all nodes and most edges is shown in Figure 5-2 (actual-parameter-of, of-type, component references are not shown). The example in Figure 5-2 also demonstrates that resource usage graphs can become quite complicated even for short programs. As a matter of fact, resource usage graphs even for medium size programs at the 30KLoc level cannot reasonably be visualized anymore. Figure 5-1. Example C program. /*--- file list.c ---*/ typedef struct { int len; Item cont[100];} List; List empty () { List result; result.len = 0; return result; } Item first (List l) { return l.cont[l.len-1]; } List rest (List l) { l.len--; return l; } prepend (List *l, Item t) { l->cont[l->len] = t; l->len++; } /*--- file list.c ---*/ static List stack; void init () { stack = empty (); } void push (Item i) { prepend (&stack, i); } Item pop () { Item result = first (stack); stack = rest (stack); return result; } int size () { return stack.length; } /*--- file item.h ---*/ typedef ... Item; Basic Techniques and Metrics of Component Detection 112 5.1.2 Characteristics of the Techniques From a mathematical point of view, all the techniques are basically clustering methods that differ in the underlying cluster criterion, the clustering domain and range, and whether the clusters are disjoint. A cluster is basically a set of entities that meet the underlying clustering criterion of the method (the terms cluster, group, and candidate are interchangeable). Clustering criterion. The clustering criterion is a heuristic approximation of “real” criteria for atomic component composition. Because each technique has its own criterion, this is the most distinguishing characteristic. Other properties of the clustering methods may be shared. Domain and range. The domain of a clustering method specifies what kind of entities and relationships are considered for clustering. The range, on the other hand, specifies what type of entities can be grouped together at all and hence what kind of atomic components can be detected. Disjoint clusters. Most methods produce disjoint clusters, but there are excep- tions. In the following description of the clustering methods, we will state the domain, range, clustering criterion, and whether the methods produce disjoint clusters. In doing so, the domain will be specified in terms of the views listed in Table 3-5 on Figure 5-2. Resource usage graph for the example in Figure 5-1. List firstempty rest prepend stack init pushpop size Item use use use setset call call call call parameter-of return type subprogram variable 113 Global Object Reference Heuristic page 68. The range will be stated as the kind of atomic components the method can detect: • Abstract data types (ADT) consist of user-defined types and subprograms. • Abstract data objects (ADO) consist of global objects and subprograms. • Hybrid components (HC) consist of global objects, user-defined types, and subprograms. • Related subprograms (RS) consist of subprograms only. Furthermore, we will give the reference to the originators and specify the exten- sions or modifications we made. Since the extensions and modifications are part of joint work within the Bauhaus project, they are mainly common ideas of Jean- François Girard, Georg Schied, and me. 5.2 Global Object Reference Heuristic Clustering criterion. Global objects and all the routines that reference these objects, regardless of where they are declared, are grouped together. In the case of a global error variable used in many parts of the system, this approach will collapse a large part of the system into one atomic component. Yeh et al. propose to exclude frequently used objects from the analysis to avoid this unwanted effect. Example. Applied to the example of Figure 5-1, Global Object Reference would find the ADO {stack, init, push, pop, size}. Name Global Object Reference Reference Yeh et al. 1995 Domain Object Reference View Range ADO Disjoint Clusters Yes Basic Techniques and Metrics of Component Detection 114 Algorithm. For Global Object Reference, we can use the generic algorithm 5-1, which will also be used for other methods that produce disjoint clusters. The algo- rithm iterates over the subprograms and groups them with their relevant con- nected entities. What a relevant connected entity is depends upon the respective technique; in terms of the algorithm, this is decided by a generic parameter that yields for each subprogram all entities that have to be part of the same atomic component as the subprogram. This function could also exclude frequently used objects as proposed by Yeh et al. for the Global Object Reference heuristic. Algorithm 5-1 uses the union-find implementation for disjoint sets (Hopcraft and Ullman, 1983) where: Algorithm 5-1. Generic algorithm to detect disjoint atomic components. Generic parameter: • function connected_entities: Entity ® set of Entities Input: • input view V Output: • disjoint clusters Algorithm: 1. put each base entity in V into a set of its own: for each entity E in V loop new_set (E); end loop; 2. clustering: for each entity E in V where subprogram (E) loop for each entity E’ in connected_entities (E) loop union (find (E), find (E’)); end loop; end loop; 3. results: each remaining disjoint set is a cluster 115 Same Module Heuristic • new_set (e) defines a new set {e} • union (s1, s2) unites the two sets s1 and s2; after the call, the two set identifiers denote the same set, i.e., s1Ès2. • find (e) yields the set that contains e; since all entities will be initially put into a set and since the sets are disjoint, there is exactly one such set To instantiate the generic algorithm for Global Object Reference, the function referenced_objects as defined in Table 3-4 on page 65 serves as the actual param- eter for connected_entities, which returns all global objects referenced by a given subprogram: (5.1) 5.3 Same Module Heuristic In the ideal situation, i.e., when the system is properly decomposed, each module contains one single atomic component. When we count on good design, we can group all declarations of a module together to an atomic component which repre- sents the abstract functionality of the module. This is the underlying clustering criterion of the Same Module heuristic. Clustering criterion. All related subprograms, user-defined data types, and glo- bal objects that are declared in the same module are grouped together. A subpro- gram is related to a data type when the data type occurs in the signature of the subprogram. Likewise, a subprogram is related to an object when the subprogram references the object. Name Same Module Reference Girard, Koschke, 1997 Domain Object Reference View + Signature View Range ADO, ADT, HC Disjoint Clusters Yes connected_entities S( ) referenced-objects S( )= Basic Techniques and Metrics of Component Detection 116 This heuristic was first published in a case study by Girard and Koschke (1997). However, the initial idea for this heuristic stems from Erhard Plödereder (1997). In the case of Ada, a package body and its specification would form a module. In C, modules do not exist, but programmers simulate the lacking concept by a header file f.h for the specification and a C file f.c for the body. Same Module assumes that programmers are disciplined and follow this convention. The Same Module heuristic will fail when parts of the abstract functionality of a module are implemented elsewhere. However, it will still find individual atomic components in very large modules that have many different logical functions unless they are implemented by overlapping sets of subprograms, types, and objects. Example. For the example of Figure 5-1, Same Module would propose one ADT {List, empty, prepend, first, last} and one ADO {stack, init, push, pop, size} despite of the fact that these declarations are in one common file: The only con- nections between parts of the stack and parts of the List are call relationships, which are not considered by the Same Module heuristic (see Figure 5-2 on page 112). However, if Item were declared in the same file, Same Module heuris- tic would have assumed one big hybrid component consisting of the union of the two sets above (including type Item) because the stack ADO’s accessor routines also mention type Item in their signature. Algorithm. Same Module can be implemented by instantiating the generic algo- rithm 5-1 for disjoint clusters where referred-entities (defined in Table 3-4 on page 65) is used for connected_entities; referred-entities returns the signature types and the referenced objects of a subprogram; the module in which an entity, X, is declared, is denoted by module(X): (5.2)connected_entities S( ) referred-entities S( ) X module X( ) module S( )={ }Ç= 117 Part Type Heuristic 5.4 Part Type Heuristic Often, we find abstract data types that represent some sort of container of other abstract data types. For example, queues as containers of processes, or an account containing data about its owner and the deposited money. For such abstract data types there usually exists an operation that takes an element and puts it into the container. For a process queue, for example, there will be an insert routine with two arguments: the process to be inserted and the queue itself. Even though both types are mentioned in its signature, we would not consider insert to be an opera- tion for processes but for queues. The Part Type heuristic reflects this perception. It is based on the part-type relationship which was already defined in Section 3.5.4.3. Clustering criterion. Part Type groups a routine with those types in its signature that are not a part type of another type in the same signature. The basic assumption is that a part type is actually used to be put into its con- tainer or to be retrieved from it. It does not check this assumption. Data flow anal- ysis could validate this assumption. Example. This can be illustrated with the following declarations: typedef ... Item; typedef struct {int len; Item contents [100];} List; void prepend (List *l, T t); Here, Item is a part type of List. That is why prepend would be an operation of List according to Part Type and not of Item though both types are mentioned in the signature of prepend. The detected ADT consists of {List, empty, prepend, first, last}. Name Part Type Reference Liu and Wilde, 1990 Domain Signature View + Type Composition View Range ADT Disjoint Clusters Yes Basic Techniques and Metrics of Component Detection 118 Algorithm. Again, the generic algorithm for disjoint clusters can be used to implement the Part Type heuristic. The connected_entities of a subprogram are all signature types excluding part types: (5.3) 5.5 Same Expression Heuristic Often, the state of an abstract data object is implemented by a set of separate glo- bal objects instead of a single record object. For example, an ADO stack could be based on the following variable declarations: Item contents [100]; int stackpointer; However, the fact that the apparently separate constituents of an abstract data object contribute to a common purpose often becomes visible when they occur in the same expression, especially when one of the constituents is a composite data structure and the other constituent is used as some kind of selector for the com- posite data structure. In the above example, we can expect to find an expression such as: contents [stackpointer] The level at which the selector appears does not matter. The stack example could also be as follows: Item **contents; int stackpointer; Name Same Expression Reference Unpublished; Rainer Koschke Domain Object Reference View + Same Expression View Range ADO Disjoint Clusters No connected_entities S( ) T T signature-types S( )Î T˜ signature-types S( )Î( )part-type T T˜,( )$ØÙ{ } = 119 Same Expression Heuristic ...(*contents)[stackpointer]... Likewise, common occurrence in a mathematical expression can be a hint for relatedness: printf (“area = %d\n”, length * width); We will also consider objects jointly occurring in the actual parameter list of the same call as being in the same expression, extending the definition of same expression given in Section 4.2.4. This extension has turned out be useful in the evaluation of the Same Expression heuristic reported in Chapter 6. That is, the three variables in the following code will also be considered to be in the same expression: set_coordinates (window, length, width); As it was already discussed in Section 4.4, the phenomenon of apparently distinct actual parameters together forming an abstract concept is called tupling (Koschke et al., 1997). Tupling primarily occurs in older languages, like Fortran77, that do not have record types. However, as our experiences with the systems investigated indicate, examples for tupling can also be found in languages with record types, like C. Because the Same Expression heuristic is aimed at detecting composite abstract data objects consisting of several objects, it ignores clusters with only one object, in which no same expression relationship can occur. Before we can specify the clustering criterion, a definition is needed. A con- nected graph component is a subgraph whose nodes are all (transitively) con- nected; in other words, each node is reachable from all other nodes within the subgraph where the direction of edges is ignored. Clustering criterion. All objects that are in the same connected graph compo- nent in the same-expression view (see Table 3-5 on page 68) are grouped together with all the subprograms that reference at least one of these objects in the object reference view. Subprograms may belong to distinct clusters. Clusters with only one object are ignored. Basic Techniques and Metrics of Component Detection 120 Algorithm. Algorithm 5-2 computes non-disjoint clusters based on the same expression relationship among global objects. Example. In the working example introduced in Section 5.1.1, there is only one global object and, therefore, the Same Expression heuristic is not applicable. Let us consider the example in Figure 5-3 instead that contains two abstract data objects; the first one is shaped by a solid line, the second by a dashed line. Object c is neither part of the first nor the second abstract data object though it is accessed by accessor routines that belong to these abstract data objects. The example illustrates also that overlapping candidates can result. Algorithm 5-2. Same Expression heuristic algorithm. Input: • input view V Output: • a set of non-disjoint clusters Algorithm: 1. extract all objects: Vars:= {var | object (var) Ù var ÎV}; Clusters := Æ; 2. clustering: while Vars ¹ Æ loop Var := arbitrary_element (Vars); -- choose an arbitrary element C := transitive_closure (neighbors (Var, {same_expression})); Vars := Vars \ C; if | C | > 1 then Clusters := Clusters È {C È referencing_subprograms (C)}; end if; end loop; 3. results: Clusters is the result 121 Internal Access / Non-Abstract Usage Heuristic 5.6 Internal Access / Non-Abstract Usage Heuristic The purpose of an abstract data type is to hide implementation details of the inter- nal data structure by providing access to it exclusively through a well-defined set of operations. The idealized encapsulation principle entails that all routines that access internal components of the abstract data type are considered to be the data type’s operations which is exactly the attitude of the Internal Access heuristic. An internal access for a record type is any record component selection. Extensions. Yeh et al. proposed to consider internal access to record types only, but the same principle can be applied to arrays and pointers as well: • if T is an array then any index subscript is an internal access; • if T is a pointer then any dereference is an internal access. Originally, the subprograms are associated with those types in their signature whose corresponding formal parameter is internally accessed in the body of the subprogram. However, if we check only whether parameters are internally accessed, a frequent pattern in the presence of pointers that simulate call-by refer- ence in C will be missed: The value of a simulated call-by reference parameter is Figure 5-3. Example of groupings by Same Expression heuristic. Name Internal Access / Non-Abstract Usage Reference Yeh et al., 1995 (extensions: Girard, Koschke, 1997) Domain Signature View (extensions: Object Reference View) where edges are annotated with internal access information Range ADT (extensions: ADO, HC) Disjoint Clusters Yes a b c d e f1 f2 f3 reference same_expression Basic Techniques and Metrics of Component Detection 122 assigned to a local variable which is then used within the rest of the body. Before the function ends, the value of the local variable is assigned to the designated object of the pointer. This way, the programmer does not have to use the derefer- ence operator for each occurrence of the reference parameter: typedef struct {int a;} T; void f (T *pt) { T t = *pt; t.a = ...; /* internal access to local variable */ *pt = t; } This pattern disguises the fact that the designated object of the parameter is inter- nally accessed. Data flow analysis would reveal the internal access. As an approx- imation without need for data flow analysis, we count as internal access any access to the internal parts of a local variable of the same type as the parameter. We have found that, for C, our approximation matches reality in most cases. Though originally only proposed for finding abstract data types, the Internal Access heuristic can also be extended to find abstract data objects in cases in which state is implemented by record, array, or pointer variables, i.e., we can also take internal accesses to composite global objects into consideration. Furthermore, what an internal access basically represents is a non-abstract usage of a type. The data structure, which should have been hidden according to the information hiding principle, is no longer transparent. The same takes place when a predefined operator is applied to an object. In this case, the programmer also exploits the knowledge about the object’s data structure. As a conclusion, we can widen the definition of internal access to include applications of standard opera- tors to local and global objects as well as parameters. In doing so, the term inter- nal access is no longer appropriate; that is why we will use non-abstract usage instead. To sum it up, non-abstract usage includes: • internal access for composite data structures as defined above • application of predefined operators to global objects as well as parameters and local objects of a user-defined type Clustering criterion. The Internal Access/Non-Abstract Usage heuristic associ- ates subprograms with all its non-abstractly used objects and signature types (a 123 Internal Access / Non-Abstract Usage Heuristic type T is non-abstractly used when there is a non-abstract usage of any expression – excluding global objects – that has type T; this includes references to parameters, local variables, or intermediate expressions). Note that the clustering criterion does not consider every non-abstract usage of local variables but only of those local variables whose type also appears in the signature of the subprogram. The reason for this restriction is derived from the necessary criterion for accessor functions of an ADT to mention the type in the signature: A subprogram with a local variable of type T that does not mention T in its signature is a client of the abstract data type T but not an accessor function. Relaxing this restriction would lead to erroneously large candidates for systems with limited or no information hiding. Algorithm. Internal Access can be implemented by using the generic algorithm 5-1. For the generic parameter Connected_Entities, the instantiation uses function internally-accessed (S) that yields all internally accessed objects and types of a subprogram, S, in the union of object reference and signature view (where we assume that reference and signature-type edges have an attribute non-abstract that is true if there is a non-abstract usage by the subprogram S; see also Section 3.4.3 for a description of attributes and Section 3.5.2 for neighbor functions restricted by attributes): (5.4) (5.5) Example. In the example of Figure 5-1, empty, prepend, first, and last would be added by Internal Access to the type List because of their internal access to it. Internal Access would only detect a part of the ADO {stack, init, push, pop, size} because the routines init, push, pop do not access the internal record components of stack. Interestingly enough, the part {stack, size} is detected because the pro- grammer breaks the information hiding principle while init, push, and pop are not considered part of the ADO because information hiding is obeyed. The two atomic components of Figure 5-1 are an example for layered atomic components, i.e., a component is built on top of another component. Layering is not only used among atomic components but also often within large components. connected_entities S( ) internally_accessed S( )= internally_accessed S( ) successors S reference signature-type,{ } non-abstract, ,( )= Basic Techniques and Metrics of Component Detection 124 Frequently within large components, there are a few core functions accessing the internal data structure and several higher-level functions whose services are implemented using the core accessor functions. Internal Access/Non-Abstract Usage can only identify the core functions within layered components and only the lower-level components when components are built on top of each other. 5.7 Delta-IC High cohesion in the case of an abstract data object S implies that each of the sub- programs in S references many objects of S; low coupling implies that each of the subprograms of S references only very few objects that do not belong to S and that only few subprograms from outside of S reference objects of S. The approach pro- posed by Canfora et al. is heading in this direction. It basically consists of two parts. At first, objects and subprograms are clustered to ADOs according to a spe- cific usage pattern. Then all resulting clusters are rejected whose internal connec- tivity is below a given threshold. The internal connectivity metric proposed by Canfora et al. is described below. The clustering pattern and the evaluation metric is defined on the object reference view that describes the usage of global objects by subprograms. They can be explained more easily in terms of the following definitions, given a subprogram S and a global object V (we are using the concepts introduced in Section 3.1; Can- fora et al. describe their approach in a slightly different, yet equivalent way): subprograms related to S are all subprograms which set or use referenced objects of S: Name Delta-IC Reference Canfora et al., 1993, 1996 Domain Object Reference View (extensions by Rainer Koschke: Sig- nature View where edges are annotated with internal access information) Range ADO (extensions: ADT, HC) Disjoint Clusters No 125 Delta-IC (5.6) where refer-to (e) = referencing-subprograms (e) and referred-by = refer- to-1, hence, referred-by (S) = referenced-objects (S). The reason why refer-to/referred-by are introduced here — instead of using refer- encing-subprograms/referenced-objects directly — is that Delta-IC is extended to types by re-defining refer-to below (referred-by is always defined as refer-to-1). closely-related subprograms of S are all subprograms which set or use only referenced objects of S: (5.7) Example. Given the object reference view of Figure 5-4 and F as the subprogram under consideration, then the objects referred by F are {v1, v2}, the subprograms related to F are {F, f1, f2, f3}, and the closely related subprograms are {F, f1, f2}. The candidate that is proposed as an abstract data object consists of all closely related subprograms of the given subprogram S plus the objects referred by S, i.e., all objects set or used by S: (5.8) Figure 5-4. Example objects reference graph. subprograms-related-to (S) F F refer-to(e)Î{ } e referred-by (S)Î È= closely-related-subprograms S( ) = {F F refer-to e( )Î referred-by F( ) referred-by S( )}ÍÙ e referred-by S( )Î È v1 v2 v3 f1 f2 f3 F candidate-cluster (S) closely-related-subprograms (S) referred-by (S)È= Basic Techniques and Metrics of Component Detection 126 Example. In the example of Figure 5-4, the candidate cluster is {v1, v2, F, f1, f2} for the given subprogram F. Note that the proposed clusters depend upon the given subprogram. Suppose F also referenced object v3, then the cluster for F would be {v1, v2, v3, F, f1, f2, f3}; from the perspective of f3 we would get the cluster {v2, v3, f2, f3}. Thus, clusters can overlap. The candidate cluster is ranked by the internal connectivity metric and only pro- posed if this metric yields a value greater than a user-determined threshold. The internal connectivity measure (IC) and the improvement in internal connec- tivity (DIC) are defined as: (5.9) (5.10) IC(S) is the portion of references to individual variables of the cluster from sub- programs also inside the cluster (closely related subprograms) with respect to the number of all references. If there is no reference from outside the cluster, IC(S) is 1. In the example of Figure 5-4, IC(F) is as follows: (2 + 3) / (2 + 4) = 0.83. The subtrahend in the definition of DIC reflects the portion of subprograms that refer- ence only a single variable of the cluster with respect to the number of all refer- ences to that variable. In the example of Figure 5-4, the subtrahend of DIC is 1/4: f2 is the only subprogram that accesses a single variable only, namely, v2, which is referenced by 4 subprograms. Consequently, DIC(F) = 0.83 - 0.25 = 0.58. The underlying intuition of the definition of DIC is to have only few references of objects from outside the cluster (this motivates the internal connectivity measure IC) and only few routines in the cluster that reference only one object of the clus- ter (the second term in the formula for DIC). The latter is aimed at clusters whose parts are more tightly coupled. We will discuss this below in more detail. IC(S) {F F refer-to e( )Î referred-by F( ) referred-by S( )}ÍÙ e referred-by (S)Î å {F F refer-to e( )Î } e referred-by (S)Î å ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------= DIC(S) IC(S) F referred-by (F) e{ }={ } refer-to (e )-------------------------------------------------------------------- e referred-by (S)Î å–= 127 Delta-IC Clustering criterion. A candidate for a given subprogram S is candidate-clus- ter(S) where DIC (S) ³ Q. Algorithm. The original approach uses the following clustering algorithm (Can- fora et al. 1996): It may be a sign of loose relatedness when a candidate’s internal connectivity is below the threshold. The reason may be that the subprograms implement distinct logical functions and therefore reference unrelated objects. The code of such sub- programs could be separated into distinct parts that correspond to the distinct log- ical functions by means of program slicing (Weiser, 1984). This is what Canfora et al. proposed. However, for a pure reverse engineering process, which must not change the system, slicing subprograms is out of the question. Furthermore, if applied only fully automatically and non-iteratively, the nodes referred-by(S) will not be collapsed. Omitting the slicing and validation steps reduces the outer loop in algorithm 5-3 to one iteration (since the object reference view does not change) and overlapping candidates may result. The subsection on extensions discusses how overlapping candidates can be handled. The most distinguishing characteristic of this approach is that it first generates clusters and then uses a metric to assess the generated clusters whereas the other techniques simply cluster without rating their candidates. We will take up this idea again in Chapter 9 where we will use several metrics to assess the atomic component candidates within the interactive framework. Algorithm 5-3. Original Delta-IC approach. repeat build object reference view for each subprogram S loop if DIC(S) ³ Q then let the user validate candidate-cluster (S) if accepted, collapse referred-by (S) into a single representative variable else slice S using different objects of candidate-cluster (S) end if; end loop; until graph contains only isolated subgraphs consisting of an object grouping with one or more functions Basic Techniques and Metrics of Component Detection 128 Moreover, the threshold used to filter out candidates in this approach offers the user a way to influence the search for atomic components. All other techniques generate always the same candidates; whereas in an interactive environment, the user can play with different thresholds for Delta-IC and, hence, get different results. Though the threshold makes the approach more flexible, it also compels the user to search for a reasonable setting. Canfora et al. propose to establish the threshold statistically by a smaller sample of the system. However, it is not yet clear how big the sample has to be in order to allow a usable prediction. Revisiting the DIC definition. The definition of DIC consists of two parts. The second part, i.e., the subtrahend of DIC, covers substructures of the candidates that consist of only one object and those subprograms that access solely this object. It is motivated by the fact that the clustered variables are collapsed in Algorithm 5-3. Yet, in the first iteration, there are no clustered variables, i.e., each variable stands for itself. In that case, the subtrahend actually represents the inter- nal connectivity of clusters around a single variable that consist of subprograms that only access this variable and no other variable, i.e., if Cv is a cluster that con- sists of a single variable V and all subprograms that only refer to this and no other variable, then the following equation holds (which was neither shown nor men- tioned by the original authors): This will now be shown. The set subprograms-related-to is the same for all sub- programs in CV. Let S be a subprogram of Cv, then referred-by(S) = {V} and therefore: where refer-to(V) depends only upon V. Furthermore, it is also true that: S CV: Î" IC S( ) F referred-by (F) e{ }={ } refer-to (e )-------------------------------------------------------------------- e referred-by S( )Î å= subprograms-related-to (S) F F refer-to(e)Î{ } e referred-by (S)Î È= F F refer-to V( )Î{ }= refer-to= V( ) 129 Delta-IC Thus, we can conclude that: when we have a cluster Cv with one object V and all subprograms that only access V. Such substructures around a single variable might be considered a candidate on their own and therefore it could make sense to subtract their internal connectivity from the overall internal connectivity of the composite structure. Yet, this is intu- itively not appropriate for the following reasons: • The decision to consider only subclusters of single variables is arbitrary. Why not considering subclusters with two or more variables? • Furthermore, one should think that a subprogram that references one variable only and this variable is in the cluster, the subprogram should definitely also be in the same cluster. An example is an abstract data object stack based on two global variables stack_content (array for the stack content) and stack_pointer closely-related-subprograms (S) F F refer-to(e)Î referred-by(F) referred-by (S)ÍÙ{ } e referred-by (S)Î È= F F refer-to(V)Î referred-by(F) referred-by (S)ÍÙ{ }= F F refer-to(V)Î referred-by(F) V{ }ÍÙ{ }= F referred-by (F)={V}{ } F F CvÎ{ }== IC(S) {F F refer-to e( )Î referred-by F( ) referred-by S( )}ÍÙ e referred-by (S)Î å {F F refer-to e( )Î } e referred-by (S)Î å ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------= {F F refer-to V( )Î referred-by F( ) referred-by S( )}ÍÙ {F F refer-to V( )Î }--------------------------------------------------------------------------------------------------------------------------------------= closely-related-subprograms (S) subprograms-related-to (S) ----------------------------------------------------------------------------------= F referred-by (F) V{ }={ } refer-to (V )---------------------------------------------------------------------= Basic Techniques and Metrics of Component Detection 130 (index into stack_content) having an accessor function size to return the num- ber of elements on the stack; size would need to reference stack_pointer only and still does belong to the cluster. Moreover, the metric considers only coupling but not cohesion of the candidate clusters though one would expect that both coupling and cohesion should be taken into account. A minor point of critique is that the term internal connectivity is misleading. What the formula for the internal connectivity measures is the frac- tion closely-related-subprograms versus related-subprograms with respect to individual variables of the cluster, which is a relationship between the cluster and its environment as opposed to an internal property. Extensions. The original Delta-IC method as proposed by Canfora et al. involves slicing of subprograms that are part of more than one cluster. This, however, means that the system is changed. In the context of Canfora et al.’s work this makes sense since their work is aimed at finding reusable components. Subpro- grams involved in more than one atomic component fulfill different logical func- tions for different atomic components. Reusing one atomic component then would imply the need to import other atomic components as well, since the multi- function subprogram shares code with many atomic components. Our goal is to recover information for program understanding. The system must not be changed during this reverse engineering phase (it may be changed after- wards), i.e., slicing is out of the question. Leaving out slicing may result in over- lapping candidates. Merging these overlapping candidates regardless of the degree of overlap is not satisfactory. This approach was taken in an earlier evalua- tion of the techniques (Girard and Koschke, 1997) in order to get a fair evaluation since the other methods always produce distinct candidates. For the application of the Delta-IC method, we can do better: We can merge two candidates when they share a large amount, otherwise they remain distinct and overlapping. In particu- lar, this is the right approach when the user can be consulted. Merging similar candidates frees the user from an overwhelming number of similar candidates: She or he has to judge only critical cases. The modified Delta-IC algorithm 5-4 merges candidates only when they have many elements in common (Step 3). We treat one component S as a match of 131 Delta-IC another component T when S Íp T according to the partial subset relationship introduced in Section 3.5.4.3, which allows for inexact matches. Step 3 merges the overlapping candidates. Algorithm 5-4. Extended Delta-IC analysis. Input: • object reference view V • DIC threshold Q Output: • set of atomic component candidates C Algorithm: 1. generate candidates: for each subprogram S in V loop clusters (S) := candidate-cluster (S, V); end loop; 2. filter candidates whose DIC is less than Q: for each subprogram S in V loop if DIC (clusters (S)) < Q then clusters (S) := Æ; end if; end loop; 3. merge overlapping candidates: while $ a pair of subprograms {S1, S2} in clusters where clusters (S1) Íp clusters (S2) Ú clusters (S2) Íp clusters (S1) loop clusters (S1) := clusters (S1) È clusters (S2); clusters (S2) := Æ; end loop; 4. return results: for S in clusters’Range where | clusters (S) | > 1 loop C := C È {clusters (S)} end loop; Basic Techniques and Metrics of Component Detection 132 Generalization for types. The internal connectivity metric was originally pro- posed only for abstract data objects. However, we can extend the domain of con- nectivity to types as well. Before we actually generalize the metric, we state some observations. There are two different kinds of entities of an abstract data object: variables and constants that we do not want to be accessed from outside of the abstract data object and subprograms that act as public accessor routines. According to these two classes, there are the following different kinds of relationships that we implicitly distinguished above: 1. non-abstract usage: an object is directly referenced by a subprogram. There are two categories of non-abstract usage: a. the object is non-abstractly used by a subprogram within the cluster b. the object is non-abstractly used by a subprogram outside of the cluster 2. abstract usage: an object is not used directly by a subprogram S outside of the cluster but by an accessor routine of the atomic component called by S, in other words: S is accessing the object only by means of the accessor routine associ- ated with the object. Cases 1.a and 2 conform to the information hiding principle, case 1.b does not. Hence, metrics for cohesion and coupling should penalize 1.b. The metrics for objects in this section are defined with this in mind. As opposed to objects, we do not want to hide types - they would not be of any use then. Instead, we want to hide the underlying data structure of a type. This corresponds to the idea of the Internal Access heuristic. Types should be used abstractly by subprograms outside of the abstract data type. A non-abstract usage of a type is an internal access according to the definition in Section 5.6. Now that we have a unifying concept non-abstract usage for both types and objects, we can generalize the specification of refer-to and referred-by accord- ingly. The formulas (5.6) - (5.10) need not be changed. So far, refer-to(v) has been defined as referencing-subprograms(v) which is defined as predecessors(v, {refer- ence}) (see Section 3.5.3). Hence, the definition of refer-to can be extended as follows in order to include the restricted signature-types relationships (only those 133 Delta-IC signature types are considered that are annotated by an internal access; see Sec- tion 5.6): (5.11) Since referred-by is the relational inverse of refer-to: (5.12) Only signature types are considered by the definition above; local-obj-of-type edges annotated as non-abstract are ignored for the same reason as for the Internal Access heuristic stated in Section 5.6. By re-definition of refer-to, equation (5.10) is now also applicable to types. How- ever, what was found fault for the original definition of Delta-IC for abstract data objects, namely, that the definition is aimed at avoiding subclusters consisting of a single object and those subprograms that only access this object, is even more problematic for types. Let us assume information hiding is applied for a compo- nent, C, consisting of a type T and its accessor functions S1,¼,Sn (n³1). Then the proposed cluster does not depend upon the chosen subprogram and is always {T, S1,¼,Sn}. The internal connectivity IC for this cluster is exactly 1 since only the accessor functions S1,¼,Sn are referring to the type according to (5.11). However, the subtrahend in the definition of DIC in (5.10) is 1, too (let S Î {S1,¼,Sn}): Hence, DIC(S) = 1 - 1 = 0 for every subprogram S referring to T. Interestingly enough, DIC is also 0 for the following scenario: • there is a set of subprograms, S ={S1,¼,Sn}, and a type, T, where T Î referred- by (Si) "Si Î S • only a subset S’ Ì S refers to T only; all other subprograms of S non-abstractly use other types as well refer-to(e) = predecessors e reference{ },( ) predecessors e signature-type{ } non-abstract, ,( )È referred-by(s) refer-to 1– s( )= successors s reference{ },( ) successors s signature-type{ } non-abstract, ,( )È= F referred-by (F) e{ }={ } refer-to (e )-------------------------------------------------------------------- e referred-by (S)Î å F referred-by (F) T{ }={ } refer-to (T )--------------------------------------------------------------------- 1= = Basic Techniques and Metrics of Component Detection 134 Then, for any subprogram, s ÎS’ (hence, referred-by (s) = {T}) the following holds: and Hence: That is to say, the candidate-cluster for this scenario, in which subprograms out- side of the candidate non-abstractly use the type, has the same DIC value as a can- didate-cluster that represents a pure abstract data type. Of course, the equation holds also for abstract data objects that contain only one object. However, abstract data types with only one type are much more frequent than abstract data objects with only one object. To sum it up, Delta-IC is not really appropriate for atomic components containing only one object or type. In order to see whether Delta-IC is better suited for components with more than one object or type, let us revisit its definition (5.10). The upper bound of DIC is 1 and is reached for a cluster that contains at least two types or objects and whose subprograms are referring to more than one type or object in the cluster (hence, the subtrahend of (5.10) is 0) and to no object or type outside of the cluster (IC is 1). This makes sense ¾ though it is not clear why highly cohesive clusters with more than one type or object should have a higher DIC than highly cohesive clus- closely-related-subprograms (s) F referred-by (F) T{ }={ } S' SÌ= = subprograms-related-to (s) refer-to (T) S= = IC(S) {F F refer-to e( )Î referred-by F( ) referred-by S( )}ÍÙ e referred-by (S)Î å {F F refer-to e( )Î } e referred-by (S)Î å ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------= {F F refer-to T( )Î referred-by F( ) referred-by S( )}ÍÙ {F F refer-to T( )Î }-------------------------------------------------------------------------------------------------------------------------------------= F referred-by (F) T{ }={ } refer-to (T )---------------------------------------------------------------------= F referred-by (F) e= refer-to (e )------------------------------------------------------- e referred-by S( )Î å DIC S( )== 135 Delta-IC ters with only one object or type. The lower bound of DIC depends upon the num- ber of objects in the cluster. Given a subprogram, S, that refers to objects V1,¼,Vn where each object Vi in V1,¼,Vn is accessed by m other subprograms Si,1,¼,Si,m and referred-by(Si,j) = {Vi} for 1 £ j £ m (see Figure 5-5). Then, IC(S) = 1 because of closely-related-subprograms (S) = related-subprograms (S). However, the sub- trahend of DIC is as follows: where approaches 1 for large m. Hence, and, therefore, clusters based on Si,j are preferred due to which is what one would expect since S appears like a badly designed initialization function that incorporates initialization code for different components. Another problematic property of DIC is that coupling and cohesion are unbal- anced: While the value of IC, which approximates coupling, can only be between 0 and 1, the subtrahend of DIC, which approximates lack of cohesion, can be between 0 and n (where n is the number of objects and types in a cluster). It would be useful to adjust the balance between coupling and cohesion with respect to specific system characteristics. The extensions to Delta-IC presented in this section are adaptations of the approach to a pure recovery process that does not allow for changes in the system. The ideas of the original authors are preserved. The critique of the metrics used Figure 5-5. Cluster with low DIC. F referred-by (F) e{ }={ } refer-to (e )-------------------------------------------------------------------- e referred-by S( )Î å n m 1–m------------´= m 1– m ------------ IC S( )D 1 n–» IC Si j,( )D 0= S S1,1,¼,S1, V1 V2 Vn S2,1,¼,S2, Sn,1,¼,Sn, ¼ referred-by Basic Techniques and Metrics of Component Detection 136 for Delta-IC leads to a more substantial change of the approach that will be described in the next section. 5.8 Internal and External Connectivity The general aim of the definition of DIC is to minimize external connectivity and to maximize internal connectivity. However, what is called internal connectivity (IC) in the Delta-IC approach is rather a measurement between the cluster and its environment and, therefore, no internal property. Moreover, IC depends upon the considered subprogram. If one wants to use the metric to rate an arbitrary cluster (that need not to be a cluster in the sense of Delta-IC), one has to select one of the subprograms of this cluster to be able to compute the metric and it is generally not clear which one to choose. The metrics defined in this section will only depend on the cluster as a whole. Furthermore, the new metrics allow balancing coupling and cohesion and do not put clusters with only one object or type at a disadvantage. Internal and external connectivity as defined in this section are primarily pro- posed in this thesis for assessing candidates (see Section 8.4 and Section 9.3). However, they can also be used as the underlying metric of the Delta-IC algo- rithm 5-4. Moreover, they can also be used to establish a partition of the union of object reference and signature view (including non-abstract usage information) that minimizes external connectivity and maximizes internal connectivity. In order to establish such a partition, genetic algorithms could be used similar to the approach of Mancoridis et al. (1999). Moreover, if the atomic components are Name Internal-External-Connectivity Reference Rainer Koschke, unpublished Domain Object Reference View and Signature View (where edges are annotated with internal access information) Range ADO, ADT, and HC Disjoint Clusters No 137 Internal and External Connectivity known, one could also use these metrics to assess the degree of information hid- ing of the system. First of all, we make a clear distinction between internal and external connectiv- ity, in other words: between cohesion and coupling. The internal connectivity (IntC) is only based upon relationships within the atomic component, the external connectivity (ExtC) is a measurement between an atomic component and its envi- ronment. The internal connectivity of a cluster is defined as the degree to which the objects and types of the cluster are referred by subprograms within this cluster (subpro- grams was already defined in Section 3.5.4.1 and refer-to is defined by (5.11) on page 133): (5.13) where OT(C) = objects (C) È types (C) is the set of objects and types in C (see Section 3.5.4.1 for the definition of objects and types). The motivation of this formula is that subprograms should refer to many objects and types of the cluster to be part of it; thus, we are aiming at a high cohesion. The IntC(C) value is always in the range of 0 and 1. Good design should aim at a high IntC(C) value. Analogously, we can define external connectivity as the degree to which objects and types of a cluster are referred by subprograms outside of the cluster (in rela- tion to their total usage): (5.14) External connectivity corresponds to coupling, i.e., we are striving for atomic components with low external connectivity. The ExtC(C) value is always in the range of 0 and 1. Good design should aim at ExtC(C) = 0. IntC C( ) 1 OT C( ) ------------------- refer-to(e) subprograms C( )Ç subprograms C( ) --------------------------------------------------------------------------- e OT C( )Î å´= ExtC C( ) 1 OT C( ) ------------------- refer-to e( ) / subprograms C( ) refer-to e( ) ------------------------------------------------------------------------- e OT C( )Î å´= Basic Techniques and Metrics of Component Detection 138 Altogether, in order to minimize external connectivity and maximize internal con- nectivity, only clusters for which the degree of connectivity (5.15) is above the threshold should be accepted. Factor a is used to balance between internal and external connectivity. For normalization, constant 1 is added to the nominator and the term is divided by a+1. Because both IntC(C) and -ExtC(C)+1 yield values between 0 and 1, connectivity(C) is always between 0 and 1. 5.9 Schwanke´s Arch Approach The techniques described above compare pairs of entities by their direct relation- ships in order to decide whether they belong to the same atomic component. However, a complementary source of information is the environment of the com- pared entities as stated by Schwanke (1991): “If two procedures use several of the same unit-names, they are likely to be sharing significant design information, and are good candidates for placing in the same module.” And not only what kind of entities (unit-names) they commonly use increases their relatedness but also by which common entities they are used. For example, the implementations of a sine and a cosine function will both have a float parame- ter and result type, but they also will likely be used in the same context, i.e., have common callers. Schwanke’s approach takes this into account. Name Schwanke’s Arch Approach Reference Schwanke, 1991; Schwanke and Hanson, 1994 Domain Base View Range RS Disjoint Clusters Yes connectivity C( ) a IntC C( )´ ExtC C( ) 1+– a 1+ --------------------------------------------------------------= 139 Schwanke´s Arch Approach Schwanke’s work is aimed at module detection. Subprograms are clustered into modules based on a similarity metric (clustering algorithm 5-5). Clustering criterion. In each iteration, the most similar groups are combined using the similarity metric described below. Similarity between subprograms. The group similarity used to combine groups in algorithm 5-5 is based on a similarity between subprograms. Given two sub- programs A and B, the similarity metric used during clustering is defined as fol- lows: (5.16) wherein Common (A,B) reflects the common features of A and B and Distinct (A,B) reflects the distinct features. Linked (A, B) is 1 if A calls B or B calls A, oth- erwise it is 0. The two parameters k ³ 0 and d ³ 0 are weights given to Linked and Distinct in Sim. They have to be ascertained by experiments on a sample of the subject system. The parameter n ³ 0 is used for normalization purposes; it will be considered 0 in the following. Features of a subprogram A are all non-local names that A uses including the names of procedures, macros, typedefs, objects, and even the individual record component names of structured types and objects. Record component names of structured types and objects are treated as if they were unique, i.e., if two record types or objects accidentally have record components of the same name, then these names are considered distinct. But not only what A uses is a feature of A, the fact that A is used by another subprogram (i.e., the other subprogram calls A) is also considered a feature of A. Technically, a feature binding can be expressed Algorithm 5-5. Similarity Clustering algorithm. place each routine in a group by itself repeat identify the two most similar groups combine them until the existing groups are satisfactory Sim A B,( ) Common A B,( ) k Linked A B,( )´+ n Common A B,( ) d Distinct A B,( )´+ + ----------------------------------------------------------------------------------------------------= Basic Techniques and Metrics of Component Detection 140 by a pair (A, B) where A uses the non-local name B, or, in other words, A has fea- ture B. The fact that A is called by B is expressed by (A, B*). The notation B* denotes a synthetic name derived from B. This represents the difference between a name used in a unit and the name of a place where the unit is used. The distinc- tion is made so that (A, B) and (A, B*) are distinct pairs. Common and Distinct are computed as weighted sums (features (A) denotes the features of A): It obviously makes a difference whether two subprograms have a rare or frequent feature in common. For example, an error object which is used everywhere in the system is less distinctive than an object that is only used by a small portion of the system. Using the Shannon information content to ascertain the individual feature weights takes this into account (Shannon, 1972). It gives frequent features less weight and vice versa: where Probability(x) is the fraction of all entities that have x in common. The hypothesis is that rarely used entities are more significant than frequently used entities. Group similarity. Based on the similarity between subprograms, the similarity for groups is defined as the maximum similarity between any pair of group mem- bers (one from each): (5.17) Common A B,( ) W features A( ) features B( )Ç( )= Distinct A B,( ) W features A( ) features B( )¤( ) W features B( ) features A( )¤( )+= W X( ) wx x XÎ å= where wx is the weight of feature x wx ld Probability x( )( )–= GSim A B,( ) max Sim a b,( ) a AÎ b BÎÙ( )= 141 Schwanke´s Arch Approach Extensions. An extension to this approach was proposed by Schwanke himself in joint work with Hanson in 1994. In the extension, they use a nearest neighbor approach to classify components. The nearest neighbor approach works as follows. The basic similarity measure between individual entities must be monotonic and matching, as described by Tversky (1977). This makes it reasonable to use the similarity measure to com- pare one entity, S, to each of the other entities, Ti, and rank them by their similar- ity to S. This ranking identifies which of the Ti are the nearest neighbors of S. An entity, S, in an existing software system can then be classified by identifying the existing group G that contains a plurality of the near neighbors of S. Schwanke and Hanson use an arbitrary cutoff by defining “near neighbors” to be the five nearest neighbors (in the following the number of nearest neighbors considered is denoted by k). For tie-breaking, points according to nearness are assigned. Then S belongs in the group G for which the neighbors of S that belong to G represent the most points. The point values are arbitrary, but observe the following characteris- tics (let Ni be the i'th nearest neighbor): N1 is worth k points, N2 is worth k-1, and so forth; i.e., Ni is worth k-i+1 points where i is in the range of 1¼k. Example. Consider the following example scenarios: 1. Each near neighbor is in a different group: nearest neighbor wins. 2. Group G1 has N1, G2 has N2 and one of N3, N4, or N5: any two near neighbors are worth at least as much as the single nearest neighbor. 3. G1 has N1 and N2, G2 has N3, N4, and N5: the two nearest neighbors beat the next three. Looking at these kinds of scenarios could help decide what point assignments should be used. The “point” is, the weights are chosen to make ordinal compari- sons do the right thing. For this k-nearest neighbor approach, a group similarity measure is not needed and that is why Schwanke and Hanson’s paper do not investigate group similarity measures, yet Schwanke suggested in a personal communication a possible group similarity measure based on the same principle (1998): Basic Techniques and Metrics of Component Detection 142 (5.18) where pS(T) is the number of points assigned to T for being a near neighbor of S. That is, this similarity varies directly with the number of near neighbors of mem- bers of B that belong to A, and the number of near neighbors of members of A that belong to B. Dividing by the number of relationships in pairs in (AÈB) prevents large groups from getting larger just because more neighbor relationships are involved. Qualitatively, the similarity is the average reduction in member loneli- ness that merging the groups would produce. (Member loneliness is the extent to which the near neighbors of a member belong to other groups.) The effect of this group similarity measure on clustering can be exemplified as follows. First, the clustering algorithm would pair up all the {S,T} that were mutually nearest neighbors (10 points/2 = 5). Next it would pair up all the remain- ing {U,V} for which U was nearest neighbor of V, and V was second-nearest for U. (9 points/2 = 4.5). Schwanke proposes the following exercise to understand the contribution of the group similarity for clustering: compute p({A,B,C}) for all possible neighbor rankings among A, B, and C, and find the cases where the algorithm would prefer forming a triplet from a pair and a singleton rather than forming a new pair out of two singletons. For example, {a,b,c} := {a,b} + {c} vs.{d,e} := {d} + {e} The maximum possible value p({a,b,c}) is (5+4+5+4+5+4) = 27. The minimum possible value of p({a,b}), given the values in the previous line, is (5+5)=10, because the algorithm would have formed that cluster first. So, Sim ({a,b},{c}) in this case is (27-10)/6 = 2.84 and this triplet {a,b,c} would not be formed until most units were in pairs. Only pairs worth (5+0)/2, (4+1)/2, (3+2)/2, etc., would be formed later. So, almost any unit whose nearest neighbor is not already in a group would be paired with that neighbor before any triplets were formed. The problem of learning the appropriate weights is addressed by using a feedfor- ward neural network and backpropagating errors (Schwanke and Hanson, 1994). GSim A B,( ) p A BÈ( ) p A( ) p B( )–– A BÈ A BÈ 1–( )´ ----------------------------------------------------------= where p X( ) pS T( ) S T XÎ, å= 143 Type-based Cohesion The network is designed to mirror the model of similarity judgment according to (5.16). In order to distinguish the original approach from 1991 from the extension of 1994, the former approach is called the Arch approach and the latter the iArch approach following the terminology Schwanke used in his papers. The original Arch approach was extended to detect atomic components by Jean- François Girard, Georg Schied, and me in many ways. The enhancements are so manifold that the extension can be considered a new approach. It will be dis- cussed in Chapter 7. 5.10 Type-based Cohesion Patel et al. propose an approach similar to Schwanke’s similarity clustering, grouping subprograms that share a large amount of types. The main difference to Schwanke’s approach is that Schwanke considers any shared non-local features and not just types. We will discuss further differences after the introduction of Patel et al.’s approach. Patel et al. take into account every type of an expression occurring in the body of a subprogram (both left hand side as well as right hand side expressions) and all types that are used in declarations of the subprogram (declarations of local objects, parameters, and record components as well as type declarations). Each occurrence of a type or record component in any expression or declaration within function f is counted as follows: • If a reference is made to object (or parameter) V and V has type T, then each reference to V increments the counter associated with type T. Name Type-Based Cohesion Reference Patel, Chu, and Baxter, 1991 Domain Type Composition View + Specific Usage Range RS Disjoint Clusters Yes Basic Techniques and Metrics of Component Detection 144 • Moreover, if T is a part-type of some other type T’, then the counters of all types and components in the path from T to the root of the structured type are increased by one. • If object V is a local object of function f with type T, then the counter of T is increased by one. Example. The approach is based on the type composition view. It will be exem- plified by the C code in Figure 5-6 (the example is a translation of Patel’s example into C). Function f is related to the following types: int (local variable i), float (parameters rp and ip, components real_part and imag_part), Complex (local variable X), and Position_Vector (local variable A). Furthermore, it references the components, real_part and imag_part, of record Complex. The relations of the types are as in the type composition view in Figure 5-7. For example, the counter of int is 6 because • int occurs in the declaration for i • i occurs in the initializing expression of the loop • i occurs in the termination condition of the loop • i occurs as loop increment Figure 5-6. Example C code for Type-based Cohesion. typedef struct {float real_part; float imag_part;} Complex; typedef Complex Position_Vector [100]; void f (float rp, float ip) { Position_Vector A; Complex X; int i; X.real_part = rp; X.imag_part = ip; for (i=0; i<100; i++) { A[i].real_part = X.real_part; A[i].imag_part = X.imag_part; } } 145 Type-based Cohesion • i occurs twice as index to A The counter for Complex is 7 because • a variable X is declared of type Complex • there are two assignments to variable X • there are two uses of variable X • there are two assignments to A[i] which is of type Complex The counter for float is 10 because • there are two parameters, rp and ip, of type float • parameter rp is used once • parameter ip is used once • X.imag_part (of type float) is used once and set once • X.real_part (of type float) is used once and set once • A[i].real_part (of type float) is set once • A[i].imag_part (of type float) is set once And finally, the counter of Complex.real_part is 3 (likewise for Com- plex.imag_part) because • X.real_part is set once • X.real_part is used once • A[i].real_part is set once Figure 5-7. Type composition view for types related to function f in Figure 5-6. int float Complex Position_Vector real_part imag_part part-type enclosingenclosing of-typeof-typetype component part-type Basic Techniques and Metrics of Component Detection 146 Similarity between subprograms. The relatedness of a function f to a sequence S of n types [T1, T2,... Tn] (including record components) can then be represented as an n-dimensional vector RS(f) = (c1, c2,..., cn) where ci is the counter associated with Ti computed as described above. Two functions f1 and f2 are compared based on RS(f1) and RS(f2) using the cosine of the angular separation between binary vectors (inspired by work on information retrieval principles; Salton, 1968): (5.19) Similarity between groups. The similarity for two groups of subprograms, A and B, where A = {a1, a2,..., ap} and B = {b1, b2,..., bq} can be defined in terms of the similarity between all subprograms in the union of A and B. Let S = A È B = {s1, s2,..., sn}: (5.20) Patel et al. propose this as a cohesion metric for subprograms in a module, but it can also be used to cluster similar groups within algorithm 5-5 to find groups of related subprograms. Differences to Schwanke’s approach. Schwanke’s and Patel’s approaches are similar, but there are substantial differences: • Patel et al.’s similarity between groups is based on the angular separation of vectors whereas Schwanke proposes to use the maximum of the similarities between two groups or to use the k-nearest neighbor approach. • The name of an entity counts only once at most in the approach of Schwanke. In the approach of Patel et al., it counts as often as it occurs. Sim X Y,( ) xi yi× i 1= n å xi 2 i 1= n å yi 2 i 1= n å´ ----------------------------------------------= GSim S( ) Sim si sj,( ) N å n n 1–( )× 2¤ -------------------------------= where N i j,( ) i j, 1 ¼ n, ,{ } i j>ÙÎ{ }= 147 Type-based Cohesion • Direct relations between subprograms (i.e., the two subprograms call each other) are regarded by Schwanke; Patel et al. do not regard direct relations between subprograms. • Schwanke also considers the fact that two subprograms are called by the same subprogram or both call the same subprogram, i.e., indirect relations of subpro- grams other than relations to types. • All types and names of record components count equally in Patel et al.’s approach; Schwanke weighs them. • Patel et al.’s similarity is basically a relation between subprograms and types, though this relation may be derived through used objects, whereas Schwanke’s similarity is a relation between subprograms and non-local names. This can lead to very different results. An example will make this difference clear. Con- sider the following C code: T a, b; f () { a = b; } g () {T i,j; } Patel et al.’s metric considers f and g similar because both a, b, i, and j are all of type T whereas the similarity between f and g is 0 by Schwanke’s metric. Patel et al.’s approach can - to some extent - be considered a special case of Schwanke’s approach. The major advantage of Schwanke’s approach is that simi- larity is a relation based not only on related types but based on all non-local names. Whether it makes sense to count each occurrence of a type (or record component), as proposed by Patel et al., has to be validated in practice. Basic Techniques and Metrics of Component Detection 148 5.11 Strongly Connected Components Mutually recursive subprograms form an atomic component because none of them can be omitted without loosing a piece of information for the understanding of the other subprogram in the component. Mutually recursive subprograms form a cycle in the call graph which corresponds to the notion of strongly connected components in graph theory. For example, in the call graph in Figure 5-8, we find two strongly connected components {4, 8} and {6, 10, 11}. They can be detected using the linear-time algorithm of Tarjan (1974). Clustering criterion. All strongly connected components in the call graph form an atomic component. Name Strongly Connected Components Reference Cimitile, A. and Visaggio, G. (1995) Domain Call View Range RS Disjoint Clusters Yes Figure 5-8. Example of strongly connected components. 1 2 3 4 5 6 7 8 10 11 12 13 9 A Bcalls 149 Dominance Analysis 5.12 Dominance Analysis Atomic components often have local subprograms that offer basic services to the accessor routines that constitute the interface of the atomic components. These subprograms do not necessarily mention the type of the ADT in their signature or reference the global objects of the ADO, respectively, and they may remain unde- tected by all approaches described above. Because these basic service subpro- grams are local to the atomic components, they are an essential part of them and an atomic component cannot be understood without them. Local in this context means that they are only used by routines in the atomic component. It does not mean that they are local in the sense of nested scopes; quite the opposite: Because they may be used by several other routines in the atomic components, they must be visible to all of them, not to mention that C does not allow nested subpro- grams. These kinds of local routines can be detected by means of dominance analysis. Locality in the mentioned sense can be viewed as a dominance relation in graph theory. Before we show how, we define the dominance relation: A node, N, is said to dominate another node, M, in a directed graph, G, if each path from the root of G to M contains N. If N is a dominator of M and every other dominator N’ of M is also a dominator of N, then N is called an immediate or direct dominator of M. The dominance relationship can be represented as a dominance tree where a node’s parent is its immediate dominator. Cimitile and Visaggio (1995) propose to apply dominance analysis to call graphs to identify candidates for reusable modules. In their approach, cycles (i.e., strongly connected components) are collapsed before dominance analysis is applied. This approach is applied here to detect additional entities local to the components already found. The main differences are that not only cycles are col- Name Dominance Analysis Reference Cimitile, A. and Visaggio, G. (1995) Domain Call View, Object Reference View, Type Usage View, and Components View Range Yields additional local subprograms/objects. Disjoint Clusters Yes Basic Techniques and Metrics of Component Detection 150 lapsed, but also any atomic component and, furthermore, dominance analysis will here be applied not only to the call view, but to the union of the call, type usage, and object reference view (see Table 3-5 on page 68). This way, objects and types local to components can be detected. The algorithm involves the following basic steps: 1. All members of an atomic component are collapsed to a single node and all edges to and from members are redirected to and from the substituting node (this step is denoted by Collapse). 2. Dominance analysis is applied to the collapsed graph resulting from the previ- ous step. 3. In the dominance tree, each component C absorbs its (transitively) dominated subprograms that are not dominated by any other component dominated by C. Any (transitive) descendant of node A in the dominance tree is local to A. That is, we can add all descendants of a component C in the dominance tree (i.e., the tran- sitive closure of C with respect to dominate edges) to C. However, as the example dominance tree in Figure 5-9 illustrates, there may be another component C’ (transitively) dominated by C. Then, the descendants of C’ in the dominance tree should rather be considered part of C’ instead of C since they are primarily local to C’; C’ can then be considered a part of C. For the example in Figure 5-9, S1 and C’ are part of C whereas S2 and S3 are part of C’. Clustering criterion. A base entity N is considered part of the first component on the path from N to the root in the dominance tree. More formally, the entity is added to its primarily dominating atomic component. An atomic component AC is said to primarily dominate an entity N if and only if AC dominates N and there is no other atomic component dominating N that is also dominated by AC. Figure 5-9. Example dominance tree. C C’ S2 S3 S1 151 Dominance Analysis The clustering criterion adds only base entities to components, thus completing existing components. However, the same technique can also be used to detect subsystems by subsuming components under other components according to the dominance relationship. Detecting subsystems by means of dominance analysis was subject of a case study of us (1997a) and is further explored in the thesis of Jean-François Girard. Algorithm 5-6 adds all local entities to atomic components according to the clus- tering criterion. The function is started at the root of the dominance tree and recursively traverses the dominance tree in depth-first order. Example. The single steps of the approach can be explained with the example of Figure 5-10. Part (a) shows the input graph that contains two atomic components AC1 = {4,8} and AC2 = {6,10,11} detected by previous analyses; they are col- lapsed in step (1). The result is shown in part (b). The result of applying domi- nance analysis to part (b) is presented in part (c) of Figure 5-10. Function DFS of Figure 5-6 applied to the dominance tree in Figure 5-10(c) adds the two subpro- grams 12 and 13 and object Z to atomic component AC2 as well as object X to atomic component AC1. Algorithm 5-6. Detecting parts of components in the dominance tree. function DFS (Root : Node) return Node_Set is Descendants : Node_Set; begin for each D in dominatees (Root) loop Descendants := Descendants È DFS (D); end loop; if Is_Atomic_Component (Root) then Add Descendants to Root; return Æ; else return Descendants; end if; end DFS; Basic Techniques and Metrics of Component Detection 152 5.13 Preliminary Taxonomy of Basic Structural Techniques The techniques described in this section are based on structural information only. They neither leverage data flow information nor domain knowledge. They can roughly be classified as follows (this classification will be extended in Section 12.1): • Connection-based approaches cluster entities based on a specific set of direct relationships (and their quality) between entities to be grouped. For example, a routine must have a type in its signature (i.e., the two of them must be directly connected) and the corresponding parameter of this type must be internally accessed by the routine (quality of the relationship) in order to be grouped together. Global Object Reference, Same Module, Part Type, Internal Access, and Same Expression belong to this category. • Metric-based approaches cluster entities based on a metric using an iterative clustering approach. Schwanke’s Similarity Clustering and Type-based Cohe- sion fall in this category. The metric-based approaches are based on connec- tions, too, but they differ from connection-based approaches by the degree of freedom that is offered by the metric parameters and the threshold that can be varied to find atomic components with varying confidence. Delta-IC is a hybrid within this classification. It is based on a metric but is not really iterative unless one may view it as clustering with only one iteration. Figure 5-10. Example of dominance analysis. 1 2 3 4,8 5 6,10,11 7 12 13 9 1 2 3 4 5 6 7 8 10 11 12 13 9 1 2 35 7 1213 9 (a) Input Graph (b) Collapsed Graph (c) Dominance Tree calls dominates references Z X Y Z YX X AC1 AC2 Z YAC1 AC2 AC1 AC2 153 Preliminary Taxonomy of Basic Structural Techniques Furthermore, it actually consists not only of the metric but also of a clustering heuristic in the first place: to group closely related subprograms and the refer- enced objects to a given subprogram. The metric is only used to filter out non- relevant clusters. Nevertheless, I will consider it a metric-based approach because the metric is the predominant factor of this approach and because all connection-based approaches can be viewed as metric-based by expressing their underlying heuristic as a metric (this will be shown in Section 8.4). The opposite direction, i.e., viewing all metric-based approaches as connection- based, does not hold because connection-based approaches always yield the same fixed pattern and there are no parameters to influence clustering. • Graph-based approaches derive clusters from a graph by means of graph-the- oretic analyses. The difference to connection-based approaches is that the whole graph has to be considered whereas connection-based approaches regard only direct relationships between entities in order to decide whether they should be grouped. Strongly-Connected Components Analysis and Dominance Analysis belong to this category. In this thesis, only the listed connection-based, metric-based, and graph-based techniques are explored. Other techniques are described in Chapter 11. Basic Techniques and Metrics of Component Detection 154 153 Chapter 6 Evaluation of the Basic Techniques In the last chapter, proposed basic techniques for atomic component detection were presented. All but Same-Expression have been published and do not origi- nate from us (though we proposed some improvements in the last chapter). How- ever, the original authors rarely described any comparable quantifying evaluation of their techniques. In 1997, we conducted an evaluation of most of these basic techniques (Girard, Koschke, Schied, 1997b) by comparing the atomic compo- nents recovered by the approaches presented to those identified by software engi- neers. The results have already been published (Girard, Koschke, and Schied, 1999; Girard and Koschke, 2000). These results are repeated here. However, this chapter also includes new techniques and systems to achieve a more complete comparison and provides more detail about the method of comparison. 6.1 Reference Corpus In order to establish a comparison point for the detection quality of the automatic recovery techniques, software engineers manually compiled a list of reference atomic components (short reference components or references) for diverse C systems. These reference atomic components are used for statistical analyses, for calibrating parameters of diverse metrics, and to evaluate the automatic tech- niques. For the evaluation, we compared the components proposed by automatic techniques, called candidate atomic components (short candidate components or candidates), to the reference components. The sets of references for this com- Evaluation of the Basic Techniques 154 parison are called the reference sets or reference corpora. This section summa- rizes how the reference sets were obtained and validated. 6.1.1 Systems Studied The reference components were obtained for several medium size C programs (see Table 6-1 for their characteristics). Aero is an X-window-based simulator for rigid body systems (Keller, 1995), Bash is a Unix shell (Ramey, 1994), CVS is a tool for controlling concurrent software development (Berliner, 1990), and Mosaic is a world-wide web browser (NCSA, 1997). All figures about program length in terms of lines of codes throughout this thesis are ascertained with the Unix tool wc, hence include comments and blank lines. Most systems have additional libraries that often encapsulate platform dependen- cies. These libraries were not investigated. Table 6-1 lists only the size of the core systems that were analyzed by the software engineers. 6.1.2 Obtaining the Reference Atomic Components The reference components of Aero, Bash, and CVS were compiled by human ana- lysts in 1997. The reference components for Mosaic are the result of the experi- ment described in Chapter 10. The actual numbers of all major forms of atomic components (abstract data types, abstract data objects, hybrid atomic compo- nents) that were identified for each studied system are listed in Table 6-2. The rest of this section gives more detail about how the reference components were estab- lished for the respective systems and why they provide a reasonable basis for comparison. Table 6-1. Suite of analyzed C systems. System Name Version Lines of Code # User Types # Global Objects # Routines Aero 1.7 31 Kloc 57 480 488 Bash 1.14.4 38 Kloc 60 487 1002 CVS 1.8 30 Kloc 41 386 575 Mosaic 2.6 (without GUI) 37 Kloc 79 269 564 155 Reference Corpus 6.1.2.1 Reference Components for Aero, Bash, and CVS We asked five software engineers to identify atomic components in Aero, Bash, and CVS. Table 6-3 summarizes their experience and how the task was divided among them. There was no overlap of their work. They needed between 20 and 35 hours for each system to gather the atomic components of the respective systems. The soft- ware engineers were provided with the source code of each system, a summary of connections between global variables, types, and functions, and the guidelines given in Figure 6-1. Because the analysis of Bash was distributed among three software engineers, they performed a review of each other’s work and came to a consensus on the final reference components. The guidelines did not exclude overlap between atomic components, i.e., sharing elements among components. There was a small degree of overlap of the refer- ence components of Aero and Bash and no overlap for CVS. Table 6-2. Number of atomic components in analyzed systems. System #ADT #ADO #Hybrid #Total Aero 9 16 1 26 Bash 18 16 5 39 CVS 13 35 6 54 Mosaic 12 28 13 53 Table 6-3. Human analysts. Software Engineer Programming Experience System Analyzed se1 2 years research Bash se2 2 years research Bash se3 5 years research Bash se4 5 years research CVS se5 > 5 years industry Aero Evaluation of the Basic Techniques 156 The fact that our reference components used as comparison point were produced by people raises the question whether other software engineers would identify the same atomic components. In order to answer this question, Jean-François Girard performed an experiment on a subset of CVS containing 2.8 KLOC and com- posed of the following key files: history.c, lock.c, cvs.h. These source files were distributed along with a cross-reference table indicating the relations among types, global variables, and functions. Four software engineers had the task to identify the atomic components present. He collected a description of the proce- dure they followed along with their results, then looked for cases where they seemed to have broken their own rules and asked them to refine either their proce- dure or their results. He also revisited with them those atomic components for Figure 6-1. Guidelines for human analysts. Identify the existing atomic components present in this system. These are abstract data objects (ADO), and abstract data types (ADT), or a combina- tion of both. •Here we specified abstract data types, abstract data objects, and hybrid components by the definitions already presented in Chapter 3. • The key difference of ADT and ADO is that an ADT is built around a type and an ADO around a set of simple global variables. This can be decided automatically, so do not waste time writing it down. Just iden- tify the functions, variables, and types which belong together because they are cohesive and correspond to the idea of ADO and/or ADT. • In practice, programmers sometimes break the encapsulation principle, therefore we widen the definition of abstract data objects and abstract data types to clusters of types or variables, respectively, with their accessor routines. The internal representation of ADTs and ADOs can be public. •Nota bene: not all functions, variables, or types have to be put into ADO, ADT, and hybrid components. • In general, your experience and understanding has more value than rules, you are the last judge of what constitutes an ADO/ADT. 157 Reference Corpus which a comment indicated that they were unsure or something was unclear and corrected their results according to their conclusions. The four software engineers agreed on the basic principles that characterize an atomic component and proposed very similar components. There were some divergences on the details; for example, one of them added functions to an abstract data type which did not have the type T of the abstract data type in their signature, but applied a cast of type T to one of their parameters. These diver- gences occurred rarely. Jean-François Girard performed a second experiment on a subset of Bash con- taining 5.9 KLOC and composed of the following key files: copy_cmd.c, dispose_cmd.c, execute_cmd.c, make_cmd.c, print_cmd.c, and command.h. He followed the same procedure but distributed the subsystem to two software engi- neers who did not know the system to avoid learning effects. Finally, in order to assess if these experiment results from a system subset can be generalized to a complete system, one software engineer identified atomic com- ponents in the whole Bash system. The atomic components he identified were compared to those of the reference components used in this thesis (those obtained by consensus). A quantitative evaluation of the degree of agreement among the software engi- neers showed first, that the software engineers agreed to a very high degree on the atomic components of these systems and second, that the agreement gained on a smaller subset can indeed be generalized to the rest of the system. Therefore, we may conclude that the reference components for Aero, Bash, and CVS are a suit- able oracle. The procedure used for the quantitative evaluation and its exact results are explained by Girard, Koschke, and Schied (1999). 6.1.2.2 Reference Components for Mosaic The reference components for Mosaic were obtained by the experiment described in Chapter 10 in which human analysts had to detect atomic components either manually or with tool support. The task of the experimental subjects was to recover as many atomic components for Mosaic as possible within 6 hours. In order to obtain comparable results, we reduced the possible search space for Evaluation of the Basic Techniques 158 atomic components to a size that could be handled within the given time frame, i.e., all experimental subjects should be able to look at all source files within the available time. Therefore, we excluded the files that are mainly devoted to the graphical user interface, namely, all files whose names begin with the prefix gui. The 8 excluded files comprise 15 KLOC, i.e., 40 files consisting of 37 KLOC were to be analyzed. To obtain a common basis of comparison, the atomic com- ponents separately detected by each individual were merged and then validated by at least two participants. Only those atomic components were accepted for which a consensus could be reached. 6.2 Comparison of Candidate and Reference Components Candidate components and reference components are compared using an approx- imate matching to accommodate the fact that the distribution of functions, global variables, and types into atomic components is sometimes subjective and, prag- matically, we have to cope with matches of candidates and references that are incomplete, yet “good enough” to be useful. “Good enough” means that candidate and reference overlap to a large extent and only few elements are missing. More precisely, we treat one component S as a match of another component T if S is a partial subset of T (denoted by S Íp T) according to the definition of partial sub- set in Section 3.5.4.3. For the results reported in this chapter, p = 0.7 is assumed, i.e., at least 70 percent of the elements of S must also be in T. This number is arbi- trary, but motivated by the fact that at least three elements of a four-element atomic component must also be in the other atomic component to be an accept- able match. 6.2.1 Classification of Matches Based on the approximative matching criterion, the generated candidates are clas- sified into three categories according to their usefulness to a software engineer looking for atomic components: • Good when the match between a candidate C and a reference R is close (i.e., C Íp R and R Íp C). This case is denoted 1~1. 159 Comparison of Candidate and Reference Components Matches of this type require a quick verification in order to identify the few elements which should be removed or added to the candidate component. • Ok when the Íp relationship holds only in one direction for a candidate C and a reference R: - C Íp R, but not R Íp C. The candidate is too detailed. This case is denoted as n~1. - R Íp C, but not C Íp R. The candidate is too large. This case is denoted as 1~n. Partial matches of this type require more attention to combine or refine a com- ponent. The denotation n~1 and 1~n reflects the fact that multiple Ok matches may exist for a given R or C. Altogether, we have three classes of matches: 1~1, 1~n, and n~1 where the latter two are both considered Ok. Example. Consider the example in Figure 6-2. C1 and R1 are a good match. Because only partial matches are required, there can be another reference R4 (with R4 Ç R1 = Æ) that is a partial subset of C1 (of C1 \ R1, more precisely). C2 is an Ok match with R2, and so is C3. C2, C3, and R2 constitute an n~1 match. That is, the technique has produced finer-grained candidates than what was expected. Note that we cannot necessarily conclude that C2 Èÿ C3 and R2 are a good match because R2 could be much bigger than C2 Èÿ C3. R3 and C4 constitute a 1~n match, where no other reference than R3 can be matched with C4. C5 and R5 do not match at all. As the example indicates, it is not enough just to count the matches in order to judge the detection quality of a technique. For example, R3 is a partial subset of C4 and, therefore, considered at least an Ok match. However, C4 could be huge and the match just be coincidence. The next section proposes a measurement for detection quality based on multiple aspects that considers this imprecision. Evaluation of the Basic Techniques 160 6.2.2 Detection Quality There are several aspects in a comparison of a set of candidates with a set of refer- ences to consider when the matches have been established as described in the last section: • Number of false positives: The number of candidates that neither match a ref- erence nor are matched by any reference, i.e., candidates that cannot be associ- ated with any reference. Technically speaking, these are candidates that are neither involved in a 1~1, 1~n, nor n~1 match. This number should be 0. • Number of true negatives: The number of references that neither match a can- didate nor are matched by any candidate, i.e., references that are not even par- tially detected. Technically speaking, these are references that are neither involved in a 1~1, 1~n, nor n~1 match. This number should be 0. • Granularity of matches: Are the candidates at the right level of granularity? Technically speaking, there should only be good matches and no Ok matches. • Precision of matches: The degree of correspondence between candidates and reference matches. This is discussed in the following in more detail. The preci- sion should approach 1.0. Since the partial subset relationship is used to establish a match, the matching candidates and references need not be equal. That is, there may be elements of the candidate not in the reference and vice versa: C\R ¹ Æ and R\C ¹ Æ. In other words, there may be a flaw in a good match; even more so for Ok matches because of (let R be a reference and Ci be candidates for which Ci Íp R holds): Figure 6-2. Example correspondences of candidates and references. candidates references C1 C2 C3 C4 R1 R2 R3 R4 R5C5 Íp 161 Comparison of Candidate and Reference Components Accuracy for two matching components. In order to indicate the quality of imperfect matches of candidate and reference components, an accuracy factor is associated with each match. The similarity between two components, and thus the accuracy of a candidate vis-a-vis a reference component, is computed using the following formula: (6.1) In 1~n and n~1 matches ¾ and sometimes even in 1~1 matches ¾ several com- ponents may match with one other component. The accuracy as defined above, however, involves only two single components. Therefore, the definition is extended for sets of components as follows. Accuracy for two sets of components. The overlap for two matching compo- nents can be used to ascertain the accuracy of sets of components: (6.2) Accuracy for classes of matches. The accuracy for two sets of components is used to establish the accuracy for a whole class of matches where the two sets {A1,¼,Aa} and {B1,¼,Bb} are corresponding components in a match within a given class of matches. More precisely, let the matching components of a candidate or reference, X, be defined as follows: (6.3) Ci i È p RÍ Þ¤ R p Ci i ÈÍ accuracy A B,( ) overlap elements A( ) elements B( ),( )= where overlap X Y,( ) X YÇ X YÈ ----------------= accuracy A1 ¼ Aa, ,{ } B1 ¼ Bb, ,{ },( ) overlap elements Ai( )i 1¼a=È elements Bi( )i 1¼b= È,( )= matchings X( ) Y Y p XÍ{ }= Evaluation of the Basic Techniques 162 Using the matching components, we can specify the degree of agreement for the diverse classes of matches: • 1~1 match: • n~1 match: • 1~n match: To put it in words: The accuracy is ascertained based on the united matching com- ponents. The way of handling 1~1 matches may first be astonishing, but is moti- vated by the fact that a 1~1 match does not necessarily mean that there is no other component that is a partial subset of one of the components in the 1~1 match. This was already touched in the example of Figure 6-2 on page 160 where C1 and R1 are a 1~1 match and R4 is still a partial subset of C1. If there is no such addi- tional 1~n or n~1 match, then: That is, we subsume an additional 1~n or n~1 match in a 1~1 match. This is justi- fied because there is a clear 1~1 relationship in the first place and the additional 1~n or n~1 match can only be comparatively small. Such overlapping matches can also exist for pure 1~n and n~1 matches as the example in Figure 6-3 illustrates. However, the overlap of 1~n and n~1 matches is ignored since there is no dominating correspondence as in the case of overlaps with 1~1 matches. That is, the two overlapping matches in Figure 6-3 are handled as two distinct matches. Now that we have the means to establish the accuracy of a single match with respect to its class (1~1, 1~n, n~1), we can extend the accuracy to the whole class of matches. The classes of matches are defined as follows: Figure 6-3. Overlapping 1~n and n~1 matches. accuracy matchings C( ) matchings R( ),( ) accuracy matchings R( ) R{ },( ) accuracy C{ } matchings C( ),( ) accuracy matchings C( ) matchings R( ),( ) accuracy R C,( )= a a, b, c b, c, d, e, f Íp 163 Comparison of Candidate and Reference Components Then, the accuracy for a whole class of matches is defined as the average in accu- racy of the members of the class (let M be a class of 1~1, 1~n, or n~1 matches): (6.4) Overall recall rate. In the following, the detection quality of a technique is described by a vector of the number of false positives and true negatives and the average accuracies of 1~1, 1~n, and n~1 matches according to (6.4) along with their respective absolute number to indicate the level of granularity. These figures provide a detailed picture for the comparison of the techniques. However, an additional summarizing value is useful for a quick comparison. The following equation characterizes the overall recall rate (GOOD and OK are defined above): (6.5) To illustrate the definition of the recall rate, consider the example in Figure 6-4, in which the matching components of each candidate and reference component of Figure 6-2 have been merged for the comparison. There are two OK matches and one good match. R5 is not matched at all and, therefore, considered a true nega- tive; likewise, C5 is a false positive because it does not correspond to any refer- ence. The example also illustrates that the denominator of (6.5) cannot simply be the number of the original references because not only candidates but also refer- ences can be united for the comparison, which reduces the number of references actually used for the comparison. GOOD 1 1~ matchings c( ) matchings r( ),( ) c rpÍ r cpÍÙ{ }= = 1 n~ c{ } matchings c( ),( ) matchings c( ) ƹ r( )r cpÍ c rpÍØÞ"Ù{ }= n 1~ matchings r( ) r{ },( ) matchings r( ) ƹ c( )c rpÍ r cpÍØÞ"Ù{ }= OK 1 n~ n 1~È= accuracy M( ) accuracy a b,( ) a b,( ) MÎ å M --------------------------------------------------------= Recall accuracy a b,( ) a b,( ) GOODÎ å accuracy a b,( ) a b,( ) OKÎ å+ GOOD OK true negatives+ + --------------------------------------------------------------------------------------------------------------------------------------= Evaluation of the Basic Techniques 164 The recall rate (6.5) abstracts from the level of granularity ¾ since good and OK matches are treated equally by this definition ¾ and ignores false positives. The number of false positive is a different aspect and is not captured by this definition because ¾ depending on the task at hand ¾ a higher number of false positives in favor of a higher recall rate may be acceptable. 6.3 Benchmark Results for the Basic Techniques We applied the techniques listed in Table 6-4 to the reference corpus described in Section 6.1. Note that some of the techniques are only designed for detecting one type of atomic component, some are basically able to detect both ADTs and ADOs; therefore, Table 6-4 summarizes once again what kind of atomic compo- nents occurring in the reference corpus are detected by the respective technique (a ^ means that the technique cannot detect atomic components of this kind, a 3ÿsays that the type of atomic component can be detected). Results for hybrid components are not directly reported because there were not enough of them in the reference corpus for a valid evaluation (Mosaic would have been the only exception). Instead, since hybrid components can be viewed as extended ADTs or ADOs, hybrid components are part of the reference set to which the candidates are compared. A technique suitable for ADT or ADO detection can at least par- tially detect a hybrid component. Hence, the reference set for a technique suitable to detect ADTs contains all reference ADTs and hybrid components, while the reference set for a technique detecting ADOs consists of abstract data objects and hybrid components of the reference corpus. Note that the original Delta-IC method was proposed as an iterative approach involving human validation and Figure 6-4. Example merged correspondences of candidates and references. candidates references C1 C2 È C3 C4 R1 È R4 R2 R3 R5C5 Íp good OK OK 165 Benchmark Results for the Basic Techniques Table 6-4. Evaluated combinations of atomic component detection techniques. Method ADT ADO Global Object Reference ^ 3 Same Module 3ÿ 3ÿ Internal Access 3 3 Part Type 3 ^ Same Expression ^ 3 Delta-ICa ^ 3 Arch (Schwanke) 3 3 Type-based Cohesion (3) ^ a. Previously published evaluations of Delta IC (Girard, Koschke, and Schied, 1997b, 1997c, 1999; Girard, Koschke, 2000) are based on a definition of IC that differs from the original one; in the evaluation presented in this chapter, the original definition is used. Table 6-5. Detected ADTs and ADOs. Method System ADT ADO Good OK Good OK too large too detailed too large too detailed # acc. # acc # acc # acc. # acc # acc Global Object Refer- ence Aero - --- - --- - --- 3 0.91 0 0.00 0 0.00 Bash - --- - --- - --- 5 0.92 0 0.00 0 0.00 CVS - --- - --- - --- 8 0.92 1 0.61 6 0.41 Mosaic - --- - --- - --- 12 0.89 5 0.49 4 0.29 Same Module Aero 3 0.81 0 0 1 0.34 7 0.88 3 0.10 1 0.27 Bash 4 0.81 0 0.00 0 0.00 5 0.88 4 0.33 1 0.64 CVS 6 0.80 1 0.23 3 0.29 25 0.90 4 0.68 10 0.51 Mosaic 5 0.94 0 0.00 6 0.33 13 0.90 5 0.46 4 0.30 Internal Access Aero 2 0.87 0 0.00 1 0.60 0 0.00 4 0.28 1 0.11 Bash 10 0.97 3 0.44 1 0.59 3 0.89 1 0.26 1 0.43 CVS 4 0.93 1 0.26 4 0.35 6 0.87 0 0.00 15 0.35 Mosaic 5 0.90 3 0.37 6 0.42 2 0.88 1 0.67 11 0.48 Evaluation of the Basic Techniques 166 slicing of functions. Since we are comparing the technique to non-interactive techniques, we leave out the validation and slicing steps for this evaluation. Note that Type-based Cohesion can only find groups of related routines; it groups neither types nor variables. However, the job of the software engineers was only to find ADTs, ADOs, and hybrid components and, therefore, the reference corpus does not contain sets of related subprograms. Nevertheless, because Type-based Cohesion is based on the types the subprograms share, one can attempt to find at Part Type Aero 2 0.90 0 0.00 1 0.60 - --- - --- - --- Bash 9 0.93 1 0.56 0 0.00 - --- - --- - --- CVS 4 0.95 4 0.37 5 0.26 - --- - --- - --- Mosaic 6 0.95 2 0.68 5 0.33 - --- - --- - --- Same Expres- sion Aero - --- - --- - --- 2 0.94 1 0.19 0 0.00 Bash - --- - --- - --- 4 0.88 0 0.00 2 0.50 CVS - --- - --- - --- 5 0.79 2 0.18 13 0.50 Mosaic - --- - --- - --- 1 0.73 0 0.00 5 0.50 Delta-IC Aero - --- - --- - --- 4 0.81 2 0.25 0 0.00 Bash - --- - --- - --- 4 0.82 1 0.47 2 0.26 CVS - --- - --- - --- 9 0.85 0 0.00 15 0.45 Mosaic - --- - --- - --- 18 0.86 5 0.52 10 0.44 Arch Aero 1 0.80 0 0.00 5 0.28 3 0.79 6 0.28 1 0.47 Bash 0 0.00 4 0.39 6 0.41 4 0.70 5 0.28 1 0.21 CVS 2 0.86 5 0.46 3 0.37 4 0.76 10 0.46 18 0.48 Mosaic 6 0.79 3 0.49 4 0.36 7 0.72 4 0.47 13 0.47 Type Based Cohesion Aero 0 0.00 1 0.20 4 0.32 - --- - --- - --- Bash 2 0.69a 0 0.00 7 0.32 - --- - --- - --- CVS 0 0.00 1 0.14 5 0.34 - --- - --- - --- Mosaic 0 0.00 0 0.00 8 0.41 - --- - --- - --- a. Note that the accuracy of good matches can also be below the threshold of the partial subset relationship. For example, if R and C both have 10 elements and 7 elements of R are in C and 7 elements of C are in R, then R Í0.7 C Ù C Í0.7 R holds and, hence, R and C are a good match. However, the overlap of R and C is only 7/13 = 0.54 < p = 0.7. Table 6-5. Detected ADTs and ADOs. Method System ADT ADO Good OK Good OK too large too detailed too large too detailed 167 Benchmark Results for the Basic Techniques least the accessor functions of an ADT with it. On the other hand, in the evalua- tion, one should be aware that the type itself is not in the candidate ¾ which decreases the recall rate slightly ¾ and groups of related subprograms are not in the reference corpus ¾ which may increase the number of false positives. The techniques often produce candidates with less than three elements. Since components of this size are very rare (Aero has one, Bash has none, CVS has five, and Mosaic has three reference components with only two elements), candidates with less than three elements are filtered out. This way, less false positives were produced. The numbers given in the following were established after the filter for small components has been applied. Furthermore, since the largest reference components for the subject systems have less than 50 elements, another filter was applied that ignores candidates with more than 75 elements; i.e., the size of the candidates was not allowed to exceed 50% of the largest reference component. This restriction mainly affected Global Object Reference and Part Type (in partic- ular, 1~n matches), which both tend to produce very large candidates. Candidates of that size would require too much effort for validation and are therefore of little help to a maintainer. The detailed information on the number of detected components is shown in Table 6-5, and the number of false positives and true negatives are listed in Table 6-6. Figure 6-5 and Figure 6-6 summarize the recall rates for ADT and ADO detec- tion, respectively, according to equation (6.5). To be fair, we must also mention that the reference components were used to calibrate Delta-IC and Type-based Cohesion that both require parameter adjustment. In practice, one does not have the reference components in advance and has to estimate parameters based on a manually extracted list of reference components of a representative sample of the system. That is, in practice the results can be worse for Delta-IC and Type-based Cohesion. ADT recall. According to these summaries, the effectiveness of a technique strongly depends upon the system. In the case of the ADTs of Aero, all tech- niques are similarly effective; Same Module is only slightly ahead. Moreover, Type-based Cohesion is among the techniques with the least recall rate for all sys- Evaluation of the Basic Techniques 168 Table 6-6. Number of false positives and true negatives. Technique Aero Bash CVS Mosaic false positives true negatives false positives true negatives false positives true negatives false positives true negatives Global Reference ADT --- --- --- --- --- --- --- --- ADO 7 14 17 16 4 27 3 13 Same Module ADT 4 6 8 19 4 8 2 14 ADO 17 5 24 8 8 1 1 14 Internal Access ADT 2 7 5 7 9 11 3 10 ADO 12 10 12 16 5 20 1 27 Same Expres- sion ADT --- --- --- --- --- --- --- --- ADO 9 13 11 15 9 22 1 35 Part Type ADT 3 7 11 13 8 5 2 11 ADO --- --- --- --- --- --- --- --- Delta-IC ADT --- --- --- --- --- --- --- --- ADO 18 11 18 14 10 19 3 9 Arch ADT 13 4 29 13 36 9 11 10 ADO 38 6 46 11 42 17 11 15 Type- based Cohesion ADT 24 5 12 14 17 13 18 17 ADO --- --- --- --- --- --- --- --- Figure 6-5. Recall rate for ADTs. 0,15 0,16 0,1 0,13 0,22 0,17 0,27 0,33 0,23 0,55 0,27 0,37 0,28 0,14 0,33 0,27 0,24 0,39 0,37 0,36 Aero Bash CVS Mosaic 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Part Type Same Module Internal Access Arch Type- based Cohesion 169 Benchmark Results for the Basic Techniques tems and has also many false positives. In the case of Bash, Internal Access is far better than all other techniques. Internal Access is also among the best techniques for the other systems. Likewise, Part Type has a constantly high recall rate for all systems while Same Module fails for Bash. Arch is one of the less effective tech- niques and has most false positives. ADO recall. Same Module identifies more abstract data objects than any other approach (except for Mosaic where Delta-IC finds most ADOs). Arch has also one of the higher recall rates for all systems but also most false positives. Global Object Reference is among the best techniques only for Mosaic. Internal Access for ADOs is far less effective than it is for ADTs. The recall rates of Same Expression are comparatively low, but Same Expression has the fewest false posi- tives (except for CVS where it is on average). Delta-IC has comparatively high recall rates for all systems ¾ in the case of Mosaic, it is even clearly the best. The number of false positives of Delta-IC is average. The overall result of this evaluation is that the recall rate of all automatic tech- niques does not compare to the human recall rate, neither for ADTs nor for ADOs. The best recall rate of Same Module for CVS is a rare exception. In most cases, the best recall rates are between 20 and 40 percent. Figure 6-6. Recall rates for ADOs. 0,22 0,2 0,34 0,54 0,28 0,21 0,32 0,35 0,13 0,22 0,26 0,080,08 0,16 0,26 0,19 0,42 0,35 0,75 0,42 0,16 0,22 0,24 0,42 Aero Bash CVS Mosaic 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Global Reference Same Module Internal Access Same Expression Arch Delta IC Evaluation of the Basic Techniques 170 Furthermore, the number of false positives and true negatives is high for many automatic techniques. False positives deserve special attention and, hence, we devote Section 6.4 to their analysis. 6.4 Analysis of False Positives The techniques proposed some atomic components for which no corresponding reference components existed and, therefore, were classified as false positives. We investigated these false positives to learn more about the weaknesses of the techniques. It turned out that a few false positives are indeed correct positives; many of these were too small to be considered by the software engineers as atomic components, others were simply overlooked by the analysts. The analysis of false positives was only done for Aero, Bash, and CVS and for the techniques Same Module, Part Type, Internal Access, Global Object Reference, and Delta-IC. The false positives of Type-Based Cohesion and Arch were not investigated since the primary goal was to see whether these apparent false posi- tives are really false positives or simply overlooked and to find common patterns of these false positives rather than to detect the specific weaknesses of each tech- nique. 6.4.1 Average Size of False Positives Before Manual Validation The average size of false positives is an important factor to consider when evalu- ating approaches, because the time a user of the approach needs to discard a false positive is related to the size of the candidate. For this reason, Table 6-7 reports also the average size of false positives identified by the various techniques on each system (measured in terms of the number of functions, variables, and types of the false positive candidates). 6.4.2 Overlooked Atomic Components Jean-François Girard, Hiltrud Betz who was one of the original analysts, and I browsed the lists of false positives for the techniques and classified each candi- date either as overlooked positive when it actually could be regarded as reason- 171 Analysis of False Positives able atomic component or else as real false positive. Figure 6-7 shows the number of real false positives and overlooked positives. The figure reveals that 42% of the ADO candidates and 41% of the ADT candidates originally classified as false positives are indeed overlooked positives. The presence of such overlooked positives is understandable since browsing a 30 KLOC program manually is a tedious process and even larger components can easily be overlooked. This is interesting because it stresses the importance of automatic support and it shows that an automatic technique does not have to be perfect in order to be useful. Moreover, some of the false positives could be justi- fied from a different point of view (see “Different views” in the following sec- tion), i.e., the automatic techniques may provide the maintainer with another perspective. 6.4.3 Common Patterns of False Positives The analysis of false positives revealed certain common patterns that could be used to filter out false positives in a post analysis after the candidates were pro- posed by the chosen technique. These patterns were generally found in the set of false positives of any of the examined techniques. Static local variables. Some global variables are only referenced by one routine; thus, they act as static local variables of this routine but the programmer did not take advantage of the ability in C to express this explicitly. A routine with such Table 6-7. Average size of false positives before filtering overlooked references Technique Aero Bash CVS ADT Same Module 3.0 7.3 8.5 Part Type 5.7 6.2 14.5 Internal Access 3.0 4,6 8.9 ADO Same Module 15.0 9.9 6.8 Global Object Reference 20.0 5.9 5.5 Internal Access 11.1 4.2 7.0 Delta-IC 14.7 6.6 7.2 Same Expression 18.8 8.1 13.2 Evaluation of the Basic Techniques 172 static local variables alone can hardly be considered an ADO in a narrower sense ¾ even though the local variables indeed clearly belong to the subprogram. Nested routines. Some candidates consist of a few routines among which only one is called from outside and all other routines are only called from routines within the candidate; thus, the latter are local to the routine. Hence, if the vari- ables in the candidate are also only accessed by routines of this candidate, the candidate is rather one routine with nested subroutines (which cannot be expressed in C) and some static local variables. Usually, an atomic component has several interface routines and we would therefore not consider this kind of candi- Figure 6-7. Reviewed number of false positives Part Type Same Module Internal Access 0 1 2 3 4 5 Overlooked Real Aero ADTs Global Ref. Same Module Internal Access Delta IC 0 2 4 6 8 10 12 14 16 18 Aero ADOs Part Type Same Module Internal Access 0 2 4 6 8 10 12 Bash ADTs Global Ref. Same Module Internal Access Delta IC 0 3 6 9 12 15 18 21 24 Bash ADOs Part Type Same Module Internal Access 0 1 2 3 4 5 6 7 8 9 10 CVS ADTs Global Ref. Same Module Internal Access Delta IC 0 2 4 6 8 10 CVS ADOs 173 Analysis of False Positives date a valid atomic component. Such clusters do provide useful information and should be presented to the user ¾ but not as an abstract data object. Parameter passing. A more complex pattern that we also found in all systems consists of variables used for parameter passing in the presence of call backs. A call back is a call of a function F in a component A where F’s address has been transmitted from another component B to A as a function pointer. Aero, as an X- window-based application, uses this schema in its user interface code. CVS has a general recursion handler that traverses directories and has a call back for each file it finds during traversal. The client of this recursion handler does not have to care about the traversal; he only has to code the function that should be applied to each file and to convey it to the recursion handler as a call back function pointer. These functions, designated by function pointers, have to have the same signa- ture. To convey additional parameters from B to the call back function, a global variable is used. The variable is set before the call back is done and the call back function then reads the global variable. Figure 6-8 illustrates this kind of parame- ter passing in the presence of call backs. Variables only used for parameter pass- ing are conceptually different from variables of an abstract data object that model state. Both kinds of variables are needed by the abstract data object, yet ¾ proba- bly due to the conceptual difference ¾ the software engineers did not group “parameter variables” with the abstract data object. It would be helpful to a main- tainer to provide additional analyses that characterize the variables of a candidate component as “parameter variables”, state variables, or other kinds of usages of a variable. However, in order to detect variables used for parameter passing, control and data flow analyses are needed. Figure 6-8. Additional parameter passing in the presence of call backs. float B_parameter; void B_f () { B_parameter = 1.0; A_install_call_back(B_g); ... } void B_g (int i) { ... B_parameter... } typedef (*Function) (int i); void A_install_call_back (call_back) Function call_back; { ... call_back (5); ... } call call back parameter passing Evaluation of the Basic Techniques 174 System parameters. Variables used at many places in the system often represent global system parameters, e.g., variables that indicate whether a certain command line switch was set when the program was invoked. Often, it is recommended to exclude frequently used variables (Yeh et al., 1995). However, simply excluding frequently used variables may also affect variables of an abstract data object that the programmer made public. A more reliable method is to exclude variables that are directly data dependent on the parameter argv of the main routine that con- tains the command line arguments of the invoked program in batch-oriented sys- tems. Other frequently used variables are so-called mode variables that indicate the general state of the system as a whole as opposed to the state of an individual abstract data object. A mode variable, for example, may indicate that an error occurred and the system is in recovery mode. It is still not clear how to distinguish these from frequently used public variables of an abstract data object. Function pointer and enumeration types. Furthermore, the analysis of false positives revealed that function pointer types and enumeration types are generally not helpful for the detection of abstract data types. Function pointer types are often just declared for the purpose of call backs and enumeration types are either used for control variables or are part of more complex abstract data types. Never- theless, since they are represented in the resource usage graph, they are clustered as any other user-defined type. That is to say, if there is an enumeration or func- tion pointer type, T, all subprograms for which T is the only user-defined type in the signature may be grouped together by a technique based on signatures (depen- dent upon the underlying restrictions of the technique) even if the subprograms are otherwise unrelated. In particular, Part Type will propose such candidates because the part-type filter for signatures with only one user-defined type cannot have any effect. But also Internal Access (if the function pointer parameter is dereferenced or a standard operator is applied to the enumeration parameter, which is legal in C because enumeration values are actually integer values) as well as Same Module (if the type is declared in the same module as the subpro- grams) and the metric-based approaches Type-based Cohesion and Schwanke’s Arch approach (since the subprograms share at least an enumeration or function pointer type) may group these subprograms together. Different views. Sometimes there were different possible views and one was cho- sen by the software engineer and the alternative view was chosen by the tech- 175 Qualitative Comparison nique. Yet, both views could be justified. For example, in Bash, there is a file print_cmd.c that provides print commands for different data types. The software engineer decided to group these print routines with the data types, taking an object-oriented view. However, these print routines share global variables that define the current indentation and the increment of indentation. Some techniques grouped the print commands with the global variables taking a more functional view. Interestingly enough, the original programmer of Bash obviously had the functional view in mind, too, since he grouped these routines with the global vari- ables in the same file. 6.5 Qualitative Comparison After the quantitative comparison described in the last sections, we analyzed divergences between candidate and reference components for the respective tech- niques. Global Object Reference. If programmers followed the information hiding prin- ciple, Global Object Reference would detect all abstract data objects without any false positive. However, this is only the case for a few ADOs. When there is a subprogram that accesses variables of different ADOs, Global Object References unites the elements of these ADOs to one single candidate. This could be observed for many ADOs of all systems. Because very large candidates were fil- tered, Global Object Reference achieved a better recall only for Mosaic. Appar- ently, Mosaic has a better decomposition; probably because of the shorter maintenance history it has, which is also indicated by the good performance of Same Module for Mosaic. Part Type. As opposed to Same Module, Part Type does not rely on the program- mer’s distribution of routines into modules. However, it assumes that the parame- ter of a part type is actually used to be put into its container or to be retrieved from it. Since it does not analyze the actual usage any further, it is going to fail if this assumption is false. Moreover, in most signatures, there is no part type and, therefore, the Part Type heuristic equals Global Object References for ADOs with its problem of erroneously large candidates. Evaluation of the Basic Techniques 176 Same Module. The postulate of Same Module is that the programmer structures files according to atomic components. If a programmer puts each routine in a sep- arate file, Same Module cannot yield any result. Moreover, for modules with sev- eral distinct abstract data types containing conversion routines between each other, this heuristic groups all those routines and data types together in one large component. Detection of abstract data types with this heuristic did not work well for Bash. Bash has a header file with system-wide type declarations. The routines, however, are implemented in several other C files that include the type declarations. Detec- tion of abstract data objects succeeded better since global variables are never declared in header files (they can only be declared there as external). Moreover, the programmers of the subject systems often take advantage of the means of the programming language C for information hiding of global variables: These vari- ables are often declared static. The limited means for information hiding of ADTs, on the other hand, are rarely used. Large files can be a problem for Same Module. In CVS, for example, we found a type RCS node which encapsulates dependencies on the underlying revision con- trol system (RCS) of CVS. This node type is declared in one huge file where many routines use it as a parameter. Consequently, Same Module created a very large atomic component candidate. The group of software engineers has refined this candidate into different aspects of the RCS subsystem. Same Expression. Evaluation of the false positives revealed that there is in fact a strong semantic relationship among variables that occur in the same expression in most cases. Sometimes these false positives were simply overlooked by the soft- ware engineers or were too small to be selected. However, there were a few inter- esting exceptions. If there are global state or mode variables that occur in the same condition, very large candidates can be proposed. For example, the global variable interactive_shell in Bash indicates whether the shell is interactive; only then, certain services are available. In the source code conditions like if ((interactive_shell == 0)&&(def_buffered_input == fd))¼ are frequent. Often used mode variables lead to a union of otherwise separate ADO candidates of Same Expression. 177 Qualitative Comparison Internal Access. Internal Access groups user-defined data types and global vari- ables with the routines that access their internal parts. For ADT detection, Internal Access really checks how the parameter type is used, as opposed to Part Type. However, in real programs one often finds the encapsula- tion principle violated. This frequently happens for reasons of efficiency or con- venience in the case of data types of which the programmer is convinced that their representation will never change. If there are many violations of the information hiding principle, Internal Access yields very large candidates analogously to Glo- bal Object Reference or Part Type. This heuristic did badly in ADO detection for at least two reasons. First, if there is a global table, such as an array of error messages, all readers of this table are con- sidered operators of this abstract data object which yields an erroneously large component. Second, it misses all accessor routines that only set or use the vari- ables of the ADO as a whole. In order to group a subprogram with a variable by Internal Access, the subprogram must use the variable non-abstractly, i.e., either internally access the variable or apply a standard operator to it. If the variable is of a primitive type and the subprogram only sets or uses the variable, however, the subprogram will not be grouped with the variable. Unfortunately, it is a common phenomenon for abstract data objects to have separate global variables that together form an object. For example, stacks are often declared as two distinct variables, one for the contents and one for the stack pointer. The latter is declared as integer and used as an index to the array implementation for the contents. Then a function returning the size of the stack by returning the value of the stack pointer will not be recognized as part of the abstract data objects, since there is no non-abstract usage. If programmers put the distinct variables together in a record type, the connection among the variable would be obvious and each access to the variables as record components would be a non-abstract usage. However, because abstract data objects have only one instance, programmers do not make the effort. Delta-IC. A problem of Delta-IC is to establish the right threshold. It depends on the specificity of the system studied and cannot necessarily be directly reused. For the systems in this evaluation, these values were as listed by Table 6-8. One practical solution is to take a sample of the system and perform manual recovery by applying the approach on the sample and adjusting the threshold accordingly. Evaluation of the Basic Techniques 178 It is necessary to compromise between the effort required to analyze a sample manually and the quality of the results. Moreover, the definition of DÿIC is aimed at filtering out candidates that have many accessor functions that only access one single variable of the ADO. As a consequence, any ADO that actually has only one single variable cannot be detected. Furthermore, we often find accessor routines that do access only one single variable. For example, in the stack consisting of two variables for the con- tents and the stack pointer, a function size returning the number of elements on the stack needs only access one variable, the stack pointer. As a matter of fact, the recall rate of Delta-IC increases and the number of false positives decreases for Aero, Bash, and CVS when filtering is only based on the internal connectivity as defined by (5.9) on page 126 as opposed to DÿIC as proposed by (5.10) as Table 6- 9 shows. In the case of Mosaic, the detection quality is more or less the same. Arch. One problem of the Arch approach is to find the right parameters. This is true for all metric-based approaches. However, Arch has more parameters than Delta-IC and Type-based Cohesion. Schwanke and Hanson propose to use neural networks to learn the appropriate parameters (1994), whereas the parameters for Arch as evaluated in this thesis were calibrated on the reference components by systematic hand-tuning. In both cases, a set of references is needed to calibrate Table 6-8. Selected thresholds of Delta-IC. System Threshold Aero 0.25 Bash 0.38 CVS 0.25 Mosaic - 0.3 Table 6-9. Results of Delta IC based on IC only. System Recall False Positives Threshold Aero 0.31 11 0.55 Bash 0.30 21 0.72 CVS 0.39 8 0.53 Mosaic 0.51 3 -0.1 179 Qualitative Comparison the approach. In practice, one does not have these references in advance and, hence, has to compile the components for a sample of the system and calibrate the parameters on this sample. Whether the parameters established on the sample are also appropriate for the rest of the system has to be shown. Furthermore, it is not clear in advance how big the sample should be to find a suitable parameter set- ting. The Arch metric was originally defined to group related subprograms, i.e., only subprograms had to be compared. However, in our application, the entities to be grouped are heterogeneous since we group also types and objects. Since the met- ric does not make a distinction among different types of entities, the weight of an entity depends upon its frequency as a neighbor only. There is no way to assign certain types of entities more weight. For example, if we are searching for abstract data objects, variables are the crystallization points, whereas types are secondary. Likewise, because we are searching for specific kinds of atomic com- ponents, primarily for abstract data types and abstract data objects, certain rela- tionships are more important than others. Arch does not make a distinction whether the neighbor is a type used to declare a local variable or a type in a signa- ture, though it is intuitively clear that signature relationship is more important than the local-obj-of-type relationship when searching for abstract data types (which will actually be confirmed by a statistical analysis in Section 7.6.1). Another draw- back of the approach is that it is not directly recognizable for the maintainer why two entities are in the same candidate since different aspects are considered, namely, common, distinct, and direct relations, which makes validation more dif- ficult, whereas all other techniques have only one simple criterion. On the other hand, considering different aspects is also an advantage. The connection-based approaches, for example, rely on direct relations only and cannot detect groups of related subprograms that may only be detectable when called by the same subpro- grams. Type-based Cohesion. Type-based Cohesion is very similar to Arch (see the comparison in Section 5.10). Its advantage over Arch is that it has only one parameter: the threshold that determines whether two subprograms are similar enough to be in the same component. On the other hand, it can only group sub- programs based on the portion of types they share. The types themselves, how- ever, are not clustered. Hence, if abstract data types are to be detected, these types Evaluation of the Basic Techniques 180 must be added by the maintainer once Type-based Cohesion has proposed its can- didates. Moreover, Type-based Cohesion considers also intrinsic types of the pro- gramming language, like float and int. This may be an additional clue, however, it is rather questionable whether two subprograms are actually related if they share only intrinsic type. Furthermore, the metric of Type-based Cohesion does not make a distinction between user-defined types and intrinsic types, nor what the type is used for. Hence, a function, F1, having a local integer variable has the same similarity to a function, F2, having a local integer variable and a signature type, T, as a function, F3, that has signature type T as well but no integer variable. One would expect that F3 and F2 are more similar than F1 and F2. 6.6 Distinctive Contribution of Individual Techniques Before deciding to select one approach or a combination of approaches, one has to consider the additional information provided by the other techniques. As a first estimate of the contribution of each approach, we considered the good candidates from each technique. The first section of Table 6-10 (grey zone in the table) con- tains the distinctive contribution of each technique as the number of reference components identified by only one approach (good match). Table 6-10. Number of reference components identified. Aero Bash CVS Mosaic Global Object Reference 3 1 Same Module 4 3 15 Internal Access 2 1 Part Type 2 1 Same Expression Delta-IC 2 1 1 5 Arch 5 Type-Based Cohesion Found by more than one approach 7 13 16 20 Found by no approach 13 15 19 23 Identified by software engineers 26 39 54 53 181 Analysis of True Negatives These data suggest that by combining approaches instead of using a single approach, one would significantly improve the discovery of the reference compo- nents. Yet, between 35 and 50 percent of the components could not be matched by a good candidate. (However, they may be matched partially; Table 6-10 contains only good matches.) 6.7 Analysis of True Negatives In Section 6.5, only the weaknesses of individual techniques have been discussed, which gives insight into why a single technique did not detect certain compo- nents. However, as the last section showed, there are even many references that were not detected as a 1~1 match by any of the techniques. To go further into the question why reference components were true negatives for all techniques, I ana- lyzed the references of Aero, Bash, and CVS not detected as a 1~1 match with a candidate component of all techniques. The inspection revealed several reasons: • Two-element reference components: In a few cases, the reference compo- nents contained only two elements. Because a filter was used to ignore candi- dates with less than three elements, these reference components could not be detected by a 1~1 match. If the smallest possible candidate matching a two-ele- ment reference contains three elements, the overlap of the candidate and the reference cannot be higher than 2/3. However, 2/3 is below the threshold 0.7 used as tolerance parameter for the partial subset relationship. Hence, such a match is classified as a 1~n match. The filter was effective in reducing the number of false positives and is there- fore recommended. Reference components with less than three elements are rare. Aero has one, Bash has two, and CVS has five overlooked references with only two elements. • Interleaving: As a consequence of interleaved functions ¾ i.e., functions con- taining multiple, interwoven strands of computation, each responsible for accomplishing a distinct goal (Rugaber et al. 1995) ¾ clusters are merged that would have been otherwise proposed as separate candidates. An interleaved function, for example, may access variables of two different abstract data objects or internally access two different abstract data types. As a consequence, the interleaved function, F, introduces a link between the two clusters built Evaluation of the Basic Techniques 182 around the global variables and the composite data structure related to F, result- ing in a merge of the clusters. On the other hand, considering semantic argu- ments (that are not captured by purely structure-oriented techniques), the engineer has recognized the distinction among the two concepts and the inter- leaving of F and identified two separate components. Merging two intrinsically different candidates may result in two references detected by a 1~n match instead of two 1~1 matches depending upon the size of the candidates. • Lack of abstraction: A phenomenon, similar to interleaving, leading to 1~n matches is lack of abstraction. All techniques assume at least a minimum of abstraction. If all components are permissive, accessed from many different places, and all constituents of components are arbitrarily assigned to modules, large clusters arise and the chances to detect components as a 1~1 match by the proposed techniques are small. They may even not be detected as a 1~n match if the size of the clusters is above the upper threshold for acceptable candidates. Note that interleaving is not the same as lack of abstraction. In the case of lack of abstraction, a function, F, may access internal elements of a component, C, though F does not belong to C. On the other hand, in the case of interleaving, the function, F, accesses internal elements of more than one component and F actually belongs to these components. For example, in the case of two related abstract data types matrix and vector, the multiplication of a matrix with a vec- tor may be implemented by accessing the internal data structures of both types due to efficiency considerations, and F actually belongs to both of these types. • Layering: Interestingly enough, the use of information hiding was also a rea- son why some references were only partially detected. Some components are structured as layers and only the lowest layer accesses the variables or record components of the atomic component, whereas upper layers deal with user interface issues or implement services on top of the lowest layer. The engineers have sometimes grouped all layers together because there was no finer-grained structuring possible. Hence, Internal Access and all other techniques that are based on direct variable accesses can only identify the lowest layer. If these layers are additionally implemented by different modules, which is a reason- able decomposition for large components because each layers constitutes a dif- ferent kind of service, Same Module is neither be able to detect the other layers. Part Type may still be able to detect at least abstract data types among layered component (but not abstract data objects). However, because Part Type tends to produce very large clusters that are filtered out if they exceed the acceptable 183 Summary size for candidates, the layered component may happen to be among the fil- tered clusters. Then, the corresponding reference component is not detected at all. • Debatable references: There are also a few debatable components among the references. Some are incomplete, i.e., there are base entities that could be added to the component; some contain spurious elements, e.g., a function that accesses many variables of another component and that may rather be consid- ered a part of that component. Sometimes it was difficult to make a clear cut between complex clusters of interwoven base entities and a different way of decomposing the cluster could be justified as well. 6.8 Summary The comparison of automatic techniques with respect to findings of software engineers described in this chapter revealed the following points: • The effectiveness of a technique depends upon system characteristics, like degree of information hiding, proper module decomposition, and layering. • None of the investigated techniques has a sufficient recall rate; The best recall rate we obtained was 75% of the abstract data objects (in CVS, as detected by Same Module). In the worst case, namely, the abstract data types of Aero, the best technique reached only a recall rate of 28%. • Many candidates the techniques provide correspond only roughly to the refer- ence components; i.e., elements of these atomic components were superfluous or lacking. • Combining the automatic approaches instead of using a single approach, one would significantly improve the discovery of the reference components. • Yet, between 35 and 50 percent of the components still could not be completely and directly found by any of the techniques. However, the components may at least partially be matched. • In evaluating these automatic techniques, one also has to state what we observed by reviewing the false positives: It turned out that 42% of the ADO candidates and 41% of the ADT candidates classified as false positives could indeed be considered correct positives; they were either too small to be consid- Evaluation of the Basic Techniques 184 ered, simply overlooked in the manual process, or represented alternative views. • We found common patterns of false positives in all systems that could be used to filter out false positives from the set of candidates. • Moreover, whereas the groups of software engineers needed about 20 - 35 hours to compile the list of atomic components for each of our subject systems, each atomic component produced by the techniques can be checked by soft- ware engineers within minutes. To browse the whole list of false positives of all automatic techniques, we needed less than 6 hours per system. The time needed for validation can even be reduced by merging similar candidates of different techniques based on the partial subset relationship because there were many similar false positives among the candidates. In order to find more components with fewer false positives, we chose the most flexible technique, namely, Schwanke’s Arch approach, and tried to improve it. The next chapter describes the approach and its evaluation. Furthermore, due to the degree of vagueness of reasonable decompositions and the complex semantic issues involved, the user should be integrated into atomic component detection. For this reason, the second part of this thesis is devoted to effective ways of user integration. 185 Chapter 7 Similarity Clustering The overall result of the evaluation of the basic techniques described in the last chapter is that none of the techniques reaches human detection quality. In order to achieve better results, we chose the most flexible approach, namely, Schwanke’s Arch approach (1991), and improved it. This chapter describes the approach and its evaluation. Schwanke’s work is aimed at detecting subsystems using a similarity metric between routines (see Section 5.9). The similarity clustering approach described in this chapter applies the idea of Schwanke’s work to atomic component identifi- cation by generalizing the similarity metric, adding informal information, edge- dependent weights, and adapting many of its parameters. The two approaches will be contrasted in more detail in Section 7.8. The enhancements of this technique were joint work with Jean-François Girard and Georg Schied. My improvements to the technique after it has been jointly published in 1997 are explicitly listed in Section 7.8. Name Similarity Clustering Reference Girard, Koschke, Schied, 1997 Domain Base View Range RS, ADT, ADO, HC Disjoint Clusters Yes Similarity Clustering 186 Cluster analyses are used in many areas and scientific disciplines in which large amounts of data need to be reduced to few units of meaning that are easy to grasp. Cluster analyses have already emerged in the beginning of the seventies. Stein- hausen and Langer, for example, have summarized existing cluster analyses and techniques as early as 1977. The general approach of clustering is therefore well- understood. The challenge in applying cluster analysis to a particular problem is to define an appropriate similarity metric. 7.1 Overview of the Approach The similarity clustering approach, short Similarity Clustering, groups base enti- ties (subprograms, user-defined types, and global variables) according to the pro- portion of features (entities they access, their name, the file where they are defined, etc.) they have in common. The intuition is that if these features reflect the correct direct and indirect relationships between these entities, then entities which have the most similar relationships should belong to the same atomic com- ponent. Functions, variables, and types are grouped according to the algorithm already outlined in Section 5.9 and repeated in Figure 7-1 for ease of reading. In each iter- ation of this algorithm, a similarity metric measures the proportion of shared fea- tures. The algorithm terminates when “existing groups are satisfactory”; i.e., when the similarity of the most similar groups is below a certain user-determined threshold. The final result of the clustering algorithm outlined in Figure 7-1 are “flat” sets of similar entities, but the information about the similarity among the set elements is Algorithm 7-1. Flat similarity clustering algorithm. place each entity in a group by itself repeat identify the two most similar groups combine them until the existing groups are satisfactory 187 Overview of the Approach lost. However, this information is of great interest to the maintainer. Instead of presenting only the final clusters to the maintainer, one should keep a log of the order of combination as an additional information. Since the two most similar groups are combined per iteration, the order of combination can be represented by a binary tree in which the leaves are the initial groups and in which the inner nodes are combinations of groups. The farther a combination is away from the root of the tree, the higher is its degree of similarity. This procedure is called hierarchical clustering (Steinhausen and Langer, 1977). Figure 7-2 outlines a hierarchical clustering algorithm. If this tree is presented to the maintainer, one can also repeat hierarchical similar- ity clustering until everything is combined into one single group instead of stop- ping when a certain minimal similarity is reached. The maintainer can then “climb up the tree” starting at the leaves and stop at inner nodes for which the combination is doubtful. The similarity metric used in this algorithm to identify the two most similar groups is constructed of three layers: • The similarity between two groups of entities which is defined in terms of similarity between entities across groups. • The similarity between two entities which is a weighted sum of various aspects of similarity. • Each specific aspect of similarity between two entities. These layers will be discussed in the following sections. Algorithm 7-2. Hierarchical similarity clustering algorithm. place each entity in a group by itself repeat identify the most similar groups Si and Sj combine Si and Sj add a subtree with children Si and Sj to the clustering tree until the existing groups are satisfactory or only one group is left Similarity Clustering 188 7.2 Similarity Between Groups of Entities There are several alternatives to define similarity between two groups of entities. The one originally proposed by Schwanke in 1991 is to use the maximal individ- ual similarity of elements in the two groups: (7.1) We experienced that this has the effect of creating very large groups which is not very helpful. The same observation was made in other applications of clustering (Steinhausen and Langer, 1977). If we had to group circular structures, i.e., when the cluster is a cycle of pairwise similar elements whose similarity to other ele- ments is otherwise low, basing similarity on the maxima of individual elements would be the right choice. However, this is generally not the case for atomic com- ponents; we expect all their elements to be related to each other. To achieve this goal, we decided to define the group similarity as the average of the similarities of all pairs of entities in the two groups (1997): (7.2) Using the average group similarity for groups demands of the elements of a group to be considerably similar to many other elements of the group and, hence, aims at cohesive components. As described in Section 5.9, a newer variant of Schwanke’s Similarity Clustering uses a k-nearest neighbor approach (Schwanke and Hanson, 1998). The motivation for the k-nearest neighbor approach accord- ing to Schwanke and Hanson is that the various factors affecting the weights of features and other terms in the similarity measure prevent it from being validly used to compute ratios or even averages. On the other hand, the nearest neighbor approach may lead to components in which not all parts are strongly related to each other. GSim A B,( ) max Sim a b,( )( )= a A b BÎ,Î" GSim A B,( ) Sim a b,( ) a A b BÎ,Î( ) å A B´ ------------------------------------------------------= 189 Similarity Between Entities 7.3 Similarity Between Entities The similarity between two entities is the weighted sum of various aspects of similarity which fall into the following categories: • direct relations • indirect relations • informal information Direct relations are relations between the two entities compared. Indirect rela- tions are relations with common third entities. Informal information is the infor- mation in the program source code which is not captured by the semantics of programming languages, but is used by programmers to communicate the intent of a program (e.g., comments, identifier names, file organization, etc.). Informal information can be used as a complementary source of information as suggested by Biggerstaff (1999). The various aspects of similarity can be united as follows (the factors xi are used to adjust the influence of the diverse specific similarities): (7.3) 7.3.1 Features The following individual aspects of similarity, namely, Simindirect and Simdirect, are going to be defined in terms of features. A single feature describes a specific relationship of an entity to another entity in its environment. A feature has basi- cally three facets: • the partner in this relationship • the modality of this relationship • the role of the entity in this relationship For example, for a subprogram, it is of interest what global variables it accesses and by what other subprograms it is used (partner). Furthermore, it is relevant whether the accessed variables are set or used and whether the subprogram is Sim A B,( ) x1 Simindirect A B,( )× x2 Simdirect A B,( )× x3 Siminformal A B,( )×+ += Similarity Clustering 190 used by directly calling it or just by taking its address (modality). And last but not least, it makes a difference whether a subprogram is the caller or callee in a call relationship (role). The role information is, of course, only relevant to non-sym- metric relationships. We will use the term feature as a tuple of these three facets. In terms of the resource usage graph, a feature is thus a triple (n, e, r) where n is a node, e is an edge type, and r is the role. The node n is the partner in a relationship, e expresses the modality, and r is either agent when the entity is the agent in a relationship (technically, the source of the edge), patient when it is the patient in this relation- ship (the target of the edge), or simply partner when the relationship is symmet- ric. Notation. For a feature (n,e,r) of a node m, we will use the notation if m is an agent, if m is a patient, and if the relation is symmetric. The predicate is true if and only if n is related to m by relationship e in which m is an agent. Similar predicates are used for the two other kinds of fea- tures. In some equations, we will use a place holder for relations: means that there is a relationship e between m and n where the role of m is unspecified. When this place holder is used in a functor or predicate such as , then the place holder µ implicitly iterates over {®,¬,«}. For example: is equivalent to Example. If subprogram A calls subprogram B, then: • is a feature of A • is a feature of B m e n® m e n¬ m e n« m e n®( )? m e nµ µ È n a b,( ) a bµ a bµ( )?{ } µ È= n a b,( ) a b® a b®( )?{ } a b¬ a b¬( )?{ } a b« a b«( )?{ }È È= A call B® B call A¬ 191 Similarity Between Entities Moreover, is true, whereas is false if B does not call A. 7.3.2 Indirect Relations Two entities can be considered similar if they use the same entities and if they are used by the same entities, even more so when they use them or are used by them in the same way. Though each individual common relationship to their environ- ment is probably not sufficient to call them similar, the confidence of being simi- lar increases by each one. Judging two entities by their relationship to the environment is a phenotypic kind of comparison. A genotypic point of view would rather compare the degree of complexity, the number of statements, and so forth. However, these “inner” val- ues are generally of very limited use for the decision whether entities should be grouped together. There is no point in putting subprograms of the same complex- ity or with the same number of statements in a common module. Taking only common features into account may distort the result. Two subpro- grams, for example, may be called by the same subprogram and may set the same variable; but when they are called by many other distinct subprograms and set many distinct variables, we would not consider them similar anymore. Therefore, distinct features must also be borne in mind. The definition of similarity with respect to indirect relations captures the propor- tion of features (as introduced in Section 7.3.1) two entities share. (7.4) where Common (A,B) reflects common features of A and B, Distinct (A,B) reflects distinct features and d ³ 0 is a parameter which regulates the importance given to distinct features, and W(X) is the weighted sum as described below. Common and Distinct can roughly be described as follows (we will later refine this definition): A call B®( )? B call A®( )? Simindirect A B,( ) W Common A B,( )( ) W Common A B,( )( ) d W Distinct A B,( )( )×+ -------------------------------------------------------------------------------------------------------------= Similarity Clustering 192 (7.5) (7.6) The term features(X) refers to the set of features of X (see Section 7.3.1). The operator denotes the symmetric difference for sets. W(X) is the weighted sum of these features: (7.7) where is a weighting factor which allows assigning certain fea- tures more influence on the global value of the metric. Weights will be discussed in Section 7.3.3. The common and distinct neighbors are relevant to the similarity of two entities for obvious reasons. The other facets of features offer additional ways to specify and use Common and Distinct. These alternatives will be discussed in the follow- ing. Modality. There are several alternative ways of distinguishing the modality of a usage - technically speaking, of considering the edge type. We discuss them by means of the scenarios in Figure 7-1. In Figure 7-1(a), for example, there are two functions related to a type T: One by having a local variable of type T and one by having a return type T. The modality of using T is very different for the two functions. It is plausible that the two func- tions would be more similar when their usage of the type would have the same modality as in Figure 7-1(b). That is the point of view reflected by the first alter- Figure 7-1. Scenarios for typical relations. Common A B,( ) features A( ) features B( )Ç= Distinct A B,( ) features A( ) features B( )Å= Å W X( ) weight x( ) x XÎ å= weight x( ) 0³ F1 F2 T return return F1 F2 T local-obj- of-type return F1 F2 T parameter-of return (a) (b) (c) (d) F1 F2 T parameter-of return local-obj- of-type 193 Similarity Between Entities native (it is even stricter since it does not consider the functions in Figure 7-1(a) to be similar at all): Alternative 1: All nodes that are reachable by the same kind of edge are common features. The first alternative obviously goes wrong for scenario Figure 7-1(c) where one function returns a type and the other has a parameter of this type. These two func- tions could be accessor routines of an abstract data type T. The distinguishing fac- tor of this example is that the two edge types are two special cases of the same abstract kind of edge. The second alternative refines alternative 1 by taking sub- typing of edges into account (see Section 3.5.1 for the definition of equivalent edges): Alternative 2: All nodes that are reachable by equivalent edges are com- mon features. Alternative 2 would not consider the two functions in Figure 7-1(a) similar at all. However, the fact that the two of them share at least type T, even if they use T dif- ferently, is an important information. It can also be a complementary piece of information as in Figure 7-1(d) where we have an additional relationship to T by means of a local variable. Frequently, constructors of abstract data types are implemented by using a local variable of this type that is initialized and eventu- ally returned. We should therefore not ignore any kind of edge that leads to a common neighbor. This results in the third alternative: Alternative 3: All nodes that are reachable by any kind of edge are common features. Yet, we do like to make a difference between scenarios Figure 7-1(a) and (b). One attempt could be to assign different weights to different edge types. For example, return edges have a higher weight than local-obj-of-type and therefore a subprogram pair as in Figure 7-1(b) would be preferred to Figure 7-1(a). Still, that does not work because it is rather questionable why two subprograms having both a local variable of type T should be more similar than two functions having both return type T. The solution is to separate common features into such regarded by alterna- tive 2 and such covered by alternative 3. Similarity Clustering 194 Alternative 4: Make a distinction between nodes that are reached via equivalent and non-equivalent edges. A more formal description of alternative 4 will follow below as soon as we have discussed the influence of roles. Role. Not only the modality of a relationship to a common neighbor is important; we also have to distinguish whether the roles are identical or not. For example, it makes a difference whether two subprograms both call the same subprogram, S, or whether one subprogram calls S and the other one is called by S. For these reasons, I propose the following strategy that also takes alternative 4 from above into consideration. We will distinguish two cases: • Commoneq(A,B) is the set of equivalent features of A and B, i.e., the common neighbors of A and B reachable by equivalent edges and with the same role. • Commonne(A,B) is the set of non-equivalent features of A and B, i.e., the common neighbors of A and B that are reachable by non-equivalent edges only or with different roles. In the case of common features, there are nodes reachable by both entities either via different or same edge types. However, in the case of distinct features, a node is either reached from one node or from the other one such that we need not dis- tinguish features as in Commoneq and Commonne. Therefore, Distinct (A,B) denotes all features of A and B that are not shared (as originally proposed by equation (7.6)), and thus, equation (7.4) can be refined to: (7.8) Parameter Ieq in this equation is used to determine the influence of equivalent fea- tures. Its value should be greater than one because the same modality of a rela- Simindirect A B,( ) Ieq W´ Commoneq A B,( )( ) W Commonne A B,( )( )+ Ieq W´ Commoneq A B,( )( ) W Commonne A B,( )( ) d W Distinct A B,( )( )×+ + --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- = 195 Similarity Between Entities tionship to a common neighbor is more significant than just a common neighbor that is related in different ways. So far, we have used Common and Distinct in a vague way. Now that we have seen what role they play and how they are used, they can be defined in detail: (7.9) where e1 ~ e2 holds when the two edges are equivalent as defined in Section 3.5. (7.10) Neighbors yields all neighbors of a node as defined in Section 3.5. (7.11) Note that we explicitly exclude direct relations among two entities to compare. Such direct relations will be considered in Section 7.3.4. Commoneq A B,( ) A e1Cµ B e2Cµ, A e1Cµ( )? B e2Cµ( )? e1 e2~ C A¹ B,Ù Ù Ù{ }µ È = Common ne A B,( ) X eCµ X e Cµ( )? X A B,{ }Î C N A B,( ) A B,{ }¤ÎÙ Ù{ } µ È Commoneq A B,( )– = where N A B,( ) Neighbors A( ) Neighbors B( )Ç= Distinct A B,( ) X eCµ X eCµ( )? X A B,{ }Î C D A B,( ) A B,{ }¤ÎÙ Ù{ } µ È = where D A B,( ) Neighbors A( ) Neighbors B( )Å= Similarity Clustering 196 Example. Given the resource usage graph in Figure 7-2, the following common and distinct features of F1 and F2 can be found (using the abbreviations actual = actual-parameter-of, param = parameter-of, set = obj-set, and use = obj-use): The sets of common and distinct features enter equation (7.8) as a weighted sum . How features are weighted is discussed in the fol- lowing section. 7.3.3 Feature Weights The weight factors are introduced to give more influence to certain features. There are several alternatives for their definition based on the facets of the rela- tionship the feature represents. Remember that a feature consists of three parts (n, e, r) where • n is the partner entity of the relationship, • e is the modality of the relationship, Figure 7-2. Example for common and distinct features. Commoneq F1 F2,( ) = F1 callF5® F2 callF5® F1 paramT® F2 return T®, , , F1 set V2® F2 use V2®, Common ne F1 F2,( ) = F1 actualV1¬ F2 setV1® F1 call F4¬ F2 callF4®, , , Distinct F1 F2,( ) = F1 callF3¬ F2 callF6®, F1 F2 T call F4 F5 F3 F6 V1 V2 actual-parameter-of obj-setobj-set return obj-use parameter-of W X( ) weight x( ) x XÎå= 197 Similarity Between Entities • and r is the role the partner plays in the relationship, i.e., whether it is the agent. patient, or partner. Role. The role information can only make a difference in situations in which an entity can be both agent and patient, i.e., if we deal with non-symmetric relation- ships whose arguments are of the same domain, in other words, if they connect equivalent entities. In the entity-relationship model presented in Figure 3-2 on page 45, the relationships part-type, same-expression, and call are such cases. In Figure 7-3(a), for example, the question is raised whether it is more important that T2 is a part-type of T1 than that T1 is a part-type of T3 if we compare T1 to both F1 and F2. An argument could be that an abstract data type consisting of T3 and F2 needs T1 whereas an abstract data type consisting of T2 and F1 does not need T1. On the other hand, the abstract data type that includes T1 is based upon T2 and therefore {F1, T2} is closer to T1. None of the two arguments outdoes the other one. In general, considering the role an influence factor is rather question- able. At any rate, assigning different weights depending upon whether an entity is an agent or patient adds more complexity to this approach and complicates its cal- ibration. Therefore, we do without the distinction. Note that this is no argument against the roles as such. This information is rele- vant when we compare relationships of the same class. For example, in Figure 7- 3(b) it is plausible that F4 is more similar to F2 than to F1 because both F2 and F4 call the same subprogram. The definition (7.9) for Commoneq takes this already in account. That is why we have introduced roles in the first place. Figure 7-3. Example for agent/patient difference. F1 F2 return return T1 T2 T3 part-type part-type (a) F1 F2 call F3 F4 F5 call call call (b) Similarity Clustering 198 Partner. Another facet of a feature is the related entity. This is the only facet Schwanke considered in his approach. He proposed to use Shannon information content from information theory as weighting factor weight(x) (see Section 5.9): (7.12) Probability (x) is the number of how often the entity x occurs in any relationship divided by the number, N, of all entities of the system: (7.13) The quality of the Shannon information content is that it decreases by the fre- quency of an entity, hence lessening the relevance of frequently used entities. Just to get an impression what the actual weight proposed by the Shannon information content is, let us compare the weight for an entity, x, that occurs only once and an entity, y, whose probability is 0.125. We would consider the latter entity occurring unusually often. The weight ratio of the two of them is as follows: Table 7-1 shows how this ratio evolves when N increases. For example, in a sys- tem with 65,536 entities, the weight ratio of an entity that occurs only once and an entity that occurs 8,192 times more often is only 16/3. In a system with 1,024 entities, the factor is 10/3 though the other entity occurs only 128 times more often. The reason for this is that the negative gradient of the logarithm rapidly decreases the farther it is from 0. For obvious reasons, we cannot reduce the num- ber of entities in the system, but we may be able to reduce the basis on which the occurrence of an entity is established. We will come back to this below. Table 7-1. Example Shannon information content weights. N 27 = 128 210= 1,024 213 = 8,192 216 = 65,536 occ (x) 1 1 1 1 occ (y) 24 27 210 213 7/3 10/3 13/3 16/3 weight x( ) ld probability x( )( )–= probability x( ) occ x( ) N ---------------= where occ x( ) y x yµ y xµÚ{ }= weight x( ) weight y( ) ----------------------- ldN 1–– ld2 3–– ----------------- ldN 3 --------= = 1 3¤ ldN× 199 Similarity Between Entities Establishing the weights with Shannon information content over all nodes is fine when we do not care for the kinds of relationships that actually occur. In Schwanke’s approach, only subprograms are grouped and any usage of a non- local name counts no matter what it means. In our approach, the way of using the non-local entity is relevant. That is why we have to refine the determination of the weights by Shannon information content. For example, the subprogram F3 in Fig- ure 7-4 may be called by thousands of other subprograms and thus, gets a low weight according to equation (7.12) while F1 and F2 have a high weight since each is called only once. All these subprograms may set only one single global variable and therefore one should assume they are all equally similar with respect to this variable reference. Nevertheless, F1 and F2 are considered more similar than F1 and F3 because of the lower weight of F3. Obviously, the way we have introduced to compute the weights is misleading when we take the kind of relationships into consideration (as it was the case in the variant of Similarity Clustering published with Girard and Schied in 1997). Equa- tions (7.12) and (7.13) should therefore be refined as follows where we distin- guish between different classes of equivalent relationships, e, and the roles, r, an entity plays in a relationship: (7.14) where E is a representative for all edge types equivalent to e. Figure 7-4. Example for Shannon information content weights. F1 F2 V obj-set obj-set F3 obj-set ... call call many calls weight n e r, ,( ) ld probability n E r, ,( )( )–= Similarity Clustering 200 (7.15) NE is the number of relationships of type E and all its equivalent relationships in the system (in the case of symmetric relationships, NE is twice the number of actual relationships), i.e., probability (n,call,agent), for example, tells us the rela- tive frequency of n as a caller. In equation (7.13), the basis on which we established the probability is the set of all nodes. In equation (7.15), on the other hand, the set of a specific kind of edge is used. This may increase or decrease the basis, depending upon the kind of edge. For part-type, for example, Npart-type < N (where N is the number of nodes) is quite likely; for call, however, Ncall > N can be expected. As it was discussed above, decreasing the basis makes Shannon information content more sensitive. Shannon information content is a problem-independent way to establish weights that does not take advantage of the kind and quality of the relationships among entities. However, depending upon what kind of atomic components one searches for, different kinds of relationships are of different significance. For example, when looking for an abstract data type, edges connected to user-defined types are more important than call edges, hence should be given more weight. An alterna- tive to Shannon information content is therefore to assign fixed edge weights we to the kind of edges between entities. This strategy will be referred to as rela- tional weights. (7.16) Both ways to establish weights, namely, Shannon information content and rela- tional weights, are orthogonal. Their respective strength can be combined by mul- tiplying the two values. This strategy allows tuning clustering for specific patterns probability n E r, ,( ) occ n E r, ,( ) NE ---------------------------= where occ n E r, ,( ) y n y e ®( )? e E~Ù( ){ } if r agent= y n y e ¬( )? e E~Ù( ){ } if r patient= y n ye«( )? e E~Ù( ){ } if r partner=î ï í ï ì = weight n e r, ,( ) we= 201 Similarity Between Entities and makes frequently occurring entities less important. This strategy will be referred to as combined weight strategy. The results reported in this thesis are based on the combined weight strategy: (7.17) 7.3.4 Direct Relations Direct relations represent immediate connections between two entities. Such rela- tionships were explicitly excluded in the definition of common and distinct fea- tures in order to rate direct relations among two entities differently from relations to common and distinct neighbors. If two entities are directly related, they can generally be considered more dependent than if they were only related by a com- mon third entity. The contribution of direct relations to the similarity is computed as . In terms of the resource usage graph, this is defined as the weighted sum of edges between A and B: (7.18) where Link (A, B) denotes the actual links between A and B. Link can be defined in terms of features as follows: (7.19) This definition has two interesting properties. First, we count each relationship twice since and . This is necessary because the weight of a feature depends on the related entity of the feature and it could be that due to a different Shannon information content of A and B. Second, the value of equation (7.18) is not normalized as opposed to a previous definition that we proposed in 1997 which divided by all theoretically possible links between A and B. The normal- ized version promoted relationships that are the only possible ones between cer- weight n e r, ,( ) ld probability n e r, ,( )( )– we´= Simdirect A B,( ) Simdirect A B,( ) W Link A B,( )( )= Link A B,( ) A Bµ A Bµ( )?{ } µÈ B Aµ B Aµ( )?{ }µÈÈ= A B® B A¬Û A B« B A«Û weight A B®( ) weight B A¬( )¹ W link A B,( )( ) Similarity Clustering 202 tain kinds of entities (such as the of-type relationship between variables and types) in an unjustified manner. 7.3.5 Informal Information Programmers capture part of the meaning of programs in comments and in the names of functions, variables, and types. This helps them and other programmers to find their way around in a program. Another guide in a program is the file orga- nization: related functions, variables, and types are often put together in one file (as already expressed by the Same Module heuristic). Both of these means of communication among programmers are examples of informal information. Usu- ally, informal information is ignored by reverse engineering techniques (a notable exception is Biggerstaff, 1989) which focus on the information derived by a com- piler. This section discusses how the information contained in the names of pro- gram identifiers and file organization can be relevant to the identification of atomic components. Names of Identifiers. The naming of functions, variables, and types is an impor- tant source of information about a program given to a human reader. It has been observed (Biggerstaff, 1989) that even the author of a program has difficulties in recognizing the purpose of an excerpt from his code once significant identifier names have been replaced by insignificant ones (e.g., f1 instead of top_stack). The naming of identifiers also convey information relevant to the identification of atomic components. For example, in one of the systems investigated, routines that belong to an abstract data type list had similar names: list_insert, list_remove, and list_create. Two naming conventions are widely used for long identifiers built from many words: Separate words with underscore (’_’) or start each new word with a capital letter (e.g., InsertWord). The following metric based on the number of common words between two identifiers exploits these conventions: (7.20) where words (X) denotes the set of words of X, i.e., all substrings of X separated by underscores or capital letters as delimiters. Simwords X Y,( ) words X( ) words Y( )Ç words X( ) words Y( )È --------------------------------------------------------= 203 Similarity Between Entities An interesting feature of this definition is that it abstracts from the word lengths. Considering the length of the common words is generally not justified. For exam- ple, the similarity among the identifier pairs (list_insert, new_list) and (list_insert, setInsert) is equal according to equation (7.20) because of: If the word length counted, list_insert and setInsert would be more similar. Actu- ally, as a human, one would expect list_insert and new_list to be more similar, because list is a noun which probably stands for an abstract data type. An alterna- tive, more functional point of view could be to consider setInsert and list_insert more related, depending upon whether a functional design was preferred to an object-oriented design. Without knowledge of the meaning of words, we cannot make such decisions. An interesting avenue not explored in this thesis is to inves- tigate to which extent the meaning of words could be captured. Because we gen- erally have a very restricted domain of discourse within programs and typical adjectives and verbs, such as new and insert, are very frequent, taking the vocabu- lary of a domain into account could be a promising and complementary approach. The purpose of this thesis, however, is to explore to which degree structural aspects can be leveraged. When such word conventions are not used, identifiers are frequently constructed using a common prefix or postfix. For such cases, the following metric is used: (7.21) where prefix and postfix are the lengths of the common pre- and postfix of their two arguments if the length is longer than three characters; otherwise they are zero. Simwords list_insert new_list,( ) list insert,{ } new list,{ }Ç list insert,{ } new list,{ }È ------------------------------------------------------------------- 1 3 --= = Simwords list_insert setInsert,( ) list insert,{ } set insert,{ }Ç list insert,{ } set insert,{ }È --------------------------------------------------------------------- 1 3 --= = Simsuffix X Y,( ) 1 X Y= prefix X Y,( ) postfix X Y,( )+ 1 prefix X Y,( ) postfix X Y,( )+ + ---------------------------------------------------------------------------- X Y¹ î ï í ï ì = Similarity Clustering 204 Organization of Files. The division of a program into files also conveys some information about the meaning of a program. Related functions and variables are often put in the same file or in files with a common substring in their name (e.g., client-db and client-service). The previous metric for identifier name similarity based on pre- and postfixes is used to compare the names of the files without extensions (i.e., only file in file.c or file.h) in which the entity, X, is declared, denoted by filename(X): Informal knowledge should be seen as a complementary source of information. In an interactive approach, one should always use two modes: One that considers informal information and one that does not. Informal information can be helpful, but may also be misleading. 7.4 Clustering Result The result of the hierarchical clustering described by Algorithm 7-2 is a dendro- gram, i.e., a binary tree (or a forest of binary trees if clustering ends before all entities are grouped because all remaining entities are not similar enough) whose leaves are clustered entities and whose inner nodes represent unions of two sub- clusters or entities (see Figure 7-5). Each inner node is associated with the simi- larity value of its two subtrees. The farther the nodes are from the root node, the more similar they are in terms of the similarity metric because the dendrogram is generated bottom-up and the most similar subtrees are combined first. Figure 7-5. Example dendrogram. Simfilename X Y,( ) Simsuffix filename X( ) filename Y( ),( )= a b d c e f0.6 0.5 0.5 0.3 0.2 205 Integration of Other Approaches A dendrogram is a useful information to the user because it shows the order in which entities are clustered and the respective most similar entities. Hence, the result of Similarity Clustering should be presented as a dendrogram to the user. The user can then manually select components from the dendrogram. However, in order to retrieve components automatically and also to further process the results of Similarity Clustering by other techniques that expect the results as sets of enti- ties (as it will be described in Chapter 8), subtrees of a dendrogram can also be flattened using a user-determined similarity threshold. The threshold determines the minimal acceptable similarity of a subtree to be flattened where the similarity of a subtree is the similarity value associated with its root. As already mentioned, the similarity value of all dendrogram nodes below this root is greater or equal to the similarity value of the root node. In order to retrieve candidates from a den- drogram, Algorithm 7-3 is used. Consider the example dendrogram in Figure 7-5. If 0.5 is used as a threshold, the algorithm starts at leaf a and climbs up until it reaches the node with a parent whose similarity value is below the threshold, namely, the node associated with 0.5. The leaves of the subtree rooted by this node, namely, a, b, and c, are clus- tered to a candidate {a, b, c} and added to the list of candidates. Because b and c have been marked visited, the next bottom-up traversal starts at d. This time, only one step up is taken and the proposed candidate is {d, e}. The leaf f is not clus- tered because its parent is associated with a similarity value below the threshold. 7.5 Integration of Other Approaches The basic connection-based techniques introduced in Chapter 5 can partly be integrated with the general approach of Similarity Clustering without major changes. Some of the heuristics are even already a constituent of Similarity Clus- tering, though they are used to yield additional clues among other leveraged aspects rather than in the definite manner of the other approaches. For example, the assumption of Same Module that entities declared in the same module are more related than entities in different modules is covered by Similarity Clustering as informal similarity aspect Simfilename. Moreover, the similarity of objects occurring in the same expression can be increased by assigning more weight to the same-expression relationship, hence, the Same Expression heuristic is incorpo- Similarity Clustering 206 rated as well. Likewise, Global Object Reference can be simulated by assigning high weights to object reference relationships and setting all other weights as well as informal information parameters to 0. Part Type. The integration of the Part Type heuristic needs more work. The claim of Part Type is that a subprogram does not belong to a type of its parameter list that is a part-type of another type in this list. Given the resource usage graph on the left hand side in Figure 7-6, from the point of view of the Part Type heuristic, the parameter-of from F1 to T1 should be eliminated. This could be simulated in Similarity Clustering by setting the weight of the relationship for this parameter- of edge to 0. However, this means that the weight of a relationship does not only depend upon its type, the related node, and its role but also on the context of a specific instance of this relationship. Similarity Clustering can be extended in this Algorithm 7-3. Retrieving candidates from a dendrogram. Input: • dendrogram D • threshold Q Output: • list of flat candidates C Algorithm: 1. initialization: for each leaf L in D loop visited (L) := false; end loop; C := empty_list; 2. retrieval: for each leaf L where not visited (L) loop N := L; while parent (N) ¹ ^ and then Sim (parent (N)) ³ Q loop N := parent (N); end loop; if Sim (N) ³ Q then add {l | l is a leaf of N} to C end if; mark all leaves of N as visited end loop; 207 Integration of Other Approaches respect. When the similarity metric is computed for the base entities during ini- tialization, the specific contexts can be checked and the weights for such signa- ture edges be lowered or even set to 0. The advantage of Similarity Clustering is that is does not necessarily have to set this value to 0; it could also be decided only to reduce the weight. In particular, the weight could be reduced only if there is also an internal access to the container type in this context, because only then the part type could be really put into the container type (or retrieved from it). Note that even if the weight is set to 0 for Similarity Clustering, the approaches need not necessarily yield the same result. Part Type iterates over the subpro- grams and classes them with the related types producing the candidates {F1, T2} and {F2, T1} whereas Similarity Clustering could first cluster T1 and T2 due to the part-type relationship (ignored by Part Type) and then add the other two functions. However, this is only possible if the part-type relationship has a higher weight than the return relationship. Internal Access. The Internal Access heuristic groups all subprograms that access a record component of the same type or record variable (see Section 5.6). This may be correct from the information hiding point of view. However, in real- ity for reasons of efficiency, some of the record components of a type may be accessed from subprograms that do not belong to the type or variable. These are often components of a simple type telling something about the general state of the type or variable. For example, a list data structure has a component length of type int that counts the number of elements of the list. Instead of providing a function that returns the value of this component, the programmer may allow access to this component from outside. There may be other components that are not intended to Figure 7-6. Part Type captured in similarity metric. F1 F2 return return T1 T2 part-type F1 F2 return return T1 T2 part-typeparameter-of F1 F2 return return T1 T2 part-typeparameter-of Part Type Similarity Clustering Similarity Clustering 208 be used from outside because they are associated with a more complicated control logic. Thus, there can be two kinds of record components: public and private record components. The Internal Access heuristic does not distinguish between them and therefore produces too large components if there are public compo- nents. Since there is no means in the language C to distinguish public from private record components, they cannot easily be distinguished. However, it is likely that a record component is meant to be public when it is frequently accessed. Another indicator is that public record components are generally only read by subpro- grams that do not belong to the type; writing these components is mostly done by explicit accessor routines of the type because there may be complex consistency constraints to maintain among the record components of this type. Similarity Clustering as proposed so far is capable to address these two attributes. Fre- quently referenced record components have less weight according to the Shannon information content used to ascertain the weights of features in Section 7.3.3. In order to distinguish references to record components from references to variables, we have already refined the set, use, and take-address-of relationships in Section 4.2.7 where we distinguish references to record components from references to the object as a whole. Furthermore, we have specified in Section 4.2.6 that each composite variable has its own tree of statically accessible record components according to the type of the variable. Using these information, Internal Access is integrated as follows: • frequently used public record components have less weights by way of the Shannon information content • accesses to record components have higher weights than ordinary accesses by assigning higher edge weights to comp-set, comp-use, and comp-address-of than to obj-set, obj-use, and obj-address-of • internal sets have higher weights than internal uses by assigning higher weights to comp-set than to comp-use 209 Establishing the Similarity Metric Parameters 7.6 Establishing the Similarity Metric Parameters A problem of Similarity Clustering in practice is that it has many parameters that have to be adjusted. Research in cluster analyses has produced statistical analyses to improve clustering (Steinhausen and Langer, 1977). However, they cannot be applied to our problem because of the way we had to define our similarity metric. A similarity metric is normally defined with respect to specific features that are absolute to all entities to be grouped. For example, in order to cluster cells in micro biology, we can measure their size, coloring, protein content, and so forth. Each of these features constitutes an absolute scale on which each cell can be measured and hence, we can represent a cell as a vector of such absolute mea- surements. Similarity among two cells can then be expressed by alternative dis- tance metrics, typically the euclidean metric (see Steinhausen and Langer, 1977, for other metrics). Our similarity metric is defined as a relation directly between two entities rather than to a third absolute point of comparison. That is why we cannot use the statistical methods suggested for traditional similarity metrics. There are basically three layers at which parameters have to be adjusted: • edge weights (7.16) • parameters in individual aspects of similarity, namely, d and Ieq in Simindirect (7.8) • influence factors xi of the aspects of similarity on the similarity among two entities (7.3) The similarity for groups of entities does not have parameters. In the following, we are going to discuss how the parameters at the respective layers can be estab- lished. 7.6.1 Statistical Analysis of Edge Distribution As a completion of Schwanke’s suggestion to use Shannon information content to establish weights of features, we have proposed to assign weights to the kind of relationships, or technically speaking, to the edge types. In order to answer the questions of whether the edge types count at all and how one can find appropriate Similarity Clustering 210 edge weights, we investigated the distribution of edge types in the reference com- ponents described in Section 6.1. 7.6.1.1 Scope for the Data What edges we consider for the statistical analysis described below depends on the nodes regarded in the first place. The nodes will be described first before we go into details on how the statistical analysis is performed. Nodes considered. Because only a portion of all entities in the subject systems really belongs to an atomic component (according to our analysts), considering a global analysis of edge distribution does not make sense. Instead, we regard only entities within atomic component contexts. An atomic component context con- tains any node within an atomic component and any node that is a neighbor of an atomic component element. Example. There is one atomic component {f3, f5, f6, v1, v2} in Figure 7-7. Solid and dashed edges represent calls and references, respectively. f1, f2, f4, and f8 are neighbors of at least one atomic component element. Therefore, {f3, f5, f6, v1, v2} È{f1, f2, f4, f8} is the atomic component context. f5, f6, and f7 do not belong to this context because none of their neighbors belongs to the atomic component. Edges considered. In the following, we will regard only edges that are relevant to the similarity among nodes in atomic component contexts. These are all edges that occur in Link, Common, or Distinct of nodes within an atomic component context (where we consider only one atomic component context at a time). If a and b are within the same atomic component context, Common(a,b) and Dis- tinct(a,b) can only contain edges with at least one end within the atomic compo- nent context according to the definition of Common and Distinct; however, the other end of the edge could be a node outside of the atomic component context, Figure 7-7. Example for atomic component contexts. f1 f2 f3 f4 v1 v2f5 f6 f5 f6 f7 f8 call reference 211 Establishing the Similarity Metric Parameters like the target of the call from f4 to f5 in Figure 7-7. For edges in Link(a,b), even both ends must be in the atomic component context because Link contains only edges between a and b which are both in the atomic component context. Concentrating on atomic component contexts helps us to gather information why some nodes are jointly inside an atomic component and others are not. Therefore, we will count the edge types that occur in Link, Common, or Distinct, respec- tively, in a comparison of nodes that are both inside the same atomic component and nodes of different atomic components. Nodes that do not belong to any atomic component at all will be considered an atomic component of their own. The distinction of the edge distribution in same and different atomic components gives us insights what the contribution of an edge type is for two elements being in the same or in different atomic components. If an edge type occurs more often among nodes in the same atomic component than among nodes in different com- ponents, it should get more weight and vice versa. The same edge can occur several times in the statistics because it can occur in Link, Common, or Distinct of different pairs of nodes. Distinguishing among Link, Common, and Distinct can give us additional hints on the weighting of these individual aspects of similarity. Finally, we have to distinguish among different kinds of atomic components because it is clear from their definition that certain edge types can never occur within specific atomic components. For example, we cannot find a use edge within an abstract data type. Otherwise there would have to be a variable within the atomic component and thus, it would no longer be an abstract data type but a hybrid component. To sum it up, the comparison has the following dimensions: • edge types • Link / Commoneq / Commonne / Distinct • same / different atomic component • kinds of atomic component Similarity Clustering 212 7.6.1.2 Used Data The data in the following statistical analysis will be ascertained for one atomic component at a time. For reasons of readability, we will not explicitly use indices for individual atomic components in the following presentation of the way the information is computed. For the same reason, we will also do without any index for the kind of atomic component. The following equations should be understood as being applied to individual atomic components of the same kind. Recall that the edge type weight is multiplied by the Shannon information content to obtain the resulting feature weight as proposed by equation (7.17). If we simply counted the occurrence of each edge type, the statistics would not be adequate when Shannon information content is used to balance frequently used entities. Therefore, the statistics has to be based on equation (7.17). That is why the accounting of edge types is by means of with in the following where we= 1 is assumed for all edge types. If Shannon informa- tion content is not considered, one has to compute the following formulas with weight (n,e,r) = 1 to get an edge distribution based on pure edge occurrences. We are going to compute the following figures for each edge type, e, for the con- text of a given atomic component, A (assuming we = 1 for the computation of W(X); furthermore, it is assumed that the nodes are enumerated): (7.22) (7.23) W X( ) weight x( ) x XÎå= weight n e r, ,( ) ld probability n e r, ,( )( )– we´= Linkse A( ) W Fe Link ai aj,( )( )( ) ai aj, AÎ i j<Ù å= Linkde A( ) W Fe Link a b,( )( )( ) a AÎ b context A( ) A¤ÎÙ å= 213 Establishing the Similarity Metric Parameters where is a filter that sorts out all features not of type e. considers only elements that are both in the same atomic component A, whereas comprises features of entities where only one node is part of A. Given these definitions, we can compute for each edge type and similarity aspect the ratio of edges for nodes in the same atomic component, A, and edges between nodes within A and nodes outside of A: (7.24) The average of this ratio over all atomic components is the probability for an edge of kind e to link nodes in the same atomic component, in other words, it describes the “natural” portion of edges of type e within an atomic component (let A be the set of reference components used for calibration): (7.25) Only if LinkRatioe is greater than 0.5, the edge type e is significant. Hence, the edge weight can be set in correlation to this ratio. We can even use this ratio itself as weight for e. If the ratio is greater than 0 but less than 0.5, the weight should not be 0 (or even negative), because at least some edges of type e are within detected atomic components. Analogously, the weight for e should not be 1 (or greater than 1) if the ratio is greater than 0.5 but less than 1 because then at least some edges of type e are between entities not in the same atomic component. Hence, LinkRatio represents an appropriate weight. Definitions analogous to equations (7.22), (7.23), (7.24), and (7.25) can be made for Commonne, and Commoneq. Because Distinct is used to discriminate entities as opposed to the other similarities, we use the following formula to measure the Fe X( ) m e nµ m e nµ( ) XÎ{ }= Linkse A( ) Linkde A( ) LinkRatioe A( ) Linkse A( ) Linkse A( ) Link d e+ A( ) ----------------------------------------------------= LinkRatioe 1 A ------ LinkRatioe A( ) A AÎ å×= Similarity Clustering 214 Distinct ratio where a high value indicates that the edge type is an important grouping factor: (7.26) where • is defined analogously to (7.22) • is defined analogously to (7.23) 7.6.1.3 Data for Aero, Bash, CVS, and Mosaic This section describes the edge ratios for Link, Commonne, Commoneq, and Dis- tinct as defined in the previous section for the systems Aero, Bash, CVS, and Mosaic. For readability reasons, the bars for Distinct actually present 1-Distinc- tRatio as defined by (7.26) and the edge type names are abbreviated as listed in Table 7-2. Ratios for abstract data types. For ADTs the Link ratio of obj-address-of, obj- set, obj-use, same-expression, and actual-parameter-of must be 0 because other- wise a variable would be contained in the component and, therefore, the compo- nent be categorized as hybrid. The edge ratios are shown in Figures 7-8, 7-9, 7- 10, and 7-11 for the respective system. Table 7-2. Abbreviations for edge types. actual-parameter-of AP comp-set CS obj-address-of OA call CL comp-use CU obj-set OS same-expression SE parameter-of PA obj-use OU local-obj-of-type LT return RE part-type PT comp-address-of CA of-type OT delineate DE DistinctRatioe A( ) 1 Distinctse A( ) Distinctse A( ) Distinct d e+ A( ) -------------------------------------------------------------------–= Distinctde A( ) Distinctse A( ) Distinct d e+ A( ) -------------------------------------------------------------------= Distinctse A( ) Distinctde A( ) 215 Establishing the Similarity Metric Parameters • Link: It is not surprising that delineate has such a high significance for Link; programmers introduce a synonym for a type and adhere to it. Similarly, the expected higher Link ratios for signature and component references can be Figure 7-8. Ratios for ADTs in Aero. Figure 7-9. Ratios for ADTs in Bash. AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct Similarity Clustering 216 found in the charts; only in the case of Aero, return are less significant among direct links. • Commoneq: The ratios for Commoneq is in most cases close to the Link ratios for the diverse edge types. The peak of comp-address-of in CVS can be explained by the very small number of cases in which the addresses of record Figure 7-10. Ratios for ADTs in CVS. Figure 7-11. Ratios for ADTs in Mosaic. AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct 217 Establishing the Similarity Metric Parameters components are taken and which all happened to be by subprograms of the same atomic component. Neither is the peak for of-type within Mosaic repre- sentative. In comparison to other systems, the value for Commoneq is surpris- ingly high for Mosaic. This means that a variable that is an actual parameter to one accessor function is also an actual parameter to most other accessor func- tions of the same component. • Commonne: The charts for Commonne suggest that non-equivalent features are less significant. The difference between the ratios of Commoneq and Com- monne supports the distinction of shared features in equivalently and non- equivalently related ones that I proposed as an improvement to the original approach. • Distinct: The data for all systems suggest no significant difference among edge types within Distinct. Note that the bars for Distinct actually present 1- DistinctRatio, i.e., the actual DistinctRatio as defined by (7.26) is virtually 1 for all edge types. ADO ratios. In ADOs, parameter-of, return, and part-type cannot appear other- wise we would deal with a hybrid (Figures 7-12, 7-13, 7-14, and 7-15). For ADOs, the remaining reference relationships take-address-of, set and use — in particular, those for global objects — are dominating as expected. Furthermore, same-expression is also an important factor which justifies the Same Expression heuristic (the peak of same-expression in Mosaic is not representative because there is only one such edge). Neither surprising is the fact that obj-set is more important than obj-use in most cases. The data support our hypothesis that setting objects is often a more critical operation because it may involve checking certain consistency constraints or performing non-trivial algorithms on a complex struc- ture and for this reason, clients of a component avoid changing the state of a com- ponent directly. The reason why comp-set is less significant than comp-use, which apparently contradicts our hypothesis in the case of ADTs, is the way dereferences are handled by the resource usage analysis. As it was discussed in Section 4.2.8.2, an assignment to a record component by means of a dereferenced pointer, like c->a = 1, is actually considered a usage of the record component. Because most types of ADTs are dynamic data structures and, hence, accessed by means of dereferenced pointers, many references to record components that would be classified as assignments by a reverse engineer are actually considered Similarity Clustering 218 comp-uses by the resource usage analysis that follows the compiler’s point of view. It is also interesting to see how the ratios for all kinds of references change from system to system indicating that Aero and Bash are more permissive than CVS and Mosaic with respect to references to global variables and constants of abstract data objects. The high ratios for Commoneq in Bash for return edges and in Mosaic for comp-address-of edges are due to the very rare number of edges that happened all to be in the same atomic component. These figures are not representative. The overall conclusion drawn from these data is that the actual influence of the edge types in the respective similarity aspects depends on the system. As a conse- quence, the weights gained for one system cannot necessarily used for a different system. However, in this thesis, we considered very different kinds of systems from different authors. For a family of systems of a common application domain and for programmer teams with established programming conventions, there could be less divergence of edge ratios. On the other hand, the data clearly reveal differences among the edge types and these differences are similar for all systems — supporting our approach to assign Figure 7-12. Ratios for ADOs in Aero. AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct 219 Establishing the Similarity Metric Parameters different weights to different edge types. Furthermore, the different aspects of similarity, namely, Link, Commoneq, Commonne, and Distinct, actually have dif- ferent influence; Link and Commoneq are most important. Figure 7-13. Ratios for ADOs in Bash. Figure 7-14. Ratios for ADOs in CVS. AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct Similarity Clustering 220 The figures of Link for record component and object references also allow insight into the degree of information hiding of the respective systems. There is generally only little information hiding for ADTs for all systems; in the case of CVS, there is even virtually no encapsulation of ADTs at all. On the other hand, information hiding for ADOs ranges from little (Aero) over medium (Bash) to good (CVS and Mosaic). These data indicate that the means for information hiding are more often used for ADOs than for ADTs. However, even for ADOs, there are still many data references that bypass the accessor functions. 7.6.2 How to Find the Parameters As the statistics in Section 7.6 show, the parameters of Similarity Clustering have to be adjusted before it can be applied in practice. This can be done by compiling the atomic components of a sample of the system either manually or by using one of the other techniques. The sample is then used to calibrate the diverse parame- ters, namely, the edge weights and the individual influence factors of the similar- ity aspects. With these parameters, Similarity Clustering is applied to the whole system. Browsing and selecting the proposed candidates, one adds reference com- ponents, which are appended to the previous sample and which can be used for calibration in the next cycle. Of course, not all candidates have to be validated. In Figure 7-15. Ratios for ADOs in Mosaic. AP CL SE LT CA CS CU PA RE OT OA OS OU PT DE 0 0,2 0,4 0,6 0,8 1 Link CommonEQ CommonNE Distinct 221 Establishing the Similarity Metric Parameters the simplest case, only one candidate component could be treated per iteration. This calibration process is repeated until a parameter setting is found that works well for a good part of the system or until the parameters seem not to change any- more. Fortunately, all steps of this process apart from the validation can be auto- mated. How the detection quality can be judged was already discussed in Section 6.2.2. Two alternative classes of calibration approaches to find appropriate parameters for a given sample are explored in this thesis. The first one consists of two steps: (1) ascertaining the “typical” edge occurrences in atomic component contexts and (2) finding adequate influence factors with established optimization strategies. The second approach is based on a multi-dimensional contingency table — derived from the sample — that describes the probability for two entities having a certain vector of Link, Commoneq, Commonne, and Distinct values to be in the same atomic component. Both approaches are presented and evaluated in this sec- tion. 7.6.2.1 Traditional Optimization Strategies Similarity Clustering has a large number of parameters mainly due to the diverse edge types. In order to reduce the problem space of Similarity Clustering, we firstly try to find appropriate weights for the edge types and secondly adjust the other parameters. Edge weights. The edge weights contribute to the direct and indirect similarities between two entities. A high value has the effect of attracting two entities during clustering. Therefore, the likelihood for them to be in the same atomic component increases. The model described by (7.3) on page 189 assumes equal weights for the same edge type no matter to which similarity aspect the edge type contributes, i.e., whether it appears in Link, Common, or Direct. The influence parameters of the similarity aspects are used to make the difference. That is, for establishing the edge type weights, we do not care about whether there is a different edge type distribution among Link, Commoneq, Commonne, and Direct; we can simply use the relative frequency of edge types in the context of atomic components as fol- lows: Similarity Clustering 222 (7.27) where insideE is the number of edges of type E in an atomic component and acrossE is the number of edges across the border of atomic components. It makes good sense to use the edge type ratios as edge weights because they tell something about the “natural” consistence of an atomic component. Using a high edge ratio for set edge weights draws many set edges into the atomic component while a low edge ratio for actual-parameter-of has only little attraction. The edge ratio can then be combined with the Shannon information content as discussed in Section 7.3.3. Note that Shannon information content is based on the frequency of nodes and does not interfere with the edge type frequency. For the automation of the search for good parameter values, there are several alternatives that will be presented in the following paragraphs. There are many other approaches that are not investigated here. It would go beyond the scope of this thesis to try all of them. The selected ones can be viewed as typical. Grid Search. Given an interval that defines the search space for each influence factor and a step by which the search progresses, we can search all over a grid of the four-dimensional space spanned by Link, Commoneq, Commonne, and Dis- tinct. The visited space coordinates of the grid depend on the chosen step. On one hand, the step should not be too wide, otherwise maxima could be skipped; on the other hand, a short step will dramatically increase the time needed for the search. Due to its very large number of iterations, grid search is only feasible for small samples. Gauß-Seidel Strategy. The Gauß-Seidel strategy follows the strategy of hill climbing by varying a single parameter at a time and analyzing the effect on the detection quality. It first checks for the direction in which the parameter is going to be changed. If the detection quality increases when the parameter is increased, the parameter will be further increased. If the detection quality decreases instead, the parameter will be further decreased. Then the parameter will be changed in EdgeRatioE insideE insideE acrossE+ -----------------------------------------= 223 Establishing the Similarity Metric Parameters the chosen direction until no improvement can be achieved anymore. The param- eter is frozen and the next parameter is adjusted. Simulated Annealing. Simulated annealing is a widely used algorithm for com- binatorial optimization problems. It is based on the following analogy between a combinatorial optimization problem and a physical system (Aarts and Korst, 1990): • Solutions in a combinatorial optimization problems are equivalent to states of a physical system. • The cost of a solution is equivalent to the energy of a state. In condensed matter physics, annealing is known as a thermal process for obtain- ing low energy states of a solid in a heat bath. The process contains the following two steps: 1. Increase the temperature of the heat bath up to a maximum value at which the solid melts. 2. Decrease carefully the temperature of the heat bath until the particles arrange themselves in the ground state of the solid. In the liquid phase, all particles of the solid arrange themselves randomly. In the ground state, the particles are arranged in a highly structured lattice and the energy of the system is minimal. The physical annealing process can be modeled successfully by using computer simulation methods from condensed matter phys- ics. The general algorithm is well-known and therefore not presented here (the reader is referred to Aarts and Korst, 1990). Simulated annealing was introduced above as a search for the minimum. Since the maximal detection quality is searched in the sample of reference components, the simulated annealing algo- rithm is adjusted to find a state of maximal energy. Furthermore, the process ends when the number of maximal iterations (set to 15) is reached or the improvement with respect to the best solution found is less than five percent after at least half of the maximal iterations have been completed. Evolution Strategies. Another family of optimization techniques that have recently attracted attention are evolution strategies. In order to adopt evolution strategies to find reasonable parameters, one could view parameter settings as Similarity Clustering 224 individuals exposed to evolution. However, evolution strategies require a large number of individuals per generation — typically at least 50 — and several itera- tions. Since each clustering process with Similarity Clustering can take several minutes on a Sun Sparc Ultra-60, the time needed with evolution strategies were simply too long. Doval et al. have explored evolution strategies for clustering in more detail (1999). 7.6.2.2 Sample Partitioning Unfortunately, the optimization approaches described above need several itera- tions for calibration. The calibration itself involves clustering of the sample and is therefore a rather expensive operation. It would be favorable to have a more direct way to find the parameters. The optimization approaches above were used to calibrate the individual influ- ence factors of the similarity aspects within equation (7.3) on page 189. The model described so far assumes equal weights for edge types no matter to which similarity aspect they contribute. This need not necessarily be the case but was done to limit the degrees of freedom of Similarity Clustering. Adding individual influence factors of the similarity aspects depending on the edge types adds more parameters. However, if there is a sample of reasonable size, these parameters can automatically be ascertained. In experimental designs, so-called contingency tables are used to test main effects and interactions of factors (Winer et al., 1991). A contingency table is an n- dimensional table where each dimension iÎ1¼n represents an independent vari- able Vi with levels; a cell of the contingency table is the fre- quency for observed values of the dependent variable at the level v1 of variable V1, v2 of variable V2, and so forth. For example, if one is interested in whether sex and body height have any effect on the choice of Ada as the favorite programming language, one could take a sample of programmers, divide them into male and female and into short, medium, and tall (where intervals of the exact body height 1¼mvi pv1¼vn 225 Establishing the Similarity Metric Parameters would have to be specified in order to define these categories). The contingency table would be as follows (the data are fictive): For this contingency table, statistical tests can be used to validate hypotheses, like “the sex of a programmer does not influence her/his choice of Ada as the favorite programming language” or “tall males prefer other languages”. The idea of the contingency table can be adopted to the problem of establishing parameters of Similarity Clustering. The independent variables of this adoption are the dis- cretizised similarity aspects. For the dependent variable, we count how often two elements with a certain combination of discretizised values of the dependent vari- ables are in the same atomic component versus those in different atomic compo- nents. The percentage is the probability that two elements with a certain combination of similarity aspects belong to the same component and can be used for clustering, as in the example above in which the probability that two male and tall programmers both prefer Ada is 30/(30+20)=0.6. Before the use of the contingency table is explained in detail, a few principal remarks follow. Given the atomic components for a sample of the system, we could look for the patterns involving two entities in the same component (positive examples) or in different components (negative examples). These patterns could be very detailed, like “if the two subprograms have five calls to common neigh- bors, share two types in their signature, but access no variable, then they belong to the same component”. If the same pattern occurs more than once but does not always apply to entities in the same component, its validity can be determined as the number of positive examples divided by the number of occurrences. Hence, one can derive the patterns and their degree of confidence from the sample and use this information to try to find the other components in the rest of the whole system. However, if these patterns are too detailed, then similar, yet different characteristics Ada others percentage male short 20 50 0.29 medium 40 30 0.57 tall 30 20 0.60 female short 10 15 0.40 medium 30 40 0.43 tall 10 10 0.50 Similarity Clustering 226 occurrences will be missed. Therefore, there should be some sort of abstraction, in other words, formation of equivalence classes. The similarity metric for two entities is such an abstraction; it abstracts from the exact numbers and kinds of relationships between two entities. However, the abstraction might go too far for at least two reasons: • Since the similarity metric is the weighted sum over individual aspects of simi- larity, all similarity aspects should be sufficiently high when two entities are to be grouped. However, it could well be the case that two entities belong together only when one similarity aspect is high while another is low. • At a lower level, as stated above, the similarity metric assumes that the influ- ence of the edge types does not depend upon which aspect they occur in. Both of these assumptions need not be valid. Clustering based on single, very detailed patterns as described above does not make these assumptions. However, it requires all needed possible patterns of the whole system to be in the sample. A first abstraction of the very detailed patterns, also without the assumptions of the similarity metric, could be to categorize the LinkRatio, CommonRatioeq, Com- monRatione, and DistinctRatio for all patterns as introduced in Section 7.6.1.2. Recall that these ratios state the probability to find a certain edge type within a similarity aspect (Link, Commoneq, Commonne, and Distinct) between entities in the same component. A pattern for two entities A and B can then be characterized by a vector: (Linke1 (A,B), Commoneqe1 (A,B), Commonnee1 (A,B), Distincte1 (A,B), Linke2 (A,B), Commoneqe2 (A,B), Commonnee2 (A,B), Distincte2 (A,B), … LinkeN (A,B), CommoneqeN (A,B), CommonneeN (A,B), DistincteN (A,B)) where e1, e2, …, eN are edge types. Then the likelihood for two entities to be in the same component, given a certain vector, can be ascertained as the relative fre- quency of positive examples for a vector of this kind. Since the similarity aspect ratios are floating numbers, near-misses should be avoided by forming equiva- lence intervals on the ranges of possible values. That is to say, a floating number is mapped onto a discreet range and all values in this range are handled alike. In other words, a multi-dimensional contingency table is established whose indices 227 Establishing the Similarity Metric Parameters are the discretizised ranges of the similarity aspect ratios and whose entries are as follows: (7.28) where is a discretizised vector of similarity aspects across all edge types. When the contingency table has been populated with the data drawn from the sample, it can be used to estimate the likelihood that two arbitrary entities of the rest of the system belong to the same component: Only the similarity aspects for the two entities are computed and then the likelihood is looked up in the table. However, the order of this table is: where N is the number of edge types and range (X) is the number of intervals for the disretization of X (it is assumed for all edge types that a similarity aspect is discretisized into the same number of subintervals). The fact that a contingency table is very large is not the problem because it is also very sparse which allows an efficient table implementation; the problem is that the number of undefined entries, i.e., entries that neither have positive nor negative examples, increases by the order of the table. In such cases, the contingency table information is not available and it is unclear how to classify the two entities at hand. A compromise between abstraction and subcategorization is to consider only the top-level similarity aspects, i.e., to classify according to the following kind of vector that does not differentiate among edge types: (Link (A,B), Commoneq (A,B), Commonne (A,B), Distinct (A,B)) The edge type ratios are relevant in so far that they are used to establish the weight of the edge type as in the more detailed model. But then, the edge types are summarized by the top-level similarity aspects and are not used for the index range of the contingency table. ct v( ) positive_examples v( ) positive_examples v( ) negative_examples v( )+-------------------------------------------------------------------------------------------------------------------= v range Direct( ) range Commoneq( )´ range Commonne( )´ range Distinct( )´( ) N Similarity Clustering 228 7.6.2.3 Evaluation of the Training Methods There are two factors by which an optimization approach will be judged: • the number of iterations to calibrate the parameters on the sample • the size of the sample needed to find reasonable parameters On average, Gauß-Seidel optimization took nine iterations and simulated anneal- ing eleven, whereas the approach based on a contingency table needs always one iteration. Training. For the evaluation, increasingly large subsets of the reference compo- nents were used to calibrate the parameters of Similarity Clustering with the methods described above. The components used for training are called training components. The set of training components is called the training set. The set of all references, of which the training set is a subset, is called the reference set. Sizes of the training sets in the range of 0.1 to 1.0 (step = 0.1) were use for cali- bration where the size of the training set is defined as the sum of the base entities that are part of training components divided by the total number of base entities that are part of the reference set. The training components were randomly chosen from the reference set and were always complete in order to avoid false information. If a training component were not complete, its atomic component context used for the training may contain relationships erroneously classified as between two entities of different compo- nents just because one of the two partners of the relationship that are actually in the same original component does not belong to the training component. As a consequence, the following data are only valid for a usage model in which a maintainer identifies single components as complete as possible rather than beginning with several subsets of components in parallel. Furthermore, because the reference components were completely used, the exact share of the training subset may differ from the rated values i ´ 0.1 (i Î{1, ¼,10}). Because we are interested in the question how big a training set should be to find appropriate parameters, each training for a new subset begins with the same default parameters rather than using the results of the previous training. In a real usage, one would use the parameters of the previous training and try to improve 229 Establishing the Similarity Metric Parameters these. Moreover, in order to get comparable results for the different sizes of sub- sets, the training components were always selected in the same order from the reference set and the random generator used by simulated annealing was reset for each training. Hence, all methods use the same training sets and each training with simulated annealing iterates over the same random values. Threshold. Similarity Clustering ¾ as a hierarchical clustering technique ¾ returns a dendrogram of similar elements rather than a fixed set of candidates. In order to compare the results of Similarity Clustering with those of techniques that yield only sets of “flat” clusters, the dendrograms were flattened as described in Section 7.4. The retrieval of clusters from a dendrogram depends upon a thresh- old that specifies the minimal acceptable similarity for the flattened clusters. In order to investigate the influence of this threshold, four different thresholds were used. A high threshold has the effect of retrieving smaller clusters. Whether it also decreases the number of clusters depends. For example, a subtree is split into two clusters when the similarity associated with the root of the subtree is below the threshold but the similarities associated with the children of the root are above the threshold. Then, two clusters are proposed while using a lower threshold would unite these two clusters. On the other hand, a very high threshold generally decreases the number of clusters because only few clusters will be similar enough. A lower threshold yields larger clusters. Like for higher thresholds, we cannot predict whether a lower threshold also increases the number of clusters: More clusters with lower similarities may be accepted but also the size of acceptable subtrees of the dendrogram increases, hence, more clusters are united. There is an obvious connection between the number of candidates and the num- ber of false positives: Since the reference set has a fixed size, the more candidates are proposed the more false positives will be created and vice versa. However, as discussed, there is not necessarily a direct correlation between the threshold and the number of candidates and, thus, the number of false positives. Moreover, there is also no simple connection due to the filter used to exclude clusters with less than 3 and more than 75 elements (as it was done in the comparison in Chapter 6). Similarity Clustering 230 A low threshold may produce very large clusters that are then filtered out, and a high threshold may produce clusters too small to be proposed as candidates. Nev- ertheless, as the following figures for the thresholds between 0.1 and 0.4 show, the lower a threshold is, the lower is the number of false positives in general ¾ despite of the fact that the curves in the following charts intersect in some cases. This correlation can be explained by the circumstance that a lower threshold will generally produce larger and, hence, fewer candidates. In order to see whether the number of false positives increases linearly with the increase of the recall rate as the threshold decreases, a linear regression analysis was performed. The statistical analysis showed that in 7 out of 16 cases the num- ber of false positives linearly increases with the recall rate at a significance level above 0.8. In the other cases, a linear connection could not be shown. An interesting case of the influence of the threshold is the recall rate for Mosaic using Gauß-Seidel optimization shown in Figure 7-18. The thresholds 0.1 and 0.2 yield good results, but then, when the threshold is further increased, the number of candidates above the threshold immediately drops to 0. Fortunately, this phe- nomenon seems to be rare. Results. The calibration results for the respective methods and kinds of atomic component are shown in Figures 7-16, 7-17, 7-18, 7-19, 7-20 and 7-21. The charts present the recall rate as defined by (6.5) on page 163 and the number of false positives with respect to the size of the subset used to calibrate the parameters. Each chart contains data for four different thresholds used to retrieve candidates from the tree produced by Similarity Clustering. For the Contingency Table approach, a different set of thresholds was used to do the charts because the threshold 0.4 did not yield any candidates. In most cases, there is not a substantial difference between calibration according to Gauß-Seidel and Simulated Annealing. In few cases, the Gauß-Seidel tech- nique yielded worse parameters (ADO detection for Bash; ADT detection for Bash and CVS with threshold 0.1). The Contingency Table approach is clearly worse than the other two approaches. Only for Mosaic, fewer false positives were generated at the same recall rate (ADO detection) or a recall rate slightly worse than the other approaches (ADT detection). An overall result of this evaluation is 231 Establishing the Similarity Metric Parameters Figure 7-16. ADO detection with Simulated Annealing. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 Aero (recall) Aero (false positives) Bash (recall) Bash (false positives) CVS (recall) CVS (false positives) Mosaic (recall) Mosaic (false positives) Similarity Clustering 232 that the size of the subset does not have any discernible influence on the recall rate and the number of false positives: Sometimes they increase, sometimes they Figure 7-17. ADT detection with Simulated Annealing. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 Aero (recall) Aero (false positives) Bash (recall) CVS (recall) CVS (false positives) Mosaic (recall) Mosaic (false positives) Bash (false positives) 233 Establishing the Similarity Metric Parameters decrease. Furthermore, even when the parameters are calibrated on all reference components of the system, the recall rate is far from 1.0. This may partly be due Figure 7-18. ADO detection with Gauß-Seidel. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 Aero (recall) Aero (false positives) Bash (recall) Bash (false positives) CVS (recall) CVS (false positives) Mosaic (recall) Mosaic (false positives) Similarity Clustering 234 to the few iterations of calibration conducted, but is certainly also because the Similarity Clustering metric is an abstraction of the actual patterns that is in many Figure 7-19. ADT detection with Gauß-Seidel. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.1 0.2 0.3 0.4 Aero (recall) Aero (false positives) Bash (recall) Bash (false positives) CVS (recall) CVS (false positives) Mosaic (false positives) Mosaic (recall) 235 Establishing the Similarity Metric Parameters cases too coarse. Moreover, for the evaluation, the candidates are derived from the dendrogram using a threshold that is decisive for all subtrees of the dendro- Figure 7-20. ADO detection with Contingency Table. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 Aero (recall) Aero (false positives) Bash (recall) Bash (false positives) CVS (recall) CVS (false positives) Mosaic (recall) Mosaic (false positives) Similarity Clustering 236 Figure 7-21. ADT detection with Contingency Table. 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 10 20 30 40 50 60 70 80 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0.001 0.10 0.20 0.35 Aero (recall) Aero (false positives) Bash (recall) Bash (false positives) CVS (recall) CVS (false positives) Mosaic (recall) Mosaic (false positives) 237 Establishing the Similarity Metric Parameters gram. Hence, it is assumed that the average similarity among elements of compo- nents is comparable, which need not be the case; e.g., for one atomic component, a naming convention may be established such that informal information can be leveraged by Similiarity Clustering, while for other components, the names of their elements are not similar at all, and consequently, have a lower average simi- larity. On the other hand, a positive result of this evaluation is that a subset size of about 20% seems to be sufficient to find suitable parameters. 7.6.2.4 Comparison to Other Techniques Figure 7-22 and Figure 7-23 contain the recall rate of Similarity Clustering in comparison to the other approaches using simulated annealing as the training method and picking suitable thresholds that balance recall rate and number of false positives. Table 7-3 on page 238 contains the accuracies for the 1~1, 1~n, and n~1 matches, where the columns with heading “th.” contain the minimal sim- ilarity thresholds used to retrieve components from the dendrogram. These thresholds were chosen to balance the recall rate and the number of false posi- tives. Table 7-4 on page 238 lists the number of false positives for Similarity Clustering and Schwanke’s Arch approach. The table demonstrates that the num- ber of false positives for Similarity Clustering could actually be reduced with respect to its predecessor Arch. Figure 7-22. ADT recall rates. 0,34 0,48 0,34 0,4 0,15 0,16 0,1 0,13 0,22 0,17 0,27 0,33 0,23 0,55 0,27 0,37 0,28 0,14 0,33 0,27 0,24 0,39 0,37 0,36 Aero Bash CVS Mosaic 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Part Type Same Module Internal Access Arch Type- based Coh. Similarity Clustering Similarity Clustering 238 Figure 7-23. ADO recall rates. Table 7-3. Detected ADTs and ADOs for Similarity Clustering. System ADT ADO Good OK Good OK too large too detailed too large too detailed th. # acc. # acc # acc th. # acc. # acc # acc Aero 0.20 3 0.84 2 0.44 0 0.00 0.20 5 0.81 4 0.47 0 0.00 Bash 0.20 4 0.79 8 0.51 0 0.00 0.13 1 1,00 0 0.00 12 0.41 CVS 0.10 5 0.85 3 0.32 4 0.30 0.20 22 0.82 13 0.41 6 0.51 Mosaic 0.40 1 0.86 8 0.53 5 0.41 0.40 13 0.86 12 0.49 4 0.39 Table 7-4. Number of false positives and true negatives. Technique Aero Bash CVS Mosaic false posi- tives true nega- tives false posi- tives true nega- tives false posi- tives true nega- tives false posi- tives true nega- tives Similarity Clustering ADT 11 5 17 3 27 7 3 4 ADO 36 5 32 6 17 1 2 8 Arch ADT 13 4 29 13 36 9 11 10 ADO 38 6 46 11 42 17 11 15 0,42 0,35 0,63 0,5 0,22 0,2 0,34 0,54 0,28 0,21 0,32 0,35 0,13 0,22 0,26 0,080,08 0,16 0,26 0,19 0,42 0,35 0,75 0,42 0,16 0,22 0,24 0,42 Aero Bash CVS Mosaic 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Global Reference Same Module Internal Access Same Expression Arch Delta IC Similarity Clustering 239 Implementation Hints As Figure 7-22 and Figure 7-23 show, Similarity Clustering is always among the techniques with the highest recall rates. However, as Table 7-4 indicates, it has also a higher number of false positives than most approaches. On the other hand, in comparison to its ancestor, the Arch approach, the number of false positives was substantially reduced. Similarity Clustering also needs more computational effort than any of the other approaches. This was the reason why only few training iterations could be done. The next section discusses how time and space complexity can be reduced. 7.7 Implementation Hints The basic outline of the clustering algorithm was already given in Figure 7-2 on page 187. Because the algorithm suggests to compare each entity with any other entity, the space complexity is O(n2) where n is the number of entities and the time complexity to compute Simdirect and Siminformal is O(n2). In the case of Simindirect, the time complexity even seems to be O (n3) because for each pair all neighbors need to be ascertained and the number of neighbors can be n in the worst case. However, analyzing the similarity metric in more detail shows that at least the computation of Simdirect and Simindirect, which requires the most time, is linear. Computing Siminformal is comparatively cheap. Furthermore, the storage needed for saving the similarity between entities, which is useful because the similarity is computationally expensive and needed more than once, can also be reduced leveraging the fact that most similarities are 0. In this section, we will refine the algorithm and give some hints on possible opti- mizations. First, some data structures needed to implement the approach effi- ciently are described. Then it is shown how Simdirect and Simindirect can be computed in linear time to initialize the matrix used to store the similarities among individual entities. Finally, the algorithm will be refined and possible opti- mization to reduce the space needed for the matrix will be discussed. Similarity Clustering 240 7.7.1 Used Data Structures Let N be the set of nodes {n1, n2, ¼, nm} to be clustered. We assume that the nodes in N are enumerated and can be identified by a unique number. To represent the disjoint clusters (or groups) of base entities, we can use the union-find data structure and algorithms for a partition1 S = s1,¼,sm of N by Hopcraft and Ullman (1983) that was already introduced in Section 5.2. In each step of the clustering algorithm, the pair of clusters with maximal similar- ity has to be found. Instead of iterating over the whole matrix each time to find this maximum, we will use a priority queue to record the pairs with descending similarity. For the priority queue, the following subprograms are assumed (let p be a pair of set identifiers): • procedure insert (p, s) inserts a pair p of set identifiers into the priority queue with similarity s; if this pair is already present in the queue, the present pair is removed and the new pair is added; only those pairs will actually be added to the queue whose similarity is above the minimal threshold Q • function empty_queue is true if the priority queue is empty • function head returns the pair of set identifiers with the highest similarity • procedure remove_head removes the pair of set identifiers with the highest similarity 7.7.2 Initialization Initially, each node of N is put in a set of its own using new_set, hence we start with a partition S=s1,¼,sm where set identifier si = {i} for all i in 1¼m. Because the similarity relation is computationally expensive and needed more than once, it is computed only once and stored in a matrix simN with range 1..m where simN(i,j) denotes the similarity between i and j according to equation (7.3). In the 1. A partition S of a set N = {n1, n2, ¼, nm} is a set of sets si where i = 1¼k and and N si i 1¼k= È= i j 1¼k{ } i j¹,Î,( ) si sjÇ" Æ= 241 Implementation Hints following, the capitalized Sim denotes the similarity metric, whereas sim denotes a similarity matrix. For the group similarity described in Section 7.2, we can use another matrix simS with range s1,¼,sm where simS (si, sj) denotes the similarity between si and sj. This matrix is initialized with: simS (find (i), find (j)) = simN(i, j) Since si initially consists only of i, simS equals simN in the beginning. However, during clustering the values in simS change with the following invariant: simS (si, sj) denotes the average similarity between all nodes in si and sj according to equation (7.2) if si ¹ sj, and is 0 if si = sj. The values in simN will not be changed. Because the similarity metric is symmetric, the similarity relation can be stored in a triangular matrix. Therefore, the two similarity matrices simS and simN can be stored in one single physical matrix, one in the upper, one in the lower part. Nev- ertheless, the space complexity for the similarity relations is still O(m2) where m is the number of nodes to be clustered because we compare any node with any other node. The time complexity for computing the cells of this matrix is O(m2 ´ k) where k is the maximal number of neighboring nodes (in Simindirect all neigh- bors of a node pair need to be ascertained). Consequently, the time complexity is O (m3) in the worst case. However, statistical analyses on the number of neigh- bors per nodes showed that k is a very low constant much smaller than m and nei- ther increases with the system size. By a closer look at the way similarity between entities is defined, one can see that Simdirect(ni, nj)=0 (equation (7.18)) for nodes that are not directly connected, and Simindirect(ni, nj)=0 (equation (7.4)) for nodes that do not have a common neigh- bor. Let nodes that are direct neighbors of a node n be the first-degree neighbors of n and let the nodes that are not directly reachable, but only one node away from n be the second-degree neighbors of n, then Simdirect (n, nj) has only to be com- Similarity Clustering 242 puted for nj Î first-degree neighbors (n) and Simindirect (n, nj) has only to be com- puted for nj Î first-degree neighbors (n) È second-degree neighbors (n). Since there is only a small number of first and second-degree neighbors for each node in practice, Simdirect and Simindirect is basically computed linearly to the number of nodes, i.e., has time complexity O(m). Because these are the computationally most expensive parts of the similarity between nodes, one can reduce the time needed to compute the similarity matrix immensely. However, computing Simdi- rect and Simindirect is only one part of the matrix initialization. In the case of infor- mal information, each entity is compared to each other entity no matter whether the entities are neighbors or not. Therefore, the overall time complexity remains O(m2) unless one does not consider informal information. Below, we will also discuss optimizations to reduce the space needed to store the similarity relation leveraging that the similarity relation is usually greater than 0 for only a few pairs of nodes, i.e., the matrix is very sparse. 7.7.3 A Refined Clustering Algorithm The outline of the clustering algorithm in Figure 7-2 on page 187 can be refined to the one in Figure 7-4. In this refined algorithm, it is still open how the pair with the maximal similarity can be found efficiently. Instead of iterating over the matrix simS each time it has changed and thus ending up with a cubic algorithm, one can use the priority queue introduced in Section 7.7.1 to keep track of the order of similarity. Using the priority queue, one can rewrite the algorithm in Figure 7-4 as in Figure 7-5. Initially, the queue is filled with all pairs in the similarity matrix. The procedure recompute in this algorithm recomputes the similarities of the newly united clusters to all other clusters. During this recomputation, all recom- puted pairs will be added to the priority queue; all obsolete similarity values for the recomputed pairs are removed thereby. Equation (7.2) reduces the similarity of two clusters to the average similarity between their members. Its computation becomes increasingly expensive as the 243 Implementation Hints involved clusters grow. An equivalent, but less costly approach is to compute the new similarity for the union by the similarity of the two clusters to be united using the following equation: (7.29) The equivalence of (7.29) and (7.2) can be proven by induction. Algorithm 7-4. Refined clustering algorithm. Algorithm 7-5. Optimized refined clustering algorithm. Input: • a partition S • similarity matrix simS for the elements in the partition S • a minimal threshold Q for the union of clusters Output: • a new partition for S • a tree that describes the hierarchical clustering for S while $(si, sj): simS (si, sj) > Q loop let p = (si, sj) where simS (si, sj) > Q is maximal; union (si, sj); add a subtree with children si and sj to the clustering tree recompute (si, sj); end loop; while Ø empty_queue loop let (si, sj) = head; remove_head; union (si, sj); add a subtree with children si and sj to the clustering tree recompute (si, sj); end loop; Sim A B C,È( ) A Sim A C,( )× B Sim B C,( )×+ A B+ -------------------------------------------------------------------------------= Similarity Clustering 244 Begin of induction. The equivalence of (7.29) and (7.2) on page 188 is obviously true when A, B, and C are all disjoint sets with only one element: Assumption of induction. To see that this equivalence holds even for disjoint sets with more than one element, let us assume that: (7.30) (7.31) Step of induction. Inserting equations (7.30) and (7.31) into (7.29) leads to (note that A and B are always disjoint): Algorithm 7-6. Algorithm recompute. Input: • a pair of set identifiers (si, sj) whose similarities to all other clusters have to be recomputed Algorithm: for all sk in S where k¹i Ù k¹j loop simS (find (si), sk):= Sim (si, sk) using equation (7.2) on page 188; -- find (si) denotes the union of si and sj insert ((si, sj), simS (si, sj)); end loop; GSim A BÈ C,( ) GSim a b,{ } c{ },( )= Sim a c,( ) Sim b c,( )+ a b,{ } c{ }´ ----------------------------------------------------= a{ } Sim a c,( )× b{ } Sim b c,( )×+ a{ } b{ }+ --------------------------------------------------------------------------------------= Sim A BÈ C,( )= Sim A C,( ) GSim A C,( ) Sim a c,( ) a A c CÎ,Î å A C× -------------------------------------------------= = Sim B C,( ) GSim B C,( ) Sim b c,( ) b B c CÎ,Î å B C× -------------------------------------------------= = 245 Implementation Hints Conclusion of induction. Hence, equation (7.29) indeed computes (7.2) when (7.30) and (7.31) hold. By induction, we conclude that (7.29) computes (7.2). The advantage of using (7.29) instead of (7.2) is not only that it is easier to com- pute, but also that there is no need to keep the original values for the similarities of entities simN that were necessary to compute (7.2). Thus, we can save the space for similarity matrix simN. 7.7.4 Matrix Representation Because the similarity between nodes as defined by equation (7.3) is greater than 0 only for a few pairs of nodes (actually, as already discussed, if informal infor- mation is not considered, it is definitely 0 for all nodes that do not have a common neighbor and are not directly linked), the matrix to store the similarity relation is very sparse. Instead of allocating a quadratic matrix, it is more reasonable to rep- resent the matrix as linked nodes as outlined in Figure 7-24. Each cell of the matrix whose value is greater than zero is part of two lists: one list for its row and one for its column. A node in this data structure has the following components: • a real number for the value of this cell • a column pointer to the next cell in this column • a row pointer to the next cell in this row Sim A B C,È( ) A Sim a c,( ) a A c CÎ,Î å A C× -------------------------------------------------× B Sim b c,( ) b B c CÎ,Î å B C× -------------------------------------------------×+ A B+ ---------------------------------------------------------------------------------------------------------------------------------= Sim a c,( ) a A c CÎ,Î å Sim b c,( ) b B c CÎ,Î å+ A B+( ) C× ----------------------------------------------------------------------------------------------------------= Sim u c,( ) u U c CÎ,Î å U C× --------------------------------------------------= where U A B A B Æ=ÇÙÈ= GSim A BÈ C,( )= Similarity Clustering 246 • a column and a row index of this cell in the original matrix; this is necessary to identify the cell during row or column traversal The matrix as a whole is then represented by a column and a row vector whose fields point to the first node in the column or row, respectively. One could argue that the costs of accessing cell values is a high price that we have to pay for this sparse representation. This were true if we randomly accessed the matrix. Fortunately, this is not the case. Only when we merge two clusters, cells have to be accessed to recompute the similarity for the union of these clusters to all the remaining clusters. The last section proposed to use (7.29) on page 243 to recompute the similarity relation for Sim(A È B, Ci), which requires to recompute the shaded parts in the matrix in Figure 7-25(a) by reading the values in sim(A, Ci) and sim(B, Ci) for all Ci ¹ A, B. After the merge of A and B, one of them ceases to exist while the other represents the union of A and B from now on. Let us assume that the columns and rows of A are used to store the values of Sim(A È B, Ci). Fig- ure 7-25(a) graphically pictures three recomputations of the similarity of A È B to C1, C3, and C4 where a complete matrix is assumed. The recomputation is along the columns for A and B and combines the two values at the corresponding row index for the respective Ci. If a complete matrix is assumed, we would also have to re-compute the rows of A and B accordingly. However, since it is de facto a tri- angular matrix without the diagonal, the recomputation actually advances as illus- trated in Figure 7-25(b): When the cell for sim (X,Y) does not exist, the value of sim (Y,X) is used instead. Figure 7-24. Sparse matrix representation. n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 column vector row vector 247 Implementation Hints Thus, there is a regular access pattern to the cell contents of a triangular similarity matrix by iterating simultaneously over the rows and columns of A and B as shown in Figure 7-26(a) that allows an implementation of the rows and columns as linked lists. We can arbitrarily choose one of A and B to store the new similarity values for the union of A and B; rows and columns of the other cluster of the union can be Figure 7-25. Access pattern for triangular similarity matrix. Figure 7-26. Similarity matrix before and after recomputation due to union. C1 C2 C3 C4 C5 C6 A A B B C1 C2 C3 C4 C5 C6 A B C1 C2 C3 C4 C5 C2 C3 C4 C5 C6A B sim (AÈB,C4) = f(sim(A,C4), sim(B,C4)) sim (AÈB,C1) = f(sim(A,C1), sim(B,C1)) sim (AÈB,C4) = f(sim(A,C4), sim(B,C4)) sim (AÈB,C1) = f(sim(A,C1), sim(B,C1)) sim (A È B , C 3 ) = f(s im (A , C 3 ), s im (B , C 3 )) sim (A È B , C 3 ) = f(s im (A , C 3 ), sim (B , C 3 )) (a) (b) after the unionbefore the union A B C1 C2 C3 C4 C5 C2 C3 C4 C5 C6A B A B C1 C2 C3 C4 C5 C2 C3 C4 C5 C6A B (a) (b) Similarity Clustering 248 released. In the example of Figure 7-26(b), we have chosen A as the new repre- sentative for the union and rows and columns of B are removed. This will speed up future traversals. In the data structure we have chosen for the matrix, each cell node is member in two lists: one for the column, one for the row. Since these lists are only linked in one direction, the nodes cannot be easily removed from the list. We reject doubly linked lists due to the space overhead. The easiest solution to this problem is just to the set the cell value to 0.0 and to remove cells with a 0.0 content during future traversals on the fly. 7.7.5 Implementing the Priority Queue The priority queue can be implemented as an integral part of the similarity matrix by a list of nodes in the matrix as illustrated by Figure 7-27. The head of the list represents the cell with the highest value. When two clusters are united, one can sort the newly computed nodes and then traverse the priority queue replacing the changed nodes by their newly computed values from the sorted list of nodes. The traversal of the priority list stops when all newly computed nodes are at their appropriate place. This way only one (par- tial) traversal of the priority queue is necessary. Figure 7-27. Sparse matrix representation combined with priority queue. n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 column vector row vector priority queue 249 Differences from Previous Approaches 7.8 Differences from Previous Approaches Similarity Clustering differs from the other base techniques described in Chapter 5 in that it generates candidates using a metric which provides a more refined view on the relations between program entities. The previous approaches com- pute connected components or other fixed patterns in a graph. In contrast, whether these patterns are present or not, Similarity Clustering deals with contin- uous values which reflect the degrees of similarity. Schwanke defines a similarity metric used to group functions into modules (1991, 1994). Our new similarity clustering approach exploits these nuances, but extends the metric in the directions summarized in Table 7-5. Further refinements that took place after Jean-François Girard, Georg Schied, and I jointly developed and published these extensions in 1997 are described in column “My Additions”. 7.9 Summary Similarity Clustering is the most general approach described in this thesis. It can detect abstract data types, abstract data objects, hybrid components, as well as groups of related routines. All connection-based techniques described in Chapter 5 can be subsumed under Similarity Clustering. Similarity Clustering goes beyond other approaches in that it also considers relations to common third enti- ties and informal aspects. Similarity Clustering can be used in two different modes: Search for specific user-defined patterns and search for similar patterns of already found atomic components. Whereas connection-based techniques always yield the same candi- dates, Similarity Clustering can be adjusted by the maintainer to different search patterns by changing its edge weights. The adjustable parameters of Similarity Clustering offer more flexibility. On the other hand, when the maintainer wants to search for atomic components similar to those already found, these parameters can be automatically calibrated by the set of known components using traditional optimization techniques, such as simulated annealing or Gauß-Seidel optimiza- tion. The sample used to calibrate Similarity Clustering can be ascertained with other techniques or with Similarity Clustering after specific user-defined adjust- Similarity Clustering 250 Table 7-5. Differences from previous approaches. Schwanke’s Approach Our Approach 1997 My Additions clustering method hierarchical cluster- ing algorithm non-hierarchical clus- tering algorithm hierarchical clustering algorithm domain routines routines, user-defined types, global variables weights Shannon info. for non-local names Shannon info. for part- ners (globally estab- lished) edge weights reflecting the modality refined Shannon info. (established consider- ing modality and part- ner) features usage of non-local names including macros type relationships reference relationships call relationships no macros roles references to record components same-expression indirect links common and dis- tinct common and distinct common separated into Commoneq and Com- monne direct links values = 0 or 1 continuous values between 0 and 1 continuous values ³ 0 informal information not used tokens in identifiers pre- and postfix in iden- tifiers organization of files similarity between groups maximum similarity between elements (1991) k-nearest neighbors (1994) average similarity between elements establish- ment of fea- ture weights neural networks systematic hand tuning (grid search) contingency table Gauß-Seidel approach simulated annealing 251 Summary ments have been applied. Experiments with calibration methods for the subject systems used in this thesis indicate that a sample of 20-30% of the groupable base entities of a system grouped to atomic components is a sufficient training set (a base entity is said to be groupable if it actually belongs to an atomic component; recall that not all base entities were grouped to components by the software engi- neers for our subject systems). However, the data have not shown that one could improve the recall rate of Similarity Clustering by using larger samples. This is probably because of the diversity of characteristics among atomic components. For example, some atomic components may be properly encapsulated such that high weights for record components and variables references will yield good results. Some others may be permissive atomic components for which higher weights for record components and variables references will also add many non- accessor functions that break the information hiding principle. Similarity Clustering is one of the most effective techniques as far as the recall rate is concerned. On the other hand, it has also more false positives than other approaches (except for Arch which has more false positives). In earlier variants, the number of false positives was even worse (Girard et al. 1997c). My additions described in Table 7-5 in the last section resulted in a substantial reduction of the false positives in comparison to results previously published. Another advantage of Similarity Clustering, as a hierarchical clustering method, is that it yields a dendrogram of clustered entities instead of a set of flat candi- dates. This is in particular useful for validation. In the quantitative evaluation of Similarity Clustering, branches of the dendrogram were cut and converted into candidates using the same similarity threshold. However, this assumes that the same threshold is suitable for all components. Using a single threshold was nec- essary in a fair comparison to other automatic techniques; in an interactive approach, one does not need a threshold at all. Hence, less false positives can be expected for a hierarchical view. There are also some drawbacks of Similarity Clustering. For all techniques other than Similarity Clustering, there is one single criterion used for clustering. Hence, the reason why a technique has grouped entities together is obvious. This is less obvious for Similarity Clustering when the similarity metric considers several Similarity Clustering 252 aspects at the same time. This complicates validation of candidates proposed by Similarity Clustering. When Similarity Clustering considers informal information, it may happen that entities are clustered that are not even transitively connected to each other just because they have similar names. This may be useful when groups of related sub- programs are to be detected. However, for ADT and ADO detection, the entities are always at least transitively connected via call, type, or reference relationships. Fortunately, unconnected entities can be easily filtered from candidates if this is necessary. In the quantitative comparison reported in this chapter, a filter for can- didates with unconnected entities was not used. Using such a filter will probably lead to less false positives. Furthermore, future extensions of Similarity Cluster- ing should try to use informal information only for nodes that are either first or second-degree neighbors. Then, informal information would only be an addi- tional hint but not a sufficient criterion for two nodes to be in the same compo- nent. This would also reduce the time complexity for computing the similarity matrix as discussed next. The computational effort needed for Similarity Clustering is higher than for all other techniques. This is mainly due to establishing the similarities among the entities while clustering as such is comparatively fast. Section 7.7 gave hints on how the complexity can be reduced. It turned out that time and space complexity for Similarity Clustering is basically linear to the number of entities, n, when informal information is excluded (assuming an upper constant limit of neighbors an entity can have). However, when informal information is used, each entity has to be compared to any other entity resulting in a time complexity of O(n2). If the proposal above to use informal information only for first and second-degree neighbors is put into action, however, the complexity can be reduced to O(n). 253 Part III The Semi- Automatic Method 255 Chapter 8 Combined and Incremental Techniques In the previous chapters, the basic techniques and their evaluation were described. The evaluation has shown that none of the basic techniques has the detection quality that compares to human judgement. Therefore, further improvements are necessary. Advances can be expected by combinations of the basic techniques. Another avenue to progress in atomic component detection is to integrate the maintainer in the detection process. Furthermore, new techniques could be invented or existing techniques be refined by means of control and data flow anal- yses, for example. However, before new techniques are tackled, possible improvements by combinations of existing techniques should be explored first. Moreover, the maintainer has to validate the candidates proposed by automatic techniques at any rate because we can rarely expect perfect recall of any auto- matic technique due to the vague criteria and sometimes subjectivity of the rules for constituting atomic components. That is, integrating the maintainer is needed in any case. For these reasons, this thesis presents only generic ways of combining existing techniques and leaves new techniques to further research. The combinations are designed with applicability to a semi-automatic detection process with human intervention in mind. Chapter 9 proposes a possible semi-automatic method that uses the extensions discussed in this chapter. Combined and Incremental Techniques 256 8.1 Ways of Combinations There are basically two ways of combining the heuristics described in Chapter 5: They can either technically be integrated such that the underlying heuristics of several basic techniques are implemented by one combining analysis ¾ or the results of the techniques and not the techniques as such are combined. The latter strategy has the advantage of more flexibility in both practical operation and implementation: The user can decide on his own on the selected heuristics as well as the order of their application and the developer of new analyses can write the analyses independently; otherwise, adding a new analysis to an existing suite of N analyses would require to develop N combinations of the new analysis with the existing ones if pairs of combinations are considered. If more than two techniques are to be combined, the implementation effort is even worse. That is why I propose to combine the results of the techniques instead of the tech- niques themselves. Manifold combinations are possible by means of a few com- bining operators such as union, intersection, and difference operators for components views. Chapter 9 describes a semi-automatic method in which the maintainer uses these operators to combine and tailor the basic methods for his own use. The operators are simple enough that the maintainer can compose them by simple mouse clicks. One of the operators, namely, the composition operator, applies two techniques successively. The output of the first technique is fed into the second technique. As a consequence, the second technique has as input a description of the system that does not only contain the base entities and their relationships but also a set of atomic components already detected by the first technique. The second technique must be prepared for this, i.e, we need incremental versions of the basic tech- niques. Incremental analyses are not only useful for intermediate steps in combinations of techniques but also for the interactive method described in Chapter 9. The main- tainer validates the candidates proposed by the diverse techniques and in doing so produces a partial description of the systems’s atomic components. The tech- niques then find new atomic components based on this partial description. Chap- ter 9 will elaborate more on that. In this chapter, however, we discuss how the 257 User Information information provided by the user can be captured and used by the incremen- tal analyses. This chapter will also describe a voting approach as a complement to the combining operators, in which the agreement of each technique with a given component, i.e., its underlying heuristic, is expressed as a metric. An atomic component candidate can then be assessed by summing up the agreements of the individual techniques to it. 8.2 User Information Validating the candidates, the user has several choices: He or she can com- plete, accept, and reject partially and entirely. All this information has to be kept for the next iteration of an iterative detection process because con- firmed information should not be presented for validation once more. For the purpose of representing components accepted by the user we can adopt the means introduced in Section 3.2.1 and Section 3.2.2 to represent atomic components and subsystems proposed by techniques. In this section, we will also introduce a few extensions to capture other aspects of user infor- mation. 8.2.1 User Information and Components Views To distinguish candidates from reference components, we can make use of the resource usage graph views introduced in Section 3.6. To capture the point of view of the user, we use one dedicated view (per user), called the user view in the interactive method described in Chapter 9. The user view contains all information added by the user, inclusive acceptance of results of automatic analyses. Therefore, in order to ascertain whether something has already been established or rejected, we can simply consult this view. A view that shows the decomposition of components is called components view (see Section 3.6). The user view is one example; all the results of the basic techniques can likewise be represented by components views. There are basically two kinds of information contained in a components view (possibly added by the user): positive and negative information. Positive information is any decision of the user that certain elements belong Combined and Incremental Techniques 258 together; it is considered negative information when the user decides that certain elements do not belong together. We can use components views and the formalism introduced in Section 3.2.1 and Section 3.2.2 to represent positive information: The fact that the elements belong together is simply added to the user view in a relational fashion. However, we have not introduced means to express negative information yet. We will represent this fact by a new symmetric mutually exclusive relationship: mutually-exclusive (a, b) expresses that entity a and entity b must not be direct elements of the same component, nor may one be a direct part of the other one. Note that a and b may also be components, i.e., atomic components or sub- systems, in the definition above. Note also that the definition above does not exclude the example in Figure 8-1 in which a and b are both transitive parts of C (a is even a direct element of C) though a and b are mutually exclusive. The defi- nition above specifies only that a and b must not be part of the same component; otherwise, mutually-exclusive edges could not be used to separate elements within the same subsystem. The user could, for example, decompose the whole system into subsystems with one single root; in this case, no mutually-exclusive edges could be used at all. A components view will be used as additional input parameter to the combining operators and incremental analyses in order to describe the currently established component decomposition. Likewise, the result of operators and analyses is also a components view. Since the operators and analyses do not have to care whether the input components view stems from the user or was generated by previous analyses, we can define the operators and analyses in the following in a uniform way, i.e., we need not care whether they are applied to the user view or in a com- Figure 8-1. Example for mutually-exclusive. C A ba mutually-exclusivepart-of 259 User Information position of techniques. Since all analyses produce components views, we can eas- ily combine the results by operators described below. The next section describes the assumptions we are going to make about compo- nents views that have to be maintained by the operators and incremental analyses. 8.2.2 Constraints for Components Views Components views were already introduced in Section 3.6 (Table 3-5 on page 68). Here, we extend components views to incorporate mutually exclusive edges. Therefore, the following additional assumptions will be made about com- ponents views: • Nodes: A components view may contain subsystems, atomic components, and base entities. • Edges: A components view may contain part-of and mutually exclusive edges among base entities, atomic components, and subsystems. • Constraints: - A base entity is only in the components view if it is part of a component in the view or if it is an end of a mutually exclusive edge. - The constraints for subsystem structures as stated in Section 3.2.2 (may over- lap, is acyclic, and has no redundant part-of edge) hold. 8.2.3 User Actions Altogether, the user can manipulate a components view in the following ways: • creation: create a new component • assignment: add an entity to a component • rejection: remove a bound entity from its component (which does not imply that the entity and its component are mutually exclusive from now on) • exclusion: mark two entities as mutually exclusive (which implies rejection if one is part of the other one) • confirmation: confirmed information is added to the user view Note that the user is offered two ways of removing an entity from a component. If the entity is only rejected, the entity may be re-added to the component by subse- Combined and Incremental Techniques 260 quent analyses that take the current components views as input, which is useful when the user is unsure and wants to see whether another technique would also add the entity to the component. If the user is sure that the two entities do not belong together, he can mark the entities as mutually exclusive. All user actions, excluding confirmation, affect only the components views manipulated by the user. Only on confirmation, information is transmitted to the user view and may affect newly started analyses. 8.3 Combining Operators The heuristics we have described in Chapter 5 use diverse strategies to find atomic components. Some of them may be particularly powerful for certain kinds of atomic components, but rather weak for others. Some of them are not even able to detect certain kinds of atomic components at all; for example, the Part Type heuristic is not able to detect ADOs. It makes sense to combine these techniques in order to leverage individual strengths of the techniques and to compensate their weaknesses. The base techniques are basically functions that take a view as input and produce a components view containing the atomic components detected, denoted as: T: V ® V (where T is a technique for atomic component detection and V is the set of views) and the application of a technique is denoted by T(V) (this will be refined in Section 8.3.1). Therefore, we can use functional composition to combine the techniques: The result view of one technique is the input view of another technique. Since the results are basically sets of atomic components, we can use the union and inter- section as another way of combining them. Thus, the following operators can be used for the combination of the results of the basic techniques (let V, V1, and V2 be components views and T1 and T2 be techniques): • union (V1, V2) = V1 È V2 • intersection (V1, V2) = V1 Ç V2 261 Combining Operators • composition (V, T1, T2) = T2 (T1 (V)) These are the core operators. In the following sections, we will discuss them in more detail. Here we give only a brief overview. The union operator is useful for techniques that produce very different kinds of atomic components and is, there- fore, especially suited to combine techniques that are restricted to one class of atomic component. For example, Delta IC can only detect ADOs and Part Type only ADTs. Applying the union operator to these two heuristics allows the detec- tion of both kinds of atomic components. The intersection is used to reveal the agreement of two techniques: Only atomic components detected by both techniques will be present in the resulting view. This gives us a higher confidence about the resulting components. A good exam- ple is the intersection of Part Type with Internal Access. Remember that Part Type assumes that the parameter of the part type in the signature is put into or retrieved from the parameter of the container type. This can only be the case when the parameter of the container type is internally accessed. The composition of techniques is the consecutive application of two techniques. The application of the second technique tries to group base entities that were not grouped by the first technique. The second technique can create new atomic com- ponents or add not yet bound entities to atomic components detected by the first technique. The exact definitions of these combining operators will follow. At this point, we only point out that applying the set operators union and intersection is more than just composing, uniting, and intersecting the sets of atomic components repre- sented by views in the sense of set theory. For the intersection, for example, we can hardly expect that we find two exactly equal atomic components by both techniques. A simple definition of intersection according to set theory would therefore yield empty result views in most cases. Composition, intersection, and union are the core operators to combine the tech- niques, but there are other useful operators. Because some techniques consider variables and types, it may be desirable for their combination with other tech- niques to restrict them to one kind of base entity. For example, one might want to Combined and Incremental Techniques 262 apply Internal Access to detect ADTs only and to apply Delta IC to detect ADOs. Therefore, we add the following unary operator: • restriction (V, k) = Vk where Vk is the view V restricted to entities of kind k Since it generally does not make sense to exclude subprograms for the combina- tion of techniques, k denotes either variables or types; the resulting view will always contain all the subprograms of the non-restricted view. Internal_Access (restriction (V, Type)), for example, will apply Internal Access to the View VType that contains all the subprograms and types of V but not the variables of V. In order to apply Delta IC to detect ADOs and Internal Access to detect ADTs in view V, for example, one can compute: • union (Delta-IC (restriction (V, Variable)), Internal-Access (restriction (V, Type))) Finally, a difference operator may also be useful in an interactive approach. It allows the maintainer, for example, to inspect the difference between his own point of view, i.e., the set of atomic components he has specified, and the set of candidates generated by an analysis: • difference (V1, V2) = V1\V2 È V2\V1 These operators allow powerful yet simple combinations of base techniques. Yet, not all possible combinations may be sensible; in particular, one has to be aware that multiple intersections may result in empty sets of atomic components and multiple unions may produce many overlapping atomic components. In the fol- lowing sections, we will discuss the combining operators in more detail. 8.3.1 Composition The composition operator applies two techniques consecutively. The output of the first technique is the input of the second technique. The second technique may add left free entities to the existing component (and create new components as well). Hence, the composition operator presumes incremental techniques. The base techniques as introduced in Chapter 5 have only one input parameter, namely, the base view. Recall that the base view consists of all base entities and their relationships that can directly be derived from source code (see Table 3-5 on 263 Combining Operators page 68). The incremental versions of these techniques have an additional param- eter: the components view that describes the atomic components that have already been found. The result is a components view meeting the constraints listed in Sec- tion 8.2.2. Hence, the application of an incremental technique, T, can be described by a function: T: V ´V ® V ´ V (where V is the set of views) and the application of a technique is denoted by T(V1, V2) where V1 is the specific input view for the technique that describes the base entities and relationships considered by T and V2 is a components view. Note that T(V1, V2) = (V1, V2’) holds, i.e., the first input view is not changed. We will call base entities bound if they are already part of a component in the input components view. All other entities are considered free. 8.3.1.1 Restrictions Imposed by the Interactive Employment The composition operator is used to support the semi-automatic method that is going to be described in Chapter 9. The interactive employment of the operator imposes further restrictions on the allowable operations on components views: (R1): Only free entities may be grouped. (R2): If a free entity, E, is to be clustered with a bound entity, F, entity E is to be added to the enclosing components of F. The restrictions will be motivated in the following. Ad (R1). There are basically two kinds of scenarios for the application of the composition operator: 1. the operator is applied to the user view 2. the operator is applied to an intermediate view (a view that results from analy- ses or combining operators) When applied to the user view, the operator must preserve all information con- tained in the user view; in other words, the operator must not remove bound enti- ties or reorganize existing components because the user has previously approved the elements of the user view. In principal, it would also be useful for a more gen- Combined and Incremental Techniques 264 eral usage within maintenance to allow the analyses to restructure existing com- ponents; the analyses would then propose alternative decompositions of existing components. However, the focus of this thesis is on recovery rather than restruc- turing. Nevertheless, the thesis provides building blocks to support restructuring. A simple way to support restructuring would be to allow clustering of all entities of the user view; then, the difference operator can be applied to the user view and the resulting view in order to investigate discrepancies. Another way to analyze an existing decomposition is to let the analyses rate the components as described in Section 8.4. Because the user has previously approved the elements of the user view and because the focus of the operators described in this thesis is on recovery, an incre- mental technique may only create new components comprising free entities and add free entities to existing components, i.e., a bound entity must not be removed from its components nor may it be added to another component (because this would be another kind of restructuring). The arguments against allowing an incremental technique to regroup or remove bound entities hold also for the second scenario in which the operator is applied to intermediate views. The operator is only applied by the user and if she selects a components view to be refined by the composition operator, we may assume that she agrees to the decomposition within the intermediate view, otherwise she would have modified the intermediate view by hand or rejected the intermediate view as a whole. Ad (R2). Furthermore, during clustering, an analysis may decide to group a free entity, E, with another entity, F. If F is also a free entity, the two of them may be used to create a new atomic component when all free entities are clustered. How- ever, if the input view of the composition operator is not empty, F may already be bound. Because bound entities cannot be removed from their enclosing compo- nents (as it was discussed above), there are basically two options on how to deal with a free entity, E, and a bound entity, F, that the analysis wants to be grouped together: 1. E and F are used to create a new component (but F remains still also a part of its enclosing component) 2. E is added to the enclosing component of F 265 Combining Operators The first option is not possible due to restriction (R1) which implies that a bound entity must not be added to another component. However, the second option is not only preferred because the first option is excluded by (R1). First, if restriction (R1) were not in place and E and F were added to a new component, the new component would overlap with the enclosing components of F, which would force the user to the check the addition of F to the new component though she has already decided on F. Second, during the course of clustering, an incremental technique whose input components view is empty (i.e., all entities are free) puts E and F into the same new cluster if they are related. The same would be expected of an incremental technique whose input component view is not empty. For example, if the base entities a, b, and c in Figure 8-2(a) are related in the sense of the heuristic under- lying a technique, T, that is to be applied, T will create a new atomic component, AC, that consists of a, b, and c as sketched in Figure 8-2(b). If the user decides to reject b in the result of technique T (Figure 8-2 (c)), AC will no longer contain b (Figure 8-2 (d)). However, because b was only rejected temporarily and not excluded (see Section 8.2.3), one would expect that T will add b again to AC when applied to Figure 8-2 (d). This effect is only achieved by the second alterna- tive that adds b to the enclosing component of a and c. The second alternative is also appropriate if two different techniques are applied successively. Consider the example in Figure 8-3 in which a composition of two techniques S and T is computed based on the second alternative. Only a and c are related with respect to technique S and therefore grouped by S into AC. When T is applied to the intermediate result in Figure 8-3(b) and b is related to c according to the clustering criterion of T, c is already bound to AC. Following the second Figure 8-2. Successive application of the same technique. a b c A related a b capply T apply T Aa b c reject Aa b c apply T (a) (b) (c) (d) Combined and Incremental Techniques 266 alternative, T groups b to AC as well. The second alternative allows to succes- sively completing an atomic component using different criteria, i.e., different techniques can complement each other, while the first alternative would always generate a new component and assumes that one criterion is sufficient for entities to be in the same component. For these reasons, the second alternative is pre- ferred. 8.3.1.2 Algorithm for the Composition Operator The composition can be organized in the following steps (the term cluster is used here to make clear that the individual techniques generate sets of related elements that only become components in the last stage of the composition): 1. Iterate over the free entities and cluster them. The result are clusters with related-to information (see below). 2. Split all clusters so that there are no mutually exclusive entities in the same subcluster where each subcluster contains only entities that are (transitively) related to each other. 3. Transform clusters into components. The first step depends on the analysis. It is described as composition in the fol- lowing sections. The second and third steps are identical to all analyses and are explained in Section 8.3.1.6 and Section 8.3.1.7. An advantage of organizing the composition this way instead of letting the analyses be in charge of mutually exclusive entities is that the analyses do not have to take care of negative informa- tion. Other consequences of this decision will be discussed in Section 8.3.1.6. We will discuss the diverse classes of approaches, namely, connection-based, metric-based, and graph-based techniques (as described in Section 5.13) sepa- rately. The clusters generated by all techniques are sets of related entities which Figure 8-3. Successive application of different techniques. a c apply T (c)(a) b a c b A a c b Aapply S (b) related w.r.t. S related w.r.t. T 267 Combining Operators are to be grouped together. More precisely, a cluster is represented as a connected graph, so-called related-to graph, whose nodes are the entities to be grouped together and whose edges represent the symmetric related-to relationship that specifies which entities are related to each other in the sense of the underlying heuristic of the technique (the related-to relationship is a virtual relationship defined by each technique in its own specific way in the following and, therefore, not represented in the resource usage graph; one can think of the related-to graph as a separate data structure distinct from the resource usage graph). All elements of a cluster are at least transitively related to each other; otherwise, the techniques would not have grouped the elements together in the first place. However, the related-to annotation is needed when the cluster is split into subclusters due to mutually exclusive entities. Then, a subcluster may only contain related entities and no mutually exclusive entities. The related-to information also plays a role when clusters are transformed into components by step 3 as it will be explained in Section 8.3.1.7. 8.3.1.3 Composition for Connection-based Techniques Connection-based approaches cluster entities based on a specific set of direct relationships in between entities to be grouped. They differ only in the types and characteristics of the relationships they consider. These relationships are present in the base view (or in a subview of the base view, respectively). The techniques iterate over the free entities in the base view and collect the connected entities that could be grouped. These connected entities themselves may be either bound or free. Free connected entities will belong to the same cluster of the entity that is under consideration. Bound connected entities cannot be grouped again because they already belong to an atomic component. Instead, the entity under consider- ation should belong to the same atomic component the connected bound entity belongs to (or the set of atomic components if the connected entity is bound to more than one). This will technically be solved in algorithm 8-1 by adding the enclosing atomic component(s) in lieu of the connected entity. Initially, each entity is put into a set of its own using the disjoint sets algorithm already introduced in Section 5.2. For simplification, we assume that all entities (bound or free) and atomic components are enumerated from 1 to Last_Entity. The generic parameter Connected_Entities of this algorithm is specific to each connection-based approach and yields all the base entities that could be grouped Combined and Incremental Techniques 268 Algorithm 8-1. Generic incremental connection-based clustering. Generic Parameter: • Connected_Entities : Entity ® Set of Entities Input: • base view B • components view A Output: • set of disjoint clusters with related-to information Algorithm 1. Put each free entity and component in a set of its own: for E in 1..Last_Entity where not Is_Bound (E, A) or Component (E) loop New_Set (E); end loop; 2. Iterate over free entities and cluster connected entities: for E in 1..Last_Entity where not Is_Bound (E, A) and not Component (E) loop for C in Connected_Entities (E) loop if not Is_Bound (C, A) then Union (Find (E), Find (C)); add related-to (E, C); else for AC in Enclosing_Components (C, A) loop Union (Find (E), Find (AC)); add related-to (E, AC); end loop; end if; end loop; end loop; 3. Result: Each disjoint set represents a cluster that constitutes a candidate. 269 Combining Operators according to the underlying heuristic of the approach. The function Enclosing_Components returns a set of atomic components and subsystems to which the given entity belongs. When two entities A and B are grouped together by Union, the related-to information is added. The result of this algorithm is a set of disjoint clusters with related-to informa- tion. A cluster can consist of base entities and components. How the result is interpreted and converted into components is described in Section 8.3.1.7. Algorithm 8-1 enhances all connection-based techniques to work incrementally. The techniques themselves need not be changed, i.e., the specification for Connected_Entities remains unaffected. 8.3.1.4 Composition for Metric-based Techniques Delta-IC. As previously said, Delta-IC is a hybrid of metric-based and connec- tion-based approaches. The actual clustering is connection-based while the metric is only used to remove bad candidates. Hence, an incremental extension of Delta IC can be organized analogously to connection-based approaches. A single step in the incremental connection-based techniques basically consists of two sub- steps: (1) select a cluster and (2) replace bound entities within these clusters by their enclosing components. This scheme can be adopted for Delta-IC as follows. At first, the clusters are identified according to the procedure explained in Section 5.7 and then bound entities are replaced within these clusters as described by algorithm 8-2 (only step 4 and 5 are new, all other steps were already explained in algorithm 5-4 on page 131). The reasons for replacing bound entities by their enclosing components were given in Section 8.3.1.1. If a cluster is split into subclusters due to mutually exclusive entities, a subpro- gram, S, should be in a subcluster that contains an object, O, referenced by the subprogram. That is why a related-to(S, O) annotation is added to the cluster in step 4 of algorithm 8-2. If an entity, E, is related to an entity, E’, that is already bound, the induced related-to information with respect to the enclosing compo- nents of E’ is added as part of step 5 of algorithm 8-2. Combined and Incremental Techniques 270 Algorithm 8-2. Extended incremental Delta IC. Input: • object reference view V • components view A • DIC threshold Q Output: • non-disjoint clusters with related-to information Algorithm: 1. - 3. -- See algorithm 5-4 on page 131. -- clusters is an array of clusters where DIC (clusters (C)) ³ Q 4. add related-to information: for each C in clusters’Range loop for each subprogram S in clusters (C) loop for each object O in referenced-objects (S) in V loop add related-to (S, O); end loop; end loop; 5. handle bound entities: for C in clusters’Range loop for each element E in clusters (C) loop if Is_Bound (E, A) then clusters(C):=(clusters(C)\{E})ÈEnclosing_Components(E,A); -- clusters (C) is a set of entities, i.e., a single cluster for each E’ in clusters (C) where related-to (E, E’) loop for each AC in Enclosing_Components (E, A) loop add related-to (E’, AC); -- induced related-to information end loop; end loop; end loop; end loop; C" 271 Combining Operators Other metric-based techniques. Both Type-based Cohesion and Similarity Clustering use the same hierarchical clustering algorithm that was proposed in Chapter 7; they only differ in the underlying metric. In an incremental approach, atomic components have already been detected and yet unbound entities have to be clustered. This compares to a snapshot of the clustering algorithm after a few runs when there are already some clusters and yet more iterations ahead. Considering this, the clustering algorithm can easily be modified to work incrementally. Only a pre- and a post-processing phase is necessary. In the pre-processing phase, the similarity relation is computed among all atomic components and all unbound entities using the group sim- ilarity (equation (7.2) on page 188). The clustering algorithm then clusters all atomic components and free entities based on the similarity relation just computed. The result of the clustering algorithm is a dendrogram whose subtrees are flattened into clusters using a user-determined threshold for the acceptable similarity among the elements of the subtrees as described in Section 7.4. If atomic components have been added in lieu of their elements in the pre-processing phase, the dendrogram may contain both components and base entities and, hence, the retrieved clusters contain components besides entities, too. Note that it need not necessarily be the case that the similarity among all elements of a cluster retrieved from a dendrogram is above the threshold. Because the average as defined by (7.2) on page 188 is used as group simi- larity, a less similar entity may be balanced by strongly similar entities. Consider the example similarity matrix in Figure 8-4 that contains two base entities a and b and two atomic components A1 and A2. According to Figure 8-4(a), {a} and {b} are the most similar groups and are therefore united. The similarity of the union of {a} and {b} to all other elements is re-com- puted using the group average. The resulting similarity matrix is Figure 8- 4(b) in which {a,b} and {A1} are the two most similar groups. The final dendrogram for the similarity matrix Figure 8-4(a) is shown in Figure 8- 4(c). In this example, a and A2 are in the same cluster if a threshold Qÿ ³ 0.35 is used to retrieve clusters from the dendrogram in Figure 8-4(c) even though the similarity among a and A2 is only 0.1 according to Figure 8-4(a). Combined and Incremental Techniques 272 In this example, the high similarity among b and A2 has lead to the inclusion of A2 into the group. To sum it up, the group similarity is used by hierarchical clustering to generate a dendrogram and the threshold is used to derive flat clusters from the dendrogram. The threshold is determined by the user and expresses his degree of tolerance about when two elements are similar enough to be in the same cluster. Using the average group similarity, an entity is added to a group that contains very similar entities even though there may also be a few entities for which the similarity is less than the threshold. The less similar entities are outweighed by the very simi- lar entities. If the clusters are split into subclusters due to mutually exclusive enti- ties, the resulting subclusters should contain those entities that are similar enough according to the user, i.e., whose individual similarity is above the threshold. Hence, the cluster is annotated with related-to (a, b) for all elements a and b of the cluster for which Sim (a, b) ³ Q according to equation (7.3) on page 189 holds. The clusters derived from the dendrogram are then treated in a post-pro- cessing phase as described in Section 8.3.1.7. 8.3.1.5 Composition for Graph-based Techniques Dominance Analysis was already presented as an incremental analysis: Domi- nance Analysis is applied to the collapsed base view and a base entity is added to Figure 8-4. Example dendrograms. b A1 A2 a 1.0 0.9 0.1 b 0.6 0.7 A1 0.3 A1 A2 a,b 0.75 0.4 A1 0.3 A2 a,b,A1 0.35 a b 1.0 A1 a b 1.0 0.7 A1 a b 1.0 0.7 0.3 A2 (a) (b) (c) 273 Combining Operators its primarily dominating component. The collapsed view for a view, V, results from the following transformations: 1. A bound subprogram, variable, and type in V is replaced by its enclosing atomic component. 2. Edges between unbound entities are kept whereas edges to a bound entity are replaced by edges to its atomic component and analogously, edges from a bound entity are replaced by edges from the atomic component. Recall that base entities can directly be part of a subsystem. The fact that these base entities are part of a subsystem expresses only that they are in some way related but not related enough that they could be grouped into an atomic compo- nent. That is why we collapse only base entities that are part of an atomic compo- nent. There are basically four kinds of possible combinations to consider when edges are to be added to the collapsed view (see Figure 8-5): 1. The edge is among entities that are not part of an atomic component; in this case, both entities will be added to the collapsed view including the edges among them. 2. The source of the edge is a free entity, S, and the target is a bound entity that is part of an atomic component, A, then S and A and only the induced edge from S to A will be added to the collapsed view. 3. The target of the edge is a free entity, S, and the source is a bound entity that is part of an atomic component, A, then S and A and only the induced edge from A to S will be added to the collapsed view. 4. The edge is among entities that are parts of atomic components A1 and A2, respectively; then A1 and A2 and an induced edge between A1 and A2 (only if A1¹ A2) are added to the collapsed view. There are two consequences for the collapsed view one should be aware of. First, collapsing may result in dependencies that were not present before, and second, collapsing overlapping atomic components is problematic. The examples in Figure 8-6 illustrate the first property. In example (a), a cycle results because x and y are collapsed into one node though there was previously Combined and Incremental Techniques 274 no such cycle. In example (b), collapsing x and y has as result that b is reachable from a though it was not reachable from a before. (The discussion of the effect of collapsing nodes will be continued in the discussion of Strongly Connected Com- ponents Analysis below.) The problem with collapsing overlapping atomic components is that all involved atomic components inherit the dependencies of the overlapping parts. This is illustrated by the example in Figure 8-7 where AC1 and AC2 overlap in y and x. Both AC1 and AC2 are assumed to call a and to be called by b in the collapsed view. Furthermore, because AC1 contains x which calls y that is also contained in AC2, AC1 is assumed to call AC2 and vice versa. Hence, the ambiguity that was present in the overlapping atomic components continues in the resulting collapsed view and the user is, therefore, urged to limit overlapping in the atomic compo- nents. Figure 8-5. Induced relationships. Figure 8-6. Examples for changed dependencies in the collapsed view. Figure 8-7. Examples for collapsing overlapping atomic components. S S’ S’S A S’ A S A1 S S’ A2 (1) (2) (3) (4) part-of part-of part-of part-of induced induced induced x y b a b x y a b a b a Example (a) Example (b) x a b AC1 AC2 a b AC1 AC2y 275 Combining Operators In the dominance tree for the collapsed view, the clusters proposed by the Domi- nance Analysis heuristic consist of an atomic component and all its primarily dominated entities. If such a cluster is to be split due to mutually exclusive enti- ties, an entity should be added to those subclusters that contain its dominator, its dominatees, or its siblings in the dominance tree since these are the entities most related according to the dominance relation. As an example, consider the domi- nance tree in Figure 8-8. The clusters proposed by Dominance Analysis for the dominance tree Figure 8-8(a) are shown in Figure 8-8(b) including the related-to information. The related-to information for the atomic component is omitted for readability reasons. Strongly Connected Components Analysis detects cycles in the call view, or in other words, mutually recursive subprograms. There are basically two avenues for an incremental Strongly Connected Components Analysis: • Strongly Connected Collapsed Component Analysis: One can apply strongly connected components analysis to the collapsed view as it was already described for dominance analysis, i.e., atomic components are collapsed first and only then, the analysis searches for cycles. • Strongly Connected Base Component Analysis: One can apply strongly con- nected components analysis first to the call view and then try to add the cycles to existing atomic components. The first approach may yield artificial cycles due to the collapsing of nodes as illustrated by the examples in Figure 8-6 and Figure 8-7. In Figure 8-6(a), the Figure 8-8. Dominance example. 1 2 5 7 1213 9 X AC1 AC2 Z Y variable subprogram AC3 {AC1, 2, 9, X} {AC2, 12, 13, Z} {AC3, 7} dominates related-to (a) dominance tree (b) clusters Combined and Incremental Techniques 276 cycle is introduced because x and y are treated as one node. Recall that strongly connected components are considered a relevant kind of atomic component because all the parts in the cycle are necessary to understand the component. This is not only so for mutually recursive subprograms. This also applies to artificial cycles as in Figure 8-6(a): Because x and y are part of the same atomic compo- nent, their respective comprehension is very much related. But what about the cycle in Figure 8-7 that causes the two overlapping compo- nents to be considered a strongly connected component? Because of the ambigu- ity in the overlapping part, the two atomic components are obviously very much related and the Strongly Connected Component Analysis does not reveal more than what was already present. If the user does not want this effect, a better way of modeling the relatedness of two overlapping components and the unclear clas- sification of the ambiguous part is offered by means of the subsystems introduced in Section 3.2.2. A better representation for Figure 8-7 is given in Figure 8-9 where AC1 and AC2 as well as the overlapping part x and y are all distinct parts of a subsystem. This avoids the artificial cycle in the collapsed view. The second approach that applies Strongly Connected Component Analysis to the call view yields only real cycles. On the other hand, it is therefore not able to detect atomic components that are mutually dependent, though this is an impor- tant information to the maintainer: One cannot reuse one of these atomic compo- nents in another context without taking on all of them. Both approaches have their advantages and disadvantages, so it is only pragmatic to offer both to the user. The two approaches can technically be reduced to the incremental solution for connection-based approaches where the elements of the strongly connected com- ponent act as connected entities, i.e., in order to instantiate algorithm 8-1 on page 268, the following definition of Connected_Entities can be used: Figure 8-9. Better representation for overlapping atomic components. xAC1 AC2y a b C 277 Combining Operators (8.1) where . In the case of Strongly Connected Base Component Analysis, Connected_Entities is established on the call view, whereas Strongly Connected Collapsed Compo- nent Analysis ascertains Connected_Entities on the collapsed call view. When cycles are split due to mutually exclusive entities, an entity should be in the subcluster that also contains its direct neighbors in the cycle since these are the elements of the cycle most needed to understand the entity. Hence, the proposed clusters are annotated by related-to (a, b) for all elements a and b of the cluster that are direct neighbors in the cycle. 8.3.1.6 Partitioning Clusters with Mutually Exclusive Elements The previous sections explained how the individual techniques can be adapted to work incrementally. They all yield clusters of related entities that are to be trans- formed into actual components. The clusters may contain mutually exclusive entities since the techniques as described above do not take negative information into consideration. Clusters with mutually exclusive entities must be split into subclusters without conflicting entities, which is uniformly handled by a separate step after clustering. This section explains how. The input to the partitioning stage are the clusters that were formed by the previ- ous step and are annotated with mutually-exclusive and related-to information, i.e., these clusters are associated with an interference graph that describes the entities that mutually exclude each other and the related-to graph that describes which entities are related. Generally, not all entities in a cluster are mutually exclusive and, therefore, the cluster is not completely senseless. Instead of throwing away the whole cluster, the cluster should be partitioned into subclusters without mutually exclusive enti- ties. A second requirement for a reasonable splitting is that the subclusters should be as large as possible. Subclusters with only one element obviously do not have any conflict, but they are not very helpful either. Unfortunately, we are facing the Connected_Entities S( ) s s called S( )Î S called s( )ÎÙ{ } S{ }¤= called s( ) transitive_closure successors s call{ },( )( )= Combined and Incremental Techniques 278 NP-complete graph coloring problem here, i.e., for an optimal solution, we may need exponential time. The graph coloring problem is to assign a minimal number of colors to nodes in a graph where no two neighboring nodes may have the same color. This is equiva- lent to partitioning a graph’s nodes into sets of nodes where no two neighboring nodes are in the same set. This problem can be tackled heuristically as follows: 1. Remove nodes from the interference graph in the order of least to most con- flicts and put them onto a stack. When all nodes are on the stack, the nodes with most conflicts are at the top of the stack. 2. Then, the stack’s nodes are popped and assigned to a partition such that no neighboring nodes are in the same partition. Algorithm 8-3 implements this strategy. Function Choose_Node selects a node with minimal conflicts and function Partition_Number yields the minimal parti- tion number that can be assigned to a given node N where the corresponding par- tition does not contain a node that is in conflict with N. Using the algorithm in the described way yields subclusters that do not contain mutually exclusive entities. The number of generated subclusters is a local opti- mum, that is to say, there may be better solutions with less, hence, larger subclus- ters, but the found solution is generally a good approximation. Finding the optimal solution would require exponential time in the worst case. However, the algorithm used in the described way can yield subclusters of unre- lated entities. For example, given the cluster in Figure 8-10, this algorithm may yield two subclusters {T2} and {T1, F1, F2, F3} though F1 and F3 are not related to T1. In order to avoid unrelated subclusters, function Partition_Number needs to be provided with a preference rule: The entity is preferably added to a cluster that has most related entities (and no mutually exclusive one). Because the answer depends on the heuristic that produced this cluster in the first place, the detection techniques annotated the generated clusters with related-to information. 279 Combining Operators However, modifying Partition_Number solves the problem of unrelated entities only partly because the algorithm 8-3 treats all elements equally when putting them onto the stack. For example, F1 and F3 could be above T2 in the stack (Fig- ure 8-10) when F1 and F3 have conflicts, too. Then the top element, say F1, would be assigned to a subcluster first. When F3 is to be added, there is no connection to F1 and, hence, F3 could be added to a separate subcluster. Only then T2 is to be Algorithm 8-3. Partitioning clusters with mutually exclusive elements. Input: • Cluster C Output: • A set of clusters C1, C2,..., Cn where: Algorithm: 1. Initialization: C’ := C; 2. Build stack; top most element has most, bottom element has least conflicts: while not Is_Empty (C’) loop Node := Choose_Node (C’); Remove_Node (C’, Node); Push (Node_Stack, Node); end loop; 3. Partitioning (graph coloring): while not Is_Empty (Node_Stack) loop Node := Pop (Node_Stack); Assign (Node, Partition_Number (Node, C)); end loop; C Ci i 1¼n= È= i j 1¼n{ }Î,( )i j Ci CjÇ Æ=Þ¹"( ) i 1¼n{ }Î( ) a b CiÎ,( )mutually_exclusive(a,b)$Ø" Ù Ù Combined and Incremental Techniques 280 handled and may either be added to the subcluster of F1 or F3 resulting in one of F1 and F3 becoming orphaned. In general, types as well as variables constitute a crystallization point for cluster- ing and, therefore, should be distributed to subclusters before subprograms. Algo- rithm 8-3 can be adjusted to this strategy by first putting all subprograms onto the stack and only then the remaining types and variables. This way, types and vari- ables are above all subprograms in the stack and, therefore, get partitioned first. The only necessary modification is to adapt Choose_Node accordingly. The resulting subclusters can then be transformed into components as described by the following section. 8.3.1.7 Transforming Clusters into Components The (sub-)clusters generated by the previous step as described in the last section contain related entities that are not mutually exclusive. These clusters are basi- cally sets of entities and are now to be transformed into actual components. In general, clusters with only one element are not useful and are, therefore, rejected. Likewise, huge clusters may also be rejected. In an interactive approach, the user is able to specify lower and upper bounds for reasonable components (the upper bound should only be set when the user does not plan to refine the candi- date components by a second analysis). Principally, the clusters generated by the incremental techniques fall into one of the following categories: Figure 8-10. Cluster with mutually exclusive entities. T1 T2 F1 F2 mutually exclusive related-toF3 281 Combining Operators 1. The cluster contains only base entities. Such clusters represent completely new atomic components. 2. The cluster contains a single atomic component, all other elements are base entities. This is the case when an incremental technique wants the base entities of the cluster to be added to an existing atomic component. Remember that the enclosing atomic component is added to a cluster in lieu of a bound entity. 3. The cluster contains more than one atomic component. This happens when at least one base entity can be added to more than one existing atomic component. For the first type of cluster, we can create a new atomic component that contains all the elements of the cluster (Figure 8-11(a)). In the second case, we add all ele- ments to the atomic component contained in the cluster (Figure 8-11(b)). If clusters contain more than one atomic component, it is not clear to which atomic components the base entities of the cluster should belong. Therefore, such clusters have to be presented to the user as a whole and he or she has to decide. To achieve this, a new subsystem is introduced and all atomic components of the clus- ter are considered part of the subsystem. The related-to annotation determines the possible elements of the atomic components. These elements can be derived as follows: 1. The related-to graph is ascertained whose nodes are entities and whose edges represent the related-to relationship (Figure 8-12(a)). 2. The restricted related-to graph is derived from the related-to graph by omit- ting components, i.e., the restricted related-to graph contains base entities only (Figure 8-12(b)). Figure 8-11. Transforming clusters into components. {F1, F2, T1} {F1, F2, T1, AC1} AC F1 F2 T1 (a) (b) AC1 F1 F2 T1 new part-of Combined and Incremental Techniques 282 3. The direct elements of an atomic component, A, are all members of connected subgraphs (in a connected graph, each node can be reached from each other node) of the restricted related-to graph that contain at least one node, N, for which related-to (N, A) holds in the original related-to graph (Figure 8-12(c)). Furthermore, all atomic components are part of a new subsystem node. Restricting the related-to graph before identifying the related entities is necessary in order to add only those entities to a component that are (transitively) related to the component itself. Otherwise, entities would be added to components that actu- ally belong to other components, e.g., T2 is primarily related to AC2 and not to AC1. The fact that basically all elements are related in the cluster is expressed by subsuming them under the same subsystem. As the example in Figure 8-12 illustrates, overlapping components may be pro- posed using this strategy. However, the overlap arises from the actual related-to information. It is the user’s decision how the overlap is to be resolved. Moreover, if the deep intersection operator is applied to the resulting components view, the overlap may also be resolved by the judgement of another technique. An alternative strategy to assign base entities to atomic components in clusters with more than one component would be to use dominance analysis applied to the related-to graph assuming a subsystem node as root to which all atomic compo- nents of the cluster are related. Then, entities would be added to their dominating component. However, if an entity is (transitively) related to more than one compo- nent, the entity would be dominated by the root node and not by one of the atomic Figure 8-12. Handling clusters with more than one component. AC1 F1 F2 T1 AC2 F3 F4 T2 F1 F2 T1 F3 F4 T2 AC1 F1 F2T1 AC2 F3 F4T2 (a) related-to graph (b) restricted related-to graph (c) component decomposition C new part-of related-to 283 Combining Operators components, i.e., the information about the overlap would be lost. For this reason, the first strategy is preferred. 8.3.2 Set Operators for Combining Components Views The former section explained one of the core operators to combine components views, namely, the composition. This section describes further operators that are modeled on set operations. Components views are basically sets of components and, hence, can be combined by set operations. In order to perform the set union, intersection, and difference for components views, we have to identify matching components of the two views to be combined; e.g., the intersection operator yields only those compo- nents that are in both views. This raises the question how do we determine whether two components match? Remember that in our relational perspective of components, a component has two faces: A component is characterized by a com- ponent entity and a set of its parts. Thus, a component is actually a named set. As a consequence, we can compare two components by name as well as by their ele- ments. A comparison by name would consider the two atomic components AC1 and AC2 equal when they are identical, i.e., AC1 = AC2, in other words, when they are represented by the same entity; a comparison by elements, on the other hand, would consider AC1 and AC2 equal when their elements are the same, i.e., ele- ments (AC1) = elements (AC2). The comparison by name makes sense when two components views are the result of different incremental techniques applied to the same input view. Then, the same component may occur in both views if it is present in the input view. On the other hand, when two incremental techniques are applied to disjoint views, identical components cannot occur. Then, we want to compare components in terms of their contents rather than name. For example, because AC1 and AC2 in Figure 8-13 have the same elements, they can be consid- ered equal from the set perspective though AC1 and AC2 may be different entities in the relational perspective. However, using set operations for combining views by elements is still too sim- plistic to be useful for two reasons. First, small differences of the elements lead to duplication or removal of similar components for union and intersection, respec- tively, and second, the comparison based on elements does not come up to the Combined and Incremental Techniques 284 hierarchical structure of subsystems. In the following, these problems will be dis- cussed in detail. The purpose of a union operator is to combine two techniques that detect more or less distinct entities, i.e., their results are mainly disjoint sets of disjoint candi- dates such that the union operator can be interpreted as a union of sets. However, there is no guarantee that the results really do not overlap, and if they do overlap, using the simple union for sets based on elements may lead to almost duplicated candidates. For example, in Figure 8-14, the simple union, so-called shallow union, of the sets of candidates Result 1 and Result 2 treats any component as an atomic set member and produces five candidates where four of them are very sim- ilar. If this is presented to the maintainer, she has to check almost equal compo- nents twice. The outcome for intersecting the two components views in Figure 8-14 is even worse since the resulting view is empty. For set intersection, a component of one view is only in the resulting view when it has an exact match in the other view. Beside the problem of similar, yet not equal components, the common set opera- tors do not come up to the hierarchical structure of subsystems. In the case of atomic components, we can consider their elements in order to find out whether a comparable atomic component exists in the other view; but considering only the Figure 8-13. Two equal components AC1 and AC2 from different perspectives. Figure 8-14. Shallow union and intersection based on elements. AC1 F1 F2 T1 AC2 F1 F2 T1 AC1 F1 F2 T1F1 F2 T1 AC2 relational perspective set perspective part-of a b c d e f g a b h d e i a b c d e f g a b h Result 1 Result 2 shallow intersection È Ç d e i shallow union Æ 285 Combining Operators elements is not appropriate for subsystems since subsystems with the same set of elements may still differ in structure. In the following, ways of combining two views that are modeled on set union, intersection, and difference are described that overcome the deficiencies of the shallow set operator semantics. In particular, the proposed operators • are based on both comparison by name and by elements, • tolerate divergences of the elements of the components, • consider the structure of subsystems, • and also combine the parts of subsystems at all levels. In order to support comparison by name and by elements as well as to tolerate divergences, the matching criterion of the new combining operators for two com- ponents is the correspondence as defined in Section 3.7: We consider two compo- nents corresponding when they are identical or affine where affinity is associated with a tolerance parameter that allows inexact matches. The correspondence of components is defined at all levels such that a component in one view may be matched with a component in the other view that is a subcomponent of a larger subsystem. For example, C1 and C2 of the components views VL and VR in Figure 8-15 are a match where C2 is a part of C3; C4 and C5 ¾ that are transitive parts of C1 and C2, respectively ¾ are a match, too. The corresponding components are then combined by uniting, intersecting, or building the difference of their direct elements depending on the respective operator. Only the direct elements of corre- sponding components are combined since only these were the basis on which cor- respondence was established. Moreover, if there are corresponding transitive parts of two subsystems, they will likewise be combined at their respective level. The combination of C1 and C2 of Figure 8-15, for example, combines the direct elements of C1 and C2 as sketched on the right hand side of Figure 8-15 by apply- ing the underlying set operator to their direct elements a1 and a2. Furthermore, the corresponding parts of C1 and C2, namely, C4 and C5 will also be combined analogously. Whether C is also in the result view despite of the fact that there is no correspondent, depends upon the operator (see below). Combined and Incremental Techniques 286 Because not only components views as such are united or intersected but also cor- responding components contained in these views, these operators are called deep union, deep intersection, and deep difference. Beside corresponding compo- nents, the deep set operators also determine how nodes that do not have a corre- sponding node, so-called singles, have to be treated. In the case of the intersection operator, for example, they have to be ignored; in the case of the union operator, they have to be added to the result view. The deep set operators basically consist of the following three steps and differ only in the actual set operation Ä that is to be applied: 1. Find the corresponding components in the two components views. 2. Handle correspondents: Apply the set operation Ä to the direct elements of all corresponding components. 3. Handle singles. How these steps can be implemented in a general manner for all deep set opera- tors is explained in this section (see algorithm 8-4). The following sections describe the specific parts of the concrete deep set operators. Handling correspondents and singles. The generic algorithm 8-4 has three generic parameters. Handle_correspondents combines two corresponding com- ponents. The other two generic parameters are used to handle singles. There are two generic parameters for singles because the symmetric difference operator treats singles in the two argument views differently. The view that is given as first Figure 8-15. Deep set operator outline. C1 a1 C2 a2 C3 a1Ä a2 C1|C2 VL VR correspond part-of Vresult = VL Ä VR C4 a4 C5 a5 correspond a4 Ä a5 C4 | C5 C3 part-of 287 Combining Operators Algorithm 8-4. Generic algorithm for deep set operators. Generic • handle_correspondents: View ´ View ´Entity ´ Entity ´ View • handle_left_single: View ´Entity ´View • handle_right_single: View ´ Entity ´ View Input • components views VL, VR Output • components view V that contains the result of the deep set operator Algorithm -- Identify corresponding components for each L in reverse-topological-order (VL) loop if $ (R Î VR) identical (L, R) then add R to correspondents (L); add L to correspondents (R); end if; for each R in Potentially_Affine (L) loop if affine (L, R) then add R to correspondents (L); add L to correspondents (R); end if; end loop; end loop; -- Handle correspondents and singles for each L in reverse-topological-order (VL) loop if correspondents (L) = Æ then handle_left_single (VL, L, V); else for each R in correspondents (L) loop handle_correspondents (VL, VR, L, R, V); end loop; end if; end loop; for each R in reverse-topological-order (VR) where correspondents (R) = Æ loop handle_right_single (VR, R, V); end loop; Combined and Incremental Techniques 288 argument to the operator is called left argument view, the one given as second argument is called right argument view. Finding correspondents. Before the set operation can be applied, the corre- sponding components have to be identified. Correspondence of base entities can immediately be established by identity; identity is their only way of correspon- dence since they do not have further elements. Correspondence of atomic compo- nents can also be quickly established based on their direct elements. Ascertaining corresponding subsystems is more difficult since subsystems can have several lev- els and the correspondence at one level depends on the correspondence at the deeper levels, i.e., the correspondents of the parts of a component have to be established before one can determine its own correspondents. Since a subsystem structure is an acyclic directed graph, the subsystem structure in one components view can be traversed in reverse topological order (inverting the part-of edges) where the visited nodes are compared to nodes in the subsystem structures of the other components view. Traversal of acyclic graphs in reverse topological order ensures that a node is only visited when all its parts have been visited. As a conse- quence, we can assume that the corresponding nodes of its parts have already been established. When a node is visited, its corresponding nodes in the other components view can be determined by identity or affinity. Whether there is an identical node in the other components view can be decided immediately. In order to identify affine nodes, however, there has to be a component in the other view whose direct ele- ments correspond to the direct elements of the currently visited node. Fortunately, we do not have to compare all possible pairs of components in the two compo- nents views in order to find affine components. According to the definition of affinity given in Section 3.7.2, a node A can only be affine to a node B if That is, the principal components that could be affine to A are as follows: a direct-elements A( )Î( ) b direct-elements B( )Î( )correspond a b,( )$$ Potentially_Affine A( ) = B a direct-elements A( )Î correspond a b,( ) b direct-elements B( )ÎÙ Ù{ }= B part-of a A,( ) correspond a b,( ) part-of b B,( )Ù Ù{ }= 289 Combining Operators Since all constituting conditions for membership of this set (i.e., part-of and cor- respond when it is saved once computed) can be checked in constant time and the number of parts can be assumed to be below a constant limit, this set can be com- puted in constant time. This set allows only to identify the potentially affine com- ponents; by iterating over its elements and checking each element for affinity with A as defined in Section 3.7.2, the actual affine components can be identified. Note that each component may have more than one correspondent as it was shown in Section 3.7.3. That is why correspondents (A) in Figure 8-4 denotes a set of entities. One could argue that traversing only the left argument view is not sufficient because the correspondence information of nodes in the right argument view is not yet available. But this is not the case. Let us consider the scenarios in Figure 8-16. When a node A of the left argument view is visited, this node can either be a leaf or an inner node in a subsystem structure. Because leaves do not have further elements, the only possible way of correspondence with a node B in the other view is by identity (Figure 8-16(a) and (b)). Hence, the direct elements of B are irrelevant to the correspondence with A. Now, let us assume A is an inner node, i.e., it has direct elements in the compo- nents view. Again, B can either be a leaf or an inner node. If B is a leaf, the only possible way of correspondence is the one by identity again (Figure 8-16(c)). Thus, the only situation in which correspondence can be by affinity is when both A and B are inner nodes (Figure 8-16(d)). However, if A is an inner node, its parts have already been visited and their correspondents are all known. Hence, all needed information to find out whether A and B correspond is available. Algorithm 8-4 traverses the left argument view twice. This is needed because the following descriptions of the actual procedures to handle correspondents and sin- gles assume that the correspondence information has been established for all Figure 8-16. Constellations of correspondence. A B A B (a) (b) b A B (c) a A B (d) a b correspond Combined and Incremental Techniques 290 direct elements of its argument entities. To illustrate that actually two traversals are needed, consider Figure 8-17. Let us assume that the nodes 1, 2, 3, and 4 have already been visited and AC1 is the currently visited node during the first traversal in reverse topological order. At this point, it is known that AC1 and AC3 corre- spond to each other. Moreover, the singles and nodes with correspondent among the direct elements of AC1 are known as well. However, it is not necessarily known for the direct elements of AC3 that do not have a correspondent among the direct elements of AC1 whether they do not have a correspondent at all. For exam- ple, the direct element 5 of AC3 actually has a correspondent, but its correspon- dent in VL has not yet been visited and is therefore not yet known. If handle_correspondent (AC1, AC3) were called during the first traversal, handle_correspondent would consider node 5 a single though it actually has a correspondent. The generic algorithm 8-4 implements the steps to combine two views. The deep union, intersection, and difference are instantiation of this general scheme by put- ting the generic parameters into concrete terms. The generic parameters represent the underlying shallow set operation that is to be applied to the direct elements of corresponding components. For example, the deep union operator unites all direct elements of corresponding components using the regular set union (by providing an adequate implementation for handle_correspondents). Since the correspond- ing components do not depend on the deep set operation and, therefore, the regu- lar set operation is applied to the same pairs of entities independent from the actual deep set operation, the following equation holds: Figure 8-17. Example traversal of subsystem structures. 2 3 4 5 6 7 81 AC1 AC2 C1 3 4 52 AC3 VL VR correspond part-of deep-union A B,( ) deep-intersection A B,( ) deep-symmetric-difference A B,( )È= 291 Combining Operators on the analogy of set theory: We do not prove this equation since it becomes plausible by the working example that is used to explain the deep set operators in the following. Working example. Figure 8-18 contains two subsystems C1 and C2 that are to be compared. The figure already connects the corresponding components, assum- ing as tolerance parameter Q = 3/5. Corresponding base entities are not explicitly highlighted for ease of readability. Since base entities correspond by identity, the pairs of corresponding base entities are obvious. Here, we will explain the generic algorithm 8-4 by describing a traversal and listing the generic subprogram param- eters that are called during this traversal. The following sections, in which the concrete subprograms to instantiate the generic algorithm for the actual deep set operators are described, will only present the result for the respective instantia- tion. One possible reverse topological order of C1 (among others) is to visit the nodes in the order of 0, AC0, 1, 2, 3, 4, AC1, 5, 6, 7, 8, AC2, 10, C1. Base entity 0 has no identical node in the other view and, therefore, handle_left_single(VL, 0, V) is called. Likewise, AC0 has no identical node, nor has its direct element 0 has a cor- respondent and, therefore, AC0 has no correspondent. That is why handle_left_single (VL, AC0, V) is called. 1 has no correspondent (handle_left_single (VL, 1, V) is called), but 2, 3, and 4 have identical nodes in the other view, which results in calls handle_correspondents (VL, VR, 2, 2, V), handle_correspondents (VL, VR, 3, 3, V), and handle_correspondents (VL, VR, 4, 4, V). AC1 has no identical node but its direct elements 2, 3, and 4 have corre- sponding entities whose enclosing component is AC3. Therefore, Potentially_Affine (AC1) is {AC3}. AC1 and AC3 are in fact affine for every toler- ance factor Q £ 3/5 and, hence, handle_correspondents (VL, VR, AC1, AC3, V) is called. Then, handle_correspondents is called for the pairs of identical nodes (5,5), (6,6), (7,7), and (8,8). AC2 has an identical component in the other sub- system structure; thus, handle_correspondents (VL, VR, AC2, AC2, V) is called. union A B,( ) intersection A B,( ) symmetric-difference A B,( )È= Combined and Incremental Techniques 292 Note that the two identical components are not affine for Q = 3/5 because their degree of overlap according to condition (3.2) on page 70 is only 2/5. 10 does not have a correspondent, which leads to a call to handle_left_single (VL, 10, V). Eventually, C1 is visited. Its direct elements are AC0, AC1, AC2, and 10. AC1 and AC2 have corresponding components in the other view whose enclosing compo- nent is C2. Hence, Potentially_Affine (C1) is {C2}. The corresponding direct ele- ments of C1 and C2 are (AC1, AC3) and (AC2, AC2). Distinct elements of C1 and C2 are AC0, 10, 6, AC4. The degree of overlap for C1 and C2 is, therefore, 2/(2+4) < Q = 3/5. Thus, C1 and C2 are distinct and handle_left_single (VL, C1, V) is called. After the second reverse topological traversal of VL, handle_right_single is called for each remaining single of the components view given as second argument of the deep set operator. To sum it up, the generic operations are executed in the order given in Table 8-1. Notation. The nodes that represent the combination of two corresponding nodes L and R are represented in the result of the combination as a pair (L, R) if L ¹ R. If L=R, only the single node L is added to the output view in order to preserve its identity for future combinations. A single entity L is equivalent to a pair (L, L). 8.3.2.1 Deep Union The deep union operator unites corresponding components of its argument com- ponents views in order to avoid duplicated similar components. All entities that do not have a correspondent are added to the output components view, too, since this is the expected semantics of a union. That is, handle_correspondents will Figure 8-18. Working example of partly corresponding components. 2 3 4 5 6 7 81 AC1 AC2 C1 3 4 5 6 7 8 92 AC3 AC2 C2 1112 AC4 VL VR AC0 0 10 correspond part-of 293 Combining Operators unite its two arguments and handle_left_single / handle_right_single simply add their argument to the output view by taking on their parts. Handling correspondents. In more detail, for any pair of corresponding entities of the two views, a new entity (L, R) is added to the output view that rep- resents the union of the two corresponding entities L and R (algorithm 8-5). The direct elements of (L, R) are the pairs of correspondents among the direct ele- ments of L and R plus all direct elements of L and R, respectively, that do not have a correspondent. Consider the example in Figure 8-19. The two components L and R are to be united at an intermediate stage. Their direct elements have already been visited, hence, pairs (l, r) have been added to the output view when l is a direct element of L and r is a direct element of R and l and r correspond. If they do not have a correspondent, they have been added as non-combined nodes. In the Table 8-1. Executed generic subprogram parameters. handle_left_single (VL, 0, V) handle_left_single (VL, AC0, V) handle_left_single (VL, 1, V) handle_correspondents (VL, VR, 2, 2, V) handle_correspondents (VL, VR, 3, 3, V) handle_correspondents (VL, VR, 4, 4, V) handle_correspondents (VL, VR, AC1, AC3, V) handle_correspondents (VL, VR, 5, 5, V) handle_correspondents (VL, VR, 6, 6, V) handle_correspondents (VL, VR, 7, 7, V) handle_correspondents (VL, VR, 8, 8, V) handle_correspondents (VL, VR, AC2, AC2, V) handle_left_single (VR, 10, V) handle_left_single (VR, C1, V) handle_right_single (VR, 9, V) handle_right_single (VR, 11, V) handle_right_single (VR, 12, V) handle_right_single (VR, AC4, V) handle_right_single (VR, C2, V) Combined and Incremental Techniques 294 example, b1 corresponds to b2 and c1 corresponds to c2 and c3 while a and d do not have any correspondent. Thus, the nodes a, (b1, b2), (c1, c2), (c1, c3), and d have been added. These and only these nodes are the direct elements of the new node (L, R) that represents the union of L and R. Algorithm 8-5 implements this step and is used as handle_correspondents in the instantiation of the generic algo- rithm 8-4. Handling singles. The entities that do not have a correspondent have to be added to the output view in the case of the union operator. When a single is added to the result view, all its direct elements in the original view become its direct elements in the result view, too. Due to the traversal in reverse topological order, all ele- ments of a single have already been added to the output view. Thus, all what has to be done is to enrol the direct elements of the component in the input view as its direct elements in the output view. When direct elements are added, direct elements with and without correspondent have to be distinguished. For the former, pairs of correspondents have been added, for the latter, only plain nodes. The example in Figure 8-20 illustrates this. Algorithm 8-6 implements the union for singles and is used as parameter handle_left_single to instantiate the generic algorithm 8-4. The implementation of handle_right_single differs from handle_left_single only by the order of ele- ments of corresponding pairs (e, e‘): The first element of such a pair is assumed to be contained in the left argument view and the second element of the pair is assumed to be contained in the right argument view; i.e., within the body of handle_right_single, we would replace (e, e‘) by (e‘, e). Figure 8-19. Union for corresponding entities. a L R b1 c1 db2c2 c3 (L, R) d(b1,b2)(c1, c2)a (c1, c3) correspond part-of 295 Combining Operators The result of the deep union operator applied to the working example in Figure 8- 18 on page 292 is shown in Figure 8-21. Algorithm 8-5. Handle_correspondents for deep union. Figure 8-20. Union for singles. Input • Components views VL, VR • Entity L in VL • Entity R in VR • Output components view V Algorithm add (L, R) to output view V; for each l Î direct-elements (L in VL) loop if correspondents (l) ¹ Æ then for each r Î correspondents (l) Ç direct-elements (R in VR) loop add part-of ((l,r), (L,R)) to output view V; end loop; else add part-of (l, (L,R)) to output view V; end if; end loop; for each r Î direct-elements (R in VR) where correspondents (r) = Æ loop add part-of (r, (L,R)) to output view V; end loop; a L b1 c1 b2 c2 c3 L (b1,b2) (c1, c2)a (c1, c3) correspond part-of Combined and Incremental Techniques 296 8.3.2.2 Deep Intersection In the case of the union operator, correspondents and singles have to be added to the result view whereas singles have to be ignored by the intersection operator since only entities to which both techniques agree ¾ in other words, entities with correspondent ¾ may be in the output view of the intersection. Handling correspondents. The parameter handle_correspondents for the instan- tiation of the generic algorithm 8-4 is a simplified version of Algorithm 8-6. Handle_left_single for deep union. Figure 8-21. Result of deep union. Input • Components view VE • Entity E in VE • Output components view V Algorithm add E to output view V; for each e Î direct-elements (E in VE) loop if correspondents (e) ¹ Æ then for each e‘ Î correspondents (e) loop add part-of ((e,e‘), E) to output view V; end loop; else add part-of (e, E) to output view V; end if; end loop; 2 3 4 5 6 7 81 (AC1,AC3) AC2 C1 9 C2 1112 AC4AC0 0 10 part-of Note that (AC2,AC2) º AC2 297 Combining Operators handle_correspondents for the union operator, in which the part that handles direct elements without correspondent is omitted (see algorithm 8-7). Handling singles. The specifications of handle_left_single and handle_right- single is trivial: A node without correspondent must not be in the output view. Thus, these two subprograms do nothing. The result of the deep intersection operator for the working example in Figure 8- 18 on page 292 is shown in Figure 8-22. As one can observe, the result of the deep intersection operator is a subset of the result of the union operator as one would expect of the common union and intersection operations. Algorithm 8-7. Handle_correspondents for deep intersection operator. Figure 8-22. Result of deep intersection. Input • Components views VL, VR • Entity L in VL • Entity R in VR • Output components view V Algorithm add (L, R) to output view V; for each l Î direct-elements (L in VL) loop for each r Î correspondents (l) Ç direct-elements (R in VR) loop add part-of ((l,r), (L,R)) to output view V; end loop; end loop; 2 3 4 7 8 (AC1,AC3) AC2 part-of Note that (AC2,AC2) º AC2 Combined and Incremental Techniques 298 8.3.2.3 Deep Symmetric Difference The deep symmetric difference operator, or simply difference operator, is used to investigate the discrepancy between two views. There are basically two scenarios for its usage (let VL be the left argument view and VR be the right argument view): 1. One wants to see what has been added to a view after an analysis has been applied to it. Then, the simple difference VR\VL of the original components view VL to the components view VR resulting from applying a technique to VL is of interest. Because the incremental techniques as presented so far can only add entities to a components view and do not remove entities, VL is a subset of VR. In an interactive environment, the simple difference operator will be most often applied with respect to the user view. 2. One wants to compare components views generated by different techniques. In this case, differences can be additions as well as removals. Given a components view VL to be compared to a components view VR, VR\VL denotes the additions and VL\VR denotes the removals. Since we are interested in both VR\VL and VL\VR, we are actually interested in the symmetric difference of VL and VR denoted by VL Å VR = VR\VL È VL\VR. In this section, the symmetric difference for components views instead of the sim- ple difference is introduced since the former is more general and also appropriate for the first scenario: If VR results from an incremental technique applied to VL, then VL Í VR holds and, hence, VL Å VR = VR\VL È Æ = VR\VL, which is exactly what is of interest in the first scenario. Handling correspondents. As opposed to a shallow symmetric difference, the deep symmetric operator also yields the symmetric difference of the correspond- ing components. If L and R are corresponding components, the differences between L and R can be (let L be in the left and R in the right argument view): 1. a direct element of L has no correspondent in R 2. a direct element of R has no correspondent in L Because the first argument of the difference operator is the view to which the other view is to be compared, case 1 is considered a removal and case 2 an addi- tion. The resulting view has to contain these differences tagged accordingly. The 299 Combining Operators entities that are in both L and R ¾ more precisely, direct elements of L with a cor- respondent in R ¾ will be ignored since they do not represent a difference. Algo- rithm 8-8 computes the symmetric difference for two corresponding components with respect to their direct elements. Because L and R are both present in the input views, (L, R) is actually no difference between the input views. However, adding (L, R) to the output view is needed since the entities in direct-ele- ments(L)\direct-elements(R) and direct-elements(R)\direct-elements(L) have to be attached to it ¾ it is the part-of edge that makes the difference and an edge needs both source and target. In order to make explicit that (L, R) is no real difference, it is added non-tagged. Handling singles. Above, differences between the direct elements of correspon- dents were classified as removals or additions. Likewise, singles contained of the left argument view of the difference operator are considered removed and singles contained in the right argument view are considered added. In both cases, the sin- gles and their direct elements have to be enrolled in the output view. In doing so, we again have to make a distinction between direct elements with and without correspondent since the former were added as pairs and the latter as single enti- Algorithm 8-8. Handle_correspondents for the deep symmetric difference. Input • Components views VL, VR • Entity L in VL • Entity R in VR • Components view V Algorithm add (L,R) to V non-tagged; add each element in to ((L,R)) in V with tag added; add each element in to ((L,R)) in V with tag removed; r part-of r R,( ) VRÎ l( )$Ø part-of l L,( ) VLÎ correspond l r,( )ÙÙ{ } l part-of l L,( ) VLÎ r( )$Ø part-of r R,( ) VRÎ correspond l r,( )ÙÙ{ } Combined and Incremental Techniques 300 ties. The distinction was already explained for the deep union operator. Algorithm 8-9 implements handle_left_single for the deep symmetric difference. The algo- rithm for handle_right_single is analogous where nodes and edges are tagged added instead. The result of the deep symmetric difference operator applied to the working example in Figure 8-18 on page 292 is shown in Figure 8-23. Note that both nodes and edges need to be tagged (nodes that are present in both input views and, therefore, do not represent a difference are shown in italic font). Algorithm 8-9. Handle_left_single for deep symmetric difference. Figure 8-23. Result of the deep symmetric difference. Input • Components view VL • Entity L • Output components view V Algorithm add L to output view V tagged as removed; for each l Î direct-elements (L in VL) loop if correspondents (l) ¹ Æ then for each r Î correspondents (l) loop add part-of ((l,r), L) to output view V tagged as removed; end loop; else add part-of (l, L) to output view V tagged as removed; end if; end loop; 5 61 AC1,AC3 AC2 C1 AC0 0 109 11 12 AC4 C2 removed entity added entity removed part-of added part-of 301 Voting Approach 8.4 Voting Approach In the previous sections, we have discussed combinations of techniques based on operators modeled on set operations. This section presents an alternative approach for combining the techniques. When a yet free entity it to be bound during clustering, the question arises to which atomic component this entity should be added. So far, this has been answered by running one of the base techniques separately. However, what if we do not want to rely on a single heuristic? Then, we could run several analyses, compare the results, and add the entity to the atomic component to which most techniques agree. Agreement among techniques can be established by means of the intersection operator. The intersection operator is defined for two arguments but could easily be extended to an arbitrary number of components views. How- ever, the likelihood that the resulting components view is empty increases by the number of techniques that have to agree because all techniques have to subscribe. A more practical approach when several techniques are considered at the same time is to accept atomic components when a certain number of techniques agree but not necessarily all. Furthermore, the agreement of a technique is only binary for the intersection operator though the actual degree of certainty of the technique may lay somewhere in between 0 and 1. 8.4.1 Summarized Agreement Dropping total agreement and allowing continuous degrees of agreements between 0 and 1 lead us to the so-called voting approach: Given a base entity E that is to be added to one of the atomic components in SAC = {AC1, ¼, ACn} and a set of techniques ST = {T1, ¼, Tm}, E is added to the atomic component ACj Î SAC for which the agreement among techniques of ST is maximal, i.e., quantita- tively, (8.2) has to be maximal, where xT is the weight of technique T used to give more trusted techniques more influence and agreementT (E, AC) is the individual agree- ment of technique T that E should be added to AC. total-agreement E ACj,( ) xTi agreementTi E ACj,( )´i 1= må( ) xTii 1= må¤= Combined and Incremental Techniques 302 The allowable range of values for the individual agreement is between -1 and 1, where values in the range of -1 and 0 express varying degrees of disagreement and values greater than 0 are considered agreements. However, as it will be dis- cussed in the following sections, the actual range of agreement of most techniques is between 0 and 1. This is the case when the underlying clustering criterion is inherently positive, i.e., when the clustering criterion does not specify attributes to exclude entities. Because the individual agreements are normalized to yield a value between -1 and 1, total-agreement is also normalized. Normalization simplifies the application of the voting approach for the user. Adjusting the weights xT, i.e., the influence of individual techniques, is easier when the range of agreement of a technique is known to be between -1 and 1. Furthermore, the total-agreement is easier to inter- pret because it is a relative value between -1 and 1, whereas the left and right lim- its of total-agreement would be unspecified if total-agreement were not normalized. The individual agreement is specific to each technique. Before the individual agreement is defined for each technique, different applications of the voting approach are discussed. The agreement of some techniques is not applicable to all possible usage scenarios of the voting approach. The following sections will point out techniques not applicable to a certain kind of application. 8.4.2 Ways of Using the Voting Approach Before we go into detail of defining the level of agreement for each technique, we discuss some usages of the voting approach. There are basically four kinds of sce- narios in which the voting approach is helpful: • single entity assignment: one entity it to be added to existing atomic compo- nents • multiple entities assignment: several entities are to be added to existing atomic components • clustering: free entities are to be grouped to new or existing atomic compo- nents • assessment: existing atomic components are to be assessed 303 Voting Approach These scenarios will be discussed in the following. Single entity assignment. The voting approach was already motivated as an alternative to the intersection operator for determining the corresponding atomic component of a single free entity when several techniques are to be consulted. The entity is added to the atomic components for which the total-agreement of all techniques is maximal. This kind of usage of the voting procedure is useful dur- ing clustering, but a similar situation arises when the maintainer implements a new function of the system after the atomic components have been detected: The voting techniques could then be used to determine to which atomic component ¾ or module, respectively ¾ the function should be assigned (this is known as the orphan problem; Tzerpos and Holt, 1997). N.B.: Same Module cannot be polled because the new function has no module yet. Multiple entities assignment. Assignment of several entities to existing atomic components is a straightforward extension of the latter scenario. However, one has to be aware that the assignment depends on the order in which the entities are handled, i.e., the addition of an entity to an atomic component, AC, can influence the agreement for another entity with respect to AC ¾ the agreement can increase or decrease. That is why the entities should be assigned in the order of the current agreement as described by algorithm 8-10. This way, the voting procedure can be used in a clean-up phase when the atomic components have been detected to a large extent and the remaining free entities are to be added to the found atomic components. Clustering. In the previous two scenarios, entities were to be added to existing atomic components. When new atomic components are to be created, we can use the voting procedure as a regular clustering method by using total-agreement as defined by equation (8.2) as a similarity metric for the clustering algorithm described for Similarity Clustering. Total-agreement was defined above with respect to an entity and an atomic com- ponent. However, in the beginning, when no atomic components are present, one can treat each entity as an atomic component of its own. Combined and Incremental Techniques 304 Assessment. Techniques may yield a large number of candidate components. In an interactive approach, it is useful to rank these candidates when presenting them to the user. The user can then validate the most promising candidates first. The voting procedure can be used to rate these components. It makes sense to include the vote of the technique that has produced the candidates, too, since the tech- nique need not necessarily be entirely convinced, hence, its agreement may be below 1. Furthermore, the voting procedure can also be used to assess an existing decom- position of a system into files. It can be applied under the assumption that each file constitutes an atomic component. This way, those modules may be identified that need to be restructured and atomic component detection can be applied more goal-oriented. The assessment of the voting procedure can then also be used to detect the mavericks that cause low cohesion of some modules. N.B.: It is clear that Same Module cannot be used for an assessment of existing modules. In order to identify mavericks of a module, we can use equation (8.2) directly by iterating over the elements of an atomic component and computing the total- agreement to the chosen entity, E, with respect to its atomic component, AC, (excluding the element) and to all other atomic components. The element E is a maverick when the entity should actually be in a different atomic component, i.e., if there is an atomic component AC’ for which Algorithm 8-10. Multiple entities assignment. Input: • set S of free entities • set SAC of atomic components Algorithm while S ¹ Æ loop ascertain E Î S, AC Î SAC where "E’ÎS,AC’ÎSAC: total-agreement(E’, AC’) £ total-agreement(E, AC) add E to AC S := S \ {E} end loop; 305 Voting Approach total-agreement (E, AC’) > total-agreement (E, AC-{E}) In order to rate an atomic component as a whole, one can look at the absolute and relative number of mavericks (with respect to the number of elements in the atomic component) as well as compute the average on total-agreement over all elements in the atomic component AC: Modules with a high absolute and relative number of mavericks or a low average- agreement may represent candidates for restructuring in preventive maintenance. On the other hand, atomic component candidates with a low absolute and relative number of mavericks or a high average-agreement should be validated first in an interactive atomic component detection process. 8.4.3 Agreement of Individual Techniques The definition of total-agreement according to equation (8.2) is based on the indi- vidual agreement of the basic techniques. In this section, this agreement will be defined for the connection-based, metric-based, and graph-based techniques listed in Chapter 5. The agreement has to be defined as a function of an entity and an atomic compo- nent. It has to express the degree of certainty of the technique that the entity belongs to this atomic component. That is, the agreement is high if and only if the technique would add the entity to the atomic component during clustering. There- fore, the following definitions of agreement are modeled on the actual action of the techniques when they would have to cluster this element. In other words, the definitions reflect one single clustering step. In order to obey the information that has been contributed by the user, there is an overriding rule for the definitions following in the next sections: If the agreement for an entity, E, and an atomic component, AC, is to be computed, the agreement is -1 if there is an entity, E’, in AC where E and E’ are mutually exclusive. average-agreement AC( ) total-agreement E AC E{ }–,( ) E ACÎ åè øæ ö AC¤= Combined and Incremental Techniques 306 8.4.3.1 Agreement of Connection-based Techniques The principal procedure of clustering is the same for all connection-based tech- niques and can be described by the generic algorithm 8-1 on page 268. These techniques differ only in the actual specification of the generic function parameter Connected_Entities of the algorithm that yields all the connected entities to which a given entity is to be grouped. That is, the definition of agreement for all connec- tion-based techniques can be uniformly made based on the abstract function Connected_Entities. Among the elements yielded by Connected_Entities to be grouped with the entity may be bound and free entities. We are going to ignore free entities in the follow- ing definition of agreement because the question for the voting approach is to which existing atomic component a given entity belongs. However, if free entities must not be ignored, e.g., for general clustering, we can consider each free entity as an atomic component of its own. The following function is a filter for free enti- ties: Applying this filter to the result of Connected_Entities leaves the bound entities. As described by algorithm 8-1 on page 268, the incremental connection-based approaches add the entity to the enclosing atomic component of the bound enti- ties. That is, it is sufficient for an atomic component to have one single entity that is related to the entity under consideration (in the sense of the heuristic) to be a candidate for receiving this free entity. This rule is adequate for the incremental techniques since their results are intended to be presented to and validated by the user. The user is then the final judge sifting all possible alternatives. However, in this section, we have to define a finer gradation of the agreements of the tech- niques than just yes or no. The definition of agreement will, therefore, be based on the number of entities of an atomic component that are a reason for the given entity to be added to the atomic component. It is the fraction of entities of the atomic component connected to the entity under consideration, relative to all con- nected entities. More precisely, (8.3) d S( ) s s SÎ Is_Bound s( )Ù{ }= agreement E AC,( ) x x d Connected_Entities E( )( ) direct-elements AC( )ÇÎ{ } d Connected_Entities E( )( )----------------------------------------------------------------------------------------------------------------------------------------------= 307 Voting Approach where d(Connected_Entities (E)) are all bound entities to which E can be grouped. So far, the actual functions used as Connected_Entities for the specific techniques have been defined for subprograms only. However, the entity could also be a vari- able or type. In these cases, we simply use the relational inverse of the original function for Connected_Entities. Example. Consider the example in Figure 8-24. Connected_Entities (E) = {x1, x2, x4, x5, x6} where x1 and x2 are both free, hence, d(Connected_Entities (E)) = {x4, x5, x6}. Because AC1 has two elements to which E is connected, the agree- ment (E, AC1) is 2/3, while AC2 has only one connected entity and whose agree- ment (E, AC2) is, therefore, only 1/3. Note that the definition of agreement is only relative to the number of connected entities and does not depend on the size of the involved atomic components. However, the more elements an atomic component has, the likelier it is that an atomic component contains many connected entities. In other words, since a com- ponent with N elements cannot have more than N connected entities, smaller components are at a disadvantage. This is undesirable for clustering because it leads to few large components. The definition of agreement above was made from the perspective of the entity: It expresses the connection of the entity to the atomic component relative to all other components to which the entity could belong. For clustering, we can flip over the perspectives and define the agreement as the extent to which the entity fits into the atomic component. This can be Figure 8-24. Connected entities example. E x1 x2 x3 x7 x6 x5 x4 AC1 AC2 connected Combined and Incremental Techniques 308 achieved by defining it relatively to the number of elements of the atomic compo- nent as follows: (8.4) For clustering and assessment, we will prefer equation (8.4), for entity assign- ment, however, we will prefer (8.3). 8.4.3.2 Agreement of Metric-based Techniques For the definition of agreement for metric-based approaches, we can use their metrics. However, Delta IC is an exception. As it was already discussed in Sec- tion 5.13, Delta IC is a hybrid of strictly connection-based and metric-based approaches. Delta IC. Delta IC consists of two steps: cluster formation and cluster filtering. The metric is only used for filtering. Moreover, the metric is defined for subpro- grams only and has other disadvantages (see Section 5.7). Two other metrics, namely, Internal and External Connectivity, have been proposed to overcome the restrictions of the Delta IC metric. (The definition of agreement for Internal and External Connectivity follows below.) Due to the restrictions of Delta IC and the fact that Internal and External Connectivity are too similar to the Delta IC metric ¾ and hence, similar characteristics would enter total-agreement twice ¾ the agreement of Delta IC will be based on its primary criterion for cluster formation. The actual cluster comprises the closely related subprograms and the referred entities of a subprogram (see equation (5.8) on page 125): Therefore, given a subprogram, S, we can define the agreement of Delta IC anal- ogously to agreement of connection-based techniques as the fraction of entities of the atomic component that are also in the cluster: agreement E AC,( ) x x ACÎ x d Connected_Entities E( )( )ÎÙ{ } elements AC( )--------------------------------------------------------------------------------------------------------------= candidate-cluster (S) closely-related-subprograms (S) referred-by (S)È= agreement S AC,( ) x x ACÎ x d cluster S( )( )ÎÙ{ } d cluster S( )( )------------------------------------------------------------------------------= 309 Voting Approach However, what if the entity is not a subprogram but a type or object? For the con- nection-based approaches, we used the relational inverse of Connected_Entities. A similar approach can be used for Delta IC. The relational inverse of referred-by is refer-to. Closely-related-subprograms, on the other hand, was defined as fol- lows: An inverse analogon for types and objects is: A cluster for a type or an object, E, can then be defined as: Hence, we can use candidate-cluster as Connected_Entities in (8.3) and (8.4), respectively, to specify the agreement of Delta-IC. Internal and external connectivity. Internal and external connectivity are two different aspects; this is why agreement is defined for each one separately in this section rather than defining a combined agreement in terms of connectivity as defined by (5.15) on page 137. Internal and external connectivity are defined for a given atomic component whereas the agreement is based on an atomic component and an entity. However, we can use them for the definition of agreement for internal and external connec- tivity, respectively, by assuming the given entity were part of the atomic compo- nent and measure the difference of connectivity with and without the entity as follows: (8.5) closely-related-subprograms S( ) = {F F refer-to e( )Î referred-by F( ) referred-by S( )}ÍÙ e referred-by S( )Î È closely-related-entities E( ) = {e e referred-by s( )Î refer-to e( ) refer-to E( )}ÍÙ s refer-to E( )Î È candidate-cluster E( ) closely-related-entities E( ) refer-to E( )È= agreement E A,( ) D= IntC E A,( ) IntC A E{ }È( ) IntC A( )–= Combined and Incremental Techniques 310 (8.6) Equation (8.5) is used for the agreement based on internal connectivity and equa- tion (8.6) is used for the agreement based on external connectivity. Note the two different orders in the differences (8.5) and (8.6). The definition of DIntC is aimed at rewarding an increase in internal connectivity, whereas the definition of DExtC promotes a decrease of external connectivity. Furthermore, the value of internal and external connectivity as defined by (5.13) and (5.14) on page 137, respec- tively, is between 0 and 1. Hence, the differences (8.5) and (8.6) are in the range of -1 and 1. Type-Based Cohesion, Schwanke’s Arch Approach, and Similarity Cluster- ing. The agreement of the similarity-based techniques Type-based Cohesion, Schwanke’s Arch approach, and Similarity Clustering can be based on the under- lying metric of these techniques. In the case of Schwanke’s Arch approach, and Similarity Clustering, the group similarities as defined by (5.18) on page 141 and (7.2) on page 188, respectively, can be used treating the entity to be compared to the component as a group of its own. This compares to a real clustering step when the elements of A have been grouped together and the similarity of A and the group that contains the single entity is to be re-computed. More precisely, given a component, A, and an entity, E, the agreement can be defined as follows: The group similarity for Type-based Cohesion as defined by (5.20) on page 146 cannot be used to define the agreement of Type-based Cohesion because the group similarity unites the two groups to be compared and computes the average similarity over all pairs of the union. Hence, the similarity among the elements that were already in component A to which the entity E is to be compared would also contribute to the agreement. As a consequence, if this group similarity were used and the component A as such has already a high agreement by Type-based Cohesion, the resulting agreement of Type-based Cohesion for A È {E} would also still be high even if E has nothing to do with A. That is to say, the agreement should measure the similarity of E to all other elements in A but not the similarity among elements of A. In the course of this thesis, two alternative group similari- ties have been introduced. Schwanke proposed to use the maximal similarity between the two groups whereas Similarity Clustering uses the average similarity agreement E A,( ) D= ExtC E A,( ) ExtC A( ) ExtC A E{ }È( )–= agreement E A,( ) GSim direct-elements A( ) E{ },( )= 311 Voting Approach between elements of different groups. Both of these group similarities can also be used for the agreement of Type-based Cohesion. However, because the group similarity based on the maximal individual similarities was less effective in prac- tice, the average group similarity is used instead. Hence, the following definition of agreement for Type-based Cohesion is used (the formula is simplified leverag- ing that {E} contains only one element): Sim is the similarity between two base entities specific to Type-based Cohesion and is defined by (5.19) on page 146. 8.4.3.3 Agreement of Graph-Based Techniques Cycles in the call view are rare and most constituents of an atomic component are not dominated by the atomic component. Moreover, the atomic component must be known to a large extent before Dominance Analysis can be applied in the first place. Therefore, Strongly Connected Component Analysis is commonly used only in the beginning of the detection process whereas Dominance Analysis is preferably used at the end of the process. That is, these two graph-based tech- niques play a minor role in the actual clustering process (other than in a prepara- tion and a clean-up phase) in which the voting approach has its place. Nevertheless, their agreement may be an additional piece of information and is, therefore, defined here. As discussed in Section 8.3.1.5, Strongly Connected Components Analysis clus- ters all entities of a cycle. Technically, this is achieved by using the generic algo- rithm 8-1 on page 268 for connection-based approaches where the entities in a cycle are returned by the actual parameter for Connected_Entities. Likewise, we can reduce the definition of the agreement of Strongly Connected Components Analysis to the agreement for connection-based approaches as defined by equa- tion (8.3) on page 306 for entity assignment or equation (8.4) when assessment or clustering is required. Function Connected_Entities therein is the set of entities within a cycle as defined by (8.1) on page 277. Consider Figure 8-25 as an exam- ple. Given the entity, E, for which the agreement is to be ascertained, the cycle containing E consists of a, b, c, and E. According to the definition (8.1) of agreement E A,( ) 1 elements A( ) -------------------------------- Sim E e,( ) e elements A( )Î å´= Combined and Incremental Techniques 312 Connected_Entities, the connected entities of E are a, b, and c. Entity a is part of A1 and the other two entities b and c are part of A2. Thus, agreement (E, A1) = 1/3 and agreement (E, A2) = 2/3 according to (8.3). Dominance Analysis adds an entity to its primarily dominating atomic component (see Section 5.12). If the primarily dominating atomic component exists, it is always unique. Hence, as opposed to the other approaches, there is maximally one atomic component to which the entity is added. Therefore, the agreement of Dom- inance Analysis is binary, either 1 if the given atomic component is the primarily dominating atomic component or 0 otherwise. 8.5 Summary In this section, high-level operators were introduced that allow manifold combi- nations of the basic techniques in a flexible way. These operators are preferred to a technical integration of the diverse heuristics since integration of new tech- niques in the operator framework is straightforward. These operators are espe- cially suited for combinations triggered by a user and are, therefore, used in the semi-automatic method presented in the next chapter. In order to support compo- sitions of techniques, the basic techniques have been extended to work incremen- tally. This section also proposed a voting approach that allows to combine the heuris- tics on the basis of their agreement overcoming the shortcomings of the intersec- tion operator. Though the intersection operator could principally also be used to establish agreement among techniques, it is too strict when more than two heuris- Figure 8-25. Example for Strongly Connected Components. A1 a A2 b cE connected entities call part-of 313 Summary tics are to be combined. The voting approach can be used for entity assignment, clustering, and assessment. When a new technique is added to the atomic compo- nent detection framework, its heuristic must be cast into a metric that reflects the agreement of the technique that a given entity belongs to a certain atomic compo- nent. This agreement is a value between -1 and 1. The basic techniques introduced in Chapter 5 and Chapter 7 were evaluated in Chapter 6 and Section 7.6.2.3, respectively. The combinations of the basic tech- niques as described in this chapter are not evaluated with respect to the evaluation framework of Chapter 6 because first, there is an infinite number of possible com- binations and, second, these ways of combinations are primarily thought as being used interactively in order to support the semi-automatic method described in the next chapter. Chapter 10 describes experiments conducted to evaluate the semi- automatic method that will also partly allow conclusions for the combined tech- niques. Combined and Incremental Techniques 314 315 Chapter 9 A Semi-Automatic Method to Detect Components The techniques introduced in Chapter 5 are all fully automatic which is desirable especially for large systems. However, their evaluation revealed that none of them has the detection quality that compares to human detection. There are basically two ways to improve the detection quality process. We can search for more sophisticated techniques or include the user in the detection process. The purpose of this thesis was to see to which extent structural information can be leveraged. Future research toward using data flow information and domain knowledge may produce more powerful techniques. However, even then, the user will remain the final judge. Due to the complexity, vagueness, and to some degree subjectivity, it is questionable whether we can ever find precise techniques that fit all cases. Therefore, atomic component recovery is a problem that has to be tackled in con- cert with a maintainer at any rate. This chapter describes a method in which human and computer interact to detect atomic components. It depicts how the process can be split into tasks assigned to either the computer or the maintainer. The outcome of each task carried out by one of the two partners, human or computer, is used for the other partner’s next task. A Semi-Automatic Method to Detect Components 316 9.1 Method Overview This section presents an overview of the method with the main tasks of both com- puter and human and the way and order of their interaction. A detailed description of the individual parts will follow after this overview. Figure 9-1 contains the main steps of the method. The inner cycle, consisting of analysis application, metric ranking, presentation, and bookkeeping of detected atomic components, is the core of the detection process. The user controls the detection process by selecting analyses and metrics and by validating the candi- dates proposed by the automatic techniques. The task of the computer comprises the automatic analyses, computation of the metrics for the proposed candidates, presentation of the results, and bookkeeping of the user decisions. The base view contains the base entities and their relationships needed for com- ponent detection and is automatically derived from source code. The so-called user view logs the information contributed by the user. It records the atomic com- ponents that have been detected and confirmed so far. In the beginning, when no atomic component is known, the user view is empty. The user selects an analysis that is to be applied. The analysis takes into consideration the components that were confirmed by the user (in the first iteration there are none). Thus, the analy- ses are applied incrementally. Chapter 8 already discussed how the analyses intro- duced in Chapter 5 can be modified to work incrementally by clustering only those base entities that have not been clustered before and by forming new atomic Figure 9-1. Semi-automatic method for atomic component detection. user view catalog memory computer task human task analysis selection metric selection metric adjustmentvalidation analysis application and combinationanalyses metricsmetric ranking presentation & acceptance flow of data flow of control consideration base view 317 Method Overview components or adding free base entities to existing atomic components. Gener- ally, the techniques propose many candidates. The user should not be swamped with all of them. Instead, the candidates should be presented in their presumed quality. Metric ranking is supported by letting the user select and adjust certain metrics. After the candidates have been ranked by user-selected metrics, the can- didates are presented to the user for acceptance. The presentation is a crucial and non-trivial task. It must be in such a way that the user’s validation can be as quick as possible. Additional information the user may need has to be provided on demand. For example, the maintainer will probably also want to inspect the source code. The user validates the candidates and those atomic components he or she accepts enter the atomic component memory, i.e., the user view. In each iteration, the user selects and combines different analyses to find compo- nents that could not be found by previous analyses. The process ends when the found components are sufficient for the task at hand or no further component can be found anymore. The user does not have to select, apply, and validate only one analysis at a time. Instead, several analyses can be selected and applied in paral- lel. Then, the intersection, union, and differences of these analyses can automati- cally be ascertained and the user can investigate and validate these. Particular large candidates of some techniques can be refined by applying other techniques to these individually. Because the typical maintenance task usually does not require to find all atomic components of a system but only a few relevant ones in a specific part of the sys- tem, the domain of search can be restricted to certain modules. We are using Rigi for presentation and interaction with the user (Müller, 1994). Rigi is a customizable graph editor developed to support reverse engineering and offers many useful capabilities such as: • support for annotateable nodes and edges of different types • hierarchical nodes and views • direct linkage to the correspondent source code by clicking on nodes • automatic layout and context-preserving browsing capabilities • filter and selection mechanisms • Rigi command language for customizations A Semi-Automatic Method to Detect Components 318 Rigi was extended in many directions to adapt it to our needs. The adaptations were opportunistic; not everything what might have been useful could be worked into Rigi, e.g., an undo mechanism would have been helpful. But all of our major requirements were more or less easy to fulfill with Rigi. The following sections go into more detail of the individual steps of the method. 9.2 Analysis Selection and Application Currently, all connection-based, metric-based, and graph-based techniques listed in Chapter 5 and Chapter 7 are supported. The user selects a subset of the basic analyses and specifies the order in which they should be applied. This is neces- sary since for many basic analyses, the order of composition is relevant. The analysis takes into account the atomic components that have been detected so far, i.e., that have been confirmed by the user. Chapter 8 anticipated what kind of information the user can add: Positive information expresses that a base entity (variable, type, or subprogram) belongs to a given atomic component. Negative information conveys that two entities do not belong together. Every analysis must preserve all positive information, that is to say, an analysis may only add to the components a user has confirmed and never remove any of their elements, and likewise, an analysis must not cluster entities that were not supposed to be grouped together. The metric-based approaches use a metric to measure the similarity of entities that are to be grouped together. These metrics can be used for both the assessment and clustering of candidates. That is why the analysis application in Figure 9-1 is not only controlled by the selected analysis but also by the used metrics and their actual parameter settings. The analyses can be selected, combined, and started from within Rigi by means of list boxes and menus. It was important to us that the selection and combination is easy to do with simple mouse clicks such that the user need not learn a complex language. The result of an analysis is represented by a single hierarchical analysis 319 Metric Selection, Adjustment, and Ranking node that is the root of the actual candidates. This makes it possible to further process the results of an analysis by direct manipulation. For example, the differ- ence to the currently accepted atomic components can be shown, it can be inter- sected or united with the results of other analyses (deep intersection, deep union), or the next kind of analysis can be applied to it (composition). For the interactive approach, a variant of the composition operator is useful that can be applied to individual components. The so-called individual composition is a form of com- position in which only the elements of a single component may be clustered. 9.3 Metric Selection, Adjustment, and Ranking Metrics are used to assess and rank the candidates that have been proposed by the analysis. There is a catalog of metrics that the user can choose of. The catalog comprises the metrics of the metric-based approaches and metrics that express the underlying heuristics of the other non-metric-based techniques as described in Section 8.4 for the voting approach. Established intra-modular and inter-modular metrics, such as number of lines of code, McCabe or Shepperd complexity (Fen- ton and Pfleeger, 1997), could also be integrated to measure additional aspects of the candidates but were not implemented for the current prototype. The metric used for ranking is a composite metric that is the normalized weighted sum over the individual metrics of the techniques. The composite metric is used to guide the user through the large set of candidates. The metric is computed once and then a threshold is used to control the presenta- tion. All candidates above the threshold come to the fore. The next section will discuss how this can actually be done. Independent of the way of presentation, we would start with a high threshold that is decreased step by step. In each step, the candidates above the chosen threshold are validated by the maintainer. Once the elements of a candidate have been accepted, they are not clustered again since the user has already decided where they belong to. Some metrics have parameters that need to be adjusted. Altogether, there are, hence, three dimensions of variability: • the influence factors for the basic metrics within the composite metric, A Semi-Automatic Method to Detect Components 320 • the inherent parameters of the basic metrics, • and the filtering threshold. These parameters can be adjusted by the user and the presentation updated accordingly. Several distinct metric settings can be tried without need to rerun the automatic clustering; only the metrics have to be re-computed. 9.4 Presentation, Validation, and Acceptance For the presentation of the results of automatic techniques, we are using the means offered by Rigi, mainly flat and hierarchical perspectives of the component decomposition. Base entities and components are expressed as nodes, their rela- tionships as edges. We are using a special node type, so-called analysis node, to represent the results of an analysis. Introducing an analysis node gives the user a handle for direct manipulation of analysis results. For example, the user can select two analysis nodes and apply the intersection operator to them. The analysis node can be unfolded. Then, the actual subsystem and atomic com- ponent candidates are shown. The user can browse these candidates by clicking on the nodes or viewing the node hierarchy as a whole. The node hierarchy is especially interesting when the results of Similarity Clustering or Type-Based Cohesion are viewed. These two clustering approaches return a tree that indicates the order in which elements were grouped together and, therefore, immediately show what is more similar and what is less. The maintainer can then “climb up the tree” starting at the leaves and stop at any inner node for which the combina- tion is doubtful. Direct validation is possible in any view. The composite metric mentioned in the last section can be used for the purpose of presentation of the candidates either to filter candidates below a threshold or to emphasized candidates above the threshold. An alternative way of presentation would be to color the nodes or to set the size of the nodes according to their met- ric value. This way, the user could see all candidates at once and yet immediately find the most promising candidates. Unfortunately, this would have been more difficult to achieve with Rigi. 321 Detection Strategy The returned candidates can be accepted or rejected individually or as a whole by direct manipulation. The atomic components can be renamed by the user to give them a meaningful name. Single base entities within candidates can be accepted or rejected. Cut and paste capabilities are available to move single or whole sets of base entities as a group at once from one atomic component to the other. Any base entity can be added to a candidate. The maintainer is also able to create her own atomic components. Everything confirmed by the user is moved in the user view. The user view, basi- cally, is represented by the same means as any other analysis view such that most commands available for analysis results are also available for the user view in a uniform manner. Only those that make no sense for the user view were excluded, such as accepting nodes or viewing the difference to the user view. 9.5 Detection Strategy The framework described in the previous sections has many degrees of freedom. Therefore, some guidelines should be given on how to use it successfully. The recommended strategy for component recovery consists of two main parts: Detec- tion of atomic components and then identifying the relationships among the atomic components. Component detection. Detection of atomic components can be done in the fol- lowing phases (each phase may consist of several iterations): 1. Apply all connection-based analyses and Strongly Connected Base Component Analysis in parallel. Use deep union, deep intersection, and composition to combine the techniques (for reasonable combinations see below). If you want to poll the agreement of more than two techniques, use the voting approach instead of the intersection (the result may be empty otherwise). Add negative information during validation when you find entities that should not be grouped together. Rationale: One gets only few promising candidates in the beginning which can form points of crystallization for the subsequent analyses. The added mutually exclusive information will break up larger candidates in the subsequent runs of the analyses. A Semi-Automatic Method to Detect Components 322 2. Apply connection-based approaches once again but one at a time. This time, they will leverage the information the user has contributed and return different clusters (containing no entities that have already been grouped and no mutually exclusive entities). Validate the non-intersected results (thus, using less strict criteria). If particular large candidates occur, refine them with other techniques by means of intersection or individual composition. If the results of one tech- nique have been validated, run the next analysis. Rationale: The crystallization points of the first step are extended and new atomic components are built that were dropped out by the intersection in the first step. Applying the techniques successively guarantees that all validation information is respected by the analyses. 3. The metric-based approaches are associated with parameters whose values may not be known in advance. Fortunately, the previous two steps lead to a set of components that can be used to calibrate the metric-based approaches. Varying this calibration reveals further clusters. In the case of Similarity Clustering, the parameters should be set specific to one kind of atomic component. For exam- ple, if abstract data types are to be detected, one gives signature and local-obj- of-type relationships more and variable reference relationships less weight. If a hierarchical clustering is used, one starts the validation of the results at the leaves and then climbs up the tree toward the root until a metric value is reached that does not indicate sufficient confidence in the candidate anymore. Rationale: The connection-based techniques are based on fixed patterns and, therefore, will always yield the same candidates. The metric-based techniques allow more variability by changing their parameters. 4. Finally, Dominance Analysis can be applied to find local utility functions of atomic components. Rationale: The dominance analysis for atomic components can only be applied when the atomic components exist in the first place. Hence, it can only be used late in the detection process. Dominance Analysis will detect the local entities that might not be detectable by other approaches. For example, a subprogram that provides a special service for one abstract data type only need not have any recognizable relation to the type (signature relation or a local variable of this type) other than being called by the functions of this abstract data type. There- fore, none of the connection-based techniques will cluster it to the abstract data type since they all ignore calls. Likewise, the chances that Similarity Clustering 323 Detection Strategy finds it are low since the calling relationship alone is usually not very signifi- cant. Combining components views. In the first phase of the method above, the diverse techniques are combined by means of the combining operators described in Section 8.3, namely, restriction, composition, deep intersection, and deep union. It is often helpful to restrict the search to one kind of atomic component at a time because the search criteria are mostly different. For this reason, it is possible to restrict all the analyses to a particular set of entity types. For example, if one searches for abstract data types, global variables can be ignored. If a technique was restricted to certain entity types, entities of other types may be left. Then, another technique can be applied in a composition to the result of the previous technique in order to cluster free entities. Composition can also be used to refine the results of possibly too large components of one technique by another technique. When the intersection is applied to components views generated by techniques that propose very distinct atomic components, the resulting components view is likely to be empty. This is, in particular, the case when techniques are combined that consider different kinds of entities. By a look at Table 9-1, which summarizes what kind of base entities are considered by the respective base technique, one can quickly decide which intersections do make sense. A 3 in Table 9-1 means that a certain kind of base entity is considered, ^ means that it is not considered. A sensible intersection can be expected of those techniques that consider a com- mon set of base entity kinds. Interestingly enough, even if the results of techniques that are not compatible in the sense of Table 9-1 are intersected, the results need not necessarily be empty. All techniques that have been introduced so far consider at least subprograms. So, even when Delta IC, which considers variables and subprograms, is intersected with Part Type, which considers types and subprograms, a few groups of subpro- grams may remain. The result may, however, not be very useful because the rea- son why the subprograms were brought together is not clear anymore. A Semi-Automatic Method to Detect Components 324 The deep union operator is useful to join together the results of two different tech- niques for further processing by a third analysis. The union operator may produce overlapping components (the intersection and composition do not yield overlap- ping atomic components other than those already produced by the applied tech- niques). Overlapping candidates are a problem when presented to a maintainer for validation because all overlapping candidates have to be investigated to decide where a given entity (in the overlapping part) belongs to. In the case of non-over- lapping components, the maintainer can simply accept or reject the entity at hand. However, this is only a problem when the final result contains overlapping com- ponents. For intermediate results during combination, overlapping components are useful. This way, several alternative candidates can be investigated in parallel until a decision is made in the course of combination. Detection of relations among components. Once the atomic components have been detected, their relationships can be analyzed by applying Strongly Con- nected Component Analysis and Dominance Analysis to the base view in which the atomic components are collapsed (see Section 8.3.1.5). Strongly Connected Component Analysis yields sets of atomic components that mutually depend on each other and Dominance Analysis reveals whether atomic components are local to each other. Table 9-1. Domains of the basic techniques. Technique Subprograms Variables Types Global Variable Reference 3 3 ^ Same Module 3 3 3 Part Type 3 ^ 3 Same Expression 3 3 ^ Internal Access 3 3 3 Delta IC 3 3 ^ Similarity Clustering 3 3 3 Type-Based Cohesion 3 ^ ^ Strongly Connected Components 3 ^ ^ Dominance Analysis 3 3 ^ 325 Extensibility of the Framework The recovered components are documented by the maintainer and saved for future maintenance. They can be used to explain the system at a higher level of abstraction above the code level and are candidates for reuse. Ideally, the module decomposition of the system will be restructured to conform to the atomic com- ponent structure, i.e., a module contains exactly one atomic component. If this is not immediately possible, the programmer should regard the system at the atomic components views rather than at the module view (see Table 3-5 on page 68). The source location of each base entity can be used as a mapping from the compo- nents view to the module view. If the system is changed during further maintenance, the captured decomposition of the system into components has to be adapted. If entities are removed from a component, the voting approach can be used to analyze whether the system should be restructured. If an entity is added, the voting approach can be used to give hints to which atomic component or module, respectively, the entity should be assigned (see Section 8.4). 9.6 Extensibility of the Framework The framework supporting the semi-automatic method can be extended in many ways: • The way how combination of techniques is organized allows for quick addition of new analyses. • Further metrics can be brought in to rank the candidates, in particular, tradi- tional intra-modular software engineering metrics, such as lines of codes, com- plexity and information flow metrics. • The way how results are presented to the user can be changed independently from the analyses. Coloring the candidates according to their metric ranking is just one example. • Multiple user views could be enabled instead of the view of one single main- tainer in order to collect the components for large systems by the joint work of many maintainers. From the point of view of one maintainer, the user view of other maintainers is like any other analysis. The combining operators can then be used to reach a consensus among the maintainers or to analyze divergences. A Semi-Automatic Method to Detect Components 326 Or the different views may indeed be used as different perspectives on the sys- tems for different programmer groups. The method could also be applied with no automatic analyses at all. One could, for example, start with the actual module decomposition and restructure the mod- ules by hand without any proposal by automatic techniques. The framework would then only be used to keep track of the manual findings and as a convenient cross-reference tool. Whether the automatic analyses are helpful at all is yet an open question. The next chapter reports on an experiment conducted to look into this. 327 Chapter 10 Experiments to Evaluate the Semi- Automatic Method This chapter describes an experiment conducted to empirically evaluate the impact of the automatic techniques for atomic component detection within the semi-automatic method and a case study to see whether our tools for atomic com- ponent detection may also help maintainers in other typical maintenance activi- ties. 10.1 Goals of the Experiments In Chapter 6, the detection quality for atomic component recovery of diverse techniques was evaluated. The results indicate that none of these techniques reaches the human recall rate. However, the techniques did find many relevant atomic components very quickly and, therefore, can support the maintainer. Chapter 9 proposed a method in which maintainer and computer work hand in hand to find atomic components. The automatic techniques are to support the user in the detection process. However, it is not per se clear that a maintainer using the automatic techniques is faster than one without the analyses. For this reason, I conducted an experiment to examine the aid of automatic techniques. The framework described in the last chapter is primarily aimed at atomic compo- nent detection. Finding atomic components is a prerequisite for migration of a legacy procedural system to an object-oriented system, it supports identification of reusable components, and may help in program understanding by providing a more abstract view of the system above the code level. Beyond component recov- Experiments to Evaluate the Semi-Automatic Method 328 ery, the prototype provides the maintainer also with many other base facilities that may support more typical maintenance tasks. Among these are cross-referencing, search mechanisms, automatic derivation of exact interfaces, and so forth. In order to gather experiences on how useful they actually are and what extensions would be necessary for a broader support of maintenance, we performed a case study in which students were asked to perform typical maintenance tasks with or without the prototype. There were several constraints for the experiments that I want to point out before the experiments are described in the following sections. Only very few companies exist that are willing to spend time and money in an experiment whose outcome is unclear. Since this was our first experiment, we could not approach industry by showing previous success stories. Therefore, I decided to conduct the experiment with students. Unfortunately, it was very difficult to find even students for this task. Only few students volunteered. Due to their limited available time, the experiments including training and discussion could not exceed 20 hours. For this reason, it was not affordable to do more than one system per experiment since systems of a realistic size were to be used and the experimental subjects should have enough time to do more than a superficial analysis. Since the experimental subjects were students and since their number was low and also because only one system per experiment was analyzed, we refrain from generalizing the results too far. The general objective of these experiments was not to yield a definite empirical proof for the usefulness of the semi-automatic method for all kinds of systems and settings but to learn about the strengths and weaknesses of the method and to investigate where further research should be directed. Furthermore, the detailed description of the experimental layout of this pilot study and its statistical analysis in the following sections can be used to repeat the experiment in an industrial setting and with a larger number of experi- mental subjects for other systems. 10.2 Experimental Subjects Nine students volunteered for the experiment. Since we chose a two-block design, we asked the system administrator of our department to participate in order to get 329 Experiment to Evaluate the Semi-Automatic Method equal group sizes. The system administrator was not involved in programming for our project. At the time of the experiment, the students were studying computer science at the University of Stuttgart (six at the graduate level, three at the undergraduate level). All of them had at least two years of programming experience and were familiar with the programming language C. See Table 10-1 for their individual profile. 10.3 Experiment to Evaluate the Semi-Automatic Method The experiment described in this section addresses the impact of the automatic techniques within the interactive and incremental method. 10.3.1 Hypotheses The general hypothesis is that the semi-automatic method as described in Chapter 9 yields more atomic components than a pure manual process. By manual search, we mean that only the cross-reference capabilities of our atomic component detection framework as well as common textual pattern matching tools, such as Table 10-1. Profile of the experimental subjects. experimental subject # semester programming experience S1 11 good S2 9 good S3 9 good S4 3 good S5 5 average S6 9 good S7 3 good S8 3 average S9 9 good S10 professional good Experiments to Evaluate the Semi-Automatic Method 330 grep, are used, whereas for the semi-automatic method, all automatic analyses are available. The independent variable is therefore: • semi-automatic method versus manual search The dependent variable is: • the recall of atomic components The null hypothesis and the two-sided and single-sided alternative hypotheses are as follows: Null hypothesis H0: There is no difference in the recall rate for the semi- automatic and manual method. Alternative hypothesis H1: The recall rates for the semi-automatic method and the manual search differ (for two-sided tests). Alternative hypothesis H2: The recall rate for the semi-automatic method is greater than the one for manual search (for single-sided tests). 10.3.2 Experimental Materials The task of the experimental subjects was to recover the atomic components for Mosaic (see Table 6-1 on page 154). In order to obtain comparable results, we reduced the possible search space for atomic components to a size that could be handled within the given time frame, i.e., all experimental subjects should be able to look at all source files within the available time. Therefore, we excluded the files that are mainly devoted to the graphical user interface, namely, all files whose names begin with the prefix gui. The 8 excluded files comprise 15 KLOC, i.e., 40 files consisting of 37 KLOC were to be analyzed. None of the experimen- tal subjects was familiar with the implementation of the system. Mosaic was selected for several reasons. First, the students were all acquainted to the application domain of web browsers. Second, its implementation involved several programmers such that different programming styles could be expected. 331 Experiment to Evaluate the Semi-Automatic Method Third, Mosaic is used in many other reverse engineering studies which allows results more comparable to other approaches. 10.3.3 Tool Support The tools used for the experiment were as follows: • standard tools: the editors emacs and vi, the Gnu C compiler gcc, and grep (the Gnu tool for pattern matching based on regular expressions); these tools are widely used on Unix platforms and can be considered standard tools for maintainers (all participants were familiar with these tools before the experi- ment) • plain Rigi: the graph editor Rigi without the analyses that was used as a graph- ical cross-reference tool and to capture the results of the search; plain Rigi can be considered a representant of source code browsing tools; even though there are similar tools of this type available on the market (mostly for object-oriented systems primarily to browse the inheritance hierarchy and often only text- based), these tools are rarely available to the typical maintainer • extended Rigi: the graph editor Rigi with an integration of the analyses described in Chapters 5, 8, and 9 Limitations of extended Rigi. At the time of the experiment, the extended Rigi had some limitations. Some of the ideas presented in previous chapters were only inspired by the experiment. The restrictions were as follows: • Among the combining set operators, only deep intersection was available. Composition was implemented in the form of incremental analyses but not as an operator that could be applied to individual components. • Neither was the voting approach implemented. • Similarity Clustering could not be run in parallel and, therefore, one had to wait up-to ten minutes for the results (for this reason, it was rarely used by the experimental subjects during the experiment that was limited to six hours). • Moreover, it was not possible to accept components in the hierarchical view returned by Similarity Clustering. The hierarchical view had to be flattened before components could be accepted. Hence, an advantage of Similarity Clus- tering could not be leveraged. Experiments to Evaluate the Semi-Automatic Method 332 • There were also problems with Rigi’s (both plain and extended Rigi) update strategy for visualized components views: Whenever a component was accepted, the layout of all windows was lost and people had to re-order their components views. • Furthermore, hierarchical subsystems were not supported. Interestingly enough, people in both groups complained that there was no higher grouping mechanism on top of atomic components which would have allowed them to group related atomic components. This shortage motivated the generalization of the combining operators for atomic components to subsystems as described in Chapter 8. 10.3.4 Experimental Design The experimental subjects were randomly assigned to two groups that differed in the tools available to the search for atomic components: • Group SAM (semi-automatic method): extended Rigi • Group MS (manual search): standard tools and plain Rigi Table 10-2 shows how the experimental subjects were randomly assigned to the two groups. The size of each group was five persons. In order to avoid too much variance in the set of experimental subjects, all experi- mental subjects were jointly trained as follows: 1. The diverse kinds of atomic components that were to be detected were explained and exemplified. (30 minutes) 2. The available tools (emacs, vi, grep, and plain and extended Rigi) were intro- duced. The available analyses in the extended Rigi were even introduced to members of group MS though they would not use the extended Rigi. This was done to avoid a margin of the SAM group since introducing the analyses (which was necessary to use them at all) means teaching the heuristics that are Table 10-2. Experimental groups. group experimental subjects SAM S1, S3, S5, S6, S8 MS S2, S4, S7, S9, S10 333 Experiment to Evaluate the Semi-Automatic Method associated with the analyses. (1 hour) 3. The experimental subjects were trained with an example system (the unix spread sheet calculator sc with about 10 KLOC). The subjects had to detect as many atomic components as possible within a fixed period of time. The avail- able tools for this task were emacs, vi, grep, and plain Rigi for all trainees. Members of group MS could also use the extended Rigi. (3 hours) 4. The result of the training was jointly discussed in order to achieve an agree- ment on the notion of atomic components among the subjects. This discussion revealed a consensus about the general notion of atomic components among the experimental subjects. (1 hour) In the actual experiment after the training, every subject had to analyze Mosaic for 6 hours. The system was already preprocessed, i.e., the resource usage graph for the analyzed system was available for the experimental subjects. 10.3.5 Measurement of the Dependent Variable Since the number of atomic components for Mosaic was not known in advance, a termination criterion for the search for atomic components did not exist. That is why we stopped the search after 6 hours. Limiting the available time makes the experiment even more realistic since in an industrial setting, one can generally not afford to spend unrestricted time on a problem. Two distinct ways of measuring the dependent variable were chosen: 1. using the absolute number of clustered elements for each subject (individual absolute recall, short IAR) 2. comparing the components of each individual to the joint set of components of all individuals (reference corpus recall, short RCR) The first alternative does not require agreement among the experimental subjects, hence measures only how many elements were clustered in a given time by each individual. For the second alternative, the individual results were joined and the result of each individual was judged with respect to the joined result. The joined list of Experiments to Evaluate the Semi-Automatic Method 334 components was individually reviewed by the experimental subjects. The review work was distributed among the subjects so that each atomic component was at least reviewed by two persons. In order to reduce the effort for the experimental subjects to review the reference corpus, each subject had only to review a part of the reference corpus. The reviewing subjects could accept components as a whole or in parts as well as add elements to the components. Overlapping atomic com- ponents were allowed. The set of accepted atomic components formed the refer- ence corpus to which the proposed atomic components of each subject were compared. The comparison followed the method described in Section 6.2. The dependent variable was measured as the recall rate defined in Section 6.2.2 with respect to the reference corpus. Because not each individual reviewed the whole reference corpus, it may have happened that people would not agree to certain parts. That is why both reference corpus recall as well as individual absolute recall will be evaluated in the follow- ing. 10.3.6 Experimental Results The recall rate of each experimental subject with respect to the reference corpus and the individual absolute recall are listed in Table 10-3. The numbers are listed in descending order; this order does not correspond to the order in which the experimental subjects are listed in Table 10-1 for reasons of anonymity. Both the reference corpus and individual absolute recall are approximately the same for both groups. Table 10-3. Results for Mosaic. x*1 x*2 x*3 x*4 x*5 å x*i RCR SAM 0.42 0.28 0.28 0.20 0.16 1.34 0.27 MS 0.48 0.35 0.27 0.24 0.12 1.46 0.29 IAR SAM 433 400 290 204 248 1575 315 MS 498 275 275 326 150 1524 305 x *i x*iå( ) 5¤= 335 Experiment to Evaluate the Semi-Automatic Method 10.3.7 Statistical Analysis The common way to evaluate statistical data of controlled experiments is to apply analysis of variance (ANOVA). The F statistic, for example, may be used to test the hypothesis that the population means for the two groups are equal (Winer et al., 1991). However, the F statistics and other statistical tests of ANOVA assume a certain distribution of the population or themselves approach a normal distribu- tion only for large samples. However, normal distribution cannot be assumed for our experiment. There have not been any large-scale experiments on the recall rate of programmers in finding atomic components yet and, hence, the actual dis- tribution is unknown. Furthermore, the size of our sample is too small to evaluate it with the F statistics. There are other statistics, so-called non-parameterized sta- tistics, that do not assume any distribution and are applicable to small samples. The power of these tests is generally better than the power of parameterized tests. According to Lienert, there are basically two kinds of statistics appropriate for the design chosen for this experiment (1973): The exact U-test by Mann and Whitney (1947) and the exact Fisher-Pitman randomization test for two independent sam- ples (Pitman, 1939). These two methods differ in the leveraged scaling informa- tion of the data. The exact U-test assumes data at an ordinal scale, i.e., the data can only be compared in terms of a greater/lesser relationship, whereas the exact Fisher-Pitman test is based on interval information. Since the recall rate is actu- ally at an interval scale, Fisher-Pitman test seems to be the appropriate test. How- ever, it assumes that the samples are an exact image of the whole population which cannot really be justified since we dealt almost only with students. There- fore, both tests are used for the evaluation. 10.3.7.1 Exact U-Test The exact U-Test consists of the following steps (Mann and Whitney, 1947; Lien- ert, 1973): 1. The data of both groups are united and ordered. 2. Each value of SAM is compared to all other values of MS. Let Gi be the num- ber of elements of MS that are smaller than element i and Li be the number of elements of MS that are greater than i. Experiments to Evaluate the Semi-Automatic Method 336 3. Summarize the numbers: and . The smaller figure is the observed U value. 4. The expected value of U under the null hypothesis is mU = (NSAM+NMS)/2 where NSAM is the size of group SAM and NMS is the size of group MS. In our experiment, NSAM = NMS = 5. The more U differs from mU the less likely does the null hypothesis hold. 5. The likelihood to get the observed U value or a value smaller than the observed U value is determined by the number ZU of those combinations out of the possible permutations of SAM and MS recall rates that yield a U value not greater than the observed U value (for a single-sided test): and for a two-sided test: 6. Reject the null hypothesis if P is less than a certain threshold. The united and ordered data of Table 10-3 are listed in Table 10-4 and Table 10-5. For the reference corpus recall, L=12 and G=13, hence U=L=12; and for the indi- vidual absolute recall, L=13 and G=12, hence U=G=12. For the probability P, one could either lookup P in the table provided by Owen (1962, Table 11.3) or write a small program that computes ZU by evaluating all Table 10-4. Ordered reference corpus recall rates for SAM and MS. rank 1 2 3 4 5 6 7 8 9 10 recall 0.12 0.16 0.2 0.24 0.27 0.28 0.28 0.35 0.42 0.48 group MS SAM SAM MS MS SAM SAM MS SAM MS Gi 4 4 2 2 1 Li 1 1 3 3 4 G Gi i SAMÎ å= L Li i SAMÎ å= NSAM NMS+ NSAMè ø æ ö NSAM NMS+ NMSè ø æ ö= P ZU NSAM NMS+ NSAMè ø æ ö¤= P ZU ZU 1–+( ) NSAM NMS+ NSAMè ø æ ö¤= 337 Experiment to Evaluate the Semi-Automatic Method permutations of MS and SAM members. For U = 12, this program will compute ZU=126 and ZU-1=106, hence: for a two-sided test. Since we chose U=L=12 for the reference corpus recall and U=G=12 for the indi- vidual absolute recall, the null hypothesis holds for both measurements with a probability of 0.92. In other words, a positive effect of the automatic analyses for atomic component detection could not be shown. 10.3.7.2 Exact Fisher-Pitman Test As opposed to the exact U-test, the exact Fisher-Pitman test leverages the interval information of the recall rate data (Pitman, 1939; Lienert, 1973). However, its assumption is that the sample is an exact image of the whole population. The exact Fisher-Pitman test is based on the following observations: The whole sam- ple (i.e., the union of SAM and MS) of the NSAM + NMS = N values may be split into two single samples with NSAM and NMS values in different ways. Each of them is equally likely with respect to the null hypothesis. If we use the difference of the locations (means or medians), i.e., difference , as the value against we test, then we can compute the D val- ues for all two-sample combinations and determine whether the observed test value D is among the Z + z highest D values (Z is the number of Table 10-5. Ordered individual absolute recall for SAM and MS. rank 1 2 3 4 5 6 7 8 9 10 recall 150 204 248 275 275 290 326 400 433 498 group MS SAM SAM MS MS SAM MS SAM SAM MS Gi 4 4 2 1 1 Li 1 1 3 4 4 P 126 106+( ) 5 5+5è ø æ ö¤ 0.92= = N NSAMè ø æ ö NNMSè ø æ ö= D xSAM xMS–= N NSAMè ø æ ö Experiments to Evaluate the Semi-Automatic Method 338 two-sample combinations whose value is the observed D value and z is the num- ber of two-sample combinations whose value is less than the observed D value). If so, we reject the null hypothesis. An equivalent but simpler evaluation strategy is as follows: 1. The sum S of the recall rates of the smaller probe is the value against which we test (in our case, both samples are at the same size): Let SMS be the sum of the recall rate for MS, i.e., SSAM + SMS = T, then: Since T/NMS and NSAMNMS/(NSAM + NMS) are both constants, the test distribu- tion of SSAM is functionally connected to the test distribution of D. Therefore, we can use the more easily computable distribution of SSAM instead of D. 2. The test whether an observed S value is sufficient to reject the null hypothesis is as follows. First, we compute all two-sample combinations whose S value is either smaller than or equal to the observed S value; let the former be denoted by z and the latter by Z. Then, for a single-sided alternative hypothesis, the probability of the null hypothesis is as follows: Since the distribution of S depends upon the samples and, therefore, differs from test to test, it cannot be tabulated. One may write a program that computes Z and z over the possible two-sample combinations of the values of SSAM and SMS. For the data given in Table 10-4, Z=154 and z=2 and therefore: S SSAM xSAMi i 1= NSAM å= = D SSAM NSAM -------------- SMS NMS ----------– SSAM 1 NSAM -------------- 1 NMS ----------–è ø æ ö× T NMS ----------–= = SÛ SSAM D T NMS ----------+è ø æ ö NSAM NMS× NSAM NMS+ -------------------------------×= = P Z z+( ) NNSAMè ø æ ö¤= 339 Experiment to Evaluate the Semi-Automatic Method In other words, the likelihood that the alternative hypothesis H2 holds is 38 per- cent. For the data in Table 10-5, Z=110 and z=2 and, hence, P = (110 + 2) / 252 = 0.44. Therefore, the likelihood of the alternative hypothesis H2 is 0.56. In comparison to the exact U-test, the likelihood of the alternative hypothesis has increased both for the reference corpus recall and the individual absolute recall. Yet, these results are still at a low significance level and, therefore, are not suffi- cient to reject the null hypothesis. 10.3.8 Summary A positive effect of the automatic analyses could not be shown by the experiment on the evaluation of the semi-automatic method. However, the conclusions should not be generalized too far for the following reasons: • The subject system used for the experiment, Mosaic, is well-structured. The experimental subjects were allowed to use the module view that consists of the modules of the system and their respective contents. The module view corre- sponds to the Same Module heuristic without the restriction of Same Module that the elements of a component have to be (transitively) connected to each other. Since Same Module - and hence, the module view if the modules are not too large - is one of the most effective techniques for well-structured systems, the advantage of people with automatic support was only marginal. • Beside the fact that the subject system was in a better shape than many legacy systems, the experimental subjects may also not be considered typical average programmers. All participants were computer science students with grades bet- ter than the average. A future experiment should investigate whether less tal- ented programmers would profit more from automatic analyses. • Furthermore, even members of group MS could use automatic support to some degree. They could use plain Rigi as a cross-reference and browsing tool and to P 154 2+( ) 10 5è ø æ ö¤ 156 252 -------- 0.62= = = Experiments to Evaluate the Semi-Automatic Method 340 log their findings. Mostly, programmers can only use very simple search tools, like grep, in practice. • One must also note that at the time of the experiment, the framework did not offer all the functionality described in Chapter 8 and Chapter 9. Section 10.3.3 describes the limitations of the extended Rigi used for the experiment. However, even for worse decomposed systems and less talented programmers, for which automatic analyses may be more helpful, we should be aware that the sup- port of the automatic analyses is limited to gathering candidates. This part, even done by hand, is comparatively small to the time needed for validation by the maintainer who will always have to look at the source code. As long as there are no absolutely precise techniques whose candidates can be accepted without checking, there will be a constant human factor in the process of component detection that cannot be eliminated. The goal for automatic analyses must be to be as reliable, flexible, and quick as possible for an interactive application. Here, the framework inherited in parts the weaknesses from the used automatic techniques. As the evaluation in Chapter 6 has shown, the techniques may produce bad candi- dates and do not find all components either. Future research should be aimed at finding more precise techniques considering other sources of information, like data flow information or domain knowledge. The techniques described in this the- sis leveraged only structural information. The experiment described in this chapter could not show any positive effect of the analyses (nor a negative effect). On the other hand, experiences with the over- looked false positives described in Section 6.4 indicate that automatic analyses may be useful when larger parts of a system are to be analyzed. In the experiment with Mosaic, the search was limited by time (6 hours). On the other hand, in gath- ering the references in Aero, Bash, and CVS used to evaluate the automatic tech- niques, which took between 20 and 35 hours per system, 42% of the ADO candidates and 41% of the ADT candidates proposed by the automatic techniques and formerly categorized as false positives were indeed overlooked or alternative components. 341 Case Study for Maintenance Support 10.4 Case Study for Maintenance Support In order to find out whether the framework could also be used to support mainte- nance tasks other than atomic component detection, a case study with the students of the previous experiment was performed. The goal of this case study was also to learn what other types of analyses would be useful for maintainers. All but one of the experimental subjects of the last experiment participated in this case study. The participants were in the same groups as in the previous experi- ment. The size of group SAM was four and the size of group MS was five. Mem- bers of group SAM could use the extended Rigi, while members of group MS were only allowed to use the standard tools grep, vi, emacs, and a C compiler, i.e., they could not use plain Rigi. The system used for the case study was XCoral (version 3.2), an X-window-based editor consisting of 73 KLOC of C code. The editor also contains a large sub- system SMAC that implements an interpreter for a subset of the programming language C. The structure of the system is highly deteriorated, information hiding is disregarded, and many function clones (duplicated and slightly modified func- tions) exist. The assignments for the participants involved typical maintenance problems like change of data structures, lifetime analysis of variables, interface identification, concept recognition, and clone detection. The tasks were as follows (the com- ments in italic font were not mentioned to the participants): 1. All clients of the ADT List directly access the record components next and pre- vious of List. These components should now be hidden, i.e., no function ¾ except for the accessor functions of List ¾ may access internal components of List. Instead, an iterator for List should be provided and used at the client site. What changes are necessary? This task was aimed at impact analysis of changes to a data structure. How- ever, the primary question was whether the participants would find the already existing iterator for ADT List that was implemented in another file and then avoid the re-implementation of the found iterator. 2. The subsystem SMAC has a memory leak at its hash table because the destruc- tor HashTable_Empty is never called. Insert this function call at the sites where the lifetime of the hash table ends. If the lifetime of the hash table lasts till the Experiments to Evaluate the Semi-Automatic Method 342 end of the program, it does not need to be explicitly released; in this case, it will be implicitly released with the program termination. This task was aimed at lifetime analysis of variables. SMAC has 8 global hash tables, one of them not even used at all, and no local hash tables (except for one in the constructor for hash tables). The lifetime of all global variables ends with the program. As a matter of fact, in the narrower sense, there was no mem- ory leak of the table itself. But there was one call to HashTable_Empty freeing the entries of the hash table that was removed from the source. Hence, the memory allocated for the entries was not released, which in turn would lead to an error in the application because the table needed to be empty at some points. 3. What is the exact interface between SMAC and XCoral, i.e., what parts of SMAC are used within XCoral and what parts of XCoral are used within SMAC? As a matter of fact, there is one file smacXCoral that is supposed to be the interface between XCoral and SMAC. However, there are a few other declara- tions of SMAC used in XCoral not listed in smacXCoral. Interestingly enough, there are also declarations of XCoral used within SMAC. 4. File file_dict.c implements an ADT FileRec as a list of filenames. Moreover, this file contains a global variable file_dict. What is the concept behind these declarations? How could this concept be reused? And where (within the sys- tem)? On the surface, this task was aimed at concept recognition. The actual purpose was to detect existing clones of this concept. The total time available for all these tasks together was six hours. The tasks were to be handled in the given sequence. Participants were allowed to skip a task when they could not find a solution. Actually, there was another task aiming at the understanding of a central data structure Text that consists of many “inlined” sub- concepts whose accessor functions are implemented in different files. However, only three participants arrived at this task within the given time. The individual times needed for these assignments are shown in Table 10-6. The figures in brack- ets are the time spent on a task before a participant gave up (these numbers are not considered for the average and median). 343 Case Study for Maintenance Support Table 10-6 shows a high variance among the participants for most tasks in both groups. Members of group SAM needed less time than members of MS for task 1, similar time for task 3, and more time for tasks 2 and 4. 10.4.1 Task 1 - Change of Data Structure The variance of the time needed for task 1 can partly be explained by the varying degree of detail in the answers of the participants. Participant MS1, for example, identified only the files that are affected by the proposed change, while partici- pant MS2 even listed all source positions that would have to be changed and described how these sites should be changed. Members of group MS used grep to identify the places where a previous and next occur. This strategy would have been more difficult if there had been another data structure with components named alike. One of the members of group MS even noted that certain sites where previous and next occurred would not have to be Table 10-6. Needed time in minutes for task 1-4. Task 1 Task 2 Task 3 Task 4 SAM SAM1 85 115 60 70 SAM2 90 75 (40) 95 SAM3 60 90 75 65 SAM4 60 30 120 60 Average SAM 74 78 85 73 Median SAM 73 83 75 68 MS MS1 35 80 130 40 MS2 125 60 75 30 MS3 110 50 (45) 120 MS4 80 80 37 40 MS5 74 56 92 54 Average MS 85 65 84 57 Median MS 80 60 84 40 Experiments to Evaluate the Semi-Automatic Method 344 changed because they belong to a global variable Memo. However, the participant overlooked that Memo is of type List. The extended Rigi was able to support task 1, but only in two steps. For this task, the Internal Access heuristic could be used to gather all functions that access the internal components of global variables and parameters of type List. However, internal accesses to local variables are not reported by this analysis. Subprograms having local variables of type List could be identified by the cross-reference facil- ity and then, their source code could be inspected. This explains why members of group MS still needed so much time. Internal Access could be enhanced toward fully automatic support of maintenance jobs like task 1. An enhanced Internal Access analysis would then report reliable results within a minute where primitive tools, like grep, yield only a very rough approximation that has to be validated by the maintainer. In both groups, there were three people that did not realize the existence of an iterator for List in the source. The assumption was that members of group SAM would overlook the iterator more often because they would rely on the results of Internal Access and, therefore, not look at the source code. However, members of MS who were far more attached to the source code overlooked the iterator equally often. 10.4.2 Task 2 - Lifetime Analysis The support for lifetime analysis by the extended Rigi is very limited. Finding the hash tables in the system could be done by the cross-reference facilities. But the available control and dataflow information in the resource usage graph is very rough. The call view abstracts from the exact order of function calls and contains only directly visible calls; limited data flow information is only available as the set and use relationships for global variables. Likewise, people solely using grep could identify the global hash tables. Most people in both groups concluded from the scope of the global variables that the lifetime of the variables would end with the program, others were unsure. None realized the missing call to the function that empties the hash table. One participant of group MS proposed to check for a defined hash table in certain initialization routines and then to release the hash table when already defined before calling the constructor. However, firstly, this would not be a remedy since the initialization routines are only called once in the 345 Case Study for Maintenance Support beginning and, secondly, this could even result in an error when the hash table were aliased and the initialization functions were called more than once. 10.4.3 Task 3 - Exact Interface Identification Members of the MS group looked at the include statements in order to identify the exact interface. Those files were closely investigated that included a file from the other subsystem. Since all files of SMAC are in a subdirectory of the XCoral source directory, the files of XCoral used in SMAC could be identified by include statements like “../file.h”. This strategy did not work for files of SMAC used by XCoral since there was no include statement like “Smac/file.h”. Instead, explicit extern declarations for SMAC elements were used within XCoral. Looking only at files that are explicitly included may also result in overlooking usages of routines for which not even an extern declaration within the using file exists since routines are implicitly declared in C when no declaration can be found. In Rigi, there is a built-in function that allows the extraction of exact interfaces of composite nodes. The most simple solution would have been to collapse all files of SMAC into one node (identified by their path) and all other files into a second node. Then the exact interface could have been ascertained for both nodes. This would have been work of few minutes. However, members of SAM did not see this ability. They rather used the cross-reference capabilities. Of both groups, one participant gave up at this task. 10.4.4 Task 4 - Concept Recognition and Clone Detection There are approaches that try to automatically assign concepts to pieces of code using typical coding patterns for data structures and algorithms (Wills, 1992; Qui- lici, 1997). The extended Rigi does not offer such capabilities. The only support it provides for this kind of task is to find groups of related elements without assign- ing any meaning to these groups, and offering a higher level of abstraction by visualizing global declarations and their relationships only. The actual goal of this task, however, was to see whether a more abstract view would help or hinder in realizing function clones. On one hand, providing a more abstract view may sup- press important details because the source code is not immediately visible and the abstracted information may not be sufficient to recognize function clones. On the Experiments to Evaluate the Semi-Automatic Method 346 other hand, merely looking at the source code may lead to get lost in details. This case study did not provide evidence for either hypothesis. In both groups, there was one member that did not see the function clones. Both members recom- mended to use another more general data structure in the subsystem SMAC instead. 10.4.5 Summary The case study described in this section was headed at the ability of the extended Rigi to support maintenance tasks other than those it was originally designed for, namely, atomic component detection. The goal of this case study was also to learn what other types of automatic analyses would be useful for maintainers. The average time needed for the diverse maintenance tasks performed in this case study was at least 1 hour. Cases in which participants needed 2 hours were not rare. However, at least three of these tasks could easily be automated (to some degree). Tasks 1 and 3 fall into the same category of static name binding for a given set of declarations at the atomic component and subsystem level which can reliably be done with a global semantic analysis as far as visible declarations are concerned. In the case of interfaces, there may also be hidden dependencies that are harder to track down. For example, there could be an external file which one part of the sys- tem writes into and the other part reads from. Such hidden dependencies may be partly derived by control and data flow analyses, e.g., by means of constant prop- agation. However, it need not necessarily be recognizable or decidable that the same external file is meant at two different sites of the system. Problems of the kind of task 2 deal with lifetime and protocols of components (a protocol is a specification on the allowed order of actions associated with a com- ponent). How long a component exists is often undecidable statically; for exam- ple, when the component is allocated on the heap, one would have to find out whether there is no pointer to the component left in order to ascertain the end of the lifetime of the component. The protocol of the hash tables in the above assign- ments would require to call HashTable_Empty at the end of the lifetime of each hash table. Since the end of the lifetime is often undecidable, this protocol speci- fication cannot always be checked statically. On the other hand, if points-to analy- 347 Case Study for Maintenance Support ses or user assertions exclude aliasing of the hash table, the lifetime of the hash table may in fact be derived and it can be checked whether HashTable_Empty is properly called. In order to support clone detection, as requested in task 4, several automatic approaches have been proposed. Baker uses pattern matching techniques (1995), Mayrand et al. compare values of certain metrics in order to identify pieces of code that perform similar functions (1995), and Baxter et al. match abstract syn- tax trees (1998). Experiments to Evaluate the Semi-Automatic Method 348 349 Part IV Finale 351 Chapter 11 Related Research This chapter summarizes research in architecture recovery related to atomic com- ponent detection. 11.1 Other Automatic Component Detection Techniques Most automatic techniques for component detection have already been presented in Chapter 5. Others not considered in this thesis follow in this section. 11.1.1 Metric-based Approaches There have been several clustering techniques for module and subsystem detec- tion proposed that are based on a similarity metric. Schwanke’s work (1992) and the work of Patel, Chu, and Baxter (1991) have already been discussed in Section 5.9 and Section 5.10, respectively. 11.1.1.1 Belady and Evangelisti Belady and Evangelisti’s approach groups related subprograms using a similarity metric based on data bindings (1982). A data binding is a potential data exchange via a global variable. Several kinds of data binding can be distinguished accord- ing to the following levels of accuracy: A potential data binding is defined as an ordered triple where p and q are procedures and x is a variable within the static scope of both p and q. A used data binding is a potential data binding where p and q either set or use x. An actual data binding is defined as a used data bind- Related Research 352 ing where p assigns a value to x and q uses x. A control flow data binding is defined as an actual data binding where there is a “possibility” of control passing to q after p has had control. A possibility is said to exist whenever either (1) there exists a chain of calls from p to q or vice versa, or (2) there exists a procedure r such that there are chains of calls from r to p and from r to q and there exists a path in the directed control flow graph connecting the call chain to p with the call chain to q. The similarity metric is based on the percentage of the bindings that connect to either of the two components and are shared by the components. Vary- ing reliability can be achieved by selecting different degrees of data bindings. 11.1.1.2 Hutchens and Basili Hutchens and Basili extend Belady and Evangelisti’s work by using a hierarchical clustering technique to identify related subprograms and subsystems (1985). 11.1.1.3 Girard and Würthner Girard and Würthner’s work is aimed at identification of functionally cohesive components and subsystems (Eisenbarth et al. 1999). Functionally cohesive com- ponents are groups of routines that together implement a certain functionality. Candidates of functionally cohesive components may be identified as subtrees of the dominance tree for the call view. The root of the subtree is the interface func- tion, all other functions of the subtree are local service functions to the interface function. In order to retrieve functionally cohesive components from the domi- nance tree, two heuristics were proposed. The first heuristic is based on the size of the subtree. During bottom-up traversal of the dominance tree, a subtree is selected whose size (in terms of number of nodes) is not larger than a user-deter- mined threshold. However, selecting subtrees by size does not necessarily say anything about the cohesion of a component and, therefore, another heuristic based on shared variables is proposed. The underlying assumption of the second heuristic is that a subcomponent, S, that has a different functionality than a com- ponent, C, dominating S may be distinguished by its references to global data. If S implements a functionality that is only needed in the context of C, it is quite likely that S references the same data as C\S (the set of subprograms in C without those in S). If S implements a functionality that differs from the one of C, S probably needs further data. Hence, a dissimilarity can be defined as: 353 Other Automatic Component Detection Techniques where V(X) is the set of variables referenced by subprograms in X. That is, the dissimilarity measures the ratio of the variables additionally used by S. An incre- mental algorithm can be used to cut the dominance tree so that dissimilarity among components is maximized. Once the functionally cohesive components have been identified, they can be grouped to subsystems based on the variables they share. Likewise, variables can be clustered based on the functionally cohe- sive components that reference these variables. This approach is particularly suited for a language like Fortran in which global variables are very frequent. 11.1.1.4 Mancoridis et al. Mancoridis et al. propose measurements very similar to internal and external con- nectivity as defined in Section 5.8 for clustering modules to subsystems (1999). They propose a genetic algorithm based on these metrics for finding a partition of the module view that minimizes external connectivity and maximizes internal connectivity. The approach was applied to modules only, hence, very few nodes ¾ compared to the number of base entities in a system ¾ are to be clustered. Whether the use of genetic algorithms scales to a finer granularity has to be shown. Furthermore, external connectivity as defined by Mancoridis et al. ignores the actual number of dependencies existing among modules, i.e., modules with only one dependency have the same connectivity as modules with hundreds of dependencies. 11.1.2 Concept Analysis Concept analysis is also based on structural information; yet ¾ as it will be shown in this section ¾ it is a class of approaches quite different from those pre- sented in Chapter 5 and represents its own field of research. For this reason, it has not been explored further in this thesis. This section acquaints with the basic ideas of concept analysis and summarizes existing applications of concept analy- sis to atomic component detection. Because concept analysis is also based on structural information, yet not investigated in this thesis, the diverse approaches based on concept analysis are presented in more detail than other techniques within related research. dissim S C,( ) V S( ) \ V C \ S( ) V C( ) ----------------------------------------= Related Research 354 Concept analysis provides a way to identify groupings of entities that have com- mon attributes. Its mathematical foundation was laid by G. Birkhoff in 1940. Gre- gor Snelting introduced concept analysis to software engineering in 1998. Since then it has been used to evaluate class hierarchies (Snelting and Tip, 1998), explore configuration structures of preprocessor statements (Krone and Snelting, 1994; Snelting, 1996), and to perform atomic component detection (Lindig and Snelting, 1997; Siff and Reps, 1997; Sahraoui et al., 1997; Graudejus, 1998; Can- fora et al., 1999). 11.1.2.1 Mathematical Background Concept analysis is based on a relation R between a set of objects O and a set of attributes A, hence R Í O ´ A. An object in the sense of concept analysis can be anything, not only objects as defined by Section 3.1.1. Within this section on con- cept analysis, the term object will be used in the sense of concept analysis. The triple C = (O, A, R) is called a formal context. For any set of objects O Í O, the set of common attributes is defined as Similarly, for any set of attributes A Í A, their set of common objects is As an example, let us consider the fictitious binary relation described by Table 11- 1. An object Oi has attribute Aj when there is a 5 in row i and column j in Table 11-1. In this table, for example, the following two equations hold: The two functions s and t form a Galois connection, i.e., a pair of two antimono- tone functions: and s O( ) a AÎ o OÎ( ) o a,( ) RÎ"{ }= t A( ) o OÎ a AÎ( ) o a,( ) RÎ"{ }= s O1{ }( ) A1 A2,{ }= t A7 A8,{ }( ) O3 O4,{ }= O1 O2Í s O2( ) s O1( )ÍÞ A1 A2Í t A2( ) t A1( )ÍÞ 355 Other Automatic Component Detection Techniques and both s ° t and t ° s are closure operators: e.g., determines the big- gest set of objects that have the same attributes as O. A pair (O, A) is called a concept, if , i.e., all objects share all attributes. For a concept c = (O, A), O is the extent of c, denoted by extent(c), and A is the intent of c, denoted by intent(c). Informally, a concept corresponds to a maximal rectangle of filled table cells modulo row and column permutations. For example, Table 11-2 contains the con- cepts for the relation in Table 11-1. The set of all concepts of a given relation forms a partial order via or, equivalently . Table 11-1. Example relation. A1 A2 A3 A4 A5 A6 A7 A8 O1 5 5 O2 5 5 5 O3 5 5 5 5 5 O4 5 5 5 5 5 5 Table 11-2. Concepts for Table 11-1. C1 {O1, O2, O3, O4}, Æ C2 {O2, O3, O4}, {A3, A4} C3 {O1}, {A1, A2} C4 {O2, O4}, {A3, A4, A5} C5 {O3, O4}, {A3, A4, A6, A7, A8} C6 {O4}, {A3, A4, A5, A6, A7, A8} C7 Æ, {A1, A2, A3, A4, A5, A6, A7, A8} s°t O( ) A s O( )= OÙ t A( )= O1 A1,( ) O2 A2,( )£ O1 O2ÍÛ O1 A1,( ) O2 A2,( )£ A1 A2ÊÛ Related Research 356 If c1 £ c2, then c1 is said to be a subconcept of c2 and c2 is a superconcept of c1. For example, ({O2, O4}, {A3, A4, A5}) £ ({O2, O3, O4}, {A3, A4}) in Table 11-1. The sets of concepts and the partial order form a complete lattice, called concept lattice: In this lattice, the infimum (or join) of two concepts is computed by intersecting their extents: Note that , as has at least common attributes . Thus, an infimum describes the set of attributes common to two sets of objects. Similarly, the supremum (or meet) is computed by intersecting the intents: Again, . Thus, a supremum describes a set of common objects which fit to two sets of attributes. Graphically, the concept lattice for the example relation in Table 11-1 can be rep- resented as a graph whose nodes are the concepts in Table 11-2 and whose edges denote the < relationship as shown in Figure 11-1 (a). The graph for a concept lattice would be difficult to read when each node showed its complete concepts, i.e., if the nodes Ci were replaced by their contents accord- ing to Table 11-2. Fortunately, there is a better strategy for labelling nodes. A graph node in Figure 11-1(b) is labelled with attribute a Î A if it is the largest concept having a in its intent; it is labelled with an object o ÎO if it is the smallest concept having o in its extent. The (unique) lattice element labelled with a is thus: L C( ) O A,( ) 2O 2A´Î A s O( )= O t A( )=Ù{ }= O1 A1,( ) O2 A2,( )Ù O1 O2Ç s O1 O2Ç( ),( )= A1 A2È s O1 O2Ç( )Í O1 O2Ç A1 A2È O1 A1,( ) O2 A2,( )Ú t A1 A2Ç( ) A1 A2Ç,( )= O1 O2È t A1 A2Ç( )Í 357 Other Automatic Component Detection Techniques The element labelled with o is The equivalent graph for Figure 11-1(a) using this labelling strategy is shown in Figure 11-1(b). A concept represented by a node N in this graph consists of all objects at and below N and of all attributes at and above N. For example, the con- cept labelled with O2 and A5 in Figure 11-1(b) is ({O2, O4}, {A3, A4, A5}). The concept lattice and the table, T, originally used to represent the relation are two equivalent ways to represent the relation, i.e., they can be reconstructed from each other, formally stated as: However, the concept lattice is a much more comprehensible representation allowing direct insight into the structure of the original relation. For example, we can immediately derive from Figure 11-1 that there are two disjoint sets of objects: O1 that has attributes A1 and A2 and objects O2, O3, and O4 share the other attributes. Furthermore, among O2, O3, and O4, O4 has all attributes of O2 and O3 but not vice versa. A3 and A4 is common to all objects O2, O3, and O4. This information can, of course, also be derived from the table, but for larger tables, this is more difficult. Figure 11-1. Example lattice. O3O2 O4 O1 A1, A2 A5 A6, A7, A8 A3, A4C2 C1 C4 C5C3 C7 C6 (a) (b)< < m a( ) c L C( )Î a intent c( )Î{ }Ú= g o( ) c L C( )Î o extent c( )Î{ }Ù= o a,( ) TÎ g o( ) m a( )£Û Related Research 358 In the following sections, it is discussed how concept analysis can be applied to detect atomic components. The approaches differ by the kinds of objects and attributes considered, the way interferences are handled, and how the concept lat- tice is interpreted. 11.1.2.2 Lindig and Snelting’s Approach A formal context for concept analysis in the approach of Lindig and Snelting (1997) consists of subprograms as objects and global variables as attributes. The underlying relation is the variable reference, i.e., a subprogram S has attribute V if and only if S references variable V. An atomic component candidate is a concept, i.e., shows up as a maximal rectangle in the table of the variable reference rela- tion. For example, if Table 11-1 represents a variable reference relation, then ({O1}, {A1, A2}) forms a candidate. The maximal rectangles in the table, how- ever, need not be completely filled ¾ not every subprogram in a component uses all variables, and not all variables are used by all procedures ¾ as long as it corre- sponds to a independent sublattice in the concept lattice. An independent sublat- tice is connected to other sublattices only via the top and bottom elements; it has a single entry (from the bottom element) and a single exit (to the top element), so to speak. The concepts associated with the nodes , , , and in Figure 11-1, for example, form an independent sublattice. forms another independent sublattice. Given an independent sublat- tice, the constituents of the proposed candidate are all objects (subprograms) and attributes (variables) of the concepts in the sublattice. Hence, {O2, O3, O4, A3, A4, A5, A6, A7, A8} and {O1, A1, A2} are two candidates detected by concept analysis. Ideally, the variable reference relation is horizontally decomposable, i.e., the cor- responding lattice consists of independent sublattices only. However, in reality, there are often subprograms in one component that directly access variables of Name Concept Analysis by Lindig and Snelting Reference Lindig and Snelting (1997) Domain Object Reference View Range ADO Disjoint Clusters Yes 359 Other Automatic Component Detection Techniques other components, which results in interferences in the concept lattice that pre- vent us from identifying components as independent sublattices. The subpro- grams S in Figure 11-2, for example, accesses two variables V1 and V2 that belong to different sublattices. As a consequence, the concept labelled with S causes an interference among two sublattices. Snelting also notes that such interferences can automatically be detected and then ¾ in principle ¾ÿÿbe removed by program transformations. For example, the access to V1 by S could be replaced by a call to an appropriate accessor function of the atomic component containing V1. Snelt- ing notes that even in the case of many interferences the system can be modular- ized by means of so-called block relations (also called weak congruencies) that correspond to rectangle shapes in the table and induce a factor lattice (see Lindig and Snelting, 1997, for details). However, even though interferences are automat- ically detectable, the user has to be called on or further heuristics have to be used to decide how the interferences should actually be handled. Lindig and Snelting report on a case study in which concept analysis was applied to a 20-year old aerodynamics system used for airplane development after several manual restructuring efforts had failed (1997). The program was written in For- tran and consisted of about 100 KLOC, 317 subroutines, and 492 global variables in 49 COMMON blocks. The concept lattice for this program contains 2249 ele- ments and is full of interferences such that it could not be horizontally decom- posed. Several attempts to restructure at least some parts of the system and to use block relations and other approaches failed, too. It was decided to cancel the reengineering project and to develop a new system from scratch. Figure 11-2. A horizontal decomposition and an interference. 55S V1 V2 S V2V1 Related Research 360 11.1.2.3 Siff and Reps’s Approach Siff and Reps use concept analysis to detect abstract data types. The objects are subprograms as in the approach of Lindig and Snelting while the attributes are signature relationships and accesses to record components as opposed to variable references. A subprogram, S, has attribute param T if and only if signature-type (S, T) holds. S has attribute access T if and only if: Siff and Reps also consider disjunctions of attributes and negative attributes. Dis- junctive attributes allow the user to specify properties of the form “A or B”, like “internally accesses type stack or type queue”. This may be useful when the user is aware of the similarity of two types. Negative attributes are a means to leverage the absence of an attribute in order to eliminate interferences. Consider the exam- ple in Figure 11-3, which describes in table (a) that subprograms S1 and S2 have parameter type T2 and both internally access an instance of type T2; likewise, sub- program S3 has parameter types T1 and T2 and internally accesses an instance of type T1. Table (a) is decomposable in three concepts ({S1, S2}, {param T2, access T2}), ({S3}, {param T1, access T1, param T2}) and ({S1, S2, S3}, {param T2}) as shown by the shaped rectangles. Suppose the programmer breaks the information hiding principle and subprogram S3 also internally accesses an instance of type T2. Then, a concept ({S1, S2, S3}, {param T2, access T2}) replaces the concepts ({S1, S2}, {param T2, access T2}) and ({S1, S2, S3}, {param T2}) and the concept ({S3}, {param T1, access T1, param T2, access T2}) replaces ({S3}, {param T1, access T1, param T2}). The replacing concepts overlap such that the two abstract data types that originally corresponded to distinct concepts can no longer be eas- ily detected from the concept lattice. The introduction of a negative attribute Ø access T1 reveals the original atomic component consisting of S1, S2, and T2 again Name Concept Analysis by Siff and Reps Reference Siff and Reps (1997) Domain Signature View and non-abstract usage information Range ADT Disjoint Clusters No S predecessors T local-obj-of-type signature-type,{ } non-abstract, ,( )Î 361 Other Automatic Component Detection Techniques as shown in table (c): Because S1 and S2 do not internally access an instance of type T1, a new maximal rectangle shows up in the table corresponding to the con- cept ({S1, S2}, {param T2, access T1, Ø access T2}). However, adding negative attributes can only lead to new concepts in the lattice but never to removal of con- cepts. Hence, concept ({S1, S2, S3}, {param T2, access T2}) is still present. Not every negative attribute does help to reveal relevant concepts. For example, adding a negative attribute Ø access T2 fails to detect the component consisting of S1, S2, and T2 in the example above since S1, S2, as well as S3 internally access T2. Furthermore, negative attributes can lead to many additional concepts, i.e., to many new potential candidates. Siff has, therefore, refined the first idea of adding negative attributes for all attributes (fully complemented context), published in 1997, by adding them only when they can be used to make a difference between two overlapping concepts and when there is a negative attribute that can be used to distinguish a concept from all other concepts (Siff, 1998); the latter is called the uniquely complemented context. Whereas Lindig and Snelting consider only independent sublattices as candi- dates, Siff and Reps treat each concept as a potential candidate. The number of concepts, however, can be huge and not all possible combinations of concepts actually make sense. Overlapping atomic components are rather rare according to Figure 11-3. Example for negative attributes. S1S2 param T1 param T2access T1 5 5 5 (a) (b) (c) S3 5 Øaccess T1 5 5 access T2 5 S1S2 param T1 param T2access T1 5 5 5S3 5 5 5 access T2 5 5 S1 S2 param T1 param T2access T1 5 5 5S3 5 5 5 access T2 5 5 5 5 Related Research 362 our experience with components compiled by software engineers for several sys- tems described in Section 6.1.1. For this reason, Siff and Reps propose to parti- tion the concept lattice into non-overlapping concepts; in other words, each subprogram is assigned to exactly one candidate. Formally, given a formal context (O, A, R), a concept partition is a set of concepts whose extents form a partition of O. That is, P = {(O1, A1), (O2, A2), ¼, (Ok, Ak)} is a concept partition if and only if Each concept partition is a possible system decomposition and can be validated by a maintainer. In the simple example of Figure 11-1 on page 357, there is only one possible concept partition ({O1}, {A1, A2}), ({O2, O3, O4}, {A3, A4}) since the two concepts labelled with O2 and O3 both contain O4. But generally, there can be a large number of possible partitions. The extent of each concept in a partition identifies the subprograms that belong to an atomic component candidate. However, what are the types that belong to the candidate? Though not explicitly said in their paper, we may assume by the exam- ples Siff and Reps are using that they add those types to the candidate that are associated with the attributes in the intent of a concept. Recall that the attributes Siff and Reps consider are not types directly but the kinds of type usages, such as accesses record components of T, has return type T, and has parameter T. That is, the candidate formed for a concept ({S1, S2, S3}, {access T1, return T2}), for example, is {S1, S2, S3, T1, T2}. Selecting candidates from concept partitions as proposed by Siff and Reps has the following disadvantages: 1. A type may be assigned to more than one candidate within a given partition. 2. Not all types are assigned to candidates. 3. There are many possible partitions. 4. The 1:1 relationship between a concept and a candidate is often too strict. O Oi1 i k£ £ È= i j¹( ) Oi OjÇ"Ù Æ= 363 Other Automatic Component Detection Techniques Ad (1): Because Siff and Reps partition the objects, i.e., the subprograms, rather than the attributes, i.e., the types, candidates may result that overlap in their sets of types. For example, if the two concepts in the lattice of Figure 11-4 labeled by T2 and T3 are selected, the resulting candidates overlap in type T1. As a matter of fact, it is even worse because Siff and Reps do not only consider types as such as attributes but the way a type is related. That is to say, because the two distinct attributes param T and access T can be in different concepts, the type is added to different components. In the systems we have investigated, types do always belong to one component at most. Overlap in subprograms is more frequent ¾ yet rare. Ad (2): Consider the example lattice in Figure 11-4. The concept labelled by T1 can be selected for a partition. Then, its (transitive) subconcepts cannot be selected for the partition because their extents are subsets of the concept labelled by T1. As a consequence, the types T2 and T3 that are associated with subconcepts will not be assigned. They may only be assigned in a different partition, i.e, one that selects these subconcepts. Hence, it is not sufficient to look at a single parti- tion in order to assign all types. Ad (3): Siff and Reps report for the fully complemented context of a smaller pro- gram consisting of 372 subprograms and 8 user-defined types altogether 34 con- cepts and 63 possible partitions of the lattice. For another program with 26 subprograms and 3 user-defined types, 28 concepts were detected and 153 possi- ble partitions of the fully complemented lattice were proposed. The high number is mainly due to the introduction of negative attributes in a fully complemented context. However, even for a uniquely attributed context according to the refine- Figure 11-4. Lattice interpretation. T2 T1 T3 ^ Related Research 364 ments discussed above, the approach was not usable for systems at the 30KLOC level because of the huge number of identified partitions (Siff, 1998). For this rea- son, Siff and Reps propose to select the partitions interactively. An improvement was proposed by Graudejus (1998). He suggests to select only the concepts located directly below the top element of the concept lattice and achieved better results. Concepts directly under the top element in the lattice comprise all subpro- grams of their transitive subconcepts and consist of only a few types (the higher a concept the more objects and the less attributes it has). Ad (4): Forming a candidate by a single concept is very strict because then all subprograms must be related to all types in the concepts (positively or negatively) otherwise it would not be a concept. This constraint is acceptable for abstract data types because an abstract data type usually consists of one single type only and in fact all accessor functions of an abstract data type should be related to this type. For abstract data objects, however, this is too rigid. Not necessarily all subpro- grams of an abstract data object reference all of its variables. 11.1.2.4 Sahraoui et al.’s Approach Sahraoui et al. apply concept analysis to the object reference view in order to detect abstract data objects. Their approach differs from Lindig and Snelting’s approach by distinguishing different kinds of variable references and by the inter- pretation of the concept lattice. Variable references are distinguished into setting and using a variable. The way a variable is referenced can help in two cases: 1. A global variable that is never modified can be considered a constant and be removed from the variable reference view. Sahraoui et al. note that this decision is not easy to make when pointer arithmetic is used. However, why pointer arithmetic for global variables should be a problem is unclear. What they prob- Name Concept Analysis by Sahraoui et al. Reference Sahraoui et al. (1997) Domain Object Reference View Range ADO Disjoint Clusters No 365 Other Automatic Component Detection Techniques ably meant was that the variable could be modified when passed as a (simu- lated) reference parameter to a function that modifies the dereferenced parameter. A conservative decision without need for data flow analysis is to exclude global variables that are never modified directly and whose address is never taken. 2. The specific kind of reference can also be used for method identification (see below). Sets and uses of variables could also be used as two different attributes for con- cept analysis, i.e., one could make a distinction between subprograms that only use a variable and subprograms that set a variable. Though not explicitly said but suggested by the example they give, Sahraoui et al. do apparently not distinguish sets and uses for concept analysis as such but have only one attribute for referenc- ing a variable. The identification of atomic components in the concept lattice is divided into three steps: 1. Identification of sets of variables. 2. Merging of overlapping sets of variables. 3. Identification of subprograms. Identification of sets of variables. The first step is based on the concept lattice for the variable reference relation. In the paper of Sahraoui, the concept lattice is computed for the inverse variable reference relation. This does not have any influ- ence on the outcome since attributes and objects are interchangeable for concept analysis (the concept lattice is flipped upside down then). However, we prefer to follow the conventional way of treating the subprograms as objects and the vari- ables as attributes. In order to identify sets of variables from the concept lattice, a set, NS, that con- tains the not-yet-selected variables is used. In the initial state, NS contains all variables. The iterative identification process starts at the top element of the lat- tice and stops when NS is the empty set, hence, when all variables have been assigned to a candidate. In the case of equal cardinality of subprogram sets, con- cepts with fewer variables are preferred in order to avoid large objects. If these two criteria are still not sufficient to decide what concept should be selected, a Related Research 366 priority is given to the cluster C that has the higher cardinality of the set intent (C) Ç NS. Each time a group is selected, the variables it contains are removed from NS. Groups with intent (C) Ç NS = Æ and groups consisting of only one variable of a basic type are ignored. Merging overlapping sets of variables. The previous step can lead to overlap- ping sets of variables. These overlapping sets may be merged. In order to detect overlapping sets of variables, concept analysis is applied to the formal context (O, A, R) where O consists of the sets of variables, A consists of the global variables, and (o, a) Î R if o contains a. Two sets of variables overlap when they have a common superconcept other than the top element. Sahraoui et al. merge two sets of variables only if they have at least two common variables. As a consequence, two candidates may overlap in one variable. Method identification. Once the sets of variables for the candidates have been established, the subprograms are identified. Interestingly enough, Sahraoui et al. do not consider the concept lattice in order to assign subprograms to candidates. Instead, they propose three rules based on the following definitions: Let V be the set of global variables, SV Í V be the set of variables identified by the previous step, and F the set of functions. For each function f Î F, two sets refer- enced-by(f) and set-by(f) are defined as follows: where (referenced-objects is defined in Table 3-4 on page 65 and obj_set is defined in Table 3-3 on page 51) Note that set-by(f) is a subset of referenced-by(f). Based on possible categories for values of referenced-by(f) and set-by(f) relevant to assign a function to a candi- date, three rules can be stated: • If | referenced-by(f) | = 1, then f is assigned to the unique set of variables in ref- erenced-by(f). • If | referenced-by(f) | > 1 Ù | set-by(f) | = 1, then f is assigned to the set of vari- ables identified by the previous two steps within the concept lattice that con- referenced-by f( ) referenced-objects f( ) SVÇ= set-by f( ) set-objects f( ) SVÇ= set-objects f( ) v obj-set f v,( ){ }= 367 Other Automatic Component Detection Techniques tains the single variable in set-by(f) since a modification suggests a higher coupling than just reading the value of a variable. • If | referenced-by(f) | > 1 Ù | set-by(f) | > 1, then f cannot be clearly assigned to one set of variables. Sahraoui et al. propose to slice this function since their approach is aimed at migrating to an object-oriented language. Note that these rules do not guarantee that no function is assigned to more than one candidate because the sets of variables merged by the previous step may still overlap in one variable and if the function happens to set only one variable that is in more than one set of variables, the function will be added to all these candi- dates. Moreover, in a pure reverse engineering approach, it is unclear what to do with functions in the third category. An attempt not explored by the authors could be to assign the function to the set of variables with the maximal number of vari- ables referenced by the function. This strategy could also be used in the second category. It is questionable whether a single modification of a variable in one set of variables outweighs many uses of variables in another set of variables. Sahraoui et al. motivate the top-down traversal of the concept lattice for compo- nent detection by the fact that the higher a group is in the lattice the higher is the cardinality of its subprogram set (the extent) and their hypothesis is that a set of variables can be considered forming an abstract data object if these variables are simultaneously accessed by a larger number of subprograms. However, counter- examples against this hypothesis are frequently used system state or mode vari- ables, such as an error variable. Another consequence of the top-down traversal is that smaller sets of variables are preferred. The argument of the authors is that they want to avoid atomic components with a large number of variables. How- ever, this is counter-intuitive in the context of concept analysis. The definition of a concept requires of all subprograms to reference all variables and since vari- ables are rarely referenced by the same set of subprograms in reality, concepts with a huge number of variables do virtually not exist. Thus, the concepts we can expect to find do not comprise a very large number of variables and concepts with many variables represent higher cohesion among the subprograms in the concept than concepts with a lower number of variables. Related Research 368 11.1.2.5 Canfora et al.’s Approach In a case study, Canfora et al. used concept analysis to detect persistent objects, consisting of data files and their accessor routines, in a COBOL system for dis- tributed system migration (1999a). The objects are COBOL programs and the attributes files. The relationship is defined as (o, a) Î R if program o accesses file a. Data files and file accesses have not been captured by the resource usage graph used so far. The extensions, however, are straightforward: • adding new entity types data file and program • adding a new relationship type file access • adding a new view file usage view that describes which programs access data files The authors used concept analysis in this case study only as the main driving method; they also used ideas underlying other techniques to reduce the number of interferences in the concept lattice, which distinguishes their approach from the approaches discussed in previous sections. Some of the interferences could be eliminated by 1. grouping synonymous files, where synonymity was detected manually (Can- fora, 1999b), 2. considering only files corresponding to application domain objects, 3. excluding programs accessing only one file, and 4. excluding programs with many file accesses after a manual inspection of the lattice and the source code. The persistent objects were detected in the concept lattice as independent sublat- tices, i.e., sublattices only connected to the bottom and top element in the lattice. Name Concept Analysis by Canfora et al. Reference Canfora et al. (1999a) Domain File Usage View (see below) Range Persistent Objects Disjoint Clusters Yes 369 Other Automatic Component Detection Techniques The resulting concept lattice was simple enough to be inspected manually and to find further persistent objects. Though a first step toward a combination of existing techniques, this approach is still limited. Firstly, two of the ideas, namely, 1 and 3, are rather trivial. Secondly, observing the second idea leaves many programming domain objects undetected. Furthermore, the approach relies heavily on manual inspection of the lattice. This may be possible at the level of programs and files. However, at the level of global declarations, at which atomic components are detected, the lattices are much big- ger. 11.1.2.6 Summary of Concept Analysis Godin et al. (1995) showed that the worst case space complexity of the concept lattice is O(2k´|O|) for the finite upper bound k on the number of attributes for an object in the formal context. Thus, the space complexity is basically linear. Our own experiences with the resource usage graphs for the systems we investigated indicate that k is small when we equate objects with entities and attributes with relationships. But this is not so when negative attributes are considered. Negative attributes are the complement of a relationship and since the plain relationship is sparse, its complement has many members. This effect became already visible in the discussion of the approach of Siff and Reps. On the computational side, concept analysis has an exponential time complexity in the worst case. Experiences for atomic component detection indicate that it takes cubic time on average (Snelting, 1999). Concept analysis has a sound mathematical background and the insight into the relationships among system components that it can offer makes it an interesting technique for atomic component detection. On the other hand, its demanding time complexity jeopardize scalability of concept analysis for larger systems. Further- more, it is still not clear whether the computational costs for concept analysis really pay. Horizontally decomposable sublattices can also be identified by the simple Global Reference heuristic at much lesser effort using a union-find algo- rithm. The problem of concept analysis are the interferences in the concept lattice ¾ as for any other of the basic structural techniques. Moreover, Delta IC is very similar to concept analysis. The set of closely related subprograms corresponds Related Research 370 either to a single concept or ¾ if not all subprograms in the set access all vari- ables ¾ to a set of concepts that are in a subconcept relationship. Delta IC toler- ates interferences by means of the quotient of the numbers of closely-related- subprograms and related-subprograms (see Section 5.7). The overview and analysis of the diverse approaches using concept analysis in this section has revealed that detecting atomic components by means of concept analysis is a field of its own that still requires a lot of research. This field is too wide to be explored in this thesis in concert with the other techniques. For this reason, concept analysis was not investigated in this thesis further. 11.1.3 Dataflow-based and Domain-based Approaches Only few techniques exist that leverage data flow information and there is only one approach ¾ to my knowledge ¾ that is based on information about the application domain. 11.1.3.1 Valasareddi and Carver Valasareddi and Carver identify objects in a variant of the program dependency graph that describes control and data dependencies (1998). A node in the program dependency graph represents a statement, an edge represents either a control or data dependency. The dependency graph is restructured by merging nodes with high cohesion. Within the resulting dependency graph, each node represents a potential accessor function and the variables referenced by the node constitute an object. The advantage of this approach is that it may identify components below the glo- bal code level, i.e., at the level of statements within functions. However, because no evaluation of this approach is given, one cannot judge whether meaningful components can be identified at this lower level. Moreover, the authors do not specify how they deal with nodes whose sets of referenced variables overlap. If these sets overlap, overlapping candidates are proposed. 11.1.3.2 Gall and Klösch Gall and Klösch’s approach is based on data flow information as well as on domain information. The approach starts at dataflow diagrams (domain informa- 371 Semi-Automatic Methods tion) and then follows part-type relationships among data store entities of the dataflow diagrams and user-defined records and pursue data dependencies among record components to identify objects with application semantics (1995). Any found data store (file) is considered because information that is stored in a file seems to be essential for the program; a user-defined record that (transitively) depends upon data stores is considered because it is related to something already considered application-related. The accessor functions of atomic components are identified as subprograms that reference files and record variables selected. This approach is primarily aimed at abstract data objects that have application semantics in order to recover an application-oriented object model. It falls short of identifying components that rather belong to the programming domain, like stacks and queues. Though these kinds of components are at a lower level, they are necessary to understand the system as a whole and are often better suited for reuse than application-related components. Moreover, it is not clear what the cov- erage of this approach is: Are all and only application-related objects found? In analogy to program slicing where slices often cover about 80% of the program, it may be the case that more or less all record types are considered application- related. Hence, the question arises whether the approach excludes programming- domain components effectively. 11.2 Semi-Automatic Methods This section describes further semi-automatic approaches that integrate the user in the detection process. 11.2.1 Müller et al. (Rigi) Müller et al. point out that architecture recovery cannot be fully automated (1993); thus, the role of the human software engineer constitutes a central and integral part of architecture recovery and, consequently, tools for architecture recovery should integrate the user. On the other hand, as much as possible should be automated. Therefore, the tool supporting their approach, namely, Rigi allows human intervention and offers automatic operations for subsystem detection, too. The available operations for subsystem detection are removal of omnipresent Related Research 372 entities as well as composing by interconnection strength, common client/suppli- ers, centricity, and name. Furthermore, metrics can be used to assess the structure of the subsystem decomposition and exact interfaces of subsystems can automati- cally be derived. Moreover, because many analyses for architecture recovery are system-dependent, Rigi offers a scripting language that allows a maintainer to write his or her own clustering techniques. Because the semi-automatic method proposed in this thesis uses Rigi for visual- ization and user interaction, Müller’s and my approach have a great deal in com- mon. However, my work focuses on atomic component recovery and, therefore, offers a wider selection of atomic component recovery techniques, while Müller’s work is also aimed at hierarchical subsystems. Moreover, the only way of com- bining different techniques in the original Rigi is to apply the techniques succes- sively, whereas the new Rigi supporting my semi-automatic method, namely, extended Rigi, handles alternative results in parallel and offers deep set operators and composition to combine these results. 11.2.2 Kazman and Carrière (Dali) Another extension of Rigi is Dali, developed by Kazman and Carrière (1997). Dali uses an SQL database to store information about systems to be reverse engi- neered. As a consequence, SQL can be used to specify clustering patterns. Kazman and Carrière distinguish two different levels of patterns: application- independent patterns and application patterns. Application-independent patterns are used to aggregate declarations according to the language semantics, like local variables with their enclosing function or data and function members with their class (another type of low-level application-independent patterns filters noise introduced by insufficient tools for parsing and semantic analysis; these patterns are not needed when a capable C++ frontend is available). The purpose of appli- cation patterns is to group functions and classes to subsystems. These patterns leverage naming conventions or are enumerations of related elements. The advantage of Dali is its support for SQL queries to specify clustering pat- terns; patterns can be written in a declarative manner and in a language that is widely used. The patterns can be viewed as a specification of the system structure and be used to re-generate the visualization of the system when the system has changed (in which case, the patterns have to be updated accordingly). Moreover, 373 Semi-Automatic Methods the patterns can be employed to group elements quickly instead of grouping them manually. In the case of patterns that are mere enumerations, however, this advan- tage is only limited. On the other hand, there is no analytic capability in Dali. Hence, the patterns have to be written by someone already familiar with the sys- tem and cannot be reused for other systems in many cases. Refinement of the results of a pattern is difficult because the user cannot interact with the system by direct manipulation since the refinement has to be done in the pattern. For the same reason, combinations of results are difficult. In principle, combining opera- tors like deep union and intersection could be written in SQL as well using tem- porary tables and logical or and and operators, but the result (probably refined during validation) would have to be expressed as an SQL query afterward to obtain the advantages of a separate specification of the system structure as SQL patterns. 11.2.3 Yeh et al. (ManSART) The extended Rigi has also much in common with ManSART, a tool developed at MITRE that supports architecture recovery (Yeh et al. 1997). ManSART visual- izes different views of the system that are directly derived from source code, such as task-spawning views, dataflow between procedures and data files, and abstract data type views. In order to combine different views for presentation purposes, several operators are offered to the user. The purpose of the operators is to con- nect distinct views at different levels of abstractions (e.g., a task-spawning view with the abstract data type view), whereas the extended Rigi offers operators to find agreements and differences of component views or to unite two views where all views are at the same level of abstraction. In order to establish correspondence among concepts in different views when views are combined, a containment rela- tion is used by ManSART that is based on source positions of statements imple- menting the concepts. When the extended Rigi combines two views, it can consider the base entities contained in components because the corresponding components are at the same level of abstraction. 11.2.4 Gall, Klösch, and Weidl All approaches that have been presented in this section so far, including the one underlying extended Rigi, are primarily bottom-up approaches. The search starts at the level of declarations extracted from source code, which are then grouped Related Research 374 together by automatic techniques and human judgement. Gall, Klösch, and Weidl complement bottom-up clustering by a top-down approach (1998). They also use bottom-up clustering heuristics as described in Section 11.1.3. However, they go beyond a pure bottom-up process by using a domain model built by a human engi- neer (e.g., using the unified model language UML). The domain model describes the application concepts and their relationships. Part of the recovery process is to bind the domain concepts to the components found by the bottom-up phase. In order to establish this mapping, a similarity metric is used based on similarity of names of domain concepts and source code identifiers and on similarity of types (Weidl and Gall, 1998). Because it is not always possible to establish the mapping using the similarity metric only, the user is integrated in the binding process. The domain model may be used to make the recovery process more goal-directed and may increase the chance to find components with application semantics. On the other hand, other programming concepts may be missed that are also neces- sary to understand the system or could be reused in another context. Moreover, the explicit domain model and a mapping from domain concepts to components in the source implementing these domain concepts is a valuable documentation. The bottom-up approaches discussed above also use a domain model ¾ but it exists only in the head of the maintainer. However, building a domain model needs additional effort and the necessary degree of detail of a useful domain model is not known in advance. 11.2.5 Murphy, Notkin, and Sullivan (Software Reflexion Model) Another top-down approach is the Software Reflexion Model by Murphy, Notkin, and Sullivan (1995). The Software Reflexion Model is to capture and exploit the differences that exist between the source code organization and the designer’s mental model of the high-level system organization. An engineer defines a high- level model of the structure of the system and specifies how the model maps to the source. A tool then computes a software reflexion model that shows where the engineer’s high-level model agrees with and where it differs from a model of the source. The primary purpose of this technique is to streamline the amount of time it takes for someone unfamiliar with the system to understand its source code structure. 375 Plan Recognition Techniques 11.3 Plan Recognition Techniques A research area that would directly profit from atomic component detection is plan recognition. A plan is a stereotypical coding pattern for algorithms or data structures used to implement frequently used concepts, comparable to design pat- terns at the design level. For example, a stack is often implemented by a stack pointer and an array; push adds its argument to the array and increases the stack pointer. Such known coding patterns can be used to identify the pieces of codes that implement the concept. For example, van Deursen, Woods, and Quilici used plans for leap year calculations in order to find code pieces that would incorrectly handle the leap year 2000 (1997). There are many approaches to plan recognition that can be distinguished by the intermediate form they work on (source code, abstract syntax tree, lambda expression, control flow graph, or data flow graph), the underlying recognition technique (textual pattern matching, tree matching, graph parsing), and the classes of plans (control concepts, data structure con- cepts) (Wills, 1992; Harandi and Ning, 1990; Hartman, 1991; Quilici et al. 1997). A detailed classification of existing plan recognition techniques was com- piled by Wills (1992). The difference to atomic component detection is that plan recognition does not only identify the pieces of a concept (functions, variables, types) but is also able to assign a concept to these pieces. However, automatic plan recognition is com- putationally demanding. Atomic component detection techniques can be used as pre-analyses that identify the cohesive elements; then, plan recognition tech- niques could investigate each single atomic component individually, which would reduce the search space by an order of magnitude. 11.4 Connector Detection Techniques Research for techniques to detect architectural connectors has mostly concen- trated on connectors for parallel and distributed systems, such as spawning pro- cesses, remote procedure calls, socket calls, and so forth. These kinds of connectors can often be identified as a set of function calls of libraries imple- menting synchronization and communication means for parallel and distributed systems. Plan recognition techniques can be used to find these patterns. Harris et Related Research 376 al. identify such function calls within the abstract syntax tree using tree pattern matching (1995). Fiutem et al. additionally exploit data flow information in order to establish communication channels between different parts of a system and pro- pose a language to specify plans for connector detection (1996). However, most systems are sequential and even parallel systems have a large sequential part and would also profit from connector detection. Simple connec- tors, like procedure calls, are trivially detected. Calls via function pointers need points-to analyses. Actual data bindings via shared variables would also require control and data flow analyses. Moreover, atomic components can often be con- sidered connectors between different parts of the system; e.g., one parts fills a queue and the other reads from the queue. I have no knowledge of any approach that investigates the role of atomic components as connectors. 377 Chapter 12 Conclusions This chapter summarizes the contributions and conclusions of this thesis and pro- poses further research directions. 12.1 Conclusions Section 1.4 has stated the scientific questions addressed in this work. This section briefly summarizes the answers to these questions given in previous chapters. What published structural techniques exist and how can they be unified and classified? Chapter 5 described a taxonomy for structural techniques for atomic component detection. The class of concept-based approaches was added to this taxonomy in Section 11.1.2. The class of structural techniques comprises: • Connection-based approaches cluster entities based on a specific set of direct relationships (and their quality) between entities to be grouped. All the connection-based approaches can be unified by using the same generic algorithm for clustering based on a function Connected_Entities that yields the elements that are to be grouped with an entity. The approaches differ only in the exact specification of this function. • Metric-based approaches cluster entities based on a metric using an iterative clustering approach. The metric-based approaches are based on connections, too, but they differ from connection-based approaches by the degree of free- Conclusions 378 dom that is offered by the metric parameters and the threshold that can be var- ied to find atomic components with varying confidence. • Graph-based approaches derive clusters from a graph by means of graph-the- oretic analyses. The difference to connection-based approaches is that the whole graph has to be considered whereas connection-based approaches regard only direct relationships between entities in order to decide whether they should be grouped. • Concept-based approaches use concept analysis to compute a lattice of con- cepts. They use heuristics to detect atomic components within this lattice. The approaches differ in the objects and attributes considered and the way the con- cept lattice is interpreted to retrieve components. Beyond structure-based techniques, there are other automatic techniques based on data flow information and domain information (Section 11.1.3). Semi-automatic techniques integrate the user in the detection process. Most of the semi-automatic techniques are bottom-up approaches that start at the global code level (Chapter 9, Sections 11.2.1 and 11.2.2). Top-down approaches start at a high-level model of the system and try to map the concepts in the high-level model to the source (Sec- tions 11.2.4 and 11.2.5). Combined semi-automatic techniques tackle the problem of component recovery from both sides. An overview of the taxonomy and the respective techniques for atomic component detection is shown in Figure 12-1. Techniques that are in more than one class of the taxonomy are marked by é. What is the recall rate and precision in atomic component detection of pub- lished techniques? In Chapter 6, an evaluation of the basic techniques was described in which the results of the automatic techniques were compared to components found by soft- ware engineers. The evaluation has revealed the following points: • The effectiveness of a technique depends upon system characteristics, like degree of information hiding, proper module decomposition, and layering. • None of the investigated techniques has a sufficient recall rate. The best recall rate we obtained was 75% of the abstract data objects. In the worst case, the best technique reached only a recall rate of 34% (Similarity Clustering for ADTs of Aero; Figure 7-22 on page 237). 379 Conclusions Figure 12-1. Taxonomy for atomic component detection techniques. fully automatic semi-automatic dataflow-based domain-based bottom-up combined (top-down/bottom-up) concept-based structural connection-based metric-based graph-based Same Module - Girard, Koschke (1997a) + Section 5.3 Global Object Reference - Yeh et al. (1995) + Section 5.2 Internal Access - Yeh et al. (1995) + Section 5.6 Part Type - Liu, Wilde (1990) + Section 5.4 Same Expression - Koschke (unpublished) + Section 5.5 Delta-IC - Canfora et al. (1993, 1996) + Section 5.7 Internal/External Connectivity - Koschke (unpublished) + Section 5.8 Type-based Cohesion - Patel et al. (1991) + Section 5.10 Dominance Analysis - Cimitile, Visaggio (1995) + Section 5.12 Arch - Schwanke (1991) + Section 5.9 Strongly Connected Components - Cimitile, Visaggio (1995) + Section 5.11 - Lindig, Snelting (1997) + Section 11.1.2.2 - Siff, Reps (1997) + Section 11.1.2.3 - Sahraoui et al. (1997) + Section 11.1.2.4 - Canfora et al. (1999a) + Section 11.1.2.5 - Müller et al. (1993) + Section 11.2.1 Functional Cohesion - Girard, Würthner (Eisenbarth, 1999) +Section 11.1.1.3 Data Bindings - Belady, Evangelisti (1984) + Section 11.1.2.2 Data Bindings - Hutchens, Basili (1984) + Section 11.1.2.2 ê graph-based: Functional Cohesion - Girard, Würthner ê metric-based: Delta-IC - Canfora et al. - Gall, Klösch (1995) + Section 11.1.3.2 é dataflow-based: Gall, Klösch (1995) - Valasareddi, Carver (1998) + Section 11.1.3.1 - Kazman, Carrière (1997) + Section 11.2.2 - Gall, Klösch, and Weidl (1998) + Section 11.2.4 - authors + described in é see technique Conclusions 380 • Many candidates the techniques provide correspond only roughly to the refer- ence components; i.e., elements of these atomic components were superfluous or lacking. • Combining the automatic approaches instead of using a single approach, one would significantly improve the discovery of the reference components. • Yet, between 35 and 50 percent of the components still could not be completely found even by uniting the results of the techniques. However, the components may at least partially be matched. • It turned out that 42% of the ADO candidates and 41% of the ADT candidates classified as false positives in the evaluation could indeed be considered correct positives after an analysis of the false positives for Aero, Bash, and CVS and a selection of automatic techniques; they were either too small to be considered, simply overlooked in the manual process, or represented alternative views. • We found common patterns of false positives in all systems that could be used to filter out false positives from the set of candidates. • Whereas the groups of software engineers needed about 20 - 35 hours to com- pile the list of atomic components for each of our subject systems (except for Mosaic, which was used in the experiment), each atomic component produced by the techniques can be checked by software engineers within minutes. To browse the whole list of false positives of all automatic techniques, we needed less than 6 hours per system. The time needed for validation can even be reduced by merging similar candidates of different techniques based on the par- tial subset relationship because there were many similar false positives among the candidates. How can these techniques be improved individually? Chapter 5 proposed several smaller improvements for published techniques. One of these techniques, namely, Schwanke’s Arch approach, was enhanced in so many ways that it can be considered a new technique. The extended technique, Similarity Clustering, is described in Chapter 7 in detail. The properties of Simi- larity Clustering are summarized here: • Similarity Clustering is the most general approach described in this thesis. All connection-based techniques can be subsumed under it. It can detect all kinds of atomic components. It goes beyond other approaches in that it also considers relationships to common third entities and informal aspects. 381 Conclusions • Similarity Clustering can be used both to search for specific user-defined pat- terns and search for similar patterns of already found atomic components. • The adjustable parameters of Similarity Clustering offer more flexibility. If atomic components similar to those already found are to be searched for, the parameters can automatically be calibrated on the known components using traditional optimization techniques, such as Simulated Annealing or Gauß- Seidel optimization. • Similarity Clustering is one of the most effective techniques as far as the recall rate is concerned. On the other hand, it had also more false positives than other approaches in the evaluation (except for Aero, which has more). However, this is partly because the same threshold was used for all atomic components when the candidates were retrieved from the tree of clustered entities generated by Similarity Clustering. In an interactive application, the maintainer would begin the validation at the leaves and move up the tree toward the root until a node is reached for which the component membership is doubtful; i.e., thresholds indi- vidual to components would be used. • For all techniques other than Similarity Clustering, there is one single criterion used for clustering. Hence, the reason why a technique has grouped entities together is obvious. This is less obvious for Similarity Clustering when the similarity metric considers several aspects at the same type, which may com- plicate validation of candidates proposed by Similarity Clustering. • Time and space complexity for Similarity Clustering is basically linear to the number of entities considered when informal information is excluded (assum- ing an upper constant limit of neighbors an entity can have). However, when informal information is used, each entity has to be compared to any other entity resulting in a time complexity of O(n2). How can the techniques be combined? Chapter 8 described high-level operators like deep intersection, union, and com- position offered to a maintainer in order to combine the results of the techniques instead of combining the methods technically. This way, new techniques can be added with very little effort and the maintainer has all flexibility to combine the analyses. The specification of the combining operators was extended to hierarchi- cal subsystems as a consequence to the wish of many participants of the experi- Conclusions 382 ment described in Chapter 10 to have a grouping mechanism beyond atomic components in order to group related atomic components. Another way of combining the techniques was presented as the voting approach in which the agreement of each technique is polled, weighted, and summarized to a total agreement whether a given cluster is a promising candidate. In order to add a new technique to the voting approach, its underlying heuristic has only to be expressed as a metric. How can the user be integrated in atomic component detection? In Chapter 9, an incremental semi-automatic method was described that integrates the user into atomic component detection. The analyses, selected by the user, are used to propose candidates that are then validated by the user. The information added by the maintainer is used by the techniques in the next iteration. In order to realize this iterative process, the techniques had to be enhanced to work incre- mentally. Chapter 9 gives also some advice in which order the analyses should be applied. Do automatic techniques support a maintainer in atomic component detec- tion? Section 10.3 has presented a controlled experiment in order to find out whether the automatic techniques are helpful within the semi-automatic method. The eval- uation of the experiment could neither show a positive nor a negative effect on atomic component detection. However, the following restrictions of the experi- ment restrain us from generalizing the results too far: • The system used for this experiment was well structured such that the experi- mental subjects using automatic techniques had only little advantage. • The extended Rigi used within the experiment did not offer all the functionality proposed in this thesis. Many improvements were only inspired by the experi- ment. • The experimental subjects were all well trained students of computer science and had grades above the average. Less talented programmers might profit more from automatic analyses. 383 Conclusions Despite of these considerations, there are two fundamental realizations one has to be conscious of: Firstly, the semi-automatic method can only be as good as the underlying analyses and secondly, the effort of validating candidates is a constant factor that cannot be eliminated. Consequently, a way to improve atomic compo- nent detection with the semi-automatic method is to use more reliable and cover- ing analyses. More reliable techniques are needed to reduce the number of candidates to be validated; more coverage is needed to find as many components as possible in order to avoid manual search. The prototype supporting the semi- automatic method used structure-based techniques only and inherited their weak- nesses that are described in Chapter 6 and Chapter 7. However, even with more sophisticated automatic support, the effort for component recovery can only be reduced to the constant factor of validation. Validation will always be necessary because I doubt that we can ever find absolutely reliable techniques since the cri- teria for cohesive components are vague to some degree and legacy systems rarely employ information hiding. However, if more reliable techniques are available, the semi-automatic method and its realization in the extended Rigi is flexible enough to integrate these new techniques quickly. The method as such and the way how user interaction is sup- ported would not have to be changed. Are the techniques and methods for atomic component detection discussed in this work also helpful for other typical maintenance tasks? In Section 10.4, a case study was described that was performed to investigate the ability of the extended Rigi to support maintenance tasks other than those it was originally designed for, namely, atomic component detection. The goal of this case study was also to learn what other types of automatic analyses would be use- ful for maintainers. At least two of the four tasks assigned to the participants in this case study and dealing with global name binding are supported by the extended Rigi or could easily be solved by simple enhancements. These analyses could return the results within a minute where participants needed up-to two hours. For the other tasks, at least Rigi’s cross-reference and browsing capabili- ties were helpful. In order to find function clones or answer lifetime and protocol questions, more rigorous analyses are needed. Conclusions 384 12.2 Future Research This sections proposes future research directions based on the results of this the- sis. 12.2.1 Data Flow Information The evaluation of the automatic structural techniques has shown that their recall rate does not compare to human detection and in the experiment for the semi- automatic method, a positive effect of the structure-based techniques could not be shown. This is partly due to system characteristics; if programmers obeyed to the rules of information hiding, there are structural techniques, like Internal Access for ADTs and Global Object Reference for ADOs, that would reliably identify all atomic components. However, because programmers do break information hiding principles very often in practice, distinct concepts in the source code are merged to single candidates by the techniques. This is because structural techniques lever- age only coarse information about the relationships among types, variables, and subprograms. Control and data flow information might be an avenue to come to finer-grained analyses with more reliability. It was already mentioned that the assumption of Part Type that the part type is put into or retrieved from the container type could be checked by data flow analysis. Same Expression could be extended to group variables that are in the same pro- gram slice (a program slice comprises all statements that contribute to the value of a variable). Slicing techniques could also be used to check whether there are independent slices in a subprogram with respect to different global variables; if so, the subprogram likely performs multiple functions and should be excluded from clustering and be presented to the user instead. Excluding subprograms with multiple purposes avoids merging of candidates. Control flow analysis could reveal control dependencies among variables, which could be used as another grouping criterion. On the other hand, data flow analysis is expensive both in terms of computational costs and the costs to build an infrastructure for data flow analyses. The advan- tage of structural approaches is that they are comparatively simple and fast. 385 Future Research 12.2.2 Domain Knowledge Programmers understand programs not only in terms of programming semantics; they also base their understanding on their knowledge about the domain. Actu- ally, understanding a program involves to establish a mapping between the con- cepts in the domain and the concepts found in the code. In the semi-automatic method, domain knowledge comes into play by the user who selects appropriate analyses and validates their results. The structural techniques, however, do not use domain knowledge. On one hand, the analyses can therefore be used in differ- ent contexts. On the other hand, domain knowledge may improve automatic com- ponent recovery. A first approach by Gall et al. (1998) toward using domain knowledge for component recovery was already discussed in Section 11.2. In order to establish a mapping between domain model and recovered candidates, Gall et al. propose a similarity metric based on name similarity and type similar- ity. This approach still needs human intervention. Earlier approaches, like the one of DeBaud and Rugaber, were purely manual (1995). More research is necessary to explore further ways to establish the mapping. Meanwhile, at least some ele- ments of a (partial) domain model would be helpful. For example, a domain model defines the used vocabulary. A vocabulary would be useful in order to let Similarity Clustering (which uses naming conventions) know that Create_Account is more similar to Release_Account than to Create_List. More- over, if the implementation characteristics for certain domain concepts are known, plan recognition techniques could be used to find the concepts as imple- mented in the source code. Hence, atomic component detection and plan recogni- tion could complement each other. Furthermore, only in rare cases, a domain model exists, for example, created as part of a process toward a product line. If a domain model is only created for atomic component detection, it is not per se clear whether the additional effort for setting up the domain model really pays off. One may argue that domain models have additional benefits beyond component recovery, like documentation or sup- port for reuse. However, at least for documentation purposes, a domain model may be too general since it describes a whole domain and not just the system at hand. Further research should be devoted to evaluate the costs and benefits of domain models, where the costs and benefits should not only be measured with respect to component recovery but also to other aspects like the value added to documentation and reuse. Conclusions 386 12.2.3 Research Directions Concerning the Method The semi-automatic method could be further improved by allowing more interac- tion. For example, the user should be able to annotate variables and record com- ponents as public once the constituents of the atomic component have been found; the automatic analyses would then exclude public entities from consider- ation. Another example for more interaction is that, in the case of abstract data type detection, the user should be able to pick a single data type for which the accessor functions are to be detected, i.e., only one candidate is created at a time. Currently, many candidates are created and consequently, distinct components are merged to a single candidate when there is a subprogram that apparently belongs to both components. Clustering only one abstract data type at a time would hence avoid erroneously large components. However, this would not solve the problem of merged components in abstract data object detection because ADOs mostly comprise several variables and not just one type as it is the case for most abstract data types. Alternatively, one could allow overlapping candidates instead of merg- ing them (subprograms in more than one candidate should be highlighted then). The experiment to evaluate the semi-automatic method should be repeated. The experimental design and the statistical analysis described in Section 10.3 can be reused. However, future experiments should take the structure of the system into account. That is to say, instead of a single independent variable, one would use two independent variables: tool-support and the shape of the system. Further- more, more experimental subjects and preferably common programmers should be involved. 12.2.4 Role Identification The future research directions proposed above are aiming at atomic component detection as such. Other research paths should take atomic component detection as a starting point. These paths should aim at the semantics of atomic compo- nents. While techniques for atomic component detection group related entities, more knowledge is necessary to understand the purpose, or role, of the compo- nents. Two levels of roles are associated with atomic components whose identifi- cation would help in program understanding: • intra-component roles: roles of base entities within the atomic component • inter-component roles: roles of atomic components within the system 387 Future Research 12.2.4.1 Intra-Component Roles The constituting base entities of a component can be distinguished into public and private elements. Public elements can be used by clients of the component, whereas private elements are hidden within the component. The decision of whether an entity should be considered private or public has to be made by the maintainer. Automatic analyses may only identify the current state. Accessor functions of an atomic component can further be classified as follows: • constructor: creates a new (instance of the) component (hints: call to memory allocation routines like malloc, setting record components with literals) • destructor: releases an existing (instance of the) component (hints: call to memory deallocation routines like free) • selector: returns information about an existing (instance of the) component without changing it (hints: the data of the component are not changed) • modifier: changes the state of an existing (instance of the) component (hints: the data of the component are changed) This is no strict classification because in real-world programs there are often rou- tines that fall into more than one category, for example, due to efficiency consid- erations, a routine may act as modifier and selector. The categorization of a routine into the above classes describes a part of the impact of a routine call (only a part because a routine could also have side-effects) and is a useful information to a maintainer. Further analyses using plan recognition techniques could reveal more detailed characteristics of individual accessor functions. For example, Hartman has devel- oped a technique to recognize so-called control concepts like “do-loop”, “read- process loop”, “succeed-fail loop”, and “bounded linear search” (1991). Restrict- ing recognition to control concepts allows for efficient detection. The characteris- tics of accessor functions in terms of their underlying (control) concepts could be used as indices in a framework that supports reuse. These characteristics can be used to retrieve the accessor functions from the base of reusable components. Conclusions 388 12.2.4.2 Inter-Component Roles Inter-component roles of atomic components describe their association to other components and their purpose within the system. One major motivation for atomic component detection is migration of procedural programs to object-oriented programs (Fanta, 1999). In order to leverage the expressiveness of object-oriented languages and to simplify the system structure by removing redundant code, inheritance relationships should be identified for this kind of migration. Most legacy systems, however, are not designed based on the object-oriented paradigm and, therefore, inheritance cannot be recognized. On the other hand, in three of the five systems we have used to evaluate the automatic techniques and the semi-automatic method, namely, Aero, Bash, and XCoral, we have found atomic components that are actually so similar to each other that they could be implemented as types of the same class. Inheritance relationships may be identified via variant records (in C, as unions of structs), record types with record components of similar names and/or types and similar accessor functions. Here, several different reverse engineering techniques could be integrated: atomic component detection to identify cohesive components, concept analysis for iden- tifying similar types (where record component names and/or types would act as attributes), and clone detection techniques to identify similar functions. Another kind of relationship useful for documentation of data models and explic- itly represented in object-oriented modeling languages is aggregation. Actually, aggregation is already present in the resource usage graph as the part-type rela- tionship. Other associations may be identified as calls to accessor routines or ref- erences to data of other components. These associations may be a starting point for gathering more specific information about the role of associated components, like: • one component C1 is a wrapper to another component C2 if all accesses to C2 are via C1 • one component C is a data store if C contains some kind of buffer and there are read and write accesses to C, but C does not access other components unless they are themselves data stores 389 Future Research • one component C is a connector between two components C1 and C2 if C is a data store and C1 writes into C and C2 reads from C (in Belady and Evange- listi’s terminology, is a data binding via C) The examples above represent simple design patterns (Gamma et al. 1994). Future research could investigate to which degree more complicated design pat- terns can be recognized. This would not only be useful for program understand- ing but also for validation whether a design pattern has actually been implemented correctly. Recognition of most design patterns require control and data flow analyses, in particular, when non-global objects as instances of abstract data types are investi- gated that may pass through chains of function calls as actual parameters. 12.2.5 Protocols The exact interface of an atomic component, C, can be identified as the declara- tions of C used outside of C and the declarations C uses from its context (Müller, 1993). The former constitutes the actual syntactic interface. Though deriving the actual syntactic interface provides interesting information, it falls short of a cli- ent’s need for information on how to use the component. The interface of a com- ponent is conjoined with a protocol that specifies the allowable order of actions associated with the component. An action associated with a component is a call of one of its accessor functions or a reference to its data. For example, a client has to know whether it is necessary to call a constructor or a destructor and under which circumstances a call to a modifier or selector is allowed. Assuming the protocol of a component is observed, hints on its form can be derived from the actual usage of the component within a given program. For example, one attempt toward protocol recovery can be made by extracting the actions associated with a component as regular path expressions from the pro- gram (Tarjan, 1981) and presenting them as finite state machines to the user. The user can then use the extracted order of actions in order to specify the protocol. Extracting protocol information as actions visible outside of the component con- siders the component a black-box. A complementary glass-box approach would look inside of the component in order to identify pre- and post-conditions of indi- Conclusions 390 vidual accessor functions to get hints on the admissible order of actions. Both black-box and glass-box approaches are necessary: Black-box approaches sup- port protocol recovery only for actual, given usages of the component and there may be more allowable usages. On the other hand, in many cases, the assumption that a certain subprogram of the component has been called is (undocumented) part of a precondition. Hence, the actual order of actions established by black-box approaches would be of benefit to glass-box approaches. 12.2.6 Subsystem Detection Related atomic components themselves may be grouped to lower-level sub- systems and lower-level subsystems can be grouped to higher-level subsystems in order to derive a hierarchical decomposition of the system as-built. In particular for large systems, subsystems are an important grouping mechanism for under- standing, maintenance, and management purposes. Atomic component detection is a starting point for subsystem detection. Tech- niques similar to atomic components detection can be used to detect subsystems. For, example Same Module could be extended to Same Directory, Similarity Clus- tering could be used with a similarity metric based on atomic components instead of base entities, Dominance Analysis can be used as described in Section 11.1; if atomic components to be hidden are known, Internal Access could be extended, and so forth. The main difference to atomic component detection is that a hierar- chy of components is to be detected instead of a list of flat components. Techniques for subsystem detection could easily be integrated into the extended Rigi since the combining operators have already been defined for hierarchical clusters. 12.2.7 Architectural Conformance Recovering architectural information from source code is not only necessary for system understanding. It is also necessary to validate architectural specifications unless the code is automatically generated from the specification and the genera- tion itself is reliable. A software architect may specify aspects like the structure of the system (atomic components, subsystems, hidden parts etc.), protocols of com- ponents, or configurations, e.g., as design patterns. In order to validate these spec- 391 Final Remarks ifications, it is necessary to recover the architecture as-built and compare it to the specification. Hence, three major aspects need to be addressed by research in architectural conformance: • Specification: - What is to be specified in software architecture? - What are suitable methods, notations, and tools for specifying software archi- tecture? - What kind of analyses are supported by these methods and notations? • Recovery: - How can we retrieve architectural information that needs to be validated? • Validation: - How can the architecture as-built be checked for conformance to the specifi- cation? 12.3 Final Remarks This thesis has contributed to architecture recovery by evaluating, improving, and combining automatic techniques for component recovery and integrating these techniques in a general framework supporting an interactive and incremental pro- cess. The information gathered by the methods and techniques described in this thesis are helpful for program understanding, other reverse engineering and reengineering techniques, as well as software validation. This thesis is a stepping stone toward automatic support for architecture recovery. The need for automatic support will increase as the size and complexity of systems as well as projects increase. However, the summit is not yet reached. As this chapter has discussed, there is still a lot of work to do in component recovery and architecture recovery in general. Pour soulever un poids si lourd, Sisyphe, il faudrait ton courage! Bien qu'on ait du coeur à l'ouvrage, L'Art est long et le temps est court. ¾ Charles Baudelaire: Les fleurs du mal. Conclusions 392 393 Appendix A Entity-Relationship Model for Basic Structural Information This chapter summarizes all entities and relationships introduced in the course of this thesis. The entity type hierarchy is shown in Figure 1-1 (abstract types are printed in italics); non-abstract entity types are explained in Table 1-1. The rela- tionship type hierarchy is presented in Figure 1-2; the non-abstract relationships are explicated in Table 1-2. Figure 1-3 is the entity-relationship model consisting of the entities in Table 1-1 and their relationships in Table 1-2. Figure 1-1. Entity type hierarchy. user-defined type subprogram variablearchitectural quark constantobject record component record component specifier record component instance base entity is a entity component subsystem atomic component module 394 Table 1-1. Alphabetic list of non-abstract entity types. Entity Meaning Section atomic component named flat set of cohesive base entities 3.2.1 constant a global object whose value cannot be changed 3.1.1 module a syntactic unit that is used to group base entities 3.3 record component instance an actual record component of a record object that is associated with a memory location. 3.1.2 record component specifier a record component within a type decla- ration, which defines a part of the mem- ory layout of all instances of this type. 3.1.2 subprogram a function or procedure of a program 3.1.1 subsystem hierarchical sets of related elements (architectural quarks, atomic compo- nents, and other subsystems) 3.2.2 user-defined type a type introduced by a programmer 3.1.1 variable a global object whose value can be changed 3.1.1 395 Figure 1-2. Relationship type hierarchy. base relationship signature-type parameter-of return use actual-parameter-of part-type take-address-of of-type set call reference is a same-expression local-obj-of-type delineate relationship enclosing obj-address-of comp-address-of obj-set comp-set obj-use comp-use mutually-exclusive part-of 396 Table 1-2. Alphabetic list of non-abstract relationships. Relationship Source S Target T Meaning Section actual-parameter- of object subprogram S is an actual parameter in a call to T 3.1.1 call subprogram subprogram S calls T 3.1.1 comp-address-of subprogram record component S takes the address of T. 3.1.2.2 comp-set subprogram record component S sets the value of T. 3.1.2.2 comp-use subprogram record component S uses the value of T. 3.1.2.2 delineate type type T is defined in terms of S as a synonym or as a new type 4.2.3.1 enclosing record component specifier type S is enclosed by T 3.1.2.1 enclosing record component instance object record compo- nent instance S is enclosed by T 3.1.2.1 local-obj-of-type subprogram user-defined type S has a local variable of type T 3.1.1 mutually- exclusive arch. quark component arch. quark component S and T must not be in the same component and one must not be a part of the other one 8.2.1 obj-address-of subprogram object S takes the address of T. 3.1.2.2 obj-set subprogram global variable S sets the value of T. 3.1.2.2 obj-use subprogram object S uses the value of T. 3.1.2.2 of-type object user-defined type S is of type T 3.1.1 parameter-of subprogram user-defined type S has a formal parameter of T 3.1.1 397 part-of arch. quark. component component S is part of T 3.2.1 part-type type type S is a part type of T 3.1.1 return subprogram user-defined type S returns a value of T 3.1.1 same-expression object object S and T occur in the same expression 3.1.1 Figure 1-3. The entities and their relationships. Table 1-2. Alphabetic list of non-abstract relationships. Relationship Source S Target T Meaning Section subprogram user-defined type object* call signature-type* actual-parameter-of reference local-obj-of-type of-type part-type same-expression entity cardinality £ 1 ³ 0 * abstract global variable constant record component* rec. comp. specifier rec. comp. instance reference* enclosing enclosing is-a is-a * architectural quark* component* atomic component subsystem is-amutually-exclusive part-of part-of mutually-exclusive mutually-exclusive module part-of is-a 398 399 Bibliography Aarts, E. and Korst, J. (1990), Simulated Annealing and Boltzmann Machines, Wiley - Inter- science Series in Discrete Mathematics and Optimization. Ada 95 Reference Manual (1995), ANSI/ISO/IEC-8652:1995. Aho, A.V., Sethi, R., and Ullman, J.D. (1986), Compilers: Principles, techniques and tools, Addison Wesley. American National Standard Programming Language C (1989), ANSI X3.159-1989. American National Standards Institute, New York. Arnold, K. and Peyton, J. (1992), A C User’s Guide to ANSI-C, Addison-Wesley. Baker, B. (1995), ‘On Finding Duplication and Near-Duplication in Large Software Systems’, Proceedings of the Working Conference on Reverse Engineering, pp. 86-95, IEEE Computer Society Press. Baxter, I., Yahin, A., Moura, L, Sant’Anna, M, and Bier, I. (1998), ‘Clone Detection Using Abstract Syntax Trees’, Proc. of the Int. Conference on Software Maintainence, pp. 368- 377, IEEE Computer Society Press. Bayer, J., Girard, J.-F., Würthner, M, Apel, M., and DeBaud, J.-M. (1999), ‘Transitioning Leg- acy Assets - a Product Line Approach’, Proceedings of the SIGSOFT Foundations of Soft- ware Engineering, Toulouse, pp. 446-463, Association of Computing Machinery. Belady, L.A. and Evangelisti, C.J. (1982), ‘System Partitioning and its Measure’, Journal of Systems and Software, vol.2, no. 1, February, pp. 23-29. Berliner, B. (1990), ‘CVS II: Parallelizing Software Development’, Proceedings of 1990 Win- ter USENIX Conference,Washington, D.C. Bibliography 400 Biggerstaff, T. (1989), ‘Design Recovery for Maintenance and Reuse’, IEEE Computer, pp. 36- 49, July. Binkley, A.B. and Schach, S.R. (1998), ‘Validation of the Coupling Dependency Metric as a Predictor of Run-Time Failures and Maintenance Measures‘, Proc. of the Int. Conference on Software Engineering, pp. 452-459, April, IEEE Computer Society. Booch, G., Rumbaugh, J., and Jacobson, I. (1997), Unified Modeling Language Reference Manual, Addison-Wesley. Buss, E., De Mori, R., Gentleman, W. et al. (1994), ‘Investigating Reverse Engineering Tech- nologies for the CAS Program Understanding Project’, IBM Systems Journal, vol. 33, no. 3, pp. 477-500, February. Canfora, G., Cimitile, A., Munro, M., and Taylor, C.J. (1993), ‘Extracting Abstract Data Types from C Programs: A Case Study’, Proceedings of the International Conference on Software Maintenance, pp. 200-209, IEEE Computer Society Press, September. Canfora, G., Cimitile, A., and Munro, M. (1996), ‘An Improved Algorithm for Identifying Objects in Code’, Journal of Software Practice and Experience, vol. 26, no. 1, pp. 25–48, January. Canfora, G., Cimitile, A., De Lucia, A., and Di Lucca, G.A. (1999a), ’A Case Study of Apply- ing an Eclectic Approach to Identify Objects in Code’, Workshop on Program Comprehen- sion, pp. 136-143, Pittsburgh, IEEE Computer Society Press. Canfora, G. (1999b), personal communication at the Workshop on Program Comprehension, Pittsburgh, May 4th-6th. Chikofsky, E.J., Cross II, J. H. (1990), ‘Reverse Engineering and Design Recovery: A Taxon- omy’, pp. 13-17, IEEE Software, January. Choi, S.C., Scacchi, W. (1990), ‘Extracting and Restructuring the Design of Large Systems’, IEEE Software, vol. 7, no. 1, pp. 66-71, January. Cimitile, A. and Visaggio, G. (1995), ‘Software Salvaging and the Call Dominance Tree’, Jour- nal of Systems Software, 28, pp. 117–127. DeBaud, J.-M., Rugaber, S. (1995), ’A Software Re-Engineering Method using Domain-Mod- els’, Proc. of the Int. Conference on Software Maintenance, IEEE Computer Society Press, pp. 204-213, October. 401 Doval, D., Mancoridis, S., Mitchel, B.S, Chen, Y., and Gansner, E.R. (1999), ’Automatic Clus- tering of Software Systems using a Genetic Algorithm’, http://www.mcs.drexel.edu/~serg/ Projects/Bunch/publication/index.html. Eisenbarth, T., Koschke, R., Plödereder, E., Girard, J.-F., and Würthner, M. (1999), ’Projekt Bauhaus: Interaktive und inkrementelle Wiedergewinnung von SW-Architekturen’, Konfer- enzband zum Workshop Reengineering, pp. 17-25, Bad Honnef, Universität Koblenz. Fanta, R. and Rajlich, V. (1999), ’Restructuring Legacy C Code into C++’, Proc. of the Int. Conference on Software Maintenance, pp. 77-85, IEEE Computer Society Press. Fenton, N.; Pfleeger, S. (1997), Software Metrics: A Rigorous and Practical Approach Pws Pub. Fiutem, R., Tonella, P., Antoniol, G., and Merlo, E. (1996), ‘A Cliche-based Environment to Support Architectural Reverse Engineering’, pp. 319-328, Proc. of the Int. Conf. on Soft- ware Maintenance. Fjeldstadt, R.K., and Hamlen, W.T. (1984), ’Application Program Maintenance Study: Report to Our Respondents’, Proc. GUIDE 48, IEEE Computer Society Press, April. Gall, H. and Klösch, R. (1995), ‘Finding Objects in Procedural Programs: An Alternative Approach’, Proc. of the Second Working Conference on Reverse Engineering, pp. 208-216, IEEE Computer Society Press. Gall, H., Klösch, R., and Weidl, J. (1998), ’Resolving Uncertainties in Object-Oriented Re- Architecturing of Procedural Code’, 7th int. Conf. on Information Processing and Manage- ment of Uncertainty in Knowledge-Based Systems, pp. 726-733, July. Gamma, E., Helm, R, Johson, R., and Vlissides, J. (1994), Design Patterns, Addison-Wesley. Garlan, D., Perry D.E. (1995), ‘Introduction to the Special Issue on Software Architecture’, IEEE Transactions on Software Engineering, Vol. 21, No. 4, pp. 269-274, April. Garlan, D, Shaw, M. (1993), ‘An Introduction to Software Architecture’, Advances in Software Engineeering and Knowledge Engineering, Volume 1, World Scientific Publishing Com- pany, New Jersey. Girard, J.F., Koschke, R. (1997a), ‘Finding Components in a Hierarchy of Modules: a Step Towards Architectural Understanding’, International Conference on Software Maintenance, pp. 58-65, IEEE Computer Society Press. Girard, J.F., Koschke, R., and Schied, G. (1997b), ‘Comparison of Abstract Data Type and Abstract State Encapsulation Detection Techniques for Architectural Understanding’, Work- Bibliography 402 ing Conference on Reverse Engineering, pp. 66–75, Amsterdam, The Netherland, October, IEEE Computer Society Press. Girard, J.F., Koschke, R., and Schied, G. (1997c), ‘A Metric-based Approach to Detect Abstract Data Types and Abstract State Encapsulation’, Conf. on Automated Software Engineering, Lake Tahoe, pp. 82-89, IEEE Computer Society Press. Girard, J.F., Koschke, R., and Schied, G. (1999), ‘A Metric-based Approach to Detect Abstract Data Types and Abstract State Encapsulation’, Journal on Automated Software Engineering, no. 6, October, pp. 357-386, Kluwer Academic Publishers. Girard, J.F. and Koschke, R., (2000), ‘A Comparison of Abstract Data Type and Objects Recov- ery Techniques’, Journal Science of Computer Programming, vol. 6, issue 2-3, March, pp. 149-181, Elsevier Science. Ghezzi, G., Jazayeri, M., and Madrioli, D. (1991), Fundamental Software Engineering, Pren- tice Hall International. Godin, R., Missaoui, R., and Aloui, H. (1995), ’Incremental Concept Formation Algorithms Based on Galois (Concept) Lattices’, Computational Intelligence, 11(2). Graudejus, H. (1998), Implementing a Concept Analysis Tool for Identifying Abstract Data Types in C Code, Diplomarbeit, University of Kaiserslautern, Germany. Harandi, M. and Ning, J.Q. (1990), ’Knowledge-based Program Analysis’, IEEE Software, Jan- uary, pp. 74-81. Harris, D.R., Reubenstein, H.B., and Yeh, A.S (1995), ’Recognizers for Extracting Architec- tural Features from Source Code’, Proceedings of the Working Conference on Reverse Engi- neering, pp. 227-236, Toronto, IEEE Computer Society Press. Hopcraft, Ullman (1983), Data structures and algorithms, Addison-Wesley. Hutchens, D.H. and Basili, V.R. (1985), ‘System Structure Analysis: Clustering with Data Bindings’, IEEE Transactions on Software Engineering, vol. SE-11, no. 8, pp. 749-757, August. Karypsis, G., (Sam) Han, E.-H., and Kumar, V. (1999), ‘Chameleon: Hierarchical Clustering Using Dynamic Modeling’, IEEE Computer, pp. 68-75, August. Kazman, R. and Carrière, S. J., (1997), ‘Playing detective: reconstructing software architecture from available evidence’, Technical Report CMU/SEI-97-TR-010, ESC-TR-97-010, Soft- ware Engineering Institute, Pittsburgh, USA. 403 Kazman, R., Woods, S., and Carrière, S. (1998), ‘Requirements for Integrating Software Archi- tecture and Reengineering Models: CORUM II’, Working Conference on Reverse Engineer- ing, Hawaii, pp. 154-163, October, IEEE Computer Society Press. Keller, H., Stolz, H., Ziegler, A., and Bräunl, T. (1995), ‘Virtual Mechanics Simulation and Animation of Rigid Body Systems with Aero’, Simulation for Understanding, vol. 65, no. 1, pp. 74–79, July. Kernighan, B.W., Ritchie, D.M. (1978), The C Programming Language. Prentice-Hall, Engle- wood Cliffs, NJ. Koschke, R., Rugaber, S., Steirwalt, K., and Wills, L. (1997), ‘Tupling’, Unpublished. Koschke, R., Girard, J.-F., and Würthner, M., (1998) ‘An Intermediate Representation for Inte- grating Reverse Engineering Analyses’, Working Conference on Reverse Engineering, Hawaii, October, pp. 256-267, IEEE Computer Society Press. Krone, M., Snelting, G. (1994), ’On the Inference of Configuration Structures From Source Code’, Proc. of the Int. Conference on Software Engineering, pp. 49-57, May, IEEE Com- puter Society Press. Lehman, M.M., Belady, L. (1985), ‘Program Evolution’, Processes of Software Change, Aca- demic Press, London. Lienert, G.A. (1973), Verteilungsfreie Methoden in der Biostatistik, Verlag Anton Hain, Meisenheim am Glan, Germany. Lindig, C. and Snelting, G. (1997), ’Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis’, Proc. of the Int. Conference on Software Engineering, pp. 349-359, Boston. Liskov, B., Zilles, S. N. (1974), ‘Programming with Abstract Data Types’. SIGPLAN Notice, vol. 9, no. 4, pp. 50–60, April. Liu, S.S. and Wilde, N. (1990), ‘Identifying Objects in a Conventional Procedural Language: An Example of Data Design Recovery’, Int. Conf. on Software Maintenance, November, pp. 266-271, IEEE Computer Society Press. Livadas, P.E. and Johnson, T. (1994), ’A New Approach to Finding Objects in Programs’, Jour- nal of Software Maintenance: Research and Practice, no. 6, 249-260. Macro, A. and Buxton, J. (1987), The Craft of Software Engineering, Addison-Wesley, Read- ing, MA. Bibliography 404 Mann, H.B., and Whithney, D.R. (1947), ‘On a Test of Whether One of Two Random Variables is Stochastically Larger than the Other‘, Annuals of Mathematical Statistics, 18. Mayrand, J., Coallier, F. and Merlo, E. (1996), ‘Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics’, Proc. of the Int. Conference on Soft- ware Maintenance, pp. 244-253, IEEE Computer Society. McCabe, T. (1998), keynote address at the Working Conference on Reverse Engineering, Hawaii, October. Müller, H., Uhl, J.S. (1990), ‘Composing Subsystem Structures Using (K,2)-Partite Graphs’, Proceedings of the Conference on Software Maintenance, pp. 12-19, IEEE Computer Soci- ety Press. Müller, H., Wong, K., and Tilley, S. (1992), ‘A Reverse Engineering Environment Based on Spatial and Visual Software Interconnection Models’, Proc. of ACM SIGSOFT Symposium on Software Development Environments, pp. 88-98, December. Müller, H., Orgun, M., Tilley, S., and Uhl, J. (1993), ’A Reverse Engineering Approach to Sub- system Structure Identification’, Journal of Software Maintenance: Research and Practice, vol. 5, no. 4, pp. 181-204, December. Murphy, G., Notkin, D., and Sullivan, K. (1995), ’Software Reflexion Models: Bridging the Gap between Source and High-Level Models’, Proc. of the ACM SIGSOFT Symp. Founda- tions of Software Engineering, pp. 18-28. NCSA (1997), ‘NCSA Mosaic Home Page’, National Center for Supercomputing Applications, http://www.ncsa.uiuc.edu/SDG/Software/Mosaic. Nosek, J.T. and Palvia, P. (1990), ’Software Maintenance Management: Changes in the Last Decade’, Journal of Software Maintenance, 2(3), pp. 157-174, Sept. Ogando, R.M., Yau, S.S., and Wilde, N. (1994), ‘An Object Finder for Program Structure Understanding in Software Maintenance’, Journal of Software Maintenance, vol. 6, no. 5, pp. 261–83, September-October. Owen, D.B. (1962), Handbook of Statistical Tables, Addison-Wesley. Parnas, D.L. (1972), ‘On the Criteria To Be Used in Decomposing Systems into Modules’, Communications of the ACM, Vol. 15, No. 12, pp. 1053-1058, December. Parnas, D.L., Clements, P.C., and Weiss, D.M. (1985), ‘The Modular Structure of Complex Systems’, IEEE Transactions on Software Engineering, Vol. SE-11, No. 3, pp. 408-419, March. 405 Patel, S., Chu, W., and Baxter, R. (1992), ‘A Measure for Composite Module Cohesion’, Proc. of the 14th Int. Conf. on Software Engineering, pp. 38-48, Melbourne, May, Association of Computing Machinery. Perry, D.E., Wolf, A.L. (1992), ‘Foundations for the Study of Software Architecture’, ACM SIGSOFT, vol. 17, no. 4, pp. 40–52, October. Pitman, E.J.G., (1937), ‘Significance Test Which May be Applied to Samples From Any Popu- lation‘, Journal of the Royal Statistics Society 4. Plödereder, E. (1997), personal communication, Stuttgart, Germany. Quilici, A., Woods, S,. and Zhang, Y. (1997), ‘New Experiments With a Constraint-Based Approach to Program Plan Matching’, Working Conference on Reverse Engineering, pp. 114-123, IEEE Computer Society. Ramey, C. (1994), ‘Bash - The GNU shell’, Linux Journal, issue 3, August, Specialized Sys- tems Consultants. Rugaber, S., Stirewalt, K., and Wills, L. (1995), ‘The Interleaving Problem in Program Under- standing’, Proc. of the 2nd Working Conference on Reverse Engineering, pp. 166-175, Tor- onto, July, IEEE Computer Society Press. Sahraoui, H., Melo. W, Lounis, H., and Dumont, F. (1997), ’Applying Concept Formation Methods to Object Identfication in Procedural Code’, Proc. of Conference on Automated Software Engineering, Nevada, pp. 210-218, November, IEEE Computer Society. Salton, G. (1968), Automatic Information Organization and Retrieval, McGraw-Hill, New York. Salton, G. (1975), Dynamic Information and Library Processing, Prentice-Hall, Englewood Cliffs, N.J., pp. 80. Schwanke, R. W. (1991), ‘An intelligent tool for re-engineering software modularity’, Interna- tional Conference on Software Engineering, pp. 83–92, May. Schwanke, R.W., Hanson, S.J. (1994), ‘Using Neural Networks to Modularize Software’, Machine Learning, 15, pp. 136-168. Schwanke, R.W. (1998), personal communication by electronic mail, November. Sebesta, R. (1998) Concepts of Programming Languages, Addison-Wesley Publishing Com- pany, Inc. Bibliography 406 Shannon, C. E. (1972), The mathematical theory of communication. Urbana University of Illi- nois Press, ISBN 0-252-72548-4. Siff, M. and Reps, T. (1997), ‘Identifying Modules via Concept Analysis’, Proc. Int. Confer- ence on Software Maintenance, Bari, pp. 170-179, October, IEEE Computer Society. Siff, M. (1998), Techniques for System Renovation, Ph.D. Thesis, University of Wisconsin - Madison. Snelting, G. (1997), ’Reengineering of Configurations Based on Mathematical Concept Analy- sis’, ACM Transactions on Software Engineering and Methodology 5, 2, pp. 146-189, April. Snelting, G. and Tip, F. (1998), ‘Reengineering Class Hierarchies Using Concept Analysis’, Proc. ACM SIGSOFT Symposium on the Foundations of Software Engineering, November, pp. 99-110. Steinhausen, D., Langer, C. (1977), Clusteranalyse, Walter de Gruyter, Berlin, New York. Tarjan, R. (1972), ‘Depth-First Search and Linear Graph Algorithms’, SIAM, vol. 1, no. 2, pp. 146–160. Tarjan, R. (1974), ‘Finding Dominators in Directed Graphs‘, SIAM Journal of Computing, vol. 3, no. 1, pp. 62-87, March. Tarjan, R. (1981), ’Fast Algorithms for Solving Path Problems’, Journal of the Association for Computing Machinery, vol. 28, no. 3, July, pp. 594-614. Turski, W. (1981), ‘Software Stability’, Proc. 6th ACM Conf. on Systems Architecture, London. Tversky, A. (1977), ‘Features of Similarity’, Psychological Review, vol. 84, no. 4, July. Tzerpos, V. and Holt, R. (1997). ‘The Orphan Adoption Problem in Architecture Maintenance’, Proceedings of the Fourth Working Conference on Reverse Engineering, pp. 76-82, Amster- dam, October, IEEE Computer Society Press. Valasareddi, R.R. and Carver, D.L. (1998), ‘A Graph-based Object Identification Process for Procedural Programs’, Proc. of the Fifth Working Conference on Reverse Engineering, Honolulu, pp. 50-58, October, IEEE Computer Society Press. Van Deursen, A., Woods, S., and Quilici, A. (1997), ’Program Plan Recognition for Year 2000 Tools’, Proc. of the Fourth Working Conference on Reverse Engineering, pp. 124-133, IEEE Computer Society Press. 407 Weidl, J. and Gall, H. (1998), ’Binding Object Models to Source Code: An Approach to Object-Oriented Re-Architecturing’, Proc. of the 22nd Computer Software and Applications Conference, IEEE Computer Society Press, August. Weiser, M. (1984), ‘Program Slicing’, IEEE Transactions on Software Engineering, vol. SE- 10, no. 4, pp. 352-357, July. Winer, B.J., Brown, D.R., and Michels, K.M. (1991), Statistical Principles in Experimental Design, 3rd edition, McGraw-Hill Series in Psychology. Wills, L.M. (1992), Automated Program Recognition by Graph Parsing, Ph.D. Disseration, MIT, Cambridge, MA. Wirth, N. (1985), Programmieren in Modula-2, Teubner, third edition. Yeh, A.S., Harris, D., and Reubenstein, H. (1995), ’Recovering Abstract Data Types and Object Instances From a Conventional Procedural Language‘, Second Working Conference on Reverse Engineering, pp. 227–236, July. IEEE Computer Society Press. Yeh, A.S., Harris, D.R., and Chase, M.P. (1997), ’Manipulating Recovered Software Architec- ture Views’, Proc. of the Int. Conference on Software Engineering, pp. 184-194, Association of Computing Machinery. Yourdon, E. and Constantine, L.L. (1979), Structured Design, Prentice Hall, Englewood Cliffs, NJ. Bibliography 408 409 Index A abstract class 44 abstract cohesion 36 abstract data object 38, 113 with subordinated types 40 abstract data objects with subordinated types 40 abstract data type 113 state-based 40 abstract object 38 abstract usage 132 actual parameter view 69 actual-parameter relationship 46 actual-parameter-of 396 ADO 38 ADT 37 affine 70, 73 affinity tolerance parameter 70, 73 agent 190 aliased object 98 aliasing 98 analysis node 320 analysis of variance 335 anonymous type 90 ANOVA 335 antimonotone 354 Arch 142 architectural connector 375 architectural level 31 global code 33 higher architectural 33 lower architectural 33 lower code 33 architectural quark 32, 393 architecture recovery 29 aspects of similarity 189 assessment 302, 304 assignment 259 atomic architectural component 33 atomic component 23, 33, 35, 394 permissive 42 pure 42 atomic component context 210 atomic components affine 70 attribute non-abstract 123 average group similarity 188 B base entity 393 base relationship 51, 395 base type 93 base view 69 Bauhaus 24 block relation 359 body file 101 bound 263 bound entity 263 C call 396 relationship 46 call back 173 call view 68 callee neighbor function 65 caller neighbor function 65 candidate 112 CF 82 Choose_Node 278, 280 closely-related subprograms 125 cluster 112, 266 clustering 302, 303 clustering criterion 112 clustering domain 112 clustering range 112 code level 31 code structure level 31 cohesion 34, 36 abstract 36 coincidental 36 communicational 36 functional 36 logical 36 procedural 36 sequential 36 temporal 36 Collapse 149 collapsed view 273 Common 139, 191 common attributes 354 common objects 354 Common-eq 194, 195 Common-ne 194, 195 comp_address_of relationship 51 comp_set relationship 51 Index 410 comp_use relationship 51 comp-address-of 396 comparison by elements 283 comparison by name 283 component 22, 29, 35, 393 static architectural 34 component view mutually exclusive 259 components correspond 70 dissimilar 70 dynamic architectural 34 identical 70 matching 161 components view 69, 257 constraints 259 comp-set 396 comp-use 396 concept 355 concept lattice 356 concept partition 362 confirmation 259 connected graph component 119 connectivity 137 connector 22, 29, 375 constant 94, 394 constructor 387 contingency table 224 control flow data binding 352 correspond 70 correspondent 288 corresponding components 70 coupling 34 CPP 82 creation 259 D Dali 372 data binding 351 actual 351 potential 351 used 351 data file 368 deep difference 286 deep intersection 286 deep symmetric difference operator 298 deep union 286, 292 degree of connectivity 137 delineate 86, 396 dendrogram 204 design recovery 29 destructor 387 difference deep 286 difference operator 298 direct dominator 149 direct element 65 direct-elements 65 disjoint clusters 112 disjunctive attribute 360 dissimilar components 70 Distinct 139, 191, 194 distinct components 70 dominance tree 149 dominate 149 directly 149 immediately 149 primarily 150 E edge of the resource usage graph 58 source of 59 target of 59 elements 66 enclosing 396 relationship 48 enclosing components 66 Enclosing_Components 269 entity 393 bound 263 free 263 enum 88 enumeration 88 equivalent entities 61 equivalent features 194 equivalent relationships 61 evolution strategy 223 exact Fisher-Pitman randomization test 335 exact Fisher-Pitman test 337 exact interface 389 exact U-Test 335 exclusion 259 expected 336 expected U value 336 extended Rigi 331 extent 355 external connectivity 137 F F statistic 335 factor lattice 359 feature 189, 190 features 139 equivalent 194 non-equivalent 194 file access 368 file usage view 368 first-degree neighbor 241 Fisher-Pitman Test 337 formal context 354 forward engineering 28 free 263 free entity 263 fully complemented context 361 function clone 341 function level 31 function pointer 96 functionally cohesive component 352 G Galois connection 354 Gauß-Seidel strategy 222 grid search 222 group 112 H Handle_correspondents 286 header file 83 hierarchical clustering 187 horizontally decomposable 358 horseshoe model 31 hybrid atomic component 40 hybrid component 40, 113 I IAR 333 iArch 142 identical components 70 IML 82 immediate dominator 149 improvement in internal connectivity 126 independent sublattice 358 indirect relation 191 individual absolute recall 333 individual composition 319 induced edge 273 411 infimum 356 informal information 189 intent 355 inter-component role 388 interface exact 389 syntactic 389 interference 359 interference graph 277 interleaving 181 internal connectivity 126, 136 intersection deep 286 intra-component role 387 intrinsic type 90 J join 356 L Last_Entity 267 left argument view 288 Linked 139 local-obj-of-type 396 relationship 46 logically related subprograms 41 longjmp 97 M macro 84 maintenance 29 ManSART 373 matching components 161 matchings 161 maverick 304 maximal individual similarity 188 meet 356 modality 189 mode variable 174 modifier 387 module 34, 35, 57, 83, 394 module structure 68 module view 69 MS 332 multiple entities assignment 302, 303 mutually exclusive 258 mutually-exclusive 396 N name equivalent 88 name space 84 named base type 93 named type 90 negative attribute 360 negative example 225 negative information 258, 318 neighbor second-degree 241 neighbor function 62 transitive closure 64 neighbors 62 nested routines 172 node of the resource usage graph 58 nodes 59 of an edge 59 non-abstract 123 non-abstract usage 132 non-equivalent features 194 non-parameterized statistics 335 O obj_address_of relationship 51 obj_set relationship 51 obj_use relationship 51 obj-address-of 396 object 44, 393 object instance 38 object reference view 68 objects direct elements 66 obj-set 396 obj-use 396 observed U value 336 of-type 396 relationship 46 operator difference 298 overlooked positive 170 P parameter passing 173 parameter variable 173 parameter-of 396 relationship 46 partial subset relationship 67 Partition_Number 278 partner 189, 190 part-of 54, 65, 66, 397 part-type 86, 397 relationship 46 patient 190 permissive atomic component 42 persistent object 368 physical file structure 68 physical module structure 68 plain Rigi 331 plan 375 plan recognition 375 positive example 225 Positive information 257 positive information 318 Potentially_Affine 288 predecessors 62 primary dominator 150 private record component 208 program 368 program dependency graph 370 program evolution 19 program slice 384 public record component 208 pure atomic component 42 R RCR 333 real false positive 171 record component 393 record component instance 394 record component specifier 394 redundant edge 55 reengineering 28 reference 395 reference corpus 154, 334 reference corpus recall 333 reference set 154, 228 referenced-objects neighbor function 65 referencing-subprograms neighbor function 65 referred-by 125, 132 referred-entities neighbor function 65 refer-to 125, 132 Index 412 rejection 259 related subprogram 113 related subprograms 41 related-to graph 267, 277, 281 related-to relationship 267 relation indirect 191 relational inverse 64 relationship 395 mutually exclusive 258 restricted related-to graph 281 restructuring 28 return 397 relationship 46 reverse engineering 28 right argument view 288 Rigi 317, 371 role 189 RUA 83 S SAM 332 same expression 94, 119 same expression view 68 same-expression 94, 397 relationship 46 second-degree neighbor 241 selector 387 set 395 relationship 46 setjmp 97 sets of logically related subprograms 41 shallow union 284 Shannon information content 140 signature relationship 97 signature view 68 signature-subprograms neighbor function 65 signature-type 395 relationship 45 signature-types neighbor function 65 similarity aspect 189 simulated annealing 223 single 286 single entity assignment 302, 303 software architecture 29 software maintenance 29 Software Reflexion Model 374 source 59 source level 31 standard tools 331 state variable 173 state-based abstract data type 40 static local variable 171 Strongly Connected Base Component Analysis 275 Strongly Connected Collapsed Component Analysis 275 strongly connected component 40, 147 subconcept 356 subprogram 394 subprograms direct elements 66 logically related 41 subprograms related to 124 subsystem 34, 35, 55, 394 subsystem structure 55 successors 62 supremum 356 syntactic interface 389 system parameter 174 T take-address-of 395 relationship 46 target 59 tolerance 73 total-agreement 302, 305 training component 228 training set 228 transitive closure 64 transitive_closure 64 tupling 103, 119 type composition view 68 type declaration 86 type usage view 68 types direct elements 66 U U value 336 union deep 286, 292 shallow 284 uniquely complemented context 361 usage abstract 132 non-abstract 132 use 395 relationship 46 user view 257, 316 user-defined type 85, 394 U-Test 335 V variable 94, 394 view actual parameter 69 base 69 call 68 components 69 module 69 object reference 68 same expression 68 signature 68 type composition 68 type usage 68 view of the resource usage graph 67 voting approach 301 W weak congruency 359