Validation of Metrics-Based Quality Control PDF Print E-mail

 

Welf Löwe

Applied Research in System Analysis (ARiSA) – Växjö University

 

1. Introduction to the research field

More than half of the total costs in ownership of a software system are maintenance costs. Hence, it is important to control software qualities like maintainability, re-usability, and portability directly during the development of a software system.

The ISO/IEC 9126 standard describes internal and external software qualities and their connection to attributes of software in a so-called Quality Model, cf. ISO/IEC 9126-1. The Quality Model follows the Factor-Criteria-Metric model and defines six quality characteristics (or factors), which are refined into sub-characteristics (or criteria). Sub-characteristics in turn are assessed by metrics; they measure the design and development process and the software itself. Currently, the proposed measurements need to be performed manually as they require human insights, e.g. due to code reviews. Manual approaches, however, have a series of drawbacks:

(1) They are error-prone since they highly depend on the subjects performing the measurement. Hence, they are no measurements in the mathematical sense, which are required to be objective and repeatable. Humans might oversee or even deliberately ignore certain problems.

(2) They are time consuming. When taking, e.g., code reviews seriously, people have to read and understand codes that they haven’t created in the first place.

(3) They might cause tensions in the organizations. There is a conflict of interest when, e.g., the project/quality manager requests reports from a developer, which at the same time is used to evaluate the performance of that developer.

Drawbacks with manual measurements are getting more severe when considering current trends in software development, like outsourcing of development and integration of open source components into proprietary systems. It is essential for a reliable software production to guarantee not only the functional correctness of external components but also their internal qualities. Manual quality measurement is not an option in these settings.

Finally, many customers of software systems, especially governmental organizations or those operating in security and safety critical areas, demand ISO 9000 quality certification from their vendors. ISO 9000 requires quantitative reasonable statistical control over product quality as a basis for continuous quality improvement in the software and the software development process. This is, for the aforementioned reasons, hard to establish with manual quality control.

 

2. Project Proposal

2.1. Project Goal

We replace ISO/IEC 9126 metrics manually assessing internal qualities with metrics allowing for automatic measurement. This defines an adapted Quality Model. We validate the significance of this Quality Model with experiments in selected software development projects of our industry partners.

2.2. Expected Results

The project will deliver both research insights and practical methods and tool support for participating companies.

On the research side, we expect two major contributions:

(1) We define a novel Quality Model assessing internal quality (sub-)characteristics as defined by an industry standard with well-established metric analyses as proposed by the research community. This quality model is published as a compendium of software quality standards and metrics (cf. www.arisa.se/compendium/ for the structure and core of the compendium).

(2) We statistically validate the significance of that novel Quality Model, i.e. we support or disprove the hypothesis that static metric analyses allow for an assessment of internal qualities of software systems.

Together, (1) and (2) provide the theoretical basis for quality management assessing industrially standardized software qualities in an effective (since significant and objective) and efficient (since automated) way.

On the practical side, we produce tools and methods supporting the quality assessment of software under development having a profound theoretical basis. By implementing them in the participating partner companies, we gain understanding on practicability and usefulness.

(3) We get insights on how our theory, tools and methods integrate with different quality management processes existing in industry. This includes insights on initial efforts for integration and possible/necessary adaptations of these processes.

(4) We understand the speed-up in performance of assessing internal quality automatically vs. manually, since we implement both approaches: the manual standard approach for validating the significance of the new automated approach.

As a side effect, we expect a higher awareness of internal quality issues in participating companies as a first result. We even expect improvements of software quality in the companies as well as improvements in their development process. These effects, however, will be evaluated qualitatively and subjectively in the first place and not validated statistically.

Once this project is successfully finished, a spin-off of the project will provide the acquired knowledge in a consulting service for software development companies.

2.3. Prerequisites Established Before

For the project to succeed, we have already solved a few practical problems.

The project will relate ISO/IEC 9126 standard and internal quality metrics in a Quality Model. A definition and documentation framework for this Quality Model has been established with a compendium of software quality standards and metrics (cf. www.arisa.se/compendium/).

The VizzAnalyzer/Vizz3D framework is instantiated such that the tool already implements the information extraction, the metric analyses and visualizations needed for the project (cf. www.arisa.se/index_tools.html).

Participating companies need to be aware of the standards we address, notions, and the tools in order to speak the same language in the project. It helps speeding up collaboration in general and establishing realistic expectations to the project from all sides. In addition to the project-planning meeting, we have conducted a one-day workshop for developers, internal consultants, and quality managers from partner companies (cf. www.arisa.se/Workshop-08-05/). Education included introduction to industrial quality standards involved, measurement theory, well established metrics, the compendium, our approach of assessing quality, and a VizzAnalyzer/Vizz3D tool demo.

2.4. Project Implementation

In Work Package 1, we adjust and formalize the Quality Model of ISO/IEC 9126:

- The characteristics and related sub-characteristics of the standard remain. However, we concentrate of the following characteristics: (re-)usability, efficiency, maintainability, and portability and their sub-characteristics excluding compliance. The excluded characteristics functionality and reliability appear to be only indirectly assessable by means of general static software metrics (if possible at all). The sub-characteristics compliance needs application domain specific information and is, hence, not suitable for a general approach either.

- We map the remaining sub-characteristics to well-known static metrics such as McCabe Cyclomatic Complexity, Weighted Method Count, Lines of Code etc. For each metric, we have a hypothesis (a) whether it is highly or only moderately appropriate for assessing the related sub-characteristics, and (b) whether it is proportionally or inversely proportionally related (cf. www.arisa.se/compendium/ for examples).

- To assess sub-characteristics numerically, we normalize the metrics between 0 and 1 and multiply them with the weight 1 and ½, respectively, according to the classification (a) and additional weights 1 and -1, respectively, according to the classification (b). Finally, the quantitative value of the sub-characteristics is defined as the weighted sum over related metric values. This defines our initial Quality Model, i.e. the assumption on the quantitative connection between metrics and sub-characteristics. It may be adapted during the project, cf. Work package 6, by changing the metrics and/or the weights.

In Work Package 2, we define and document the requirements on development processes and quality management in the participating companies needed to perform the validation of the redefined Quality Model. This concerns infrastructure support and additional process data that are needed to validate the conclusions from quality measurements. For instance:

- Sub-characteristics testability as assessed by metrics is validated via actual coverage of test cases, which requires a unit test infrastructure in the industrial project.

- Sub-characteristics changeability as assessed by metrics is validated via evaluating the actual changes, which requires a version control infrastructure on the one hand and a certain discipline in commenting changes on the other hand.

- Characteristics maintainability as assessed by metrics is validated via comparing change requests due to bug reports with actual changes and new bug reports requiring both bug databases and version control infrastructures and process data connecting bug reports to changes in code.

These relations between metrics and validating feedback process data are formally defined in hypotheses to be tested in the actual experiments. This phase also includes the design of feedback forms and setting up an appropriate content management system.

In Work Package 3, we carefully document development processes and quality management in the participating companies, select suitable projects and discuss additional documentation requirements for the validation.

In Work Package 4, we install a Quality Monitor for the selected project. In practice, this means integrating the VizzAnalyzer software measurement tool in the build process of the software development project. Remember, the VizzAnalyzer implements the metrics in question already. Hence this is more a task of installing than programming. Moreover, a Web questionnaire and a content management system are installed for storing the metrics information and additional process data. Finally, a Web project management system is installed for reporting times spent on the project.

In Work Package 5, along with the software development projects, we follow up automated quality measurement as well as quality and process feedback for their validation. Data is reported continuously. This work package is the actual control experiment.

In Work Package 6, we evaluate the results statistically. Intermediate results are evaluated continously. The basis for the statistical analysis is hypothesis testing; the hypotheses have been established in Work Package 2 the required measurements in Work Package 5. If the hypothesis can be rejected then conclusions can be drawn. Otherwise, we need to change the Quality Model, e.g. add/remove/adapt metrics and change their quantitative connection to the sub-characteristics, and adjust the hypotheses.

Finally, in Work Package 7, we draw conclusions. Provided validation succeeds, we define and implement a consulting service for installing Quality Monitors in software development projects. In case validation does not succeed, we have reasons for assuming that internal software quality cannot be assessed uniformly with the same static metrics (or not at all with static metrics). This would, however, be a valuable research result too, since it falsifies common assumptions on software metrics in the research community.

3. Summary of the Group’s Previous Research

The Applied Research in Software Analysis (ARiSA, www.arisa.se) group at Växjö University has contributed to the state-of-the-art in the fields of software analyses and visualization for software engineering purposes. Moreover, we provide flexible software tools, VizzAnalyzer and Vizz3D, making this state-of-the-art accessible for other researchers and practitioners in software engineering.

We have combined behavioral and structural analyses and showed how these different aspects of software can be visualized [1]. We applied this idea to develop a first analysis and visualization framework for special classes of programs [2]. The synergy between different kinds of analyses and visualizations is a general research theme in our group [3]. Successful applications of this idea have been identified in architecture recovery [4,5,6] and pattern recognition [7,8,12]. Theory and methods of software analysis and visualization constitute the theoretical basis of the proposed project.

In addition to the theory and methods, we are interested in their applicability in practice. Generator technology simplifies the development of analyses [9]. A suitable architecture allows to reuse existing analysis and visualization components [11,13,18]. Our frameworks implementing such an architecture, VizzAnalyzer [10,14] and Vizz3D [16], prove practicability and put us into the position to integrate quality measurement in industrial production processes as proposed in the project.

In a controlled experiment with students, we have validated improvements in efficiency and quality in software analysis due to the VizzAnalyzer framework [17]. As a side effect, this work provides practical experiences in controlled experiments with humans involved and the knowledge of the statistics for profound evaluations. This reduces the learning efforts required for the proposed project.

Meanwhile our frameworks have been successfully applied in a number of architectural analyses and quality assessment scenarios [15,17,18]. The proposed project takes this work one step further and systematically relates metrics to standards for quality assessment.

Finally, we have defined a compendium for software quality standards and metrics [19]. This is the core of an on-line collection of community knowledge mapping metrics proposed in the research community to industrial quality standards and vice versa. It helps defining the hypotheses to validate as well as disseminating validation results of the project.

4. Scientific Approach and State of The Art

4.1. Scientific Approach

The project goal is a Quality Model allowing for automated, metrics-based quality assessment with validated significance.

We construct a Quality Model with required properties, cf. Work Packages 1 – 4.

For validating its significance, we apply the Goal-Question-Metric (GQM) approach [20]. GQM suggests defining the experiment goals, to specify questions on how to achieve the goals and to collect a set of metrics, answering the questions in a quantitative way. To avoid confusion, we distinguish model metrics and validation metrics. The former are static metrics in the new Quality Model mapped to sub-characteristics. The latter are metrics assessing the (sub-)characteristics directly, but with much higher effort, i.e. with dynamic analyses or human involvement, or a posteriori, i.e. by looking backward in the project history. For instance:

- The validation metric of the sub-characteristics testability is the actual coverage of test cases, which requires dynamic analyses.

- The validation metric of the characteristics maintainability compares change requests due to bug reports, bug-fixing changes, and new bug reports, which again requires backwards analyses and human annotations.

The Goal is to validate the significance of our Quality Model based on the model metrics. Questions and sub-questions are derived from the ISO/IEC 9126 directly:

Q1: Can one significantly assess re-usability with the model metrics proposed in the Quality Model?

Q1.1 – Q1.4: Can model metrics significantly assess understandability, learnability, operability, and attractiveness, respectively, in a reuse context?

Q2: Can one significantly assess efficiency with the model metrics proposed in the Quality Model?

Q2.1 – Q2.2: Can model metrics significantly assess time behavior and resource utilization?

Q3: Can one significantly assess maintainability with the model metrics proposed?

Q3.1 – Q3.4: Can model metrics significantly assess analyzability, changeability, stability, and testability?

Q4: Can one significantly assess portability with the model metrics proposed?

Q4.1 – Q4.2: Can model metrics significantly assess adaptability, and replaceability?

For answering each sub-question, we need both a number of model metrics (as defined in our Quality Model) and validating metrics (as defined in our experimental setup, cf. examples above).

The basis for the statistical analysis of an experiment is hypothesis testing. A hypothesis is defined formally. The data collected during the course of the experiment is used, if possible, to reject the hypothesis. If the hypothesis can be rejected then conclusions can be drawn. A null hypothesis H0 states that correlations of model and validation metrics are only coincidental. This hypothesis must be rejected with as high significance as possible. We start for all our analyses with the standard borderline significance level of 0.05, i.e. observations are not coincidental but significant with at most a 5% error possibility, with significance values of p<0.05. The alternative hypothesis H1 is the one that we can assume in case H0 is rejected.

To define the hypothesis, we classify the measured values as high, average, and low. For this classification, we use a self-reference in the software systems under development: systems are naturally divided in sub-systems, e.g. packages, modules, classes etc. For each
(sub-)characteristics c and each sub-system s:

1. We perform corresponding measurements of model and validation metrics.

2. The weighted sum as defined in the Quality Model determines aggregated values VM(c,s) from values measured with model metrics. We abstract even further from these values and classify them with abstract values AM(c,s). It is:

- AM(c,s) = high iff VM(c,s) is among the 25% highest values of all sub-systems,

- AM(c,s) = low iff VM(c,s) is among the 25% lowest values of all sub-systems, and

- AM(c,s) = average otherwise.

3. The validation metrics provide values VV(c,s) for direct assessment of (sub-)characteristics.

4. Our statistical evaluation studies the effect of changes in AM(c,s) (independent variables) on changes in VV(c,s) (dependent variables). The hypotheses H0 and H1 have the following form:

H0: There is no correlation between AM(c,s) and VV(c,s).

H1: There is a correlation between AM(c,s) and VV(c,s).

In order to find out which dependent variables were affected by changes in the independent variables, we may use, e.g., the Univariate General Linear Model [21], as part of the SPSS system [22], provided the obtained data is checked for test condition suitability.

4.2. State of the Art

On the one hand, Quality Models have existed for quite a while. The Factor-Criteria-Metric [23] model was developed already in 1977. A further development, the aforementioned Goal-Question-Metric (GQM) approach [20] is not much younger. However, implementations of these models for software quality assessment, e.g. ISO/IEC 9126 from1991, fail in integrating easily assessable metrics.

On the other hand, there exists a rich body of metrics and tools allowing automated measurement. A good overview of metrics is given in the handbook of the Famoos EU-ESPRIT project [24]. Significance of individual metrics is mainly supported by intuition or case studies; few are validated in experiments. A mapping to a generally accepted Quality Model is lacking. However, the Guide to the Software Engineering Body of Knowledge that emphasizes measurement as a foundation for software engineering, especially software engineering management such as quality management [25].

With the success and size of open source projects [26] and the connected availability of databases of source code, design documents, and process data to the research community, multi-version program analysis has emerged as a relatively new discipline; overviews give [27,28]. These analyses allow investigating the quality of development projects over time, e.g. [29], as opposed to the once-in-time analyses of Quality Models and metrics – an approach that we will exploit, as well.

Still experimentation is underrepresented in Computer Science: a survey of over 400 papers showed that 40% of papers did not include experimentation and they needed it [30]; a similar survey of over 612 of such papers confirms this for 30% of the papers [31]. Quantitative/statistical evaluation of experiments in software engineering was proposed for the first time in 1986 [32]. A good overview of theories, methods and tools gives [33].

5. Time Schedule

Work packages WP 1 – 7 are defined in Section REF _Ref113447460 \r \h 2.4. The total effort of the activities are 36 Person Months (PM) for the ARiSA group at MSI and 3 – 4 PM per major participating company:

WP 1: Responsible: ARiSA, 3 PM (ARiSA).
Milestone: Quality Model is defined and published in the compendium.
Start – Deadline: 1/1/06 – 31/3/06.

WP 2: Responsible: ARiSA, 1 PM (ARiSA) + ¼ PM per company.
Milestone: Specification of software development project requirements and additional deliverables of the software development projects. Setup of Web content management system and Web questionnaire.
Start – Deadline: 1/4/06 – 30/4/06.

WP 3: Responsible: participating companies, 4 PM (ARiSA) + ¼ PM per company.
Milestone: One software development project meeting the requirements is selected; developers are instructed to document the process according to deliverables defined in WP 2. Start – Deadline: 1/5/06 – 31/8/06.

WP 4: Responsible: ARiSA, 4 PM (ARiSA) + 1 PM per company.
Milestone: Quality Monitor is installed and running in the selected software development projects, measurements can be documented electronically.
Start – Deadline: 1/09 – 31/12/06.

WP 5: Responsible: participating companies, 9 PM (ARiSA) + 1 – 2 PM per company (accumulated). Continually, measurements are documented, propagated to ARiSA and discussed in meetings (one per quarter and project). Meeting protocols and conclusion are prepared and propagated back by ARiSA.
Start – Deadline: 1/1/07 – 31/8/08 (first measurements 31/3/07, final measurements 31/8/08).

WP 6: Responsible: ARiSA, accumulated 12 PM (ARiSA).
Confirmation/rejection of the hypothesis; adaptation of Quality Model and hypothesis, if necessary; feedback to companies.
Start – Deadline: 1/4/07 – 30/9/08 (first feedback 30/4/07, final feedback 30/9/08).

WP 7: Responsible: ARiSA, 3 PM (ARiSA) + ½ PM per company.
Milestone:
Results are reported in the compendium and in a second software quality workshop. If applicable a Quality Monitor consulting service is defined.
Start – Deadline: 1/10/08 – 31/12/08.

Measurements are continuously documented and reported (WP 5); feedback is given back (WP 6). Hence, these packages overlap. All data is made available instantly to the partners via a project Web page. Therefore, we exploit Artisan’s Web questionnaire, content management and publishing systems. Responsible for installation and dissemination is ARiSA.

Obviously, there are many small, fine-grained but similar individual tasks in WP 5 and 6. We use Artisan’s Web based project management system for reporting and keeping track of these tasks and accumulated project times.

Publications at conferences, e.g. International Conference on Software Maintenance, International Conference on Software Engineering, a journal publication, e.g. in IEEE Trans. Software Engineering or IEEE Software, and updates on the compendium site are activities parallel to the Work Packages. Altogether, our research will lead to a PhD thesis. Responsible for publications is ARiSA.

6. Bibliography

Members and students of the Software Technology Group are marked in bold face.

1. W. Löwe, A. Ludwig, and A. Schwind: Understanding Large Software Systems – Static and Dynamic Aspects. In: 17th International Conference on Advanced Science and Technology, Chicago, 2001.

2. W. Löwe and A. Liebrich: VizzScheduler: A Framework for the Visualization of Scheduling Algorithms. In: Europar '01: Parallel Processing, LNCS 2150, pp. 62 ff, 2001.

3. W. Löwe, M. Ericsson, J. Lundberg and Th. Panas. Software Comprehension - Integrating Program Analysis and Software Visualization. In: Michael Mattsson (eds.), Second Conference on Software Engineering Research and Practise in Sweden, pp. 9-20. Blekinge Institute of Technology, Oct. 2002.

4. W. Löwe and J. Lundberg: A Low-Level Analysis Library for Architecture Recovery. In: SC'03 Workshop on Software Composition, in conjunction with the ETAPS, 2003.

5. J. Lundberg and W. Löwe: Architecture Recovery by Semi-Automatic Component Identification. In: SC'03 Workshop on Software Composition, in conjunction with the ETAPS, 2003.

6. Th. Panas, W. Löwe and U. Assmann. Towards the Unified Recovery Architecture for Reverse Engineering. In: Int. Conf. Software Engineering Research and Practice (SERP), Las Vegas, June 2003.

7. D. Heuzeroth, Th. Holl, and W. Löwe: Combining Static and Dynamic Analyses to Detect Interaction Patterns. In: IDPT'02 - 6th World Conference on Integrated Design and Process Technology, 2002.

8. D. Heuzeroth, Th. Holl, G. Högström, W. Löwe: Automatic Design Pattern Detection, In: 11th Int. Workshop on Program Comprehension, co-located with 25th IEEE ICSE, Portland, May 2003.

9. D. Heuzeroth, Welf Löwe, S. Mandel: Generating Design Pattern Detectors from Pattern Specifications. In: 18th Int. IEEE Conf. On Automated Software Engineering (ASE), Oct. 2003.

10. W. Löwe, M. Ericsson, J. Lundberg, Th, Panas and N. Pettersson. VizzAnalyzer - A Software Comprehension Framework. In: 3rd Conference on Software Engineering Research and Practise in Sweden, pp. 127-136. Lund University, Sweden, Oct. 2003.

11. Th. Panas. Towards a Generic Architecture for Reverse Engineering. Licentiate Thesis, Växjö, Sweden, Nov. 2003.

12. D. Heuzeroth and W. Löwe. Generating Structure and Behavior Visualizations of Systems to Understand their Architecture. Annals of Software Engineering, Special Volume on Software visualization. Published in Chapter 9 of: Software-Visualization - From Theory to Practice, Kluwer, 2003.

13. Th. Panas, J. Lundberg, and W. Löwe. Reuse in Reverse Engineering. In: 12th International Workshop on Reverse Engineering. Bari, Italy, June 2004.

14. W. Löwe. VizzAnalyzer - A Reverse Engineering Framework. In workshop proceedings: 11th IEEE Working Conf. on Reverse Engineering (WCRE'04), Delft, The Netherlands, Nov. 2004.

15. Th. Panas, R. Lincke, J. Lundberg, and W. Löwe. A Qualitative Evaluation of a Software Development and Re-Engineering Project, In: 29th IEEE/NASA Software Engineering Workshop (SEW-29), Greenbelt, USA, April 2005.

16. Th. Panas, R. Lincke, and W. Löwe. Online-Configuration of Software Visualizations with Vizz3D, In: ACM SoftVis, St.Louis, USA, May 2005.

17. Th. Panas and M. Staron. Evaluation of a Framework for Reverse Engineering Tool Construction. In: IEEE International Conference on Software Maintenance (ICSM 2005), Budapest, Hungary, Sept 25-30, 2005.

18. W. Löwe and Th. Panas. Rapid Construction of Software Comprehension Tools, International Journal of Software Engineering and Knowledge Engineering. Special Issue on Maturing the Practice of Software Artifacts Comprehension (accepted for publication), 2005.

19. R. Lincke and W. Löwe. Compendium of Software Quality Standards and Metrics. www.arisa.se/compendium/, 2005.

20. V.R. Basili and H.D. Rombach. The TAME project: Towards improvement-oriented software environments. IEEE Trans. Software Engineering 14, 6 (June), 758–773, 1988.

21. R.E. Walpole. Probability and Statistics for Engineers and Scientists. Prentice Hill, NJ, 2002.

22. SPSS. http://www.spss.com, 2005.

23. J.A. McCall, P.K. Richards, and G. F.Walter. Factors in Software Quality, Tech. Rep. NTIS AD/A-049 014,015,055, US Rome Air Development Center, 1977.

24. H. Bär, M. Bauer, O. Ciupke, S. Demeyer, St. Ducasse, M. Lanza, R. Marinescu, R. Nebbe, O. Nierstrasz, M. Przybilski, T. Richner, M. Rieger, C. Riva, .A. Sassen, B. Schulz, P. Steyaert, S. Tichelaar, J. Weisbrod. The FAMOOS Object-Oriented Reengineering Handbook, http://www.iam.unibe.ch/~famoos/handbook/, October 15, 1999.

25. Guide to the Software Engineering - Body of Knowledge, SWEBOK. A project of the IEEE Computer Society Professional Practices Committee, www.swebok.org, 2005.

26. E.S. Raymond. The Cathedral & the Bazaar. O'Reilly, 1999, available at http://www.catb.org/~esr/writings/cathedral-bazaar.

27. International Workshop on Mining Software Repositories (MSR), Co-located with ICSE 2004 and 2005.

28. Dagstuhl Workshop on Multi-Version Program Analysis, Seminar Nº 05261, May 2005.

29. Audris Mockus, Roy T. Fielding, and James Herbsleb. Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3):1–38, July 2002.

30. W.F. Tichy, P. Lucowicz, L. Prechelt, E.A. Heinz. Experimental Evaluation in Computer Science: A Quantitative Study, Journal of Systems and Software, 28(1), 1995.

31. M.V. Zelkowitz and D. Wallace. Experimental Validation in Software Engineering. Information and Software Technology 39(11), Nov 1997.

32. V.R..Basili, R.W. Selby, and D.H. Hutchens. Qualitative Evaluation of Software Engineering, In: Proc. 1st Pan Pacific Computer Conference, Melbourne, Australia, 1986.

33. C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, B.Regnell, A.Wesslén. Experimentation in software engineering - An Introduction, Kluwer Academic Press. 2000.