Method for Evaluation of Quality Properties in SaaS Rejuvenation using Markov Model

DOI : 10.17577/IJERTV3IS041922

Download Full-Text PDF Cite this Publication

Text Only Version

Method for Evaluation of Quality Properties in SaaS Rejuvenation using Markov Model

Huynh Quyet-Thang, Nguyen Ngoc-Dung, Nguyen Hung-Cuong

School of Information and CommunicationTechnology Hanoi University of Science and Technology

Hanoi, Vietnam

Abstract – Software fault-tolerance techniques have been widely used in computing systems to achieve high level of quality. Rejuvenation, a modern software fault-tolerance technique, has attracted a large number of researchers in software engineering area. Evaluating the effectiveness and feasibility of this technique becomes extremely important in selecting, comparing and applying it in actual software systems. The study of important-quality-attributes is the scientific basis for assessing the performance of software fault-tolerance techniques. This paper presents availability, reliability, safety evaluation of rejuvenation systems. Derived mathematical relations between failure probabilities and modeling parameters enable us to gain a great deal of quantitative results.

Keywordssoftware reliability, software availability, software safety, Markov chain

  1. INTRODUCTION

    Nowadays, computer science has more and more application in human life, from economic to society, from education to medicine. So there is a requirement that developer has to build a high quality system to support user work. Most regular properties of software quality are reliability, availabilityand safety, that are studied in many fields: in component based by Larsson [1], consider maintenance and security issues by Xiong [2], base on properties and architecture of system by Roshandel [3], etc The reliability relates with the correctness of result of work, whereas the availability ensures that system is ready to serve and the safety minimize the probability that a serious accident occurs in running time. There are two main approaches to analysis those properties of fault-tolerant software: practical testing and modeling. Results of practical testing are more believable than those from modeling, which is more well-known. However, testing can only establish the presence of errors but cannot assure their absence. Also, for highly dependable systems, the testing method is not always feasible and tends to be expensive to implement and then, to obtain statistically significant results.

    Markov chain (more specific: discrete-time Markov chain) is a stochastic math system, containing a set of finite (or countable) number of possible states and a set of transitions between two of them. Given the past and present circumstance of Markov chain system, future behavior only depends on the present state and not on the past one. This model has large number of applications in natural science.

    There are several techniques to evaluating quality properties of computer system with different approaches [6,7,8]. Thus, based on advantages of Markov chain model, this study introduces a model and applies it to evaluate the quality attributes of rejuvenation-software systems.

    This paper is organized as follow: after this introduction section, section 2 explains definition of three aspects of software quality: reliability, availability and safety. Next, section 3 introduces rejuvenation – a software fault-tolerance technique. Section 4 proposes a method for evaluation those quality properties in rejuvenation systems using Markov model. Then section 5 shows experimental result of proposed method in simulation experiments. In this section we present also the experiment results in real system BKOJ software, run as SaaS in the BKCLoud system. Section 6 discusses some related and future problem to extend current work.

  2. BASIC ASPECTS OF SOFTWARE SYSTEM QUALITY

    Larsson [1] introduce dependability is the main qualityattribute of safety-critical system development. Althoughsoftware quality has complex meaning and depends on manyproblems, there are three aspects that are discussed following.

    1. Reliability

      Definition of reliability is based on probability that a system will fail in a specific period of time in given context and can be reflected by mean time to failure () equation:

      1

      Since the concept of fault-tolerant software was presented so far, many techniques has been proposed and applied

      Reliability() =

      () (1)

      successfully in practice. Rejuvenation (preventive maintenance – PM) is a new software fault tolerance technique, which Y.Huang was proposed in 1995 [2] and now it has attracted the interest and the research of many scientists [2], [3], [4], [5]. Assessing effectiveness and feasibility of this technology becomes extremely important in choosing, comparing andapplying it to practical software systems.

      While A is a module and () is a probability that this

      module fails per time unit. Being an important property of system, reliability is focused widely by researchers: Roshandel [3], Hoang P. [4], etc… It often used as an indicator for software release policy and can be got by using practical analysis of math models. Roshandel [3] introduce technique to calculate system reliability from this property of element. Hoang P. [4] summaries some statistical models and focuses on NHPP models.

    2. Availability

      Although problems of availability is larger than reliability,Xiong [2] notes that availability is a probability that systemisready-for-work in given time and given environment. Relate with reliability attribute, Larsson [1] introduce formalcalculation:

      ()

      Availability() = (2)

      () + ()

      While is mean time to repair. This attribute of systemhas high commercial contribution: users will satisfy if theycan use product service at every time. The dierence between reliability and availability is that availability dependson the dynamic state of the system.

    3. Safety

      environment of system resources, etc. Process aging will aect the performance of the application and eventually cause theapplication to fail [2].

      If an application is developed in a perfect developmentenvironment and it operates correctly in the scenario work,the implementation process associated with this applicationwill not be aging. However, practical software systems rarelyare perfect. Therefore, their processes will be aging in theoperating environment. The process aging and the softwareaging are fairly dierent. Software aging is related to sourceprogram, which will be inappropriate when requirements andmaintenance are changing after many years. On the contrary, process aging is related to the decrease of applicationfunctions after several working days or weeks.

      Larsson [1] consider software safety as an attribute thatrelates with the interaction between the system and the environment. It is a full-system property, either a component or an assembly property. Safety depends on where and howthe system is deployed, in other way is dependent on the environment of system, so a top-down approach should be used in analyzing process. Safety of system is more important inthe safe-critical systems, which will cause heavy damage topeople or environment if they encounter a failure.

    4. General method to evaluate fault-tolerantsystems

    Authors K. S. Trivedi and Goseva-Popstojanova [5, 6]have

    Completion of repair

    Error

    System failure

    Operating

    Potential Error

    Completion of rejuvenation

    Preventive Maintenance

    Rejuvenation

    proposal to use Markov model in evaluating fault- tolerantsystem:

    Step 1. Markov model implementation

    In the firt step,the Markov state map is being developedby identifying the status of the system and the transition between states.

    Step 2. Building Chapman-Kolmogorov equations

    In the second step, the Markov state chart, whichhas been developed, is being converted to a collection of the Chapman- Kolmogorov equations to find the matrix of transition state probability of the system.

    Step 3. Solving Chapman-Kolmogorov equations

    Solving Chapman-Kolmogorov equations is relativelycomplex. Some current resolutions such as analytics analysis, Laplace-Stieltjes transform or use ODEs inMatlab can simplify this task.

    Step 4. Calculating and assessing the attributes of the fault- tolerant software

    With each specific system, the software attributes will be evaluated according to specific parameters.This is general mechanism when using Markov chain in modeling. Real application depends on properties of environment, context and meaning.

  3. SOFTWARE REJUVENATION

    When software applications execute continuously for longperiods of time (scientific and analytical applications runfor days or weeks, servers in client-server systems are expected to run forever), the processes corresponding to thesoftware in execution age or slowly degrade with respect toeective use of their system resources. The causes of process aging are: memory leaking, unreleased file locks, filedescriptor leaking, data corruption in the operating

    Figure1.Status model of Rejuvenation system

    Software preventive maintenance (software rejuvenation) is a concept related to periodically reboot the system and turn the application back to the initial clean status after each maintenance [2], [3]. Here, we have an overview figure describing four states of the system when rejuvenationtechnique is applied (Figure1).

  4. PROPOSED METHOD FOR EVALUATION OF RELIABILITY, AVAILABILITY AND SAFETY OF

    REJUVENATIONSYSTEMS

    1. System Status Implementation

      Used symbols are showed in table 1 as followed:

      Table 1.MEANING OF SYMBOLS

      Symbol

      Meaning

      Probability when changing from state (available) to state

      (recovery)

      Probability when changing from state

      (available) to state (rejuvenation)

      ()

      Probability when having transactions inqueue at the time

      Estimated time to complete the process ofrecovering from

      errors

      Estimated time to complete the process ofpreventive

      maintenance

      Time when system is in state

      Speed of transactions to system

      (. )

      Speed for serving

      (. )

      Failure rate

      ()

      Mean processing time since the system rejuvenated last

      Supposed that software system is built following server- client model with a queue containing finite number of requests. System exists only errors, which seriously aectingthe functionality of total system and we are not

      interestedin the other errors, which are considered as they occur andare repaired immediately, do not decrease the reliability ofthe system. When the system encounters serious error, allrequests will be canceled and the system will become unsafe(state ), then evoke the error-recovery process.

      1 2 3 k-1 k

      A

      Error

      B C

      Preventive

      Recovering

      Operating

      Maintaining

      Figure2.Status and behavior of rejuvenation system

      We consider two dierent policies, which determine the timeto perform preventive maintenance:

      • Policy I. Purely time based: Preventive maintenance is initiated after a constant time has elapsed since it was started (or restarted).

      1 2 3 k-1 k

      Figure3.Markov process with policy I

    2. Policy I

      0 = + (11)

      • Policy II. Instantaneous load and time based: The actualpreventive maintenance interval is determined by

      =

      +

      0

      the sumof preventive maintenance wait and the time it

      +1

      1

      (12)

      takes forthe queue to get empty from the point onward.

      Let be the steady state probability that software is instate

      + +

      1 i k

      ( , , ). From the well know relation = ,we

      =

      +

      (13)

      have:

      1 1 1

      (3)

      1

      =

      = [ , , ] = [ ,

      2 2

      , 2

      ]

      (14)

      Let be a random variable denoting the sojourn time instate

      with its expectation[]. The steady state availability can be given as:

      = Pr{System is in state }

      1 i k

      For = (.) and = where L(t) is defined by:

      []

      =

      (4)

      =

      (15)

      [] + +

      =0

      Substituting the values of , , :

      []

      The set of ODEs is first augmented by the following

      differential equation:

      =

      +

      (5)

      + []

      =

      (16)

      The steady state safety can then be obtained as followed:

      = 1 Pr{System is in state } (6)

      And:

      = 1

      [] +

      +

      (7) 0

      Substituting the values of , , :

      = 1

      [] + +

      (8)

      In policy I, system is surveyed in the period (0, ] , so averagereliability can be obtained as:

      [ ()]

      1 2 3 k-1 k

      = 0

      (9)

      In policy II, system is surveyed in the period (0, ) , so average reliability can be obtained as:

      [ ()]

      = lim 0

      (10)

      1 2 3 k-1 k

      Figure4.Markov process with policy II

      The initial conditions: 0 0 = 1, 0 = 0 for1 i L

      and 0 = 0 for0 i k. Then

      = (26)

      = 151 ; = 51 . Under both polices, itcan be

      And

      =

      =0

      (17)

      seen that the higher the value of , the lower isthe availability for any particular value of .

      = 1 (18)

      The expected sojourn time in state is given by:

      = (19)

    3. Policy II

    =0

    =0

    Av

    = 0.15

    = 0.35

    = 0.55

    = 0.85

    • maximum

    ail

    In this case, we need to distinguish between and > , as policy II assumes that preventivemaintenance will be initiated if and only if the buer isempty after time has elapsed. Similar to policy I, onstep transition probability is computed by solvingthe system ofODEs at = and is

    abi

    lit

    y

    given as:

    Then

    =

    =0

    (20)

    Figure 5.Availability under policy I

    = 1 = 0 (21) The mean sojourn time in state A is now given by:

    = +

    (22)

    Av

    = 0.15

    = 0.35

    = 0.55

    = 0.85

    • maximum

    ail

    =0

    =0

    =

    =1

    abi

    = 0 +

    (23)

    lit

    y

    =0

    =0

    =1

  5. RELIABILITY, AVAILABILITY ND SAFETY

    EVALUATIONBY SIMULATION EXPREMENTATION

    The models are solved for multiple values of and optimum value is determined. Using programming solution tool inMatlab, we can estimate Chapman-Kolmogrov equations,

    thereby simulating the variability of Ass, Ploss and the upper bound of response time Tres with system parameters.

    Model parameters: = 0.85(h); = 6.0(p);

    = 50; = 240(h)Where h = hours.

    1. Simulation experiment I

      In this experiment, is varied to ascertain the eect on the measures and on optimal . Service rate and failure rate are assumed to be functions of real time, i.e., =

      and = , where () = 1 , which is the hazard function of Weilbull distribution. isfixed at 1.5 and

      is calculated from and the as:

      Figure 6.Availability under policy II

      Figure 7and Figure 8show that safety will decrease when increasing the value of parameter , while safety increases when raisingthe value of parameter . Since then, we can commentthat under the policy I, the sooner the preventive maintenance will be conducted, the safer the system will be.

      = 0.55

      Safe

      1

      ty

      1 +

      =

      (24)

      And() is defined as:

      1

      Where

      = 1 if

      if >

      (25)

      Figure 7.Safety under policy I where = 0.35

      = 0.55

      Sa

      fet

      y

      = 0.15 = 0.35

      = 0.55 = 0.85

      Re lia bil ity

      Figure 8.Safety under policy I where varies

      The safety of system under policy II will increase with thedecrease of (Figure 9). However the dependency is relatively small. In addition, the safety will reduce rapidlyalong with the increase of to a threshold (marked onthe drawings) and then will be almost unchanged. Fromtheoretical calculation, we can see the average reliability of the system does not depend on . Therefore, we fixthe value = 0.55and survey the influence of the reliability on the time to wait to perform the preventivemaintenance . The average reliability of system underpolicy I rises with the decrease of parameter (Figure 10).Clearly, under policies I, the sooner the preventive maintenance is conducted, the higher the level of reliability ofsystem is kept. Under policy II, the reliability is calculated throughout the time domain. It can see that thereliability will increase along with the increase of to athreshold and then be kept stable.

    2. Simulation experiment II.

      Re lia bil ity

      Av ail abi

      = 0.55

      Figure 10.Reliability under policy I and II where = 0.35

      In this experiment, is xed at 0.15; is an assigned value

      of 1.0,1.5 and 2.0, respectively.

      lit

      Policy I

      Policy II

      = 1.0

      = 1.5

      = 2.0

      = 1.0

      = 1.5

      = 2.0

      = 0.15

      = 0.35

      = 0.55

      = 0.85

      Sa

      fet

      y

      Figure 9.Safety under policy II where varies

      For = 1,the time to failure has an exponential distribution, which,because of its no-memory property, contradicts aging. Itis better not to perform Rejuvenation in this case if theobjective is to maximize availability. For other two values of , however, rejuvenation maximizes availability atcertain

      . For a specic policy, the bigger the failure density, i.e., higher the value of , the higher is the maximumsteady state availability. Also, with higher values , thismaxima occurs at lower values of .

      Figure 11.Availability under two policies where varies

      Figure 12and Figure 13show that the higher the failure density is, the higherthe value of safety will be in the low value domain of.

      On the other hand, when increases, the ability in whichsystem in the state will decreases, so the reliability ofsystem will be improved. In addition, the average reliability of system under policy II will increase a threshold(marked on the drawings) along with the increase of ,and then stops and be kept relatively stable. Meanwhile, the value of parameter does not inuence the reliabilityof the system any great deal.

    3. Experimental results

      In this section we will show some experimental result of the proposed method in a real application such as Online Judge system.

      Sa

      fet

      y

      = 0.1

      = 0.15

      = 0.2

      Policy I

      1. Online Judge

        An online judge is an automated judge which checks a submitted solution for an existed problem and generates the output. It checks if the generated output was correct with respect to the output set that is saved as a full proof judge output set for that particular problem thus generate the result for the user such as Accepted, Wrong Answer, Runtime Error, etc.

        Sa fet y

        = 0.1

        = 0.15

        = 0.2

        Policy II

        An online judge is in general a server, which contains descriptions of problems from different contests, as well as data sets to judge whether a particular solution solves any of theseproblems. A user from anywhere in the world can register himself (or herself) with an online judge for free and solve as many problems as he likes. He can send as many solutions as he want till receiving satisfactory information, not only about the verdict, but also about the time that the code takes to run after improving the program and/or the algorithm used to solve the selected challenge. One of the main distinctive trait of the online judges is that they allow the users this self-competitive behavior to learn informatics, not only algorithms but also programming.

        There are several existing popular online judges all over the World Wide Web. Here mentioned some of them:UVA Online Judge, Sphere Online Judge, and BKOJ Online Judge.

      2. BKOJ Online Judge

        BKOJ is an online judge of Ha Noi uninversity of Science and Technology. It was built with the primary purpose of

        Figure 12.Safety under two policies where varies

        being used as a training tool for ACM /ICPC teams of the university.

        Re lia bil ity

        = 0.1

        = 0.15

        = 0.2

        Policy I

        Re lia bil ity

        = 0.1

        = 0.15

        = 0.2

        Policy II

        Figure 14.Block diagram of BKOJ

        BKOJ system consists of two major parts: Web (as Frontend) and Core (as Backend). The Web part plays as distributed information management system, managing information such as user registration, problem submission, solution submission, problem modification,etc.

        The Core part as kernel of BKOJ system provides a method to judge all solutions, which were submitted by any users. In this paper, we only focus on the functions of the Core part.

        BKOJ system has been installed in BKCloud platform as a software service running from 2012. Figure 15 shows its deployment in the platform.

        The Core part as kernel of the system provides a method

        Figure 13.Reliability under two policies where varies

        to judge all solutions, which were submitted by any users. In

        this disscussion, we only focus on the functions of the Core part.

        Figure 15. BKOJ deployment in the BKCloud system

      3. BKOJ – Core working process

        Figure 166.The judgment process of BKOJ – Core

        Figure 177.The probabilites of checking outputs

      4. The experimental results in detail

    In this expreriment, we only consider applying the Policy I to BKOJ. We createdavirtualcontestwiththe simultaneous participation of 500 virtual contestants during 8hours. The contestusingthe data,collectedfrommany private contest of our university HUST – from 2012 to 2013.

    Somebasicinfrmation about the contest is showninthe following table:

    Server Software

    Apache/2.2.14

    Document Path

    /JudgeOnline/status.php

    Concurrency Level

    500

    Time taken for tests

    32,239 seconds

    Complete requests

    1000

    Failed requests

    207

    (Connect: 0, Receive: 0, Length: 207,

    Exceptions: 0)

    Total transferred

    7262294 bytes

    HTML transferred

    6956196 bytes

    Requests per second

    31.02[#/sec](mean)

    Time per request

    16119.251[ms](mean)

    Time per request

    32.239[ms]

    (mean, across all concurrent requests)

    Connection Times(ms)

    min

    median

    max

    Connect

    1

    229

    3000

    Processing

    296

    2009

    32200

    Waiting

    0

    1002

    15725

    Basedon theinformation mentionedinthe above tablewedrew thecontinuos theoreticalcurvesof the Availability, the Safety and the Reliability of BKOJasin Figures 18,19 and

    20respectively. In contrast, the black discretepoints representthe actual valueofthe relating properties ofBKOJ. Model parameters: = 0.5 (h); = 0.5(h); = 6.0(p);

    = 30 ; = 240(h) ; = and = are the same as (24)(25)(26); is fixed at 1.0; is a assigned values of 96(h), 144(h), 192(h), 384(h) respectively (where h

    = hours)

    Figure 188. Availability of BKOJ under policy I

  6. CONCLUSIONS

    Applying the theory of mathematical Markov model, the theory of rejuvenation, we have built a model to evaluate the software attributes of rejuvenation systems. The proposed approach used two Markov chain models with corresponding policy I and II. We showed expanding math calculation of model of this method by using the Matlab. The experiments with BKOJ SaaS on BkCloud system are conrmed its worth. In the future, based on the relationship between these software attributes and fault tolerance techniques on cloud environment, the research will be further developed. From the evaluation of the software attributes of fault-tolerant software in cloud environments, we can deliver the construction cost in rejuvenation-applied software. In addition, we can use the obtained results in the evaluation of the software attributes of the rejuvenation systems to study about client-server systems

    with K queues (K> 1) and distributed fault-tolerant software.

    ACKNOWLEDGMENT

    This research is sponsored by the research Grant KC.01.01/10-15 by Ministry of Science and Technology Vietnam

    Figure 199. Safety of BKOJ under policy I

    Figure 20. Reliability of BKOJ under policy II

    The experimental resultsshow thatthe theoreticalcurvesfitquitewell withlimitedpracticalvalue, which confirmedthe practical valueofthe method for evaluating the quality properties in a rejuvenation system using Markov model.

  7. REFERENCES

  1. M. Larsson, "Predicting quality attributes in a component-based software systems". Malardalen University Press, 2004, ISBN: 91-88834-33-6; ISSN: 1651-4238.

  2. X. CHENGJIE, Availability and reliability analysis of computer software systems considering maintenance and security issues, PhD Thesis 2011. http://scholarbank.nus.edu.sg/handle/10635/25829

  3. R. Roshandel, Calculating architectural reliability via modeling and analysis, in Proceedings of the 26thInternational Conference on Software Engineering, pp. 6971, IEEE Computer Society, 2004.

  4. H. Pham, System software reliability. Springer, 2006.

  5. M. Grottke, L. Li, K. Vaidyanathan, and K. S. Trivedi,Analysis of software aging in a web server, Reliability, IEEE Transactions on, vol. 55, no. 3, pp. 411420, 2006.

  6. T. Dohi, K. Goseva-Popstojanova, K. Vaidyanathan, K. S. Trivedi, and S. Osaki, Software rejuvenation: modeling and applications, in Handbook of ReliabilityEngineering, pp. 245263, Springer, 2003.

  7. Michael R. Lyu. Software Reliability Engineering: A Roadmap. International Conference on Software Engineering 2007 Future of Software Engineering, pp. 153-170, ISBN:0-7695-2829-5

  8. Pham Thanh Trung, Huynh Quyet Thang. Building the Reliability Prediction Model of Component-Based Software Architectures. International Journal of Information Technology, Volume 5, No. 1, 2009, pp. 17-25.

Leave a Reply