Identification of appropriatedistribution in Reliability Analysis

Sanjeev Kumar Jha; Shivlok Singh

doi:10.17577/IJERTV1IS6516

Volume 01, Issue 06 (August 2012)

Identification of appropriatedistribution in Reliability Analysis

DOI : 10.17577/IJERTV1IS6516

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 72
Total Downloads : 476
Authors : Sanjeev Kumar Jha, Shivlok Singh
Paper ID : IJERTV1IS6516
Volume & Issue : Volume 01, Issue 06 (August 2012)
Published (First Online): 30-08-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Identification of appropriatedistribution in Reliability Analysis

Sanjeev Kumar Jha Ph.D. Research Scholar,

School of Computer Science and IT Singhania University Rajasthan

Shivlok Singh Ph.D. Research Scholar,

School of Computer Science and IT Singhania University Rajasthan

Dr. Amod Tiwari Dean, ,PSIT Kanpur,India

Abstract: Distribution fitting is the method of selecting appropriate distributions among the list of distributions to be fitted on given set of data. The main aim of distribution fitting is to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval. There are many probability distributions of which some can be fitted more closely to the observed frequency of the data than others, depending on the characteristics of the phenomenon and of the distribution. The distribution giving a close fit is supposed to lead to good predictions .In distribution fitting, therefore, one needs to select a distribution that suits the data well. Here in this paper we discuss an optimum technique of selecting distributions. Goodness of fit method for identification of distribution is used.

Keywords: Goodness of Fit, Open source software (OSS) , Apache; Kolmogrove,Software reliability model; Software architecture; Reliability growth model;

INTRODUCTION

The aim of distribution fitting is to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval. In this paper with the help of data collected for Apache Web Server is used.

There are different life data distributions which can be used for reliability analysis. Single distributions cannot be used for reliability modeling of all samples under study. Thus for reliability modeling best distribution is to be selected for each sample under study. After collecting data and converting it into appropriate format for analysis,
KOLMOGOROV-SMIRNOV TEST [4]
This test is used to decide if a sample comes from a hypothesized continuous distribution. It is based on the empirical cumulative distribution function (ECDF). Assume that we have a random sample x1,x2…, xn from some distribution with CDF F(x).

The empirical CDF is denoted by

Goodness of fit test is used for selection of best distributions. There is different goodness of fit techniques available. As MLEs are MVUE. Thus in this research Likelihood Ratio is used for goodness of fit test. In this technique we will have maximum accuracy and minimum error. Different softwares are available for statistical distribution and analysis. In this research for statistical calculation and analysis

Fn (x)

1 [Number of Observations x] n

(1)

purpose Microsoft Excel 2007 with Easy Fit5.5 is used.

In distribution fitting, therefore, one needs to select a distribution that suits the data well. The goodness of fit [1] of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions or whether outcome frequencies follow a specified distribution. There are various methods used for goodness of fit test. Most important among them are as given below:

Kolmogorov-Smirnov Anderson-Darling Chi-Squared

In this research all the methods for distribution identification is discussed and finally on the basis of collected sample distributions are identified.

The Kolmogorov-Smirnov statistic (D) is based on the largest vertical difference between the theoretical and the empirical cumulative distribution function:

(2)

Hypothesis Testing

The null and the alternative hypotheses are:
- H0: the data follow the specified distribution;
- HA:the data do not follow the specified distribution.
The hypothesis regarding the distributional form is rejected at the chosen significance level () if the

test statistic, D, is greater than the critical value obtained from a table. The fixed values of (0.01, 0.05 etc.) are generally used to evaluate the null hypothesis (H0) at various significance levels. A value of 0.05 is typically used for most applications, however, in some critical industries; a lower () value may be applied. The standard tables of critical values used for this test are only valid when testing whether a data set is from a completely specified distribution. If one or more distribution parameters are estimated, the results will be conservative: the actual significance level will be smaller than that given by the standard tables and the probability that the fit will be rejected in error will be lower.

P-Value

The P-value, in contrast to fixed values, is calculated based on the test statistic, and denotes the threshold value of the significance level in the sense that the null hypothesis (H0) will be accepted for all values of less than the P-value. For example, if P=0.025, the null hypothesis will be accepted at all significance levels less than P (i.e.

0.01 and 0.02), and rejected at higher levels, including 0.05 and 0.1. The P-value can be useful, in particular, when the null hypothesis is rejected at all predefined significance levels, and you need to know at which level it could be accepted.
1. CHI-SQUARED TEST [7]
  The Chi-Squared test is used to determine if a sample comes from a population with a specific distribution. This test is applied to binned data, so the value of the test statistic depends on how the data is binned. This test is used for continuous sample data only. Although there is no optimal choice for the number of bins (k), there are several formulas which can be used to calculate this number based on the sample size (N). For example, Easy Fit employs the following empirical formula: k=1+log2N. The data can be grouped into intervals of equal probability or equal width. Each bin should contain at least 5 or more data points, so certain adjacent bins sometimes need to be joined together for this condition to be satisfied.
  
  Definition: The Chi-Squared statistic is defined as:
  
  III. ANDERSON-DARLING TEST [5]
  The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. This test gives more weight to the
  
  n (O E )2
  
  2 i i
  
  i 1 Ei
  
  (4)
  
  tails than the Kolmogorov-Smirnov test.
  
  The Anderson-Darling statistic (A2) [6] is defined as
  
  (3)
  
  HYPOTHESIS TESTING [6]
  The null and the alternative hypotheses are:
  
  H0: the data follow the specified distribution.
  
  HA: the data do not follow the specified distribution.
  
  The hypothesis regarding the distributional form is rejected at the chosen significance level () if the test statistic, A2, is greater than the critical value obtained from a table. The fixed values of (0.01,
  
  0.05 etc.) are generally used to evaluate the null hypothesis (H0) at various significance levels. A value of 0.05 is typically used for most applications however; in some critical industries a lower value may be applied. In general, critical values of the Anderson- Darling test statistic depend on the specific distribution being tested. However, tables of critical values for many distributions (except several the most widely used ones) are not easy o find. The Anderson-Darling test implemented in Easy Fit uses the same critical values for all distributions. These values are calculated using the approximation formula, and depend on the sample size only. This kind of test (compared to the "original" A-D test) is less likely to reject the good fit, and can be successfully used to compare the goodness of fit of several fitted distributions.
  
  where Oi is the observed frequency for bin i, and Ei is the expected frequency for bin i calculated by E i
  
  = F(x2)-F(x1) ,where F is the CDF of the probability distribution being tested, and x1, x2 are the limits for bin i.
  
  HYPOTHESIS TESTING
  
  The Null and Alternative Hypothesis are given by H0: the data follow the specified
  
  distribution;
  
  HA: the data do not follow the specified distribution.
  
  The hypothesis regarding the distributional form is rejected at the chosen significance level () if the test statistic is greater than the critical value defined as meaning the Chi-Squared inverse CDF with k-1 degrees of freedom and a significance level of . Though the number of degrees of freedom can be calculated as k-c-1 (where c is the number of estimated parameters), Easy Fit calculates it as k-1 since this kind of test is least likely to reject the fit in error. The fixed values of (0.01, 0.05 etc.) are generally used to evaluate the null hypothesis (H0) at various significance levels. A value of 0.05 is typically used for most applications, however, in some critical industries; a lower value may be applied.
  
  P-Value
  
  The P-value, in contrast to fixed values, is calculated based on the test statistic, and denotes the threshold value of the significance level in the sense that the null hypothesis (H0) will be accepted for all values of less than the P-value. For example, if P=0.025, the null hypothesis will be accepted at all significance levels less than P (i.e.
  
  FAILURE COUNT
2. DATA COLLECTION AND PREPROCESSING
  
  FAILURE COUNT
  
  Here Data collection is associated with collecting a data related to failure of Apache. In the bug-collection step, the online bug- repository systems are used to collect the failure data. For this purpose web site http://www.bugzilla.apache.org is used. Data is collected for Apache2, which is latest available stable version of Apache. This version of Apache2 came into existence in year 2000. Bugs are collected from 26/8/2002 to May 2011. Data are extracted directly from the web site. Bugs reported might be duplicates, provide incomplete information, or may not represent real defects. Therefore, during the bug preprocessing such noises are removed from the bugs gathered in the first step. Finally, in the third step, the preprocessed data is stored in Mysql database. Initially data was in csv (comma separated value) format. Mysql is an open source data base system. It is freely available and very secure. Total of 1,250 records were extracted from repository and after preprocessing finally 501 records were stored in Mysql Table. During preprocessing following records were deleted:
3. RESEARCH METHODOLOGY USED AND IDENTIFICATION OF DISTRIBUTION

Before applying goodness of fit test on data collected for major sub versions of Apache2 series bug frequency corresponding to time to failure in week is plotted and shown in Table 1. Further from figure in Table 1 nothing can be concluded regarding its distribution as well as trend, thus Goodness of fit Test is applied for all the samples and best distribution is identified. These data are tested by three tests for 23 life data distributions.

Table 1: Failure Count of Apache 2.0 and Apache

2.2

APACHE 2.0

4.5

4

3.5

3

2.5

2

FAILURECO

UNT

1.5

1

0.5

0

50

100

150

MONTH

Apache 2.2

14

12

10

8

6

4

FAILURECO

UNT

2

0

50

MONTH

100

APACHE 2.0

From Mysql table on the basis of version, data related to Apache 2.0 version is extracted and stored in a separate table. By using appropriate sql query Time to Failure in terms of week is calculated and stored. For Apache 2.0 we have total of 112 preprocessed failure records.

Goodness of Fit Test

Goodness of Fit Test using Easy Fit is performed and result is stored in Table2.

On the basis of goodness fit test result in above table following distributions are found suitable for time to failure data of Apache 2.0 sample

Gen. Extreme Value Distribution.
Rayleigh (2P) Distribution and

Weibull (3P) Distribution

Tsble2: Goodness of Fit Summary
Distribution	Kolmogorov		Anderson		Chi-Squared
	Smirnov		Darling		Chi-Squared
	Statistic	Rank	Statistic	Rank	Statistic	Rank
Gen. Extreme Value	0.04783	1	0.46327	2	2.5651	1
Rayleigh (2P)	0.04862	2	0.51576	3	3.4363	2
Weibull (3P)	0.05254	3	0.45083	1	4.9185	3
Gamma (3P)	0.05935	4	0.54081	4	5.4208	4
Lognormal (3P)	0.05945	5	0.55453	6	6.2789	7
Fatigue Life (3P)	0.05992	6	0.54319	5	5.9276	5
Log-Logistic (3P)	0.06327	7	0.6779	8	6.1952	6
Beta	0.07135	8	0.59535	7	8.2549	13
Gamma	0.07428	9	1.9004	15	7.2914	10
Gumbel Max	0.07618	10	1.151	10	7.764	12
Rayleigh	0.07632	11	2.1334	16	7.0116	9
Gen. Gamma (4P)	0.07646	12	4.4752	19	N/A
Normal	0.08124	13	1.0723	9	6.7778	8
Logistic	0.08258	14	1.6037	13	8.6466	15
Weibull	0.09517	15	1.8444	14	7.6194	11
Gen. Gamma	0.10421	16	1.4631	12	9.1317	16
Kumaraswamy	0.11334	17	1.3697	11	8.375	14
Lognormal	0.14404	18	3.7415	17	11.774	17
Log-Logistic	0.1484	19	4.1091	18	17.211	19
Gumbel Min	0.15183	20	5.7874	20	16.446	18
Frechet	0.2143	21	9.8675	21	28.736	20
Frechet (3P)	0.21834	22	11.323	22	N/A
Fatigue Life	0.3087	23	18.386	23	49.855	21

Among these distributions on the basis of their test statistic ranking and detail result of goodness of fit test Gen. Extreme value Distribution is identified as best distribution to be fitted. The Detail of Goodness of fit test of this distribution is as shown in Table3.

APACHE 2.2

After study of Apache 2.0, time to failure in terms of week is extracted for Apache 2.2 and stored in excel sheer For this sample we have total of 389 preprocessed data.

Goodness of Fit Test

Goodness of Fit Test is applied for collected sample data mentioned and result is shown in Table4:

Following is the list of distributions which are suitable for time to failure data of Apache 2.2 sample

Gen. Gamma Distribution.
Gen. Extreme Value Distribution.

Among these two distributions on the basis of their test statistic ranking and detail result of goodness of fit test Gen. Gamma Distribution is identified as best distribution to be fitted. The goodness of fit test result for Gen. Gamma Distribution

is as given in Table 5 .

Table 3: Goodness of fit Detail for Gen. Extreme Value Distribution-Apache 2.0
Kolmogorov-Smirnov
Sample Size Statistic P-Value Rank	112 0.04783 0.94905 1
	0.2	0.1	0.05	0.02	0.01
Critical Value	0.10139	0.11556	0.12832	0.14344	0.15393
Reject?	No	No	No	No	No
Anderson-Darling
Sample Size Statistic Rank	112 0.46327 2
	0.2	0.1	0.05	0.02	0.01
Critical Value	1.3749	1.9286	2.5018	3.2892	3.9074
Reject?	No	No	No	No	No
Chi-Squared
Deg. of freedom Statistic P-Value Rank	6 2.5651 0.86111 1
	0.2	0.1	0.05	0.02	0.01
Critical Value	8.5581	10.645	12.592	15.033	16.812
Reject?	No	No	No	No	No

Table4: Goodness of Fit Summary
Distribution	Kolmogorov		Anderson		Chi-Squared
	Smirnov		Darling		Chi-Squared
	Statistic	Rank	Statistic	Rank	Statistic	Rank
Gen. Gamma (4P)	0.02916	1	0.44158	1	5.5903	2
Beta	0.03503	2	6.1941	9	3.3941	1
Kumaraswamy	0.04667	3	12.519	15	N/A
Gen. Extreme Value	0.05595	4	2.3652	2	14.406	3
Normal	0.06901	5	4.2274	4	26.363	8
Weibull (3P)	0.06976	6	3.7826	3	27.392	10
Lognormal (3P)	0.06989	7	4.5999	5	26.797	9
Gamma (3P)	0.07182	8	4.8279	8	25.696	6
Fatigue Life (3P)	0.07351	9	4.6869	6	29.321	12
Log-Logistic (3P)	0.0757	10	4.7048	7	20.61	4
Gumbel Min	0.08959	11	7.5283	13	47.409	15
Rayleigh	0.09013	12	6.883	10	22.102	5
Logistic	0.09133	13	7.04	11	39.203	13
Rayleigh (2P)	0.10238	14	7.4245	12	25.998	7

Gamma	0.10834	15	17.049	17	57.49	16
Gumbel Max	0.11456	16	17.389	18	66.098	19
Gen. Gamma	0.13441	17	12.999	16	44.27	14
Weibull	0.16383	18	10.855	14	29.052	11
Lognormal	0.16932	19	19.802	19	61.869	18
Log-Logistic	0.17935	20	20.025	20	59.687	17
Frechet	0.22712	21	40.092	21	118.62	20
Frechet (3P)	0.26366	22	40.538	22	N/A
Fatigue Life	0.27565	23	56.017	23	149.8	21

Table 5: Goodness of Fit Detail of Apache 2.2 TTF Data for Gen. Gamma Distribution
Gen. Gamma (4P)
Kolmogorov-Smirnov
Sample Size	389
Statistic	0.02916
P-Value	0.88569
Rank	1
	0.2	0.1	0.05	0.02	0.01
Critical Value	0.0544	0.06201	0.06885	0.07697	0.08259
Reject?	No	No	No	No	No
Anderson-Darling
Sample Size	389
Statistic	0.44158
Rank	1
	0.2	0.1	0.05	0.02	0.01
Critical Value	1.3749	1.9286	2.5018	3.2892	3.9074
Reject?	No	No	No	No	No
Chi-Squared
Deg. of freedom	8
Statistic	5.5903
P-Value	0.69301
Rank	2
	0.2	0.1	0.05	0.02	0.01
Critical Value	11.03	13.362	15.507	18.168	20.09
Reject?	No	No	No	No	No

This distribution is accepted by all three tests and at all level of significances.

CONCLUSION:

On the basis of selected data for both versions appropriate distributions. Further from this distribution parameters may be estimated and further model may be constructed. This paper may be useful for researchers of any field.

References

The Problems of Assessing Software Reliability when you really need to depend on it Bev Littlewood Centre for Software Reliability, City University
London, UKJ. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.6873.
I. S. Jacobs and C. P. Bean, Fine particles, thin films and exchange anisotropy, in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271 350.
http://www.mathwave.com
K-Sample Anderson-Darling Tests of Fit, for Continuous and Discrete Cases FW Scholz, MA Stephens in Citeseer (1986)
Guo, H., Honecker, S., Mettas, A., and Ogden, D.,Reliability Estimation for One-Shot Systems with Zero Component Test Failures, 2010 Annual Reliability and Maintainability Symposium.
John D. Musa. Software reliability data. Technical report,Data & Analysis Center for Software, January 1980. http://www.dacs.dtic.mil/databases/sled/swrel.shtml [27 Januray 2003].
A Bayesian Model for Predicting Reliability of Software Systems at the Architectural Level Roshanak Roshandel1, Nenad Medvidovic2, Leana Golubchik2,3 Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, Electron spectroscopy studies on magneto-optical media and plastic substrate interface, IEEE Transl. J. Magn. Japan, vol. 2, pp. 740

741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].
R. Allen, and D. Garlan. A Formal Basis for Architecture Connection. ACM Transactions on Software Engineering and Methodology, 6(3): p.213-249, 1997.
Estimation of the Unknown Parameters of the Generalized Frechet Distribution Abd-1 Elfattah, A.M and 2Omima, A.M.
A. Mockus, T.R. Fielding, and J.D. Herbsleb, Two case studies of open source software development: Apache and Mozilla, ACM Transactions on Software Engineering and Methodology, vol. 11, no. 3, July 2002.
T.R. Moss, The Reliability Data Handbook, ASME Press, 2005.
J.D. Musa and K. Okumoto, A logarithmic poisson execution time model for software reliability measurement, 7th Intl Conference on Software Engineering (ICSE), 1984.
Cobra Rahmani, Harvey Siy, Azad Azadmanesh An experimental Analysis of open Source Software Reliability.
On line bug repositories Bugzilla: http://www.bugzilla.org.
X distribution, Lifetime Data Analys is , 7: 187-200.H. Goto,

Y. Hasegawa, and M. Tanaka, Efficient Scheduling Focusing on the Duality of MPL Representatives, Proc. IEEE Symp. Computational Intelligence in Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64, doi:10.1109/SCIS.2007.357670. (Article in a conference proceedings)

Biography:

Sanjeev Kumar Jha, Senior Systems Analyst, NIELIT Chandigarh, Ministry of C&IT Government of India. Bachelors and Masters Degree in Statistics [Honors] from Patna University and presently doing his PhD in Computer Science from Singhania University, Rajasthan. He has attended 2 international Conferences. Area of research interest is Reliability and Open Source Software.

Shivlok Singh , Programmer NIELIT is presently doing his PhD in Computer Science from Singhania University, Rajasthan. He has attended 2 international Conferences. Area of research interest is Reliability and Open Source Software

Dr. Amod Tiwari, Associate Professor, PSIT Kanpur. PhD from IIT Kanpur. Attended many international and national Conferences. Published many papers in international and national journals. Area of research interest is Image Processing, Reliability and Open Source Software.

Identification of appropriatedistribution in Reliability Analysis

Leave a Reply