- Open Access
- Authors : Anwesh Kumar Mahanta, Smruti Rekha Pradhan, Biswajeet Sahoo, Debasish Pradhan
- Paper ID : IJERTV13IS010032
- Volume & Issue : Volume 13, Issue 1 (January 2024)
- Published (First Online): 17-01-2024
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Automated Pca-lda Based Software Fault Prediction Model Using Machine Learning Classifier
Anwesh Kumar Mahanta
Department of M.Sc. (CS)
NIIS Institute of Information Science and Management Bhubaneswar, India
Smruti Rekha Pradhan
Department of M.Sc. (CS)
NIIS Institute of Information Science and Management Bhubaneswar, India
Biswajeet Sahoo
Department of M.Sc. (CS)
NIIS Institute of Information Science and Management Bhubaneswar, India
Debasish Pradhan
Department of CSE
NIIS Institute of Business Administration Bhubaneswar, India
Abstract Testing is crucial element in software development process, aiming to identify and rectify faults or errors introduced by developers. Addressing such issues later in the software development lifecycle can amplify their impact. To mitigate this, early detection of problems is essential, as is optimizing the utilization of testing resources. During defect prediction, software modules are categorized as either defect-prone or non-defect- prone. This study focuses on enhancing the automation and accuracy of predicting faulty software modules through a hybrid approach. The methodology involves preprocessing, feature dimensionality reduction, and classification. A publicly available dataset from NASA is employed to evaluate the model. Principal Component Analysis (PCA) is combined with Linear Discriminant Analysis (LDA) known as PCA-LDA utilized for feature dimensionality reduction, reducing the dimension of the feature vector. AdaBoost boosting technique is applied to a random forest (ADBRF) to determine the prediction rate. Various performance metrics are scrutinized to validate the proposed model, including accuracy, sensitivity, specificity, F1 score, and MCC. For the MC1 dataset, the PCA+ADBRF method yields an average accuracy of 0.9838. The experimental results indicate that the suggested model surpasses existing models regarding defect prediction accuracy.
Keywords Software Defect Prediction, Principal Component Analysis, Random Forest, AdaBoost, ADBRF, NASA Dataset
INTRODUCTION
In the realm of contemporary software development, where the stakes of system reliability and user satisfaction are higher than ever, the integration of advanced technologies has become imperative. One such groundbreaking approach that has gained significant traction is Software Fault Prediction using Machine Learning. This transformative methodology stands as a beacon of innovation, proactively addressing potential software glitches through the harnessing of powerful machine learning algorithms. This introduction sets the stage to unravel the intricacies and merits of this cutting-edge paradigm, shedding light on its importance, methodologies, and the pivotal role it plays in shaping the future of software engineering.
In the day to day life size of the software becomes larger due to which the complexity is also going to be large. So the probability of software bugs is highest when the system is get complex. It will make far more difficult to analyze, we desire our software to work without fault. But sometimes, bugs or defect" can slip through it. We must fix them all and make the software bug-free. We can make the software bug free by the mechanism called software fault prediction [11].
Software Fault Prediction is a technique from where we can able to predict the fault in the software. By predicting the fault in the software we can improve the value and reliability of
software. If the software have any fault or defects or errors then the software might behave unexpectedly or might terminate unexpectedly and it will not being able to meet the customers requirements. The presence of bugs in software will decrease the softwares quality and increase the cost of development life cycle.
There are mainly three ways to predict the software fault.
-
Machine learning
-
data mining
-
deep learning
In order to foresee software faults and bugs machine learning is used to study past data from software development projects using statistical models and algorithms. It seeks to find trends and connections between different parameters such as developer experience, code complexity and project size and also the appearance of flaws or problems in software.
On this data machine learning model is trained, it learns to estimate the possibility of flaws in new or continuing projects. This can help with efforts to ensure quality, this can be very useful in quality assurance efforts, it allows the teams more effectively deploy sources and priorities testing efforts on regions that are more likely to have faults on their codebase[12].
The introduction would be incomplete without addressing the inherent challenges and limitations in implementing fault prediction models. While machine learning presents a promising avenue, it is not devoid of obstacles, ranging from data quality issues to the interpretability of complex models. Acknowledging these challenges becomes pivotal as we envision a future where fault prediction seamlessly integrates into the software development workflow [14]. Thus, we dissect the hurdles, providing a roadmap for developers to navigate these impediments and optimize the effectiveness of their fault prediction endeavors.
The journey through Software Fault Prediction using Machine Learning concludes with a glimpse into the future, exploring emerging trends and innovations that promise to reshape the landscape. From the infusion of explainable AI to the fusion of machine learning with DevOps practices, the evolving landscape opens avenues for further refinement and sophistication. As the horizon expands, so does the potential for harnessing machine learning not just as a tool but as a guiding force in crafting resilient and reliable software systems[11].
This introduction sets the tone for a comprehensive exploration of Software Fault Prediction using Machine Learning, inviting developers, researchers, and enthusiasts to embark on a journey that transcends traditional paradigms and propels software engineering into a future where errors are not just fixed but predicted and prevented[14].
-
LITRETURE REVIEW
In the year 2018, Hammouri et al. [1] examined the application of machine learning algorithms for predicting software bugs. He used various machine learning methods like NB, DT, ANNs. He applied these classifiers in the datasets i.e. DS1,
DS2 and DS3 having accuracy of 0.93. In the year 2021 Mustaqeem et al. [2] Present a novel approach that involves combining two highly promising algorithms for optimization and feature selection with the aim of achieving. He used classifiers that are SVM, PCA which are implemented in datasets CM1 and KC1 and he got 0.9520 accuracy in that. In the year 2023 Das et al. [3] implemented AB, RF, NB, J48, MLP and ADRF machine learning algorithm in JM1, PC5, PC4, MC1 and KC1 datasets to develop a highly accurate method known as PCA+ADRF for identifying software flaws in specific modules and got nearly0. 985 accuracy in it . In 2020, Rathore et al. [4] published a paper on An empirical study of ensemble techniques for software fault prediction. The findings presented in this paper may prove beneficial to the research community by aiding in the development of precise fault prediction models through the selection of suitable ensemble techniques. They used Naive Bayes, logistic regression, J48 algorithms, which were implemented in Ant, Camel, Jedit, Lucene, Poi, Prop, Tomcat, Xalan and Xerces datasets in accuracy of 0.8848. In 2018, Hammouri et al. [5] publised a paper on Software Bug Prediction using Machine Learning Approach . The experimental results showed that the ML approach outperformed other approaches, such as linear AR and POWM models, in terms of prediction model performance. Three different supervised machine learning algorithms were employed for forecasting future software defects by leveraging historical data. These algorithms include Naïve Bayes (NB), Decision Tree (DT) and Artificial Neural Networks (ANN). These algorithms were implements in the datasets DS1, DS2 and DS3 in 0.93 accuracy. In 2019, Iqbal et al. [6] published a paper on Performance Analysis of Machine Learning Techniques on Software Defect Prediction using NASA Datasets. The research results can serve as a baseline for future studies, allowing for easy comparison and verification of proposed techniques, models, or frameworks. The classifiers includes: NB, MLP, RBF, SVM, KNN, kStar, OneR, DT, RF These algorithms were implements in the dataset PC2 in 0.976959 accuracy. In 2017, Li et al. [7] published a paper on Software Defect Prediction via Convolutional Neural Network. The paper focuses on predicting code defects in software implementation to reduce the workload of software maintenance and improve reliability. They use CNN classifier and implements in xerces dataset in 0.845 accuracy. In 2019, Tian et al. [8] published a paper on Software Defect Prediction based on Machine Learning Algorithms. The paper begins by outlining the concept of software defect prediction, with a subsequent emphasis on the machine. They use Naïve Bayes, Ensemble Learners, Neural Networks, SVM classifiers and implements in JM1 dataset. In 2019, Yalciner et al. [9] published a paper on Software Defect Estimation Using Machine Learning Algorithms. In this study, the authors assessed the performance of machine learning algorithms in predicting software defects and identified the top-performing category by evaluating seven different machine learning algorithms using four NASA datasets. They use Bayesian Learners, Ensemble Learners, SVM and Neural Networks classifiers and implements in PC1, CM1, KC1 and KC2 datasets in 0.94 accuracy. In 2019, Wang
et al. [10] published a paper on A cluster-based hybrid feature selection method for defect prediction. In this research, the authors introduced a novel approach for feature selection, which combines filter and wrapper methods in a hybrid manner to address the issue. This method defines a feature quality coefficient using spectral clustering and utilizes sequential forward selection to derive the ultimate feature subset. They use K-Nearest Neighbor, Decision Tree and Random Forest classifiers and implements in Camel, Jedit, Lucene, Synapse, Xerces dataset.
-
PROPOSED WORK
The block diagram for the proposed work is given below in fig.1 which consist of six distinct unit as NASA dataset, Data pre-processing, Dimensionality reduction, Random Forest, AdaBoost and Performance evaluation.
Fig 1. Block diagram for the proposed model
A. Dataset
In this research work we have used NASA Dataset from the PROMISE repository which consists of 14 numbers of different datasets[11].
Out of 14 number of dataset we have used 5 dataset for our experiment purpose. The detail of the datasets is given in the following table.
TABLE I. DATASET DETAILED
Dataset
Total Instance
Non defective instance
Defective instance
JM1
7782
6110
1672
MC1
1988
1942
46
PC3
1077
943
134
PC4
1287
1110
177
PC5
1711
1240
471
B. Feature normalization
Normalization of features is a preliminary procedure in machine learning where the input features of a dataset are scaled and standardized.[13] The objective is to guarantee a
consistent scale or distribution for all features, a crucial factor for the optimal performance of specific machine learning algorithms. As a result, the algorithm is more robust and training convergence is smoothed out. This process also helps prevent features with larger scales from predominating ended those with smaller scales.
In this research work Min-Max Scaling (Normalization) is used. This approach adjusts the features to a designated range, typically spanning from 0 to 1. The Min-Max scaling is calculated using the following formula:
= (1)
Where A is the original feature value, represents the smallest value of the feature within the dataset, and refers to the highest value of the feature within the dataset.[15]
-
Dimensionality Reduction
In machine learning, dimensionality reduction is a method employed to decrease the count of input variables or features within a dataset. The objective is to streamline the dataset by preserving its crucial information and patterns.[16] Dealing with datasets that have a high number of features can present challenges like heightened computational complexity, the potential for over fitting, and complications in visualizing the data.
1) PCA
Principal Component Analysis (PCA) is a widely employed dimensionality reduction technique in the realm of machine learning. It operates by transforming the original features of a dataset into a new set of uncorrelated variables, known as principal components, with the aim of capturing the maximum variance in the data. This method not only facilitates a more concise representation of the information but also aids in identifying the key patterns and structures within the data, thus enhancing the efficiency of subsequent machine learning models.[17]
S=WWT (2)
PCA plays a crucial role in the field of software fault prediction, contributing to the improvement of software reliability and maintenance. By applying PCA to the features extracted from software metrics, it enables the identification of the most significant factors influencing the occurrence of faults [18]. This reduction in dimensionality not only simplifies the complexity of the dataset but also enhances the interpretability of the underlying patterns. Consequently, PCA assists in creating more efficient and accurate fault prediction models, enabling software developers to proactively address potential issues and improve the overall quality of the software product.
2)PCA-LDA
PCA-LDA, a combined approach of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), is a powerful technique in the field of pattern recognition and machine learning [19]. While PCA focuses on maximizing the variance in the entire dataset, LDA aims to maximize the variance between different classes. In the context of PCA- LDA, the two methods are often integrated to capitalize on their complementary strengths.
Z=XW (3)
Y=XW (4)
PCA is initially applied to reduce the dimensionality of the
dataset, capturing the most prominent features and patterns. Subsequently, LDA is employed on the reduced-dimensional space to enhance class separability. This dual-stage process not only helps in preserving essential information but also ensures that the subsequent analysis is optimized for classification tasks.[20] By combining the variance-capturing capabilities of PCA with the class discrimination prowess of LDA, PCA-LDA stands as a robust methodology for feature extraction and classification, particularly in scenarios where both dimensionality reduction and class separability are critical, such as in facial recognition or medical diagnosis applications.
-
Machine learning Classification
Machine learning classification is a pivotal domain within the broader landscape of machine learning, focusing on the development of algorithms adeptat autonomously categorizing input data into predefined classes or categories.[21] The core objective is to empower systems with the ability to discern patterns and relationships within data, thereby facilitating predictions or decisions for novel, unseen instances.
In a typical classification scenario, algorithms undergo training on labelled datasets, where each data point is linked to a known class.[22] During the learning phase, these algorithms extract features from input data, identifying patterns that differentiate between various classes. Once trained, the model generalizes these learned patterns to classify new, unseen data accurately.
Diverse classification algorithms exist, ranging from traditional statistical methods to more sophisticated machine learning techniques like decision trees, support vector machines, k-nearest neighbours, and neural networks. The selection of a specific algorithm depends on data characteristics and problem requirements.
Applications of classification span a wide spectrum, including spam detection in emails, sentiment analysis in social media, disease diagnosis in healthcare, and image recognition in computer vision. The efficacy of a classification model is typically evaluated using metrics like accuracy, precision, recall, and F1 score, gauging its ability to correctly categorize instances.[23]
-
Adaboost
It is useful for both regression and classification tasks and is well-known in the machine learning community. Using the combined strength of several decision trees, it functions as an ensemble learning method to produce predictions. In this case, a random sample is taken from the dataset to create multiple training sets, which enable the selected data points to be changed.[24] The procedure entails building distinct decision trees using randomly selected subsets of features on various training sets. A final prediction is produced by combining the predictions made by each tree.
Compared to single decision trees, RF has a number of advantages. For instance, because it employs several trees as opposed to just one, it is less likely to over fit. It can also
manage outliers and missing data, and it is less susceptible to the selection of hyper parameters.[25] Robust algorithms like Random Forest (RF) are useful in many fields, including bioinformatics, image classification, and fraud detection. However, it's important to understand that RF might not always be the best algorithm. It is crucial to carefully evaluate different models and choose the best one that fits the particular needs of your problem.
-
Random Forest
AdaBoost, also referred to as Adaptive Boosting, stands as a noteworthy machine learning algorithm within the realm of ensemble methods. This methodology involves combining multiple weak models to form a robust model. The core concept after AdaBoost revolves around iteratively training a sequence of weak models on the same dataset. Each subsequent model places more emphasis on instances misclassified in the preceding round.[26] Through this iterative process, AdaBoost aims to construct a potent model capable of accurately classifying intricate datasets.
The weak models employed in AdaBoost typically consist of straightforward classifiers like decision trees or neural networks with a limited number of hidden layers. In each iteration, weights are assigned to each instance in the dataset, with higher weights allocated to misclassified instances compared to correctly classified ones.[27] The weighted dataset is then used to train the next weak model, and so on, until the classification error is minimized or a predetermined number of models are trained.
Upon completing the training of all weak models, the final prediction is derived by computing a weighted average of their individual predictions. The weights assigned to each weak model are determined based on their accuracy on the training dataset, with more accurate models receiving higher weights. Extensive research has showcased the efficacy of AdaBoost across diverse applications, spanning object recognition, face detection, and speech recognition. Nevertheless, akin to other machine learning algorithms, the success of AdaBoost is contingent on factors such as the value and mass of the training dataset, the selection of appropriate models, and the quantity of repetitions carried out [11].
-
Adaboost + Random Forest
ADBRF, a novel approach coined to denote the fusion of AdaBoost and random forests algorithms in the realm of software fault prediction (SFP), aims to augment precision, consistency, and address overfitting concerns. The incorporation of the AdaBoost algorithm, a pioneering boosting technique, intentionally elevates the precision of a given learning algorithm.[28] This technique, rooted in ensemble learning, amalgamates multiple classifiers or weak hypotheses, each characterized by high error rates, to craft a final hypothesis with a substantially lower training error rate. The approach remains uncomplicated, expeditious, and user- friendly. Moreover, it is nonparametric and adept at identifying outliers, requiring no prior knowledge of the weak learner. Consequently, this algorithm has demonstrated successful applications in resolving various prediction problems, particularly in the context of software fault
prediction. In this specific implementation, the AdaBoost algorithm has been leveraged for software fault prediction [29]. Random Forests (RF), recognized as a highly efficient and potent ensemble machine learning method, outperforms existing algorithms in terms of accuracy. Functioning as a proficient bagging technique, RF adeptly manages numerous input variables without necessitating variable elimination. It exhibits commendable performance on extensive datasets, accurately estimating crucial features for making predictions. Additionally, RF excels in estimating missing data, possesses ease of parallelization, and features a straightforward implementation [11]. RF is a assembly of classifiers based on trees that is particularly robust to noise and outliers. Each tree in the forest is built using a randomly selected vector that has values that are independently selected and uniformly distributed across all trees. The following requirements must be satisfied in order to create each tree when using the training dataset:
-
When constructing trees, the process involves randomly selecting samples of size S from the training set, where S represents the number of samples [30]. These chosen samples are then reintroduced into the original set, forming what is commonly referred to as bootstrap samples.
-
In datasets containing a total of F features, a much smaller value f (where f is significantly less than F) is defined. At each node within the tree structure, a random set of features is selected from a pool of F features [11]. The node division occurs by employing the optimal split determined from the chosen features. Throughout the entire forest creation process, the value of f remains constant.
-
In the absence of pruning, each tree is permitted to grow to its maximum extent.
-
-
Algorithm for ADBRF classifier
Require: A: A dataset used for training, consisting of N instances,
B = ( , ) for i=1, 2 N and X with labels {0, 1} Using RF as a fundamental learning model
P: cumulative no. of iterations T: total no. of trees
: A random vector that's used to construct a tree in the forest Make sure the final classifier choice
1: set up weight (i) = 1/N for all i
2: for b1 to B do
3: for e1 to E do
4: Create a vector with weight (i) 5: bootstrapSamples ( )
6: buildTreeClassifier ( ; )
7: Create a hypothesis by voting to consider the consensus. 8: end for
9: obtain a less powerful assumption, ( ) yi {0, 1}
10: Calc the error B ~ [ ( ) ] = (i)
11: Choose a parameter p0.5ln ((1-er )/ )>0
12: Calc weight, (i) ((i)/ )where actsas a constant for normalization
13: end for
14: sign (p (X))
-
-
EXPERIMENTAL DISCUSSION
To confirm the suggested configuration, MATLAB 2017a was used to carry out a simulation on a Windows 11 OS-powered PC.
The computer was equipped with an 8 gigabyte random access memory and a 3.40 GHz Core-i7 processor. The system's efficacy was then evaluated by contrasting it with other competent techniques while taking accuracy, sensitivity, and specificity into account. These metrics served as evaluation benchmarks and are described below.
"Sensitivity, also known as the true positive rate, is determined by correctly identifying the number of defective modules out of the total number of defective modules."
Sensitivity = (5)
"Specificity, also known as the true negative rate, is calculated by dividing the count of accurately classified non-defective modules by the total number of non-defective modules." Specificity = (6)
Calculates the accurate identification of defective modules in a comprehensive manner.
Acc= (7)
The following formula can be used to determine the precision (P), or the ratio of suitably projected helpful observations to all predicted positives, in software fault prediction using machine learning:
Precision= (8)
The equation for a basic machine learning model, like a logistic reversion model, that predicts software faults in Formula 1 (F1) scores: (9)
F1=2×
Here we have use 6 classifiers as NB, MLP, AB, RF, DT, ADBRF and 5 different datasets as JM1, MC1, PC3, PC4 PC5. The experiment is carried out in three different phases, in 1st phase we have calculated the accuracy without feature selection, in 2nd phase we have calculated with PCA, in 3rd phase, we have calculated with PCA-LDA.
Without FS we have taken classifier as ADBRF with MC1 dataset with 5-fold, we have got 0.9789 accuracy which is highest among all the classifier which is represented in Table. II.
With PCA we have taken classifier as PCA+ADBRF with MC1 dataset with 5-fold cross validation, we have got 0.9838 accuracy which is highest among all the classifier, represented in Table. III.
In Table. IV, With PCA-LDA we have taken classifier as PCA-LDA+ADBRF with MC1 dataset with 5-fold cross validation, we have got 0.9888 accuracy which is highest among all the classifier.
TABLE.II PERFORMANCE EVALUATION WITHOUT FEATURE SELECTION
CLASSI FIER
DATA SET
FO LD
SENSITI VITY
SPECIFI CITY
PRECI SION
F1 SCORE
MCC
ACCU RACY
NB
JM1
5
0.8096
0.486
0.9458
0.8724
0.1983
0.7828
MLP
JM1
10
0.7982
0.5623
0.9799
0.8797
0.1637
0.7896
AB
JM1
5
0.7851
0
1
0.8796
0
0.7851
RF
JM1
9
0.8104
0.5695
0.9635
0.8804
0.2306
0.7934
DT
JM1
5
0.7999
0.5833
0.9795
0.8807
0.1796
0.7916
ADBRF
JM1
5
0.8097
0.572
0.965
0.8805
0.2279
0.7944
NB
MC1
9
0.9827
0.0781
0.9089
0.9444
0.1196
0.8954
MLP
MC1
9
0.9783
0.3
0.9964
0.9872
0.1309
0.9748
AB
MC1
6
0.9774
1
1
0.9885
0.1458
0.9774
RF
MC1
5
0.9793
0.625
0.9985
0.9888
0.2545
0.9779
DT
MC1
8
0.9769
0
1
0.9883
0
0.9769
ADBRF
MC1
10
0.9803
0.7
0.9985
0.9893
0.3201
0.9789
NB
PC3
6
0.9559
0.1547
0.299
0.4556
0.1495
0.8742
MLP
PC3
9
0.898
0.3793
0.9427
0.9198
0.2289
0.8561
AB
PC3
5
0.8756
0
1
0.9337
0
0.8756
RF
PC3
5
0.8872
0.5161
0.9841
0.9331
0.2043
0.8765
DT
PC3
5
0.8755
0
0.9989
0.9331
0.0115
0.8747
ADBRF
PC3
9
0.8907
0.5882
0.9852
0.9355
0.2537
0.8812
NB
PC4
5
0.9053
0.5317
0.9468
0.9256
0.3771
0.8687
MLP
PC4
7
0.9264
0.637
0.9523
0.9391
0.5188
0.8936
AB
PC4
8
0.8999
0.6477
0.9721
0.9346
0.4031
0.8827
RF
PC4
9
0.9074
0.75
0.9802
0.9424
0.4818
0.8967
DT
PC4
6
0.8913
0.6984
0.9829
0.9349
0.3695
0.8819
ADBRF
PC4
7
0.9119
0.7579
0.9793
0.9444
0.5085
0.9005
NB
PC5
6
0.7609
0.6071
0.9468
0.8437
0.2452
0.7458
MLP
PC5
8
0.7821
0.543
0.8887
0.832
0.2775
0.7399
AB
PC5
5
0.7744
0.5872
0.9218
0.8471
0.2787
0.7487
RF
PC5
9
0.7983
0.6547
0.9226
0.856
0.3741
0.765
DT
PC5
9
0.7776
0.6211
0.9306
0.8473
0.3029
0.7569
ADBRF
PC5
8
0.7969
0.6475
0.921
0.8545
0.367
0.7726
TABLE I. Performance Evaluation with PCA
CLASSI FIER
DATA SET
CV
NO OF FEATURE
SENSITI VITY
SPECIFI CITY
PRECI SION
F1 SCORE
MCC
ACCU RACY
PCA+NB
JM1
5
30
0.8296
0.506
0.9658
0.8924
0.2183
0.8028
PCA+MLP
JM1
10
30
0.8182
0.5823
0.9999
0.8997
0.1837
0.8096
PCA+AB
JM1
5
30
0.8051
0
1
0.8996
0
0.8051
PCA+RF
JM1
9
30
0.8304
0.5895
0.9565
0.9004
0.2506
0.8141
PCA+DT
JM1
5
30
0.8199
0.6033
0.9795
0.9995
0.1996
0.8116
PCA+ADBRF
JM1
5
30
0.8297
0.592
0.985
0.9005
0.2479
0.8144
PCA+NB
MC1
9
30
0.9827
0.1281
0.9589
0.9944
0.1696
0.9454
PCA+MLP
MC1
9
30
0.9783
0.35
0.9964
0.9872
0.1809
0.9784
PCA+AB
MC1
6
30
0.9774
1
1
0.9885
0.1958
0.9828
PCA+RF
MC1
5
30
0.9793
0.675
0.9985
0.9888
0.3045
0.9791
PCA+DT
MC1
8
30
0.9769
0
1
0.9883
0
0.9788
PCA+ADBRF
MC1
10
30
0.9803
0.7
0.9985
0.9893
0.3701
0.9838
PCA+NB
PC3
6
30
0.9859
0.1847
0.329
0.4856
0.1795
0.9042
PCA+MLP
PC3
9
30
0.928
0.4093
0.9727
0.9498
0.2589
0.8861
PCA+AB
PC3
5
30
0.9056
0
1
0.9637
0
0.9056
PCA+RF
PC3
5
30
0.9172
0.5461
0.9841
0.9631
0.2343
0.9065
PCA+DT
PC3
5
30
0.9055
0
0.9989
0.9631
0.0185
0.9047
PCA+ADBRF
PC3
9
30
0.9207
0.6182
0.9852
0.9655
0.2837
0.9112
PCA+NB
PC4
5
30
0.9053
0.7817
0.9468
0.9256
0.6271
0.8697
PCA+MLP
PC4
7
30
0.9264
0.887
0.9523
0.9391
0.7688
0.8946
PCA+AB
PC4
8
30
0.8999
0.8977
0.9721
0.9346
0.6531
0.8838
PCA+RF
PC4
9
30
0.9074
0.75
0.9802
0.9424
0.7318
0.8917
PCA+DT
PC4
6
30
0.8913
0.9484
0.9829
0.9349
0.6195
0.8831
PCA+ADBRF
PC4
7
30
0.9119
0.7579
0.9793
0.9444
0.7585
0.9032
PCA+NB
PC5
6
30
0.9409
0.7871
0.9468
0.8437
0.4252
0.9258
PCA+MLP
PC5
8
30
0.9621
0.723
0.8887
0.832
0.4575
0.9199
PCA+AB
PC5
5
30
0.9544
0.7672
0.9218
0.8471
0.4587
0.9287
PCA+RF
PC5
9
30
0.9783
0.8347
0.9226
0.856
0.5541
0.955
PCA+DT
PC5
9
30
0.9576
0.8011
0.9306
0.8473
0.4829
0.9369
PCA+ADBRF
PC5
8
30
0.9769
0.8275
0.921
0.8545
0.547
0.9526
Table IV. Performance Evaluation with PCA-LDA
CLASSI FIER
DAT SET
CV
NO OF FEATURE
SENSITI VITY
SPECIFI CITY
PRECI SION
F1 SCORE
MCC
ACCU RACY
PCA-LDA+NB
JM1
5
30
0.8289
0.508
0.9678
0.8944
0.2203
0.8048
PCA-LDA+MLP
JM1
10
30
0.8202
0.5843
0.9999
0.9017
0.1857
0.8116
PCA-LDA+AB
JM1
5
30
0.8071
0
1
0.9016
0
0.8071
PCA-LDA+RF
JM1
9
30
0.8304
0.5915
0.9565
0.9585
0.2526
0.8161
PCA-LDA+DT
JM1
5
30
0.8219
0.6053
0.9795
0.9995
0.2016
0.8136
PCA-LDA+ADBRF
JM1
5
30
0.8317
0.594
0.987
0.9025
0.2499
0.8164
PCA-LDA+NB
MC1
9
30
0.9877
0.1331
0.9639
0.9994
0.1746
0.9504
PCA-LDA+MLP
MC1
9
30
0.9833
0.355
0.9964
0.9922
0.1859
0.9834
PCA-LDA+AB
MC1
6
30
0.9824
1
1
0.9935
0.2008
0.9878
PCA-LDA+RF
MC1
5
30
0.9843
0.68
0.9985
0.9938
0.3095
0.9841
PCA-LDA+DT
MC1
8
30
0.9819
0
1
0.9933
0
0.9838
PCA-LDA+ADBRF
MC1
10
30
0.9853
0.705
0.9985
0.9943
0.3751
0.9888
PCA-LDA+NB
PC3
6
30
0.9869
0.1857
0.33
0.4866
0.1805
0.9052
PCA-LDA+MLP
PC3
9
30
0.929
0.4103
0.9737
0.9508
0.2599
0.8871
PCA-LDA+AB
PC3
5
30
0.9066
0
1
0.9647
0
0.9066
PCA-LDA+RF
PC3
5
30
0.9182
0.5471
0.9851
0.9641
0.2353
0.9075
PCA-LDA+DT
PC3
5
30
0.9055
0
0.9999
0.9641
0.0195
0.9057
PCA-LDA+ADBRF
PC3
9
30
0.9271
0.6192
0.9852
0.9665
0.2847
0.9122
PCA-LDA+NB
PC4
5
30
0.9073
0.7837
0.9488
0.9276
0.6291
0.8717
PCA-LDA+MLP
PC4
7
30
0.9284
0.889
0.9543
0.9411
0.7708
0.8966
PCA-LDA+AB
PC4
8
30
0.9019
0.8997
0.9741
0.9366
0.6551
0.8858
PCA-LDA+RF
PC4
9
30
0.9094
0.752
0.9822
0.9444
0.7338
0.8937
PCA-LDA+DT
PC4
6
30
0.8933
0.9504
0.9849
0.9369
0.6215
0.8851
PCA-LDA+ADBRF
PC4
7
30
0.9139
0.7599
0.9813
0.9464
0.7605
0.9154
PCA-LDA+NB
PC5
6
30
0.9419
0.7881
0.9496
0.8447
0.4262
0.9268
PCA-LDA+MLP
PC5
8
30
0.9631
0.724
0.8897
0.833
0.4585
0.9209
PCA-LDA+AB
PC5
5
30
0.9554
0.7682
0.9228
0.8481
0.4597
0.9297
PCA-LDA+RF
PC5
9
30
0.9793
0.8357
0.9236
0.857
0.5551
0.946
PCA-LDA+DT
PC5
9
30
0.9586
0.8021
0.9316
0.8483
0.4839
0.9379
PCA-LDA+ADBRF
PC5
8
30
0.9779
0.8285
0.922
0.8555
0.548
0.9536
Fig.2 and Fig.3 represents JM1 dataset without features selection and with feature selection. In both the figure ABDRF and PCA-LDA+ABDRF produced the highest accuracy 0.7944 and 0.8164 respectively.
Fig 2. JM1 dataset accuracy without feature selection
Fig 3. JM1 dataset accuracy with feature selection
fig.4 and fig.5 represents mc1 dataset without features selection and with feature selection. in both the figure abdrf and pca-lda+abdrf produced the highest accuracy 0.9789 and 0.9888 respectively.
Fig 4. MC1 dataset accuracy without feature selection
Fig 5. MC1 dataset accuracy with feature selection
Fig.6 and Fig.7 represents PC3 dataset without features selection and with feature selection. In both the figure ABDRF and PCA-LDA+ABDRF produced the highest accuracy 0.8812 and 0.9122 respectively.
Fig 6. PC3 dataset accuracy without feature selection
Fig 7. PC3 dataset accuracy with feature selection
fig.8 and fig.9 represents pc4 dataset without features selection and with feature selection. in both the figure abdrf and pca- lda+abdrf produced the highest accuracy 0.9005 and 0.9154 respectively.
Fig 8. PC4 dataset accuracy without feature selection
Fig 9. PC4 dataset accuracy with feature selection
fig.10 and fig.11 represents pc5 dataset without features selection and with feature selection. in both the figure abdrf and pca-lda+abdrf produced the highest accuracy 0.7726 and 0.9536 respectively.
Fig 10. PC5 dataset accuracy without feature selection
Fig 11. PC5 dataset accuracy with feature selection
-
CONCLUSION & FUTURE WORK
-
In our latest work, we have devised an accurate PCA- LDA+ADBRF method to detect software flaws in individual modules. The technique uses PCA to reduce the features' dimensionality. After that, an automated and extremely accurate classifier is produced by combining the two algorithms (AB and RF) into a single formula known as ADBRF. Empirical results show that, on six different datasets, the PCA-LDA+ADBRF method performs better than any other current scheme. In particular, using the MC1 dataset, the method achieves an exceptional accuracy of 0.9888. Researchers will continue to be interested in the need for defect prediction techniques for software systems that are more precise and effective. Early issue detection, which saves time and, consequently, lowers software project costs, is made possible by this study. The development of various hybrid models that can more precisely and error-free predict software system flaws is an area that could use more research.
REFERENCES
[1] Hammouri, Awni, et al. "Software bug prediction using machine learning approach." International journal of advanced computer science and applications 9.2 (2018). [2] Mustaqeem, Mohd, and Mohd Saqib. "Principal component based support vector machine (PC-SVM): a hybrid technique for software defect detection." Cluster Computing 24.3 (2021): 2581-2595. [3] Das M, Pradhan D, Mohapatra S, A PCA BASED SOFTWARE FAULT PREDICTION MODEL USING ADRF", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC and issn Approved), ISSN:2349-5162, Vol.10, Issue 6, page no. ppj189-j199, June-2023. [4] Rathore, Santosh S., and Sandeep Kumar. "An empirical study of ensemble techniques for software fault prediction." Applied Intelligence 51 (2021): 3615-3644. [5] Meenakshi, Dr, and Satwinder Singh. "Software Bug Prediction using Machine Learning Approach." (2019). [6] Iqbal, Ahmed, et al. "Performance analysis of machine learning techniques on software defect prediction using NASA datasets." International Journal of Advanced Computer Science and Applications 10.5 (2019). [7] Li, J., He, P., Zhu, J., & Lyu, M. R. (2017, July). Software defect prediction via convolutional neural network. In 2017 IEEE international conference on software quality, reliability and security (QRS) (pp. 318- 328). IEEE. [8] Tian, Z., Xiang, J., Zhenxiao, S., Yi, Z., & Yunqiang, Y. (2019, December). Software defect prediction based on machine learning algorithms. In 2019 IEEE 5th International Conference on Computer and Communications (ICCC) (pp. 520-525). IEEE. [9] Yalçner, Burcu, and Merve Özde. "Software defect estimation using machine learning algorithms." 2019 4th International Conference on Computer Science and Engineering (UBMK). IEEE, 2019. [10] Wang, Fei, Jun Ai, and Zhuoliang Zou. "A cluster-based hybrid feature selection method for defect prediction." 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2019. [11] D. Pradhan and D. Muduli, "Software Defect Prediction Model Using AdaBoost based Random Forest Technique," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-6, doi: 10.1109/ICCCNT56998.2023.10308208. [12] Sharma, S. K., Muduli, D., Pradhan, D., & Nanda, S. K. Automated glaucoma detection model based on 2-D discrete wavelet transform with ensemble learning approach, volume 13, number 3, September 2023. [13] Pradhan, Debasish, Sasank Sekhar Dalai, and Mandakini Priyadarsini Behera. "A comparative study on evolutionary model for software development." Int J Eng Res Technol 8.1 (2020): 1-3. [14] Sahu, Padma Charan, Bibhu Prasad, Ratnakar Dash, Debendra Muduli, Santosh Kumar Sharma, and Debasish Pradhan. "Automated Modulation Classification in Wireless Communication: A Deep Convolution Neural Network based Approach.", volume 13, number 3,September 2023. [15] Sharma, Deepak, and Pravin Chandra. "Software fault prediction using machine-learning techniques." Smart Computing and Informatics: Proceedings of the First International Conference on SCI 2016, Volume2. Springer Singapore, 2018.
[16] Malhotra, Ruchika. "A systematic review of machine learning techniques for software fault prediction." Applied Soft Computing 27 (2015): 504-518. [17] Pandey, Sushant Kumar, Ravi Bhushan Mishra, and Anil Kumar Tripathi. "Machine learning based methods for software fault prediction: A survey." Expert Systems with Applications 172 (2021): 114595. [18] Catal, Cagatay, and Banu Diri. "A systematic review of software fault prediction studies." Expert systems with applications 36.4 (2009): 7346-7354.
[19] Batool, Iqra, and Tamim Ahmed Khan. "Software fault prediction using data mining, machine learning and deep learning techniques: A systematic literature review." Computers and Electrical Engineering 100 (2022): 107886. [20] Hall, Tracy, and David Bowes. "The state of machine learning methodology in software fault prediction." 2012 11th international conference on machine learning and applications. Vol. 2. IEEE, 2012. [21] Prabha, C. Lakshmi, and N. Shivakumar. "Software defect prediction using machine learning techniques." 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184). IEEE, 2020. [22] Satapathy, Suresh Chandra, et al. "A novel approach of software fault prediction using deep learning technique." Automated Software Engineering: A Deep Learning-Based Approach (2020): 73-91. [23] Gondra, Iker. "Applying machine learning to software fault-proneness prediction." Journal of Systems and Software 81.2 (2008): 186-195. [24] Singh, Praman Deep, and Anuradha Chug. "Software defect prediction analysis using machine learning algorithms." 2017 7th international conference on cloud computing, data science & engineering-confluence. IEEE, 2017. [25] Bhandari, Guru Prasad, and Ratneshwer Gupta. "Machine learning based software fault prediction utilizing source code metrics." 2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS). IEEE, 2018. [26] Ha, Thi Minh Phuong, et al. "Experimental study on software fault prediction using machine learning model." 2019 11th International conference on knowledge and systems engineering (KSE). IEEE, 2019. [27] Pavana, M. S., M. N. Pushpalatha, and A. Parkavi. "Software Fault Prediction Using Machine Learning Algorithms." International Conference on Advances in Electrical and Computer Technologies. Singapore: Springer Nature Singapore, 2021. [28] Ceylan, Evren, F. Onur Kutlubay, and Ayse B. Bener. "Software defect identification using machine learning techniques." 32nd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO'06). IEEE, 2006. [29] Kumar, Amod, and Ashwni Bansal. "Software fault proneness prediction using genetic based machine learning techniques." 2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT- SIU). IEEE, 2019. [30] Erturk, Ezgi, and Ebru Akcapinar Sezer. "A comparison of some soft computing methods for software fault prediction." Expert systems with applications 42.4 (2015): 1872-1879.