Data Normalization And Identification Of Differentially Expressed Genes By Multiple Hypothesis Testing Procedures

DOI : 10.17577/IJERTV2IS3268

Download Full-Text PDF Cite this Publication

Text Only Version

Data Normalization And Identification Of Differentially Expressed Genes By Multiple Hypothesis Testing Procedures

  1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India

  2. . Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India

  1. INTRODUCTION: In recent years with the advent of micro array technology biology has been greatly benefited in analyzing gene expression data.DNA micro array analysis allow highly parallel and simultaneous monitoring of the whole genome(Brown and Botstein,1999).Micro array technology has been increasingly used to detect differentially expressed genes (Spellman et al .,1998).The key objective of analyzing such types of experiments is comparison of different gene expression levels in varying conditions and identifying differentially expressed genes. Regular t-test and permutation tests are applied. Thus t-test can be considered as the ratio of between classes to within class variability of gene

    expression data. Efron et al(2001),Tusher et al (2001) and chu et al (2000) developed a different strategy where a variance component s0 is introduced to improve the reliability of the test statistics.

    1. Normalization: An important part of data processing is normalization. It adjusts the individual intensities such that comparisons can be made both within and between arrays in experiments. Normalization of raw data is mainly done to remove the bias which arises from variation in the micro array technology rather than from biological differences between the RNA samples or the printed probes. A difference in the data arises due to differences in print quality or from differences in ambient condition when the plates are processed. Normalization procedure differs with respect to which kind of average is used and what sources of variability are taken into account (Yang et al 2002).In micro array studies gene expression will not have the desired statistical properties such as normality or constant variance etc. Then we transform the values to get a better inference about the data. The commonly applied transformations are:

      1. Logarithmic transformation:

        The most commonly applied transformation is the logarithmic transformation. Logarithmic function is a monotonic function hence we apply log transformation to the micro array data. If xij represent the expression value of ith gene in the jth sample.

        yij = log (xij). (1)

        Logarithm to the base 2,10,or the natural logarithm is taken. Logarithmic transformation is applied to micro array data because it tends to provide values that are approximately normally distributed. Figure (ii),(iii),(iv) shows the box plots of log transformation of three arrays of data.

      2. Box-Cox Transformation.

        The Box-Cox transformation defines

        1

        =

        = 1,2, , = 1,2, . (2)

        the square root transformation corresponds to the parameters being ½.

        Rocke and Lorenzate(1995),Durbin et al (2002) proposes a two component error model for gene expression data in micro array analysis . Let X denote the raw expression value, the mean expression level and b the background noise then the model of log transformation is given by

        = + + ~ 0, 2 (3) , random variable and are taken to be

        independent.

        Data normalization of three groups of data applying log transformation are shown in fig(i),(iii),(iiii).

        2.1.3..Square root transformation

        In micro array studies the intensity readings will be proportional to the number of occurrences of fundamental molecular events such as hybridizations. The constant of proportionality will be the quantum of fluorescence, radiation produced by a single fundamental event. The fundamental event

        fall into two categories the true gene expression and noise. The noise contains events that are not of scientific interest. The model for such type of events is Poisson distribution. Thus the gene expression value xij is proportional to a Poisson variable.

        = = 1,2, . = 1,2, . is the variance stabilizing function.

  2. 3.1.t-test:

    Comparison of two groups of samples;

    The data analyzed is assumed to be normalized and divided into two subgroups. Normalization is based on the log ratios of the gene intensities. In micro array data analysis where p N the main goal is to assess the significance of individual features. The feature assessment problem leads one to multiple hypothesis testing problems. Suppose we have two samples of genes the normal and the infected group. To identify the informative or the significant genes a two-sample t-statistics is computed for each gene.

    2 1

    = , 4

    =

    = 1,2. ,

    is the pooled standard error for gene j.

    1 1

    = +

    , 2 = 1

    2

    +

    2

    1

    2

    1 +2 2

    1

    2

    A histogram can be constructed for the t-statistics. From the histogram a cut-off value t0,t1 (the left and right critical values) are determined. If tj s are normally distributed then any value greater than the two absolute value is considered to be significant. This procedure of finding the significant genes is called the multiple testing problems. In multiple testing problem the theoretical probabilities assuming normal distribution is calculated, the p-values for each gene is calculated.

    In computing t-statistics for micro array data sets the standard error obtained si is not very reliable as the number of samples in each group is very small. This leads to underestimation of the si values, and the genes with small variations give rise to extreme values of t, and are therefore false positive. To overcome this problem Tusher and Tibshirani (2001) suggested a new statistics.

    =

    2 1

    + 0 2

    , (5)

    Where 0 is the median of the se of all the genes. Efron et al(2001) suggested the fudge factor 0 as a particular percentile.

    =

    2 1

    + 0

    ~ , (6)

    In all the above cases if 0 = 0 then the statistics reduces to the ordinary t-statistics (4) which is nothing but the square root of F-statistics.

    3.2.Permutation tests in micro array data:

    Application of t, F tests is based on normality assumption for testing the differentially expressed genes may be unreliable because the distributional assumption may not hold true. In such situations a family of tests called permutation tests or randomization tests offer an alternative testing approach. Let us consider the testing situation in which n experimental units are randomly divided into two groups of n1 units and n2 units respectively, where n= n1+ n2 , n1 units corresponds to control condition and n2 units correspond to treatment conditions. The response measure is represented by yij for units ,j= 1,2……ni, , i= 1,2. Let H0 be the null hypothesis that there is no difference in the response pattern for units subject to the treatment and control conditions. When H0 is true the random assignment of the experimental units to treatment and control conditions imply that possible assignment of n observations into groups of n1 and n2 cases are equally probable. The arrangemet may be viewed as first n1 values of the permutation of n-responses may be assigned to group I and the remainder to group II. The theory of permutation tells us that there are A such permutations or arrangements where

    =

    1 + 2 !

    1! 2!

    (7)

    To test the hypothesis that the response pattern for the treatment and control condition share a common feature or they differ on that feature the test statistic is given by

    =

    2

    =1

    2

    1

    =1

    1

    =

    (8)

    2

    1 2 1

    Between the two groups, where is the mean of ith group. Accept H0 if is too large otherwise reject. Under H0 each permutation q of n responses can be considered as a realization of the experimental study .The analysis leads to a calculated difference in mean response .Let us denote

    the difference in mean response be denoted by dq.The permutation procedure yields a total of R such differences dq where R is defined in equation (1).In statistical hypothesis testing a p-value gives the consistency of the null hypothesis. In a permutation test if the null hypothesis were true the p- value is the fraction of the R calculated differences dq that are greater or equal to the observed difference d* in the absolute value ,that is.

    =

    (

    (10)

  3. Application to Micro array data:

    Data studied was obtained from NCBI GENE OMNIBUS. The log intensities of 12607 genes in breast cancer data set. Data filtering is carried out by leaf and stem method. Data consists of three groups, consisting of ten samples (Group-I control group 4 samples, Group-II treatment group 4 samples Group III normal tissues of two sample. Gene filtering has given a total of 103 outliers from the three groups. We have numbered the output from 1-103.Data normalization of the three groups are shown by Box Plot in Fig (i),(ii) and(iii)Applying t-test Between GroupI and Group-II values are computed .t-values using fudge factor s0,taking the

    value of s0 as the median of si Tusher and Tibshirani (2001) ,and percentiles Efron et al(2001 ) are calculated as t1(median),t2(45th quartile,t3(50th quartile)and t4(55th quartile). t-values shows that around 4-8 genes accounts for maximum variation or they are differentially expressed.A comparative study of the tis shows that the usual t-statistics and the statistics with fudge factor due to Tusher and Tibshirani(2001) shows similar results . A histogram was constructed for the t-values obtained by equation (4). A cut off value +/-0.5. gives the differentially expressed genes.

1..Benjamin .Y.and Hochberg.Y.(1995) Controlling the false discovery rate :a practical and powerful approach to multiple testing. Journal of Royal Statistical Society,B57,289-300.

  1. Dudoit.S. ,Yang.W.H,Callow.J and Speed.T.P(2000)-Statistical methods for identifying differentially expressed genes in replicated cDNA micro array experiments ,Technical Report

    -578.

  2. Dudoit.S., Shaffer.J.P. and Boldrick.J.C(2003)-Multiple Hypothesis Testing in micro array experiments ,Statistical Science,18,71-103.

Efron.B., Tibshirani .R.,Storey.J.D.,Tusher.V(2001) Empirical Bayes analysis of a micro array experiment. Journal of American Statistical Association,96 ,1151-1160.

Efron.B and Tibshirani .R.J.(1993).An introduction to the Bootstrap .London Chapman and Hall.

  1. Hastie.T.Tibshirani.R, andFriedman.J.(2001).-The Elements of Statistical Learning. Springer.

  2. Kathleen Kerr.K, Mitchell Martin and Gary. A. Churchill. (2000)-Analysis of variance for Gene expression micro array data, journal of Computational Biology)7 ,pages-819-837.

  3. Lambert.D.(1990) Robust two-sample permutation tests. Annals of.Statistics.13, 606-625.

  4. Storey.J.D.and Tibshirani.R.(2003).Statistical significance for genome wide expression. Proceeding of National Academy Science, USA,100, 9440-9445 10.Tusher.V.G.,Tibshirani.R,Chu.G(2001)-Significance Analysis of micro arrays applied to the ionizing radiation response, .Proceeding of National Academy Science,USA,98(9),5116-5121

  1. Y.H.Yang and T.Speed.(2002)-Design issues for cDNA micro array experiments ,Nature Review.Genet,3,579-588.

  2. Zhao.Y.and Pan.W.(2003) Modified non-parametric approaches to detecting differentially expressed genes in replicated micro array experiments.Bio informatics,19(9)-1046-1054.

Leave a Reply