- Open Access
- Total Downloads : 430
- Authors : R. Rakesh Kumar, A. Bhanu Prasad
- Paper ID : IJERTV2IS2316
- Volume & Issue : Volume 02, Issue 02 (February 2013)
- Published (First Online): 28-02-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Horizontal Aggregations in SQL by Using Clustering as a Mixture Gaussians
R. Rakesh Kumar A. Bhanu Prasad
M.Tech Software Engineering, Associate Professor, Department of IT, Vardhaman College of Engineering, Vardhaman College of Engineering, Hyderabad, India. Hyderabad, India.
Abstract
To analyze data efficiently, Data mining systems are widely using datasets with columns in horizontal tabular layout. Preparing a data set is more complex task in a data mining project, requires many SQL quires, joining tables and aggregating columns. Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a horizontal tabular layout returns set of numbers, instead of one number per row. The system uses one parent table and different child tables, operations are then performed on the data loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate aggregate operations. PIVOT method is much faster method and offers much scalability. Partitioning large set of data, obtained from the result of horizontal aggregation, in to homogeneous cluster is important task in this system. Gaussians algorithm using SQL is best suited for implementing this operation.
Keywords: Aggregation, Structure Query Language (SQL), PIVOT and Gaussians Algorithm.
-
Introduction
Horizontal aggregation is new class of function to return aggregated columns in a horizontal layout. Most algorithms require datasets with horizontal layout as input with several records and one variable or dimensions per columns. Managing large data sets without DBMS support can be a difficult task. Trying different subsets of data points and dimensions is more flexible, faster and easier to do inside a relational database with SQL queries than outside with alternative tool. Horizontal aggregations can be performing by using operator, it can easily be implemented inside a query processor, much like a select, project and join. PIVOT operator on tabular data that exchange rows, enable data analysis, and data presentation. There are many existing functions
and operators for aggregation in SQL. The most commonly used aggregations is the sum of a column And other aggregation operators return the average, maximum, minimum or row count over groups of rows. All operations for aggregation have many limitations to build large data sets for data mining purposes. Database schemas are also highly normalized for On-Line Transaction Processing (OLTP) systems where data sets that are stored in a relational database or data warehouse. But data mining, statistical or machine learning algorithms generally require aggregated data in summarized form. Data mining algorithm requires suitable input in the form of cross tabular (horizontal) form, significant effort is required to compute aggregations for this purpose. Such effort is due to the amount and complexity of SQL code which needs to be written, optimized and tested. Data aggregation is a process in which information is gathered and expressed in a summary form, and which is used for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, name, phone number, address, profession, or income. Most algorithms require input as a data set with a horizontal layout, with several records and one variable or dimension per column. That technique is used with models like clustering, classification, regression and PCA. Dimension used in data mining technique are point dimension. There are several advantages for horizontal aggregation. First one is horizontal aggregation represent a template to generate SQL code reduces manual work in the data preparation phase in data mining tool. This SQL code reduces manual work in the data preparation phase in data mining related project. Second is automatically generated code, which is more efficient than end user written SQL code. Thus datasets for the data mining projects can be created in less time. Third advantage is the datasets can be created entirely inside the
DBMS Gaussians clustering algorithms are used to cluster the attribute, that attribute is the result of horizontal aggregation. The rest of the paper is organized as follows. Next part presents clustering of aggregated dataset and different methods existing for aggregation and Conclusion.
-
Related Work
SQL extensions to define aggregate functions for association rule mining. Their optimizations have the purpose of avoiding joins to express cell formulas, but are not optimized to perform partial transposition for each group of result rows. Conor Cunnigalam [1] proposed an optimization and execution strategies in an RDBMS which uses two operators i.e., PIVOT operator on tabular data that exchange rows and columns, enable data transformations useful in data modeling, data analysis, and data presentation. They can quite easily be implemented inside a query processor system, much like select, project, and join operator. Such a design provides opportunities for better performance, both during query optimization and query execution. PIVOT is an extension of Group By with unique restrictions and optimization opportunities, and this makes it very easy to introduce incrementally on top of existing grouping implementations. H Wang. C.Zaniolo [2] proposed a small but complete SQL Extension for Data Mining and Data Streams. This technique is a powerful database language and system the enables users to develop complete data intensive applications in SQL by writing new aggregates and table functions in SQL, rather than in procedural languages as in current Object-Relational systems. The ATLaS system consist of applications including various data mining functions, that have been coded in ATLaS SQL, and execute with a modest (20-40%) performance overhead with respect to the same applications written in C/C++. This system can handle continuous queries using the schema and queries in Query Repository. Sarawagi, S. Thomas, and R. Agrawal [3] proposed integrating association rule mining with relational database systems. Integrating Association rule mining include several methods. Loose-coupling through a SQL cursor interface is an encapsulation of a mining algorithm in a stored procedure. Second method is catching the data to a file system on-the-fly and mining tight- coupling using primarily user-defined functions and
SQL implementations for processing in the DBMS. Loose-coupling and Stored-procedure architectures: For the loose-coupling and Stored-procedure architectures, can use the implementation of the Apriori algorithm for finding association rules. C.Ordonez [4] proposes an integration of K-means clustering with a relational DBMS using SQL. This techniques consist of three SQL implementations. First step is a straightforward translation of K-means computations into SQL, and an optimized, efficient indexing, sufficient statistics, and rewritten queries, and an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. The first implementation is a straightforward translation of Gaussians computations into SQL, which serves as a framework to build a second optimized version with superior performance. The optimized version is then used as a building block to introduce an incremental Gaussians implementation with fast convergence and automated reseeding. G Graefe, U. Fayyad, and S. Chaudhari [5] introduced efficient gathering of sufficient statistics for classification from large SQL Databases. This technique use a SQL operator (UNPIVOT) that enables efficient gathering of statistics with minimal charges to the SQL backend. Need a set of counts for the number of co-occurrences of each attribute value with each class variable. In classification the number of attribute values is not large (in the hundreds) the size of the counts table is fairly small. Continuous valued attributes are discretized into a set of intervals. The most familiar selection measures used in classification do not require the entire data set, but only sufficient statistics of the data. A straightforward implementation for deriving the sufficient statistics on a SQL database results in unacceptably poor performance. The problem of optimizing queries with outer joins is not new. Optimizing joins by reordering operations and using transformation rules is studied. This work does not consider optimizing a complex query that contains several outer joins on primary keys only, which is fundamental to prepare data sets for data mining. Traditional query optimizers use of hyper-graphs to provide a more comprehensive set of potential plans.
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh
[6] proposed a relational aggregation operator that generalizing Group-By, Cross-Tab, and Sub-Totals. The cube operator generalizes the histogram, crosstabulation, roll-up, drill-down, and sub-total constructs. The cube operator can be imbedded in more complex non-procedural data analysis programs and data mining. The cube operator treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of attribute values is a point in this space and set of points forms an N- dimensional cube. Super aggregates are computed by aggregating the N-cube to lower dimensional spaces. Creating a data cube requires generating the power set (set of all subsets) of the aggregation columns. Since the CUBE is an aggregation operation, it makes sense to externalize it by overloading the SQL GROUP BY operator. G. Luo, J.F. Naughton, C.J. Ellmann, and M. Watzke [7] proposed immediate materialized view introduces many lock conflicts or deadlocks. System results in low level of concurrency and high level of deadlocks. To solve the materialized view update problem V-locks (view locks) augment with a value-based latch pool. Direct Propagate Updates propagate updates on base relations directly to the materialized view without computing and join operator. Granularity and the No- Lock Locking Protocol locks have some interesting properties with respect to granularity and concurrency. Finer granularity locking results in higher concurrency. In the no-lock locking protocol, like the V locking protocol, updaters of the materialized view must get X locks on the tuples in the base relations they update and S locks on the tuples in the other base relations mentioned in the view. Xiang Lian and Lei Chen [9] analyzed cost models for evaluating dimensionality reduction in high-dimensional spaces. This model is general cost models for evaluating the query performance over the reduced data sets by GDR, LDR, and ADR, in light of which we introduce a novel (A) LDR method, partitioning based on Randomized search (RANS).Formal cost models to evaluate the effectiveness and efficiency of GDR, LDR. and ADR fir range queries. Furthermore, we present a novel partitioning based (A) LDR approach, PRANS, which is based on our cost model and can achieve good query performance in terms of the pruning power. Extensive experiments have verified the correctness of our cost models and indicated that compared to the existing LDR method, can result in partitions with a lower query cost. C.Ordonez [10] introduced techniques to efficiently compute
fundamental statistical models inside a DBMS exploiting User- Defined Functions (UDFS). Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross product of points. Introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework. Introduce UDFs that work in a single table scan. Aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs are used to score data sets. C.Ordonez [11] proposed two SQL aggregate functions to compute percentages addressing many limitations. The first function returns one row for each percentage in vertical form and the second function returns each set of percentages adding 100% on the same row in horizontal form. These novel aggregate functions are used as to introduce the concept of percentage queries and to generate efficient SQL code in data mining related works. Queries using percentage aggregations are called percentage queries. Two practical issues were identified when computing vertical percentage queries. First issue is missing rows and second issue is division by Zero.
-
Integrating of Gaussians Algorithm with Horizontal Aggregations
Clustering methods partition a set objects into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria. Data mining applications frequently involve categorical data. The biggest advantage of these clustering algorithms is that it is scalable to very large data sets. Even though the existing system presented the computation of the values for different attributes, it has some drawbacks. In the research of the horizontal aggregation, the existing systems are not well defined for the different fact tables that need better indexing and extraction. Multiple fact tables: Constructing new data sets within the range of a discrete set of known data points we need different attributes from different facts tables. In many applications one often has a number of data values, obtained by experimentation, which stored on limited number of databases. It is often required to extract the particular useful attributes from the different fact tables and perform
aggregation. Gaussians: Gaussians is initialized from some random or approximate solution. Each step assigns each point to its nearest cluster and then points belonging to the same cluster are averaged to get new cluster centroids. Each step successively improves cluster centroids until they are stable. This is the standard version of K-Means technique used. Optimized K-means computes all Euclidean distances for one point in one I/O, exploits sufficient statistics, and stores the clustering model in a single table. Experiments evaluate performance with large data sets focusing on elapsed time per iteration. The main issue here addressed is how to make efficient indexing of horizontal aggregation. Initially an aggregation operation is performed horizontal layout are creating by using pivot operator. In this Gaussians algorithm are implementing to create datasets with horizontal layout as input.
-
REFERNCES:
-
C.Cunningham, G.Graefe, and C.A.GalindoLegaria. PIVOT and UNPIVOT: Optimization and execution strategies in an RDBMS. In Proc. VLDB Conference. Pages 998-1009, 2004.
-
H.Wang, C. Zaniolo, and C.R. Luo. ATLaS: A small but complete SQL extension for data mining and data streams. In Proc. VLDB Conference. Pages 1113-1116, 2003.
-
S. Sarawagi, S. Thomas, and R. Agarwal. Integrating association rule mining with relational database systems: alternatives and implications. In Proc. ACM SIGMOD Conference, pages 343-354, 1998.
-
C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188-201, 2006.
-
G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large SQL databases. In Proc. ACM KDD Conference, pages 204-208, 1998.
-
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and subtotal. In ICDE Conference, pages 152-159, 1996.
-
G. Luo, J.F. Naughtn, C.J. Ellmann, and M. Watzke. Locking protocols for materialized aggregate join views. IEEE Transaction on knowledge and Data Engineering (TKDE), 17(6):796-807, 2005.
-
C. Ordonez and S. Pitchaimalai. Bayesian classifiers programmed in SQL. IEEE Transactions on Knowledge and Data Engineering ( TKDE), 22(1):139-144, 2010.
-
Xiang Lian, Student Member, IEEE, and Lei Chen, General Cost Models for Evaluating Dimensionality Reduction in High-Dimensional Spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(1):139- 144, 2010.
-
C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22, 2010.
-
C. Ordonez. Vertical and horizontal percentage aggregations. In Proc. ACM SIGMOD Conference, pages 866-871, 2004.
-
C. C. Ordonez and Zhibo Chen. Horizontal Aggregation in SQL to prepare Data Sets for Data Mining Analysis. IEEE Transactions on Knowledge and Data Engineering (TKDE), 1041-4347/11/$26.00, 2011.