Methods to Solve the Problems Related to Datasets

Requirement

Literature review: Developing the efficient data clustering method by (name of student)

Solution

Introduction

In this literature review, the problem on clusters is discussed. The authors in the papers have tries to develop methods to solve the problems related to datasets. They have adopted different approaches to developing the clusters so that data sets can be managed properly. Four papers have been discussed below that shows hoe different people have approached the problem differently. 

Why do your coursework alone? When Allassignmenthelp.com is here to assist you with our advanced database management systems assignment help. On-time delivery and plagiarism-free assignments are some of the benefits that entice students to seek our management dissertation help. We provide immediate managerial accounting assignment help and support to all students worldwide.

Paper 1

Zhang T., Ramakrishnan R. and Livny M. have written a paper titled ‘BIRCH: An Efficient Data Clustering Method for Very Large Databases’ [1]. The motivation of researching on this topic came from the fact that there are large sets of data where useful patterns could be found and the researchers wanted to identify the clusters that are present in datasets which are multi-dimensional. So basically, the authors have examined data clustering in their research which is a type of mining problem. When large set of data points are given that are multi-dimensional, the space is non-uniformly occupied. So with data clustering, the space and the crowded places are identified and hence the distribution patterns of the datasets could be discovered on the whole.  In their paper, the researchers have presented a method of data clustering named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies). They have tries to demonstrate that this method is useful when the databases are very large. The reason that they have given to support their demonstration is that they have said, “BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources” [1]. They have also argued that a good clustering could be found by BIRCH using a single data scan and hence the quality could be improved further using some more scans. As per the author, the first cluster algorithm that was proposed in the database for handling noise was BIRCH. So the authors have evaluated the time and space efficiency of BIRCH in their research along with the sensitivity of data input order and the quality of clustering with the help of several experiments.  This has been done with the help of concepts like clustering feature and CF tree that are central to the concept of BIRCH. They have also compared the performance of BIRCH and CLARANS which is another method for clustering and has been proposed in the recent years. Their comparison showed that from the two, BIRCH was more consistently superior to CLARANS. It was founded by the researchers that for large datasets, the clustering method of BIRCH was very suitable because it could make the large problem of clustering tractable as it focused on the densely occupied portions and hence used a compact summary. They also found that “BIRCH can work with any given amount of memory, and the 1/O complexity is a little more than one scan of data” [1]. 

Paper 2

The work of Aggarwal C., Wang J. and Philip S. is on the topic ‘A Framework for Clustering Evolving Data Streams’ [2]. According to these researchers, the problem of clustering is the most difficult one in the domain of data stream. This is due to the reason that when large amount of data comes in the stream, then the most traditional algorithms are rendered which are very inefficient. This view of the researchers is very different from the previous research as in the earlier paper; the researchers did not face any such problem. These researchers had chosen this topic in particular because they observed that in the recent years, there have been few clustering algorithms developed that were one-pass and they have been developed for the problem of data stream. The methods like these address the issues related to scalability of the clustering problem but they don’t pay attention to the data evolution as well as the quality of clusters that may be poor when the data gets evolved over time and the requirement of the data stream clustering algorithm for the greater functionality to find out the clusters over different portions of the stream. The researchers have tried here to explore the streams over various time windows so the users of this information can get a much deeper understanding of the behavior that is evolving from the clusters. Also, it is not possible to perform the dynamic clustering simultaneously over all the possible time horizons for the data stream that is of moderately large volume. So the researchers have discussed a philosophy in their paper that is fundamentally different for the data stream clustering and is guided by the requirements of the application center. The idea was to divide the process of clustering into the online components that could store the detailed summary stats periodically along with the office component that makes the use of summary statistics only. According to the authors, the utilization of the offline component is done by the analyst who can make use of the variety of inputs for providing quick understanding of the broad clusters in the data stream. Also, there are problems of the efficient choice, storage and the use of statistical data as the data stream is tricky for these things. So the authors here have used the concepts related to pyramidal time frame in conjunction with a micro-clustering approach [2]. Their performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by their approach [2]. The method that has been developed here is ‘CluStream’ which  is found to have more advantages than the other recent techniques developed like it gives variety of functionalities in characterizing the clusters of data stream over the different time horizons in the evolving environment. They have also developed a  wide spectrum of clustering methods have been developed in data mining, statistics, machine learning with many applications [2].

Paper 3

Domingos P. and Hulten G. have written a paper on ‘Mining High-speed Data Streams’ [3]. According to the authors, there are many companies in today’s world that have large databases and they grow at a very high rate. This though is similar to the thought of the researchers in the previous prepare. So in order to mine there continuous data streams, there are many opportunities and challenges that are faced by the companies. This gave a motivation to the researchers for evaluating VFDT which is a system that builds the decision trees with the help of a constant memory and constant time per example. It has the ability to incorporate the thousands of examples in a second using the off the shelf hardware. They have also argued that a good clustering could be found by VFDT using a single data scan and hence the quality could be improved further using some more scans. As per the author, the first cluster algorithm that was proposed in the database for handling noise was VFDT. So the authors have evaluated the time and space efficiency of VFDT in their research along with the sensitivity of data input order and the quality of clustering with the help of several experiments.  This has been done with the help of concepts like clustering feature and CF tree that are central to the concept of VFDT [3]. These researchers had chosen this topic in particular because they observed that in the recent years, there have been few clustering algorithms developed that were one-pass and they have been developed for the  problem of data stream. The methods like these address the issues related to scalability of the clustering problem but they don’t pay attention to the data evolution as well as the quality of clusters that may be poor when the data gets evolved over time and the requirement of the data stream clustering algorithm for the greater functionality to find out the clusters over different portions of the stream [3]. This is done with the help of Hoeffding bounds that guarantees the output is identical to the conventional learner or not. So the authors have studied the properties of VFDT and they have tried to demonstrate the utility of this system with the use of extensive set of experiments on the synthetic data. They have applied the system for mining the continuous stream of the data of Web access from the campus of University of Washington. The results of the study were derived from the empirical research that suggested that the system of VFDT is more effective when the advantage of massive number of examples is taken. 

Paper 4

Ankerst M., Markus M. and Sander J. wrote a paper on ‘OPTICS: Ordering Points to Identify the Clustering Structure’ [4]. As per the authors of this paper, cluster analysis is a method which sis used for database mining. It helps in getting an insight of the distribution of the data set. The input parameters are required by the cluster algorithms that are well-known. These are difficult to determine but they have a significant influence on the result of clustering. Also, for the datasets that are real, there is no global parameter that is used for many real data sets. So the authors in their paper have introduced a new algorithm for the cluster analysis that does not produce the clustering of structure accurately. But an augmented ordering of the database is created that represents its clustering structure that is based on the density. Information is contained in the cluster ordering that is equivalent to the density based clusters that correspond to the broad range of the settings of the parameter. The basis for this is versatile for the automatic and interactive cluster analysis [4]. The authors have shown that how the cluster analysis can be done automatically and efficiently using traditional information on clustering. When the sets of data are medium, the representation of the cluster order can be done graphically and when the data sets are large, the authors have introduced a technique of visualization. Both of the things were found to be suitable for the interactive exploration of the structure of intrinsic clustering that offers additional insights to the distribution and the data could be correlated as well. So the authors proposed a method of cluster analysis that was based on the algorithm of OPTICS. It was demonstrated in the paper that how the authors used it and fit it into the system of data distribution. The cluster ordering graphic was represented on the basis of the size of database. There were two techniques that were used that were used in this [4]. 

Conclusion

The cluster analysis is a method which sis used for database mining. It helps in getting an insight of the distribution of the data set. The input parameters are required by the cluster algorithms that are well-known.  In the end it can be concluded that for large datasets, the clustering method of BIRCH was very suitable because it could make the large problem of clustering tractable as it focused on the densely occupied portions and hence used a compact summary. The method that has been developed here is ‘CluStream’ which  is found to have more advantages than the other recent techniques developed like it gives variety of functionalities in characterizing the clusters of data stream over the different time horizons in the evolving environment. A wide spectrum of clustering methods have been developed in data mining, statistics, machine learning with many applications. 

Place Order For A Top Grade Assignment Now

We have some amazing discount offers running for the students

Place Your Order

References:

  • [1]T. Zhang, R. Ramakrishnan and M. Livny, "BIRCH", ACM SIGMOD Record, vol. 25, no. 2, pp. 103-114, 1996.

  • [2]W. Fan, T. Watanabe and K. Asakura, "A framework for flexible clustering of multiple evolving data streams", International Journal of Advanced Intelligence Paradigms, vol. 1, no. 2, p. 178, 2008.

  • [3]R. Fok, A. An and X. Wang, "Mining Evolving Data Streams with Particle Filters", Computational Intelligence, p. n/a-n/a, 2015.

  • [4]M. Ankerst, M. Breunig, H. Kriegel and J. Sander, "OPTICS", ACM SIGMOD Record, vol. 28, no. 2, pp. 49-60, 1999.

Get Quality Assignment Without Paying Upfront

Hire World's #1 Assignment Help Company

Place Your Order