Thursday, May 7, 2009

DMDW FAQS with solutions for JNTU BTECH and MCA students unit vii

1.Explain Cluster analysis?
Cluster analysis or clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure. Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
2.Discuss different types of clustering techniques?
Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem for which a number of techniques have been developed.
3. Explain Partitioning Around Medoids (Pam) algorithm.
Compared to the well - known kmeans algorithm, .Pam has the following features:
It operates on the dissimilarity matrix of the given data set or when it is presented with an n ´ p data matrix, the algorithm first computes a dissimilarity matrix.
It is more robust, because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances.
It provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters.
In many clustering problems, one is interested in the characterization of the clusters by means of typical objects, which represent the various structural features of objects under investigation. The algorithm Pam first computes k representative objects, called medoids. A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. In the classification literature, such representative objects are called centrotypes. After finding the set of medoids, each object of the data set is assigned to the nearest medoid. That is, object i is put into cluster vi, when medoid mvi is nearer than any other medoid mw.
d(i, mvi ) £ d (i,m w ) for all w = 1,..., k.
The k representative objects should minimize the objective function, which is the sum of the dissimilarities of all objects to their nearest medoid:
Objective function = S d(i, mvI)
The algorithm proceeds in two steps:
BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids
SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased.
Isolated clusters : The program considers two types of isolated clusters. L-clusters and L* clusters. Cluster C is an L-cluster, if for each object i belonging to C:
max dij < min dih, j C h C
Cluster C is an L* - cluster, if
max dij < min dlh, ij C, l C,h C
It is clear that L* - clusters are also L-clusters. It should be noted that the property of being isolated depends on the internal structure of a cluster as well as on its position relative to other clusters.
Diameter of a cluster
The diameter of Cluster C is defined as the largest dissimilarity between objects belonging to the Cluster C.
DiameterC = max dij, i,j C
Separation of a cluster
The separation of Cluster C is defined as the smallest dissimilarity between two objects; one of which belong to Cluster C and the other does not.
SeparationC = min dlh, l C, h C
Average distance to a medoid
If j is the medoid of Cluster C, the average distance of all objects of C to j is calculated as follows:
Average distancej =
where N is the number of object minus other than j.
Maximum distance to medoid
If object j is the medoid of cluster C, the maximum distance of objects of C to j is calculated as follows:
Maximum distancej = max dij , i C
Graphical display
Two of the most difficult tasks in cluster analysis are: How to decide the appropriate number of clusters and how to distinguish a bad cluster from a good one. The module Clusfind computes a set of values called silhouettes that provide key information about these tasks. First, we will explain how these are calculated and then we will show how they are used
Each cluster is represented by one silhouette, showing which objects lie within the cluster and which objects merely hold an intermediate position. The entire clustering is displayed by plotting all silhouettes into a single diagram, from which the quality of the clusters can be compared.
Silhouettes are constructed in the following way. Consider any object i of the data set, and let A denote the cluster to which it is assigned, and then calculate
a ( i ) = average dissimilarity of i to all other objects of A
Now consider any cluster C different from A and define
d ( i, C ) = average dissimilarity of i to all objects of C
Compute d ( i, C ) for all clusters C A, and then select the smallest of those
b = min d ( i,C ) , C A
Let B denote the cluster which attains the minimum i.e., d ( i,B ) = b ( i ) is called the neighbour of object i. The value s ( i ) can now be defined as:
s(i) =
Outlier Analysis:
l Working definition
– An outlier xk is an element of a data sequence S that is inconsistent with out expectations, based on the majority of other elements of S.
l Sources of outliers
– Measurement errors
– Other uninteresting anomalous data
l valid data observations made under anomalous conditions
– Surprising observations that may be important
Identifying Outliers
The 3s - edit rule
– If x ~ N(m, s2), then P(x-m>3s) » 0.003
– estimate mean x
– estimate standard deviation s’
– compute zk = (xk - x)/s’
– if zk > t, then xk is an outlier
4. Explain outlier analysis with illustration.
Rare, unusual, or just plain infrequent events are of interest in data mining in many contexts including fraud in income tax, insurance, and online banking, as well as for marketing. We classify analyses that focus on the discovery of such data items as outlier analysis. () captures the concept of an outlier as:
an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.
Outlier detection algorithms often fall into one of the categories of distance-based methods, density-based methods, projection-based methods, and distribution-based methods.
A general approach to identifying outliers is to assume a known distribution for the data and to examine the deviation of individuals from the distribution. Such approaches are common in statistics (, ) but such approaches do not scale well.
Distance based methods are common in data mining where the measure of an entities outliedness is based on its distance to nearby entities. The number of nearby entities and the minimum distance are two parameters. (see knorr and ng 1998 vldb24)
Density based approaches from breuning kriegel ng and sander 2000 sigmod LOF: local outlier factor. See also jin tung and han kdd2001.
The early work on outliers was carried out from a statistical view point where outliers are data points that deviate significantly from the identified underlying distribution of the data. () is a good textbook on outliers. () is another overview of statistical outliers.
Distance based approaches have been developed by (), () and (). Such approaches usually explore some neighbourhood and do not rely on underlying distributions.
() identify outliers by counting the number of neighbours within a specified radius of a data point. The radius and the threshold number of points are the only two parameters of the approach. The approach is simple but is inadequate for data that is distributed with uneven density where and might need to vary to cope with the changes. () have a similar approach whereby data points are ranked by the sum of their distance to their nearest neighbours.
() and then () introduce a density based approach to score data points with a local outlier factor (LOF). () introduce a heuristic to more efficiently identify the top outliers using the LOF.
() build mixture models as data becomes available and identifies outliers as those data items causing the most perturbation to the model.
() explore the issue of outliers in high dimensional space where data tends to be sparse and consequently all data points tend to be equidistant to other points (, ) and suggest an algorithm where the high dimensional space is projected to a lower dimensional space having unusually low density. An evolutionary algorithm is proposed to generate candidate subspaces in which outliers are to be searched for.
5.Explain K-Means and K-Medoids clustering algorithms?
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community.
Given an initial set of k means m1(1),…,mk(1), which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition the observations according to the Voronoi diagram generated by the means).

Si = {xj xj – mi(t) <= xj – mi(t) for all i* = 1, ….., k}

Update step: Calculate the new means to be the centroid of the observations in the cluster.

Mi (t+1) = i/ si(t)
The algorithm is deemed to have converged when the assignments no longer change.
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm k-medoids chooses datapoints as centers (medoids or exemplars).
k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. A useful tool for determining k is the silhouette.
It is more robust to noise and outliers as compared to k-means.
A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows:
The algorithm begins with arbitrary selection of the k objects as medoid points out of n data points (n>k)
After selection of the k medoid points, associate each data object in the given data set to most similar medoid. The similarity here is defined using distance measure that can be Euclidean distance, Manhattan distance or Minkowski distance
Randomly select nonmedoid object O′
compute total cost S of swapping initial medoid object to O′
If S<0, then swap initial medoid with the new one (if S<0 then there will be new set of medoids)
repeat steps 2 to 5 until there is no change in the medoid.


6.Explain Grid – based methods.
Grid based method consist of a gird structure formed by quantifying the object space into a finite number of cells. It is an approach of representing data object using a multi-resolution grid data structure on which all of clustering operations are performed. The benefit of grid-based methods is that it processed operations quickly without taking much time, as it depends only on the number of cells in each dimensions in the quantized space and does not rely on the number of data objects.
Grid – based approach are,
a) wave cluster : a wavelet transform is used to cluster object
b) STING : the grid cells contain statistical information which is reviewed by STING
c) CLIQUE : it is used for clustering in high-dimensional data region.
7.Discuss model based clustering methods.
Model based methods build a cluster on the basis of a model that receives the best data among others. A density function is built by model based algorithms to locate clusters. The density function defines the spatial distribution of the data points. On the basis of statistics which include noise or outliers model based algorithms can automatically find the number of clusters, which powerful clustering methods. We select clustering algorithms on the basis of The particular purpose of the application Data available.
Model based clustering are:
a) COBWEB
b) Expectation maximization (EM)
c) SOM
8.Discuss briefly Agglomerative and Divisive analysis
Agglomerative Clustering : Initially it starts with every object forming a separate group and then successively combining the objects or groups that are near to one another or until end condition is satisfied. This method is also referred as ‘bottom-up’ approach.
Divisive Clustering
Initially it starts with entire objects in a single cluster and then on successive iterations a cluster is broken down into smaller clusters until each cluster has single object or until end condition is satisfied. This method is also referred as ‘top-down’ approach.
9.Explain Partitional Methods
k-means clustering
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
The algorithm steps are:
Choose the number of clusters, k.
Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
Assign each point to the nearest cluster center.
Recompute the new cluster centers.
Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Another disadvantage is the requirement for the concept of a mean to be definable which is not always the case. For such datasets the k-medoids variant is appropriate.
Fuzzy c-means clustering
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients for any given x is defined to be 1:
With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
The degree of belonging is related to the inverse of the distance to the cluster center:
then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
The fuzzy c-means algorithm is very similar to the k-means algorithm:
Choose a number of clusters.
Assign randomly to each point coefficients for being in the clusters.
Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
Compute the centroid for each cluster, using the formula above.
For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-c-means.
QT clustering algorithm
QT (quality threshold) clustering (Heyer, Kruglyak, Yooseph, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
The algorithm is:
The user chooses a maximum diameter for clusters.
Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points ?
Recurse with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
Locality-sensitive hashing
Locality-sensitive hashing can be used for clustering. Feature space vectors are sets, and the metric used is the Jaccard distance. The feature space can be considered high-dimensional. The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets. With just one set of hashing methods, there are only clusters of very similar elements. By seeding the hash functions several times (eg 20), it is possible to get bigger clusters.
Graph-theoretic methods
Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes. Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
Spectral clustering
Given a set of data points A, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points . Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix
L = I − D − 1 / 2SD − 1 / 2
of S, where D is the diagonal matrix

This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S1, and the rest in S2. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the Meila-Shi algorith, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD − 1 for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors.

Agglomerative hierarchical clustering
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters A and B is one of the following:
The maximum distance between elements of each cluster (also called complete linkage clustering):
Max{d(x,y): x € A, y € B}
The minimum distance between elements of each cluster (also called single-linkage clustering):
Min {d (x, y): x € A, y € B}
The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
The sum of all intra-cluster variance.
The increase in variance for the cluster being merged (Ward's criterion).
The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

No comments:

Post a Comment