Cluster analysis classification based on the constructed model. Application of cluster analysis in Microsoft Excel. Tasks solved by Data Mining methods

30.03.2020

One of the tools for solving economic problems is cluster analysis. With its help, clusters and other objects of the data array are classified into groups. This technique can be applied in Excel program. Let's see how this is done in practice.

With the help of cluster analysis, it is possible to draw a sample according to the trait that is being studied. Its main task is to split a multidimensional array into homogeneous groups. As a grouping criterion, a paired correlation coefficient or Euclidean distance between objects according to a given parameter is used. The values ​​closest to each other are grouped together.

Although most often this species analysis is used in economics, it can also be used in biology (to classify animals), psychology, medicine, and many other areas of human activity. Cluster analysis can be applied using for these purposes standard set Excel tools.

Usage example

We have five objects that are characterized by two studied parameters - x And y.

cluster analysis

Most researchers are inclined to believe that for the first time the term "cluster analysis" (eng. cluster- bunch, clot, bunch) was proposed by the mathematician R. Trion. Subsequently, a number of terms arose that are now considered to be synonymous with the term "cluster analysis": automatic classification; botryology.

Cluster analysis is a multidimensional statistical procedure that collects data containing information about a sample of objects, and then arranges objects into relatively homogeneous groups (clusters) (Q-clustering, or Q-technique, cluster analysis proper). Cluster - a group of elements characterized by a common property, the main goal of cluster analysis is to find groups of similar objects in the sample. The range of applications of cluster analysis is very wide: it is used in archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, marketing, sociology and other disciplines. However, the universality of application has led to the emergence of a large number of incompatible terms, methods and approaches that make it difficult to unambiguously use and consistently interpret cluster analysis. Orlov A. I. suggests distinguishing as follows:

Tasks and conditions

Cluster analysis performs the following main tasks:

  • Development of a typology or classification.
  • Exploring useful conceptual schemes for grouping objects.
  • Generation of hypotheses based on data exploration.
  • Hypothesis testing or research to determine whether types (groups) identified in one way or another are actually present in the available data.

Regardless of the subject of study, the use of cluster analysis involves next steps:

  • Sampling for clustering. It is understood that it makes sense to cluster only quantitative data.
  • Definition of a set of variables by which objects in the sample will be evaluated, that is, a feature space.
  • Calculation of the values ​​of one or another measure of similarity (or difference) between objects.
  • Application of the cluster analysis method to create groups of similar objects.
  • Validation of the results of the cluster solution.

Cluster analysis presents the following data requirements:

  1. indicators should not correlate with each other;
  2. indicators should not contradict the theory of measurements;
  3. the distribution of indicators should be close to normal;
  4. indicators must meet the requirement of "stability", which means the absence of influence on their values ​​by random factors;
  5. the sample should be homogeneous, not contain "outliers".

You can find a description of two fundamental requirements for data - uniformity and completeness:

Homogeneity requires that all entities represented in a table be of the same nature. The requirement for completeness is that the sets I And J presented a complete description of the manifestations of the phenomenon under consideration. If we consider a table in which I is a collection, and J- the set of variables describing this population, then it should be a representative sample from the studied population, and the system of characteristics J should give a satisfactory vector representation of individuals i from a researcher's point of view.

If cluster analysis is preceded by factor analysis, then the sample does not need to be “repaired” - the stated requirements are performed automatically by the factor modeling procedure itself (there is one more advantage - z-standardization without negative consequences for the sample; if it is carried out directly for cluster analysis, it can lead to resulting in a decrease in the clarity of the separation of groups). Otherwise, the sample must be adjusted.

Typology of clustering problems

Input Types

IN modern science Several algorithms for processing input data are used. Analysis by comparing objects based on features (most common in the biological sciences) is called Q- type of analysis, and in the case of feature comparison, on the basis of objects - R- type of analysis. There are attempts to use hybrid types of analysis (for example, RQ analysis), but this methodology has not yet been properly developed.

Goals of clustering

  • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying its own analysis method to each cluster (the “divide and conquer” strategy).
  • Data compression. If the initial sample is excessively large, then it can be reduced, leaving one of the most typical representatives from each cluster.
  • novelty detection. novelty detection). Atypical objects are selected that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, individual objects that do not fit into any of the clusters are of greatest interest.

In all these cases, hierarchical clustering can be applied, when large clusters are split into smaller ones, which, in turn, are split even smaller, etc. Such tasks are called taxonomy tasks. The result of taxonomy is a tree-like hierarchical structure. In addition, each object is characterized by an enumeration of all clusters to which it belongs, usually from large to small.

Clustering methods

There is no generally accepted classification of clustering methods, but a solid attempt by V. S. Berikov and G. S. Lbov can be noted. To summarize various classifications clustering methods, a number of groups can be distinguished (some methods can be attributed to several groups at once, and therefore it is proposed to consider this typification as some approximation to the real classification of clustering methods):

  1. Probabilistic approach. It is assumed that each object under consideration belongs to one of the k classes. Some authors (for example, A. I. Orlov) believe that this group does not belong to clustering at all and oppose it under the name "discrimination", that is, the choice of assigning objects to one of the known groups (training samples).
  2. Approaches based on artificial intelligence systems. A very conditional group, since there are a lot of AI methods and methodically they are very different.
  3. logical approach. The construction of a dendrogram is carried out using a decision tree.
  4. Graph-theoretic approach.
    • Graph clustering algorithms
  5. Hierarchical approach. The presence of nested groups (clusters of different orders) is assumed. Algorithms, in turn, are divided into agglomerative (unifying) and divisive (separating). According to the number of features, monothetic and polythetic methods of classification are sometimes distinguished.
    • Hierarchical divisional clustering or taxonomy. Clustering problems are considered in quantitative taxonomy.
  6. Other Methods. Not included in the previous groups.
    • Statistical clustering algorithms
    • Ensemble of clusterers
    • Algorithms of the KRAB family
    • Algorithm based on the sifting method
    • DBSCAN etc.

Approaches 4 and 5 are sometimes combined under the name of the structural or geometric approach, which has a more formalized concept of proximity. Despite significant differences between the listed methods, they all rely on the original " compactness hypothesis»: in object space, all close objects must belong to the same cluster, and all different objects, respectively, must be in different clusters.

Formal Statement of the Clustering Problem

Let be a set of objects, be a set of numbers (names, labels) of clusters. The distance function between objects is given. There is a finite training set of objects . It is required to split the sample into non-overlapping subsets, called clusters, so that each cluster consists of objects close in metric , and objects of different clusters differ significantly. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that associates any object with a cluster number. The set in some cases is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Clustering (unsupervised learning) differs from classification (supervised learning) in that the labels of the original objects are not initially set, and the set itself may even be unknown.

The solution of the clustering problem is fundamentally ambiguous, and there are several reasons for this (according to a number of authors):

  • there is no uniquely best criterion for the quality of clustering. A number of heuristic criteria are known, as well as a number of algorithms that do not have a clearly defined criterion, but carry out a fairly reasonable clustering “by construction”. All of them can give different results. Therefore, to determine the quality of clustering, an expert in the subject area is required, who could assess the meaningfulness of the selection of clusters.
  • the number of clusters is usually unknown in advance and is set according to some subjective criterion. This is true only for discrimination methods, since in clustering methods, clusters are selected using a formalized approach based on proximity measures.
  • the clustering result significantly depends on the metric, the choice of which, as a rule, is also subjective and is determined by an expert. But it is worth noting that there are a number of recommendations for choosing proximity measures for various tasks.

Application

In biology

In biology, clustering has many applications in a wide variety of fields. For example, in bioinformatics, it is used to analyze complex networks of interacting genes, sometimes consisting of hundreds or even thousands of elements. Cluster analysis allows you to identify subnets, bottlenecks, hubs and other hidden properties of the system under study, which ultimately allows you to find out the contribution of each gene to the formation of the phenomenon under study.

In the field of ecology, it is widely used to identify spatially homogeneous groups of organisms, communities, etc. Less commonly, cluster analysis methods are used to study communities over time. The heterogeneity of the structure of communities leads to the emergence of non-trivial methods of cluster analysis (for example, the Czekanowski method).

In general, it is worth noting that historically, similarity measures are more often used as proximity measures in biology, rather than difference (distance) measures.

In sociology

When analyzing the results of sociological research, it is recommended to carry out the analysis using the methods of a hierarchical agglomerative family, namely the Ward method, in which the minimum dispersion is optimized within the clusters, as a result, clusters of approximately equal sizes are created. Ward's method is the most successful for the analysis of sociological data. As a measure of difference, the quadratic Euclidean distance is better, which contributes to an increase in the contrast of clusters. The main result of hierarchical cluster analysis is a dendrogram or “icicle diagram”. When interpreting it, researchers are faced with a problem of the same kind as the interpretation of the results of factor analysis - the lack of unambiguous criteria for identifying clusters. It is recommended to use two methods as the main ones - visual analysis of the dendrogram and comparison of the results of clustering performed by different methods.

Visual analysis of the dendrogram involves "cutting" the tree at the optimal level of similarity of the sample elements. The “vine branch” (terminology of Oldenderfer M.S. and Blashfield R.K.) should be “cut off” at around 5 on the Rescaled Distance Cluster Combine scale, thus achieving an 80% similarity level. If the selection of clusters by this label is difficult (several small clusters merge into one large one on it), then you can choose another label. This technique is proposed by Oldenderfer and Blashfield.

Now the question of the stability of the adopted cluster solution arises. In fact, checking the stability of clustering comes down to checking its reliability. There is a rule of thumb here - a stable typology is preserved when clustering methods change. The results of hierarchical cluster analysis can be verified by iterative k-means cluster analysis. If the compared classifications of groups of respondents have a share of coincidences of more than 70% (more than 2/3 of coincidences), then a cluster decision is made.

It is impossible to check the adequacy of the solution without resorting to another type of analysis. At least theoretically, this problem has not been solved. Oldenderfer and Blashfield's classic Cluster Analysis elaborates on and ultimately rejects five additional robustness testing methods:

In computer science

  • Clustering of search results - used for "intelligent" grouping of results when searching for files, websites, other objects, allowing the user to quickly navigate, select a subset that is obviously more relevant and excludes a obviously less relevant one - which can increase the usability of the interface compared to output in the form a simple sorted by relevance list .
    • Clusty - Vivísimo's clustering search engine
    • Nigma - Russian search engine with automatic results clustering
    • Quintura - visual clustering in the form of a cloud of keywords
  • Image segmentation image segmentation) - Clustering can be used to break up a digital image into distinct regions for the purpose of edge detection. edge detection) or object recognition.
  • Data mining data mining)- Clustering in Data Mining becomes valuable when it acts as one of the stages of data analysis, building a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than to create one general model for all data. This technique is constantly used in marketing, highlighting groups of customers, buyers, goods and developing a separate strategy for each of them.

see also

Notes

Links

In Russian
  • www.MachineLearning.ru - professional wiki resource dedicated to machine learning and data mining
In English
  • COMPACT - Comparative Package for Clustering Assessment. A free Matlab package, 2006.
  • P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999.
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering . Also has an explanation on mixture of Gaussians.
  • david dowe, Mixture Modeling page- other clustering and mixture model links.
  • a tutorial on clustering
  • The on-line textbook: Information Theory, Inference, and Learning Algorithms , by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
  • "The Self-Organized Gene", tutorial explaining clustering through competitive learning and self-organizing maps.
  • kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
  • Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
  • Data Mining Software - Data mining software frequently utilizes clustering techniques.
  • Java Competitve Learning Application A suite of Unsupervised Neural Networks for clustering. Written in Java. Complete with all source code.
  • Machine Learning Software - Also contains much clustering software.

There are two main types of cluster analysis in statistics (both represented in SPSS): hierarchical and k-means. In the first case, the automated statistical procedure independently determines the optimal number of clusters and a number of other parameters required for clustering.

analysis. The second type of analysis has significant limitations in practical applicability - for it it is necessary to independently determine the exact number of allocated clusters, and the initial values ​​of the centers of each cluster (centroids), and some other statistics. When analyzing by the k-means method, these problems are solved by preliminary conducting a hierarchical cluster analysis and then, based on its results, calculating the cluster model using the k-means method, which in most cases not only does not simplify, but, on the contrary, complicates the work of a researcher (especially an unprepared one).

In general, we can say that due to the fact that hierarchical cluster analysis is very demanding on computer hardware resources, k-means cluster analysis was introduced into SPSS to process very large data sets consisting of many thousands of observations (respondents), under conditions insufficient capacity of computer equipment1. Sample sizes used in marketing research in most cases do not exceed four thousand respondents. The practice of marketing research shows that it is the first type of cluster analysis - hierarchical - that is recommended for use in all cases as the most relevant, universal and accurate. At the same time, it should be emphasized that the selection of relevant variables is important when conducting cluster analysis. This remark is very important, since the inclusion of several or even one irrelevant variable in the analysis can lead to the failure of the entire statistical procedure.

We will describe the methodology for conducting cluster analysis using the following example from the practice of marketing research.

Initial data:

During the study, 745 air passengers flying with one of 22 Russian and foreign airlines were interviewed. Air passengers were asked to rate, on a five-point scale, from 1 (very poor) to 5 (excellent), seven aspects of airline ground staff performance during the check-in process: courtesy, professionalism, promptness, helpfulness, queue management, appearance, work staff in general.

Required:

Segment the studied airlines according to the level of quality of work of ground personnel perceived by air passengers.

So, we have a data file, which consists of seven interval variables denoting the performance ratings of the ground personnel of various airlines (ql3-ql9), presented on a single five-point scale. The data file contains a single variable q4 indicating the airlines selected by the respondents (22 in total). Let's carry out a cluster analysis and determine which target groups the airline data can be divided into.

Hierarchical cluster analysis is carried out in two stages. The result of the first stage is the number of clusters (target segments) into which the studied sample of respondents should be divided. The cluster analysis procedure as such is not

can independently determine the optimal number of clusters. She can only suggest the desired number. Since the task of determining optimal number segments is key, it is usually solved at a separate stage of the analysis. At the second stage, the actual clustering of observations is performed according to the number of clusters that was determined during the first stage of the analysis. Now let's look at these cluster analysis steps in order.

The cluster analysis procedure is launched using the Analyze > Classify > Hierarchical Cluster menu. In the dialog box that opens, from the left list of all variables available in the data file, select the variables that are the segmentation criteria. In our case, there are seven of them, and they denote estimates of the parameters of the work of ground personnel ql3-ql9 (Fig. 5.44). In principle, specifying a set of segmentation criteria will be quite enough to perform the first stage of cluster analysis.

Rice. 5.44.

By default, in addition to the table with the results of the formation of clusters, on the basis of which we will determine their optimal number, SPSS also displays a special inverted histogram icicle, which, according to the intention of the creators of the program, helps to determine the optimal number of clusters; Diagrams are displayed using the Plots button (Fig. 5.45). However, if we leave this option set, we will spend a lot of time processing even a relatively small data file. In addition to icicle, a faster Dendogram bar chart can be selected in the Plots window. It is a horizontal bars reflecting the process of cluster formation. Theoretically, with a small (up to 50-100) number of respondents, this diagram really helps to choose the optimal solution for the required number of clusters. However, in almost all examples from marketing research, the sample size exceeds this value. The dendogram becomes completely useless, since even with a relatively small number of observations it is a very long sequence of line numbers of the original data file, connected by horizontal and vertical lines. Most SPSS textbooks contain examples of cluster analysis on just such artificial, small samples. In this tutorial, we show you how to get the most out of SPSS in a practical setting and real market research examples.

Rice. 5.45.

As we have established, neither Icicle nor Dendogram are suitable for practical purposes. Therefore, in the main dialog box of Hierarchical Cluster Analysis, it is recommended not to display charts by deselecting the default Plots option in the Display area, as shown in Fig. 5.44. Now everything is ready to perform the first stage of cluster analysis. Start the procedure by clicking on the OK button.

After a while, the results will appear in the SPSS Viewer window. As mentioned above, the only result of the first stage of the analysis that is significant for us will be the Average Linkage (Between Groups) table, shown in Fig. 5.46. Based on this table, we must determine the optimal number of clusters. It should be noted that there is no single universal method for determining the optimal number of clusters. In each case, the researcher must determine this number himself.

Based on the experience, the author proposes the following scheme of this process. First of all, let's try to apply the most common standard method for determining the number of clusters. Using the table Average Linkage (Between Groups), it is necessary to determine at what step of the cluster formation process (column Stage) the first relatively large jump in the agglomeration coefficient occurs (column Coefficients). This jump means that before it, observations that were at sufficiently small distances from each other were combined into clusters (in our case, respondents with a similar level of assessments in terms of the analyzed parameters), and starting from this stage, more distant observations are combined.

In our case, the coefficients smoothly increase from 0 to 7.452, that is, the difference between the coefficients in steps from the first to 728 was small (for example, between 728 and 727 steps - 0.534). Starting from the 729th step, the first significant jump in the coefficient occurs: from 7.452 to 10.364 (by 2.912). The step at which the coefficient jumps for the first time is 729. Now, in order to determine the optimal number of clusters, it is necessary to subtract the obtained value from the total number of observations (sample size). The total sample size in our case is 745 people; therefore, the optimal number of clusters is 745-729 = 16.


Rice. 5.46.

We got a fairly large number of clusters, which will be difficult to interpret in the future. Therefore, now it is necessary to examine the resulting clusters and determine which of them are significant, and which ones should be tried to reduce. This problem is solved at the second stage of cluster analysis.

Open the main dialog box of the cluster analysis procedure (menu Analyze > Classify > Hierarchical Cluster). In the field for analyzed variables, we already have seven parameters. Click the Save button. The dialog box that opens (Fig. 5.47) allows you to create a new variable in the source data file that distributes respondents into target groups. Select the Single Solution option and specify the required number of clusters in the corresponding field - 16 (determined at the first stage of the cluster analysis). Clicking the Continue button will return you to the main dialog box, where you can click the OK button to start the cluster analysis procedure.

Before continuing the description of the cluster analysis process, it is necessary to present short description other options. Among them there are both useful features and actually superfluous (from the point of view of practical marketing research). For example, the main Hierarchial Cluster Analysis dialog box contains a Label Cases by field, in which you can optionally place a text variable that identifies the respondents. In our case, the q4 variable, which encodes the airlines chosen by the respondents, can serve for these purposes. In practice, it is difficult to come up with a rational explanation for the use of the Label Cases by field, so you can safely always leave it empty.

Rice. 5.47.

Infrequently, when performing cluster analysis, the Statistics dialog box is used, called by the button of the same name in the main dialog box. It allows you to display the Cluster Membership table in the SPSS Viewer window, in which each respondent in the source data file is mapped to a cluster number. With a sufficiently large number of respondents (in almost all examples of marketing research), this table becomes completely useless, since it is a long sequence of pairs of values ​​“respondent number / cluster number”, which in this form cannot be interpreted. The technical goal of cluster analysis is always to create an additional variable in the data file that reflects the division of respondents into target groups (by clicking on the Save button in the main cluster analysis dialog box). This variable, together with the numbers of respondents, is the Cluster Membership table. The only practical option in the Statistics window is to display the Average Linkage (Between Groups) table, but this is already set by default. Thus, using the Statistics button and displaying a separate Cluster Membership table in the SPSS Viewer window is not practical.

The Plots button has already been mentioned above: it should be deactivated by deselecting the Plots parameter in the main cluster analysis dialog box.

In addition to these rarely used features of the cluster analysis procedure, SPSS also offers some very useful options. Among them, first of all, the Save button, which allows you to create a new variable in the source data file that distributes respondents into clusters. Also in the main dialog box there is an area for selecting the object of clustering: respondents or variables. This possibility was discussed above in section 5.4. In the first case, cluster analysis is mainly used to segment respondents according to some criteria; in the second, the purpose of cluster analysis is similar to factor analysis: classification (reduction in the number) of variables.

As can be seen from fig. 5.44, the only possibility of cluster analysis not considered is the button for selecting the method of conducting the statistical procedure Method. Experimenting with this Parameter allows you to achieve greater accuracy in determining the optimal number of clusters. General form this dialog box with default settings is shown in Fig. 5.48.

Rice. 5.48.

The first thing that is set in this window is the method of forming clusters (that is, combining observations). Among all the possible options for statistical methods offered by SPSS, you should choose either the default Between-groups linkage method or the Ward (Ward "s method). The first method is used more often due to its versatility and the relative simplicity of the statistical procedure on which it is based. Using this method, the distance between clusters is calculated as the average of the distances between all possible pairs of observations, with each iteration involving one observation from one cluster and the second from another. theoretically possible pairs of observations.The Ward method is more difficult to understand and less commonly used.It consists of many stages and is based on averaging the values ​​of all variables for each observation and then summing the squared distances from the calculated averages to each observation.For practical purposes, marketing For new research, we recommend that you always use the default Between-groups linkage method.

After selecting a statistical clustering procedure, select a method for calculating distances between observations (Measure area in the Method dialog box). Exist various methods determination of distances for three types of variables involved in cluster analysis (segmentation criteria). These variables can have an interval (Interval), nominal (Counts) or dichotomous (Binary) scale. The dichotomous scale (Binary) implies only variables that reflect the occurrence / non-occurrence of an event (bought / not bought, yes / no, etc.). Other types of dichotomous variables (for example, male/female) should be considered and analyzed as nominal (Counts).

The most commonly used method for determining distances for interval variables is the default Squared Euclidean Distance. It is this method that has proven itself in marketing research as the most accurate and universal. However, for dichotomous variables where observations are represented by only two values ​​(for example, 0 and 1), this method is not suitable. The point is that it takes into account only interactions between observations of the type: X = 1,Y = 0 and X = 0, Y=l (where X and Y are variables) and does not take into account other types of interactions. The most comprehensive measure of distance, taking into account all important types of interactions between two dichotomous variables, is the Lambda method. We recommend using this method due to its versatility. However, there are other methods, such as Shape, Hamann or Anderbergs's D.

When specifying the method for determining distances for dichotomous variables, it is necessary to indicate in the corresponding field the specific values ​​that the studied dichotomous variables can take: in the Present field - the answer encoding Yes, and in the Absent field - No. The names of the fields present and absent are associated with the fact that in the Binary method group it is supposed to use only dichotomous variables that reflect the occurrence / non-occurrence of an event. For the two types of variables Interval and Binary, there are several methods for determining the distance. For variables with a nominal scale type, SPSS offers only two methods: (Chi-square measure) and (Phi-square measure). We recommend using the first method as the most common.

The Method dialog has a Transform Values ​​area that contains a Standardize field. This field is used when variables with different scale types (for example, interval and nominal) take part in the cluster analysis. In order to use these variables in cluster analysis, it is necessary to carry out standardization, leading them to a single type of scale - interval. The most common method of variable standardization is 2-standardization (Zscores): all variables are reduced to a single range of values ​​from -3 to +3 and after transformation are interval.

Since all optimal methods (clustering and distance determination) are set by default, it is advisable to use the Method dialog box only to specify the type of variables to be analyzed, as well as to indicate the need to perform 2-standardization of variables.

So, we have described all the main features provided by SPSS for cluster analysis. Let us return to the description of the cluster analysis carried out for the purpose of segmenting airlines. Recall that we settled on a sixteen-cluster solution and created a new variable clul6_l in the original data file, distributing all the analyzed airlines into clusters.

To establish how correctly we have determined the optimal number of clusters, we will build a linear distribution of the clul6_l variable (menu Analyze > Descriptive Statistics > Frequencies). As seen in fig. 5.49, in clusters numbered 5-16, the number of respondents ranges from 1 to 7. Along with the universal method described above for determining the optimal number of clusters (based on the difference between the total number of respondents and the first jump in the agglomeration coefficient), there is also an additional recommendation: the size of clusters should be statistically meaningful and practical. With our sample size, such a critical value can be set at least at the level of 10. We see that only clusters with numbers 1-4 fall under this condition. Therefore, now it is necessary to recalculate the cluster analysis procedure with the output of a four-cluster solution (a new variable du4_l will be created).


Rice. 5.49.

Having built a linear distribution on the newly created variable du4_l, we will see that only in two clusters (1 and 2) the number of respondents is practically significant. We need to rebuild the cluster model again -- now for a two-cluster solution. After that, we construct the distribution with respect to the variable du2_l (Fig. 5.50). As you can see from the table, the two-cluster solution has a statistically and practically significant number of respondents in each of the two formed clusters: in cluster 1 - 695 respondents; in cluster 2 - 40. Thus, we determined the optimal number of clusters for our task and carried out the actual segmentation of respondents according to seven selected criteria. Now we can consider the main goal of our task as achieved and proceed to the final stage of cluster analysis - the interpretation of the obtained target groups (segments).


Rice. 5.50.

The resulting solution is somewhat different from those that you may have seen in teaching aids by SPSS. Even the most practically oriented textbooks provide artificial examples where clustering results in ideal target groups of respondents. In some cases (5) the authors even point directly to the artificial origin of the examples. In this tutorial, we will use a real-life example from practical marketing research, which is not characterized by ideal proportions, as an illustration of the operation of cluster analysis. This will allow us to show the most common difficulties in conducting cluster analysis, as well as the best methods to eliminate them.

Before proceeding with the interpretation of the resulting clusters, let's summarize. We have the following scheme for determining the optimal number of clusters.

¦ In step 1, we determine the number of clusters based on a mathematical method based on the agglomeration coefficient.

¦ At stage 2, we cluster the respondents according to the obtained number of clusters and then build a linear distribution according to the new variable formed (clul6_l). Here you should also determine how many clusters consist of a statistically significant number of respondents. In general, it is recommended to set the minimum significant number of clusters at the level of at least 10 respondents.

¦ If all clusters satisfy this criterion, we proceed to the final stage of cluster analysis: the interpretation of clusters. If there are clusters with an insignificant number of their constituent observations, we determine how many clusters consist of a significant number of respondents.

¦ We recalculate the cluster analysis procedure by specifying in the Save dialog box the number of clusters consisting of a significant number of observations.

¦ We build a linear distribution on a new variable.

This sequence of actions is repeated until a solution is found in which all clusters will consist of a statistically significant number of respondents. After that, you can proceed to the final stage of cluster analysis - the interpretation of clusters.

It should be specially noted that the criterion of practical and statistical significance of the number of clusters is not the only criterion by which the optimal number of clusters can be determined. The researcher can independently, based on his experience, suggest the number of clusters (the condition of significance must be satisfied). Another option is a rather common situation when, for the purposes of the study, a condition is set in advance to segment respondents according to a given number of target groups. In this case, you just need to do a hierarchical cluster analysis once, keeping the required number of clusters, and then try to interpret what happens.

In order to describe the resulting target segments, one should use the procedure for comparing the average values ​​of the studied variables (cluster centroids). We will compare the average values ​​of the seven considered segmentation criteria in each of the two resulting clusters.

The procedure for comparing averages is called using the Analyze > Compare Means > Means menu. In the dialog box that opens (Fig. 5.51), select the seven variables selected as segmentation criteria (ql3-ql9) from the left list and transfer them to the Dependent List field for dependent variables. Then move the variable сШ2_1, which reflects the division of respondents into clusters in the final (two-cluster) solution of the problem, from the left list to the field for independent variables Independent List. Then click on the Options button.

Rice. 5.51.

The Options dialog box will open, select the necessary statistics in it to compare clusters (Fig. 5.52). To do this, in the Cell Statistics field, leave only the output of the Mean values, removing other default statistics from it. Close the Options dialog box by clicking the Continue button. Finally, from the main Means dialog box, start the mean comparison procedure (OK button).

Rice. 5.52.

In the SPSS Viewer window that opens, the results of the statistical procedure for comparing averages will appear. We are interested in the Report table (Fig. 5.53). From it you can see on what basis SPSS divided the respondents into two clusters. In our case, such a criterion is the level of assessments for the analyzed parameters. Cluster 1 consists of respondents for whom the average scores for all segmentation criteria are at a relatively high level (4.40 points and above). Cluster 2 includes respondents who rated the considered segmentation criteria quite low (3.35 points and below). Thus, we can conclude that 93.3% of the respondents who formed cluster 1 rated the analyzed airlines in all respects as generally good; 5.4% is quite low; 1.3% found it difficult to answer (see Fig. 5.50). From fig. 5.53, one can also conclude which level of ratings for each of the parameters considered separately is high and which is low (and this conclusion will be made by the respondents, which allows achieving high classification accuracy). From the Report table, you can see that for the Queue Throttling variable, the average score level of 4.40 is considered high, and for the parameter Appearance -- 4.72.


Rice. 5.53.

It may turn out that in a similar case, 4.5 is considered a high score for the X parameter, and only 3.9 for the Y parameter. This will not be a clustering error, but, on the contrary, will make it possible to draw an important conclusion regarding the significance of the parameters under consideration for the respondents. Thus, for the Y parameter, already 3.9 points is a good estimate, while for the X parameter, the respondents impose more stringent requirements.

We have identified two significant clusters that differ in the level of average scores according to the segmentation criteria. Now you can assign labels to the received clusters: for 1 - Airlines that meet the requirements of the respondents (according to the seven analyzed criteria); for 2 -- Airlines that do not meet the requirements of the respondents. Now you can see which particular airlines (coded in the q4 variable) meet the requirements of the respondents, and which do not according to the segmentation criteria. To do this, you should build a cross-distribution of the variable q4 (analyzed airlines) depending on the clustering variable clu2_l. The results of such a cross-sectional analysis are presented in Figs. 5.54.

Based on this table, the following conclusions can be drawn regarding the membership of the studied airlines in the selected target segments.


Rice. 5.54.

1. Airlines that fully meet the requirements of all customers in terms of the work of ground personnel (included in only one first cluster):

¦ Vnukovo Airlines;

¦ American Airlines;

¦ Delta Airlines;

Austrian Airlines;

¦ British Airways;

¦ Korean Airlines;

Japan Airlines.

2. Airlines that meet the requirements of most of their customers in terms of the work of ground personnel (most of the respondents flying with these airlines are satisfied with the work of ground personnel):

¦ Transaero.

3. Airlines that do not meet the requirements of the majority of their customers in terms of the work of ground personnel (most of the respondents flying with these airlines are not satisfied with the work of ground personnel):

¦ Domodedovo Airlines;

¦ Pulkovo;

¦ Siberia;

¦ Ural Airlines;

¦ Samara Airlines;

Thus, three target segments of airlines were obtained by the level of average ratings, characterized by varying degrees of satisfaction of respondents with the work of ground personnel:

  • 1. the most attractive airlines for passengers in terms of the level of work of ground personnel (14);
  • 2. rather attractive airlines (1);
  • 3. rather unattractive airlines (7).

We have successfully completed all stages of cluster analysis and segmented airlines according to seven selected criteria.

Now we give a description of the methodology of cluster analysis paired with factor analysis. We use the condition of the problem from section 5.2.1 (factorial analysis). As already mentioned, in segmentation problems with a large number of variables, it is advisable to precede cluster analysis with factor analysis. This is done to reduce the number of segmentation criteria to the most significant ones. In our case, we have 24 variables in the original data file. As a result of factor analysis, we managed to reduce their number to 5. Now this number of factors can be effectively used for cluster analysis, and the factors themselves can be used as segmentation criteria.

If we are faced with the task of segmenting respondents according to their assessment of various aspects of the current competitive position of airline X, we can conduct a hierarchical cluster analysis according to the five criteria identified (variables nfacl_l-nfac5_l). In our case, the variables were evaluated on different scales. For example, a score of 1 for the statement I would not want the airline to change and the same score for the statement Changes in the airline will be a positive moment, diametrically opposed in meaning. In the first case, 1 point (strongly disagree) means that the respondent welcomes the changes in the airline; in the second case, a score of 1 indicates that the respondent rejects the changes in the airline. When interpreting clusters, we will inevitably encounter difficulties, since such variables that are opposite in meaning can

fall into the same factor. Thus, for the purposes of segmentation, it is recommended to first bring the scales of the variables under study into line, and then recalculate the factorial model. And already further to carry out cluster analysis on the variables-factors obtained as a result of factor analysis. We will not again describe in detail the procedures for factor and cluster analysis (this was done above in the relevant sections). We only note that with this technique, as a result, we got three target groups of air passengers, differing in the level of assessments of the selected factors (that is, groups of variables): the lowest, the average and the highest.

Very useful application cluster analysis is the division into groups of frequency tables. Suppose we have a linear distribution of answers to the question What brands of antiviruses are installed in your organization?. To form conclusions on this distribution, it is necessary to divide antivirus brands into several groups (usually 2-3). To divide all brands into three groups (most popular brands, average popularity and unpopular brands), it is best to use cluster analysis, although, as a rule, researchers separate the elements of frequency tables by eye, based on subjective considerations. In contrast to this approach, cluster analysis makes it possible to substantiate the performed grouping scientifically. To do this, enter the values ​​of each parameter in SPSS (it is advisable to express these values ​​as a percentage) and then perform a cluster analysis on these data. By saving the cluster solution for the required number of groups (3 in our case) as a new variable, we get a statistically valid grouping.

We will devote the final part of this section to describing the use of cluster analysis for classifying variables and comparing its results with the results of the factor analysis carried out in Section 5.2.1. To do this, we will again use the condition of the problem about assessing the current position of airline X in the air transportation market. The methodology for conducting cluster analysis almost completely repeats the one described above (when the respondents were segmented).

So, in the original data file, we have 24 variables that describe the attitude of respondents to various aspects of the current competitive position of airline X. Open the main Hierarchical Cluster Analysis dialog box and place 24 variables (ql-q24) in the Variable(s) field, fig. 5.55. In the Cluster area, indicate that you are classifying variables (check the Variables option). You will see that the Save button has become unavailable -- unlike factor analysis, cluster analysis cannot save factor ratings for all respondents. Disable plotting by deactivating the Plots option. In the first step, you don't need any other options, so just click the OK button to start the cluster analysis procedure.

The Agglomeration Schedule table appeared in the SPSS Viewer window, according to which we determined the optimal number of clusters using the method described above (Fig. 5.56). The first jump in the agglomeration coefficient is observed at step 20 (from 18834.000 to 21980.967). Based on the total number of analyzed variables, equal to 24, it is possible to calculate the optimal number of clusters: 24 - 20 = 4.

Rice. 5.55.


Rice. 5.56.

When classifying variables, a cluster consisting of only one variable is practically and statistically significant. Therefore, since we have obtained an acceptable number of clusters by the mathematical method, no further checks are required. Instead, open the main cluster analysis dialog box again (all the data used in the previous step is preserved) and click the Statistics button to display the classification table. You will see a dialog box of the same name, where you must specify the number of clusters into which 24 variables must be divided (Fig. 5.57). To do this, select the Single solution option and specify the required number of clusters in the corresponding field: 4. Now close the Statistics dialog box by clicking the Continue button and run the procedure from the main cluster analysis window.

As a result, the Cluster Membership table will appear in the SPSS Viewer window, distributing the analyzed variables into four clusters (Fig. 5.58).

Rice. 5.58.

According to this table, each variable under consideration can be assigned to a specific cluster as follows.

Cluster 1

ql. Airline X has a reputation for excellent passenger service.

q2. Airline X can compete with the best airlines in the world.

q3. I believe that Airline X has a promising future in global aviation.

q5. I am proud to work for Airline X.

q9. We have a long way to go before we can claim to be a world class airline.

qlO. Airline X really cares about passengers.

ql3. I love how Airline X is presenting itself visually to the general public (in terms of colors and branding).

ql4. Airline X is the face of Russia.

ql6. Airline X service is consistent and recognizable throughout

ql8. Airline X needs to change in order to exploit its full potential.

ql9. I think Airline X needs to present itself visually in a more modern way.

q20. Changes in airline X will be a positive thing. q21. Airline X is an efficient airline.

q22. I would like to see the image of airline X improve in terms of foreign passengers.

q23. Airline X is better than most people think.

q24. It is important that people all over the world know that we are a Russian airline.

Cluster 2

q4. I know what the future strategy of Airline X will be.

q6. Airline X has good communication between departments.

q7. Every employee of the airline makes every effort to ensure its success.

q8. Now Airline X is improving rapidly.

qll. There is a high degree of job satisfaction among airline employees.

ql2. I believe that senior managers do their best to achieve the success of an airline.

Cluster 3

ql5. We look like “yesterday” compared to other airlines.

Cluster 4

ql7. I would not want airline X to change.

If you compare the results of factorial (section 5.2.1) and cluster analyzes, you will see that they differ significantly. Cluster analysis not only provides significantly less opportunities for variable clustering (for example, the inability to save group ratings) compared to factor analysis, but also produces much less visual results. In our case, if clusters 2, 3 and 4 are still amenable to logical interpretation1, then cluster 1 contains statements that are completely different in meaning. In this situation, you can either try to describe cluster 1 as it is, or rebuild the statistical model with a different number of clusters. In the latter case, to find the optimal number of clusters that can be logically described, you can use the Range of solutions parameter in the Statistics dialog box (see Figure 5.57), specifying the minimum and maximum number of clusters in the corresponding fields (in our case, 4 and 6, respectively). In such a situation, SPSS will rebuild the Cluster Membership table for each number of clusters. The analyst's task in this case is to try to choose a classification model in which all clusters will be interpreted unambiguously. In order to demonstrate the capabilities of the cluster analysis procedure for clustering variables, we will not rebuild the cluster model, but will limit ourselves to what has been said above.

It should be noted that, despite the apparent simplicity of cluster analysis compared to factor analysis, in almost all cases of marketing research, factor analysis is faster and more efficient than cluster analysis. Therefore, for the classification (reduction) of variables, we strongly recommend using factor analysis and leave the use of cluster analysis for the classification of respondents.

Classification analysis is, perhaps, one of the most complex statistical tools from the point of view of an unprepared user. This is due to its very low prevalence in marketing companies. However, this group statistical methods is also one of the most useful for practitioners in the field of marketing research.

Cluster analysis is

Good day. Here I have respect for people who are fans of their work.

Maxim, my friend, belongs to this category. Constantly works with figures, analyzes them, makes relevant reports.

Yesterday we had lunch together, so for almost half an hour he told me about cluster analysis - what it is and in what cases its application is reasonable and expedient. Well, what about me?

I have a good memory, so I will provide you with all this data, by the way, which I already knew about in its original and most informative form.

Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a task of multivariate data classification.

There are about 100 different clustering algorithms, however, the most commonly used are hierarchical cluster analysis and k-means clustering.

Where is cluster analysis applied? In marketing, this is the segmentation of competitors and consumers.

In management: division of personnel into groups of different levels of motivation, classification of suppliers, identification of similar production situations in which marriage occurs.

In medicine, the classification of symptoms, patients, drugs. In sociology, the division of respondents into homogeneous groups. In fact, cluster analysis has proven itself well in all spheres of human life.

The beauty of this method is that it works even when there is little data and the requirements for the normality of distributions of random variables and other requirements of classical methods of statistical analysis are not met.

Let us explain the essence of cluster analysis without resorting to strict terminology:
Let's say you conducted a survey of employees and want to determine how you can most effectively manage your staff.

That is, you want to divide employees into groups and select the most effective control levers for each of them. At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To solve the problem, it is proposed to use hierarchical cluster analysis.

As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to split the staff into.

Suppose that we decide to divide the staff into three groups, then to study the respondents who fell into each cluster, we get a tablet with the following content:


Let us explain how the above table is formed. The first column contains the number of the cluster — the group whose data is reflected in the row.

For example, the first cluster is 80% male. 90% of the first cluster fall into the age group from 30 to 50 years old, and 12% of respondents believe that benefits are very important. Etc.

Let's try to make portraits of respondents of each cluster:

  1. The first group is mostly men. middle age holding leadership positions. The social package (MED, LGOTI, TIME-free time) does not interest them. They prefer to receive a good salary, rather than help from the employer.
  2. Group two, on the contrary, prefers the social package. It consists mainly of "aged" people occupying low positions. Salary is certainly important for them, but there are other priorities.
  3. The third group is the "youngest". Unlike the previous two, there is an obvious interest in learning and professional growth opportunities. This category of employees has a good chance to replenish the first group soon.

Thus, when planning a campaign to introduce effective personnel management methods, it is obvious that in our situation it is possible to increase the social package for the second group to the detriment, for example, of wages.

If we talk about which specialists should be sent for training, then we can definitely recommend paying attention to the third group.

Source: http://website/www.nickart.spb.ru/analysis/cluster.php

Features of cluster analysis

A cluster is the price of an asset in a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number within the cluster.

The bar of any TF contains, as a rule, several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, for each price level.


A change in the price of one asset inevitably entails a chain of price movements on other instruments as well.

Attention!

In most cases, the understanding of the trend movement occurs already at the moment when it is developing rapidly, and entering the market along the trend is fraught with falling into a corrective wave.

For successful trades, it is necessary to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph.

With the help of cluster analysis, you can see the activity of market participants inside even the smallest price bar. This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes for each asset price level.

In the market there is a constant confrontation between the interests of sellers and buyers. And every smallest price movement (tick) is the move to a compromise - the price level - which in this moment suits both parties.

But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then the next moment, most likely, there will be buyers.

The number of completed transactions at neighboring price levels is also not the same. And yet, first, the market situation is reflected in the total volume of transactions, and only then on the price.

If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

To successfully apply cluster analysis, you first need to understand what a cluster and a delta are.


A cluster is called a price movement, which is divided into levels at which transactions were made with known volumes. The delta shows the difference between buying and selling occurring in each cluster.

Each cluster, or group of deltas, allows you to figure out whether buyers or sellers dominate the market at a given time.

It is enough just to calculate the total delta by summing the sales and purchases. If the delta is negative, then the market is oversold, there are redundant sell transactions. When the delta is positive, the market is clearly dominated by buyers.

The delta itself can take on a normal or critical value. The value of the delta volume above the normal value in the cluster is highlighted in red.

If the delta is moderate, then this characterizes a flat state in the market. With a normal delta value, a trend movement is observed in the market, but a critical value is always a harbinger of a price reversal.

Forex trading with CA

To get the maximum profit, you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from a flat to a trend movement and be able to get the most profit.

More visual is the cluster chart, where you can see significant levels of accumulation and distribution of volumes, build support and resistance levels. This allows the trader to find the exact entry to the trade.

Using the delta, one can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside the bar of any TF.

This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

Source: http://website/orderflowtrading.ru/analitika-rynka/obemy/klasternyy-analiz/

Areas and features of application of cluster analysis

The term cluster analysis (first introduced by Tryon, 1939) actually includes a set of different classification algorithms.

General question, asked by researchers in many fields, is how to organize observed data into visual structures, i.e. expand taxonomies.

In accordance with modern system accepted in biology, man belongs to primates, mammals, amniotes, vertebrates and animals.

Note that in this classification, the higher the level of aggregation, the less similarity between members in the corresponding class.

Man has more similarities with other primates (i.e., apes) than with "distant" members of the mammal family (i.e., dogs), and so on.

Note that the previous discussion refers to clustering algorithms, but does not mention anything about testing for statistical significance.

In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters”.

There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about the classes, but are still in the descriptive stage of the study.

Attention!

It should be understood that cluster analysis determines the "most possibly meaningful decision".

Therefore, testing for statistical significance is not really applicable here, even in cases where p-levels are known (as in the K-means method, for example).

The clustering technique is used in a wide variety of fields. Hartigan (1975) has provided an excellent overview of many published studies containing results obtained by cluster analysis methods.

For example, in the field of medicine, the clustering of diseases, treatment of diseases, or symptoms of diseases leads to widely used taxonomies.

In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is critical to successful therapy. In archeology, using cluster analysis, researchers are trying to establish taxonomies of stone tools, funeral objects, etc.

There are wide applications of cluster analysis in marketing research. In general, whenever it is necessary to classify "mountains" of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

Tree Clustering

The example in the Primary Purpose section explains the purpose of the join (tree clustering) algorithm.

The purpose of this algorithm is to combine objects (for example, animals) into sufficiently large clusters using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram).

Now imagine that gradually (in very small steps) you "weaken" your criterion for what objects are unique and what are not.

In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (combine) more and more clusters of increasingly different elements.

Finally, in the last step, all objects are merged together. In these charts, the horizontal axes represent the pooling distance (in vertical dendrograms, the vertical axes represent the pooling distance).

So, for each node in the graph (where a new cluster is formed) you can see the amount of distance for which the corresponding elements are linked into a new single cluster.

When the data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by various branches.

As a result of successful analysis by the join method, it becomes possible to detect clusters (branches) and interpret them.

The union or tree clustering method is used in the formation of clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multidimensional space.

For example, if you have to cluster the types of food in a cafe, you can take into account the number of calories contained in it, the price, the subjective assessment of taste, etc.

The most direct way to calculate distances between objects in a multidimensional space is to calculate Euclidean distances.

If you have a 2D or 3D space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

However, the pooling algorithm does not "care" about whether the distances "provided" for that are real or some other derived distance measures, which is more meaningful to the researcher; and the challenge for researchers is to select the right method for specific applications.

Euclidean distance. This seems to be the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not from the standardized data.

This is the usual way of calculating it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may turn out to be an outlier).

Attention!

However, distances can be greatly affected by differences between the axes from which the distances are calculated. For example, if one of the axes is measured in centimeters, and then you convert it to millimeters (by multiplying the values ​​by 10), then the final Euclidean distance (or the square of the Euclidean distance) calculated from the coordinates will change dramatically, and, as a result, the results of the cluster analysis can be very different from the previous ones.

The square of the Euclidean distance. Sometimes you may want to square the standard Euclidean distance to give more weight to more distant objects.

This distance is calculated as follows:

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates.

In most cases, this measure of distance leads to the same results as for the usual Euclid distance.

However, note that for this measure the influence of individual large differences (outliers) decreases (because they are not squared). Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wishes to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. It is sometimes desired to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different.

This can be achieved using a power-law distance. The power distance is calculated by the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure "works".

The p parameter is responsible for the gradual weighting of differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

The percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

Association or association rules

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure.

However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

In other words, you need a join or link rule for two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in the two clusters are closer to each other than the corresponding link distance.

In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method.

This rule builds "fibrous" clusters, i.e. clusters "linked together" only by individual elements that happen to be closer to each other than the others.

Alternatively, you can use neighbors in clusters that are farthest from each other of all other feature pairs. This method is called the full link method.

There are also many other methods for joining clusters, similar to those that have been discussed.

Single connection (nearest neighbor method). As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "strings".

Full connection (method of the most distant neighbors). In this method, the distances between clusters are defined as the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

Unweighted pairwise mean. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

The method is effective when objects actually form different "groves", but it works equally well in cases of extended ("chain" type) clusters.

Note that in their book Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pairwise mean. The method is identical to the unweighted pairwise average method, except that the size of the respective clusters (ie, the number of objects they contain) is used as a weighting factor in the calculations.

Therefore, the proposed method should be used (rather than the previous one) when unequal cluster sizes are assumed.

Sneath and Sokal (1973) introduce the abbreviation WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity.

Attention!

Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

Weighted centroid method (median). This method is identical to the previous one, except that weights are used in the calculations to take into account the difference between cluster sizes (i.e., the number of objects in them).

Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward method. This method is different from all other methods because it uses ANOVA methods to estimate distances between clusters.

The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

Details can be found in Ward (1963). In general, the method seems to be very efficient, but it tends to create small clusters.

Earlier this method was discussed in terms of "objects" that should be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables.

It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (observations) with heart disease.

The investigator may wish to cluster observations (of patients) to identify clusters of patients with similar symptoms.

At the same time, the researcher may wish to cluster variables to identify clusters of variables that are associated with a similar physical state.e

After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

The Cluster Analysis module contains an efficient two-way join procedure to do just that.

However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to contribute simultaneously to the discovery of meaningful clusters.

So, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients that are similar in relation to certain clusters of physical condition characteristics.

The difficulty in interpreting the results obtained arises from the fact that the similarities between different clusters may come from (or be the cause of) some difference in the subsets of variables.

Therefore, the resulting clusters are inherently heterogeneous. Perhaps it seems a bit hazy at first; indeed, compared to other cluster analysis methods described, two-way pooling is probably the least commonly used method.

However, some researchers believe that it offers a powerful tool for exploratory data analysis (for more information, see Hartigan's description of this method (Hartigan, 1975)).

K means method

This clustering method differs significantly from agglomerative methods such as Union (tree clustering) and Two-Way Union. Suppose you already have hypotheses about the number of clusters (by observation or by variable).

You can tell the system to form exactly three clusters so that they are as different as possible.

This is exactly the type of problem that the K-Means algorithm solves. In general, the K-means method builds exactly K distinct clusters spaced as far apart as possible.

In the physical condition example, a medical researcher may have a “hunch” from their clinical experience that their patients generally fall into three different categories.

Attention!

If so, then the means of the various measures of physical parameters for each cluster would provide a quantitative way of representing the investigator's hypotheses (eg, patients in cluster 1 have a high parameter of 1, a lower parameter of 2, etc.).

From a computational point of view, you can think of this method as an analysis of variance "in reverse". The program starts with K randomly selected clusters, and then changes the belonging of objects to them in order to:

  1. minimize variability within clusters,
  2. maximize variability between clusters.

This method is similar to reverse analysis of variance (ANOVA) in that the significance test in ANOVA compares between-group versus within-group variability in testing the hypothesis that group means are different from each other.

In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when performing analysis of variance (ANOVA).

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the means for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most, if not all, of the measurements used in the analysis.

Source: http://website/www.biometrica.tomsk.ru/textbook/modules/stcluan.html

Classification of objects according to their characteristics

Cluster analysis (cluster analysis) - a set of multidimensional statistical methods for classifying objects according to their characteristics, dividing the totality of objects into homogeneous groups that are close in terms of defining criteria, selecting objects of a certain group.

A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or difference between objects.

The object is the specific subjects of study that need to be classified. The objects in the classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

Although it is possible to carry out cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously.

These can be both quantitative and categorical variables, depending on the method of cluster analysis. So, the main goal of cluster analysis is to find groups of similar objects in the sample.

The set of multivariate statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

However, there is no generally accepted classification of methods, and sometimes cluster analysis methods also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

The scope of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other areas.

Here are some examples of applying cluster analysis:

  • medicine - classification of diseases, their symptoms, methods of treatment, classification of patient groups;
  • marketing - the tasks of optimizing the company's product line, segmenting the market by groups of goods or consumers, identifying a potential consumer;
  • sociology - division of respondents into homogeneous groups;
  • psychiatry - correct diagnosis of symptom groups is crucial for successful therapy;
  • biology - classification of organisms by group;
  • economy - classification of subjects of the Russian Federation by investment attractiveness.

Source: http://website/www.statmethods.ru/konsalting/statistics-methody/121-klasternyj-analiz.html

General information about cluster analysis

Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

For example, biologists aim to break down animals into different species in order to meaningfully describe the differences between them.

The task of cluster analysis is to divide the initial set of objects into groups of similar, close objects. These groups are called clusters.

In other words, cluster analysis is one of the ways to classify objects according to their characteristics. It is desirable that the classification results have a meaningful interpretation.

The results obtained by cluster analysis methods are used in various fields. In marketing, it is the segmentation of competitors and consumers.

In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is crucial for successful therapy.

In management, the classification of suppliers is important, the identification of similar production situations in which marriage occurs. In sociology, the division of respondents into homogeneous groups. In portfolio investment, it is important to group securities according to their similarity in the trend of return in order to compile, based on the information obtained about the stock market, an optimal investment portfolio that allows maximizing return on investments for a given degree of risk.

In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

Cluster analysis allows considering a fairly large amount of information and greatly compressing large arrays of socio-economic information, making them compact and visual.

Attention!

Cluster analysis is of great importance in relation to sets of time series characterizing economic development(for example, general economic and commodity conjuncture).

Here it is possible to single out the periods when the values ​​of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Advantages and disadvantages

Cluster analysis allows for an objective classification of any objects that are characterized by a number of features. There are a number of benefits to be derived from this:

  1. The resulting clusters can be interpreted, that is, to describe what kind of groups actually exist.
  2. Individual clusters can be culled. This is useful in cases where certain errors were made in the data set, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
  3. For further analysis, only those clusters that have the characteristics of interest can be selected.

Like any other method, cluster analysis has certain disadvantages and limitations. In particular, the composition and number of clusters depends on the selected partitioning criteria.

When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values ​​of the cluster parameters.

Methods

Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering.

The Statistica package implements the following clustering methods.

  • Hierarchical algorithms - tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. At the next step, some of the clusters closest to each other will be combined into a separate cluster.
  • K-means method. This method is the most commonly used. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is set by the user.
  • Two way association. When using this method, clustering is carried out simultaneously both by variables (columns) and by observation results (rows).

The two-way join procedure is performed when it can be expected that simultaneous clustering on variables and observations will provide meaningful results.

The results of the procedure are descriptive statistics on variables and cases, as well as a two-dimensional color chart on which data values ​​are color-coded.

By the distribution of color, you can get an idea of ​​\u200b\u200bhomogeneous groups.

Normalization of variables

The division of the initial set of objects into clusters is associated with the calculation of distances between objects and the choice of objects, the distance between which is the smallest of all possible.

The most commonly used is the Euclidean (geometric) distance familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

But for a given metric, the distance between objects can be strongly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change dramatically. This will lead to the fact that the results of cluster analysis may differ significantly from the previous ones.

If the variables are measured in different units of measurement, then their preliminary normalization is required, that is, the transformation of the initial data, which converts them into dimensionless quantities.

Normalization strongly distorts the geometry of the original space, which can change the results of clustering

In the Statistica package, any variable x is normalized according to the formula:

To do this, right-click on the variable name and select the sequence of commands from the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variances will become equal to one.

K-means method in Statistica

The K-means method splits a set of objects into a given number K of different clusters located at the greatest possible distance from each other.

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the averages for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most of the measurements used in the analysis.

The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with career quality indicators. The table contains the answers to the questionnaire questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

The variable names correspond to the answers to the following questions:

  1. SLT - a combination of personal goals and the goals of the organization;
  2. OSO - a sense of fairness in wages;
  3. TBD - territorial proximity to the house;
  4. PEW - a sense of economic well-being;
  5. CR - career growth;
  6. ZhSR - the desire to change jobs;
  7. OSB is a sense of social well-being.

Using this data, it is necessary to divide the employees into groups and select the most effective control levers for each of them.

At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To date, most sociological surveys give only a percentage of votes: the main number of positive answers is considered, or the percentage of those who are dissatisfied, but this issue is not systematically considered.

Most often, the survey does not show trends in the situation. In some cases, it is necessary to count not the number of people who are “for” or “against”, but the distance, or the measure of similarity, that is, to determine groups of people who think about the same.

Cluster analysis procedures can be used to identify, on the basis of survey data, some really existing relationships of features and generate their typology on this basis.

Attention!

The presence of any a priori hypotheses of a sociologist when working with cluster analysis procedures is not a necessary condition.

In the Statistica program, cluster analysis is performed as follows.

When choosing the number of clusters, be guided by the following: the number of clusters, if possible, should not be too large.

The distance at which the objects of a given cluster were joined should, if possible, be much less than the distance at which something else joins this cluster.

When choosing the number of clusters, most often there are several correct solutions at the same time.

We are interested, for example, in how the answers to the questions of the questionnaire correlate with ordinary employees and the management of the enterprise. Therefore, we choose K=2. For further segmentation, you can increase the number of clusters.

  1. select observations with the maximum distance between cluster centers;
  2. sort distances and select observations at regular intervals (default setting);
  3. take the first observation centers and attach the rest of the objects to them.

Option 1 is suitable for our purposes.

Many clustering algorithms often “impose” a structure that is not inherent in the data and disorient the researcher. Therefore, it is extremely necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.

The results of the analysis can be viewed in the dialog box that appears:

If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be plotted:


Each broken line on this graph corresponds to one of the clusters. Each division of the horizontal axis of the graph corresponds to one of the variables included in the analysis.

The vertical axis corresponds to the average values ​​of the variables for the objects included in each of the clusters.

It can be noted that there are significant differences in the attitude of the two groups of people to a service career on almost all issues. Only in one issue is there complete unanimity - in the sense of social well-being (OSB), or rather, the lack of it (2.5 points out of 10).

It can be assumed that cluster 1 represents workers and cluster 2 represents management. Managers are more satisfied with career development (CR), a combination of personal goals and organizational goals (SOLs).

They have a higher sense of economic well-being (SEW) and a sense of pay equity (SWA).

They are less concerned about proximity to home than workers, probably because of less transportation problems. Also, managers have less desire to change jobs (JSR).

Despite the fact that workers are divided into two categories, they give relatively the same answers to most questions. In other words, if something does not suit the general group of employees, the same does not suit senior management, and vice versa.

The harmonization of the graphs allows us to conclude that the well-being of one group is reflected in the well-being of another.

Cluster 1 is not satisfied with the territorial proximity to the house. This group is the main part of the workers who mainly come to the enterprise from different parts of the city.

Therefore, it is possible to offer the top management to allocate part of the profits to the construction of housing for the employees of the enterprise.

Significant differences are seen in the attitude of the two groups of people to a service career. Those employees who are satisfied with career growth, who have a high coincidence of personal goals and the goals of the organization, do not have a desire to change jobs and feel satisfaction with the results of their work.

Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the above indicators. Senior management should pay special attention to the current situation.

The results of the analysis of variance for each attribute are displayed by pressing the Analysis of variance button.

The sums of squares of deviations of objects from cluster centers (SS Within) and the sums of squares of deviations between cluster centers (SS Between), F-statistics values ​​and p significance levels are displayed.

Attention!

For our example, the significance levels for the two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the work, the hypotheses about the equality of the means for the cluster centers are rejected at significance levels less than 0.01.

The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

The table shows the case numbers (CASE_NO) that make up the clusters with CLUSTER numbers and the distances from the center of each cluster (DISTANCE).

Information about objects belonging to clusters can be written to a file and used in further analysis. In this example, a comparison of the results obtained with the questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 - of managers.

Thus, it can be seen that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows drawing conclusions that cannot be reached by constructing a histogram of averages or by calculating the percentage of those satisfied with various indicators of the quality of working life.

Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially cluster first the closest, and then more and more distant elements from each other into a cluster.

Most of these algorithms start from a matrix of similarity (distances), and each individual element is considered at first as a separate cluster.

After loading the cluster analysis module and selecting Joining (tree clustering), you can change the following parameters in the clustering parameters entry window:

  • Initial data (Input). They can be in the form of a matrix of the studied data (Raw data) and in the form of a matrix of distances (Distance matrix).
  • Clustering (Cluster) observations (Cases (raw)) or variables (Variable (columns)), describing the state of the object.
  • Distance measures. Here you can select the following measures: Euclidean distances, Squared Euclidean distances, City-block (Manhattan) distance, Chebychev distance metric, Power ...), the percentage of disagreement (Percent disagreement).
  • Clustering method (Amalgamation (linkage) rule). The following options are possible here: Single Linkage, Complete Linkage, Unweighted pair-group average, Weighted pair-group average ), Unweighted pair-group centroid, Weighted pair-group centroid (median), Ward's method.

As a result of clustering, a horizontal or vertical dendrogram is built - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

The tree structure of the graph allows you to define clusters depending on the selected threshold - a given distance between clusters.

In addition, the matrix of distances between the original objects (Distance matrix) is displayed; mean and standard deviations for each source object (Distiptive statistics).

For the considered example, we will carry out a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure.


The vertical axis of the dendrogram plots the distances between objects and between objects and clusters. So, the distance between the variables SEB and OSD is equal to five. These variables at the first step are combined into one cluster.

The horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distances selected for a given clustering step.

It can be seen from the graph that the question “desire to change jobs” (JSR) forms a separate cluster. In general, the desire to dump anywhere visits everyone equally. Further, a separate cluster is the question of territorial proximity to home (LHB).

In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction, made according to the results of the study using the K-means method.

Feelings of economic well-being (PEW) and pay equity (PWA) are combined - this is a block of economic issues. Career progression (CR) and the combination of personal goals and organization goals (COL) are also combined.

Other clustering methods, as well as the choice of other types of distances, do not lead to a significant change in the dendrogram.

Results:

  1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
  2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package are due to their graphical capabilities. Two-dimensional and three-dimensional graphical representations of the obtained clusters in the space of the studied variables are provided, as well as the results of the hierarchical procedure for grouping objects.
  3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.
  4. Cluster analysis can be considered successful if it is performed in different ways, the results are compared and common patterns are found, and stable clusters are found regardless of the clustering method.
  5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Therefore, this method of non-parametric statistics can be considered as an integral part of system analysis.

cluster analysis various formalized procedures for constructing classifications of objects are called. The leading science in the development of cluster analysis was biology. The subject of cluster analysis (from the English "cluster" - bunch, bundle, group) was formulated in 1939 by psychologist Robert Tryon. The classics of cluster analysis are the American taxonomists Robert Sokal and Peter Snit. One of their most important achievements in this area is the book "Principles of Numerical Taxonomy", published in 1963. In accordance with the main idea of ​​the authors, the classification should not be based on a mixture of poorly formalized judgments about the similarity and relationship of objects, but on the results of a formalized processing of the results of mathematical calculation of the similarities / differences between the objects being classified. To accomplish this task, appropriate procedures were needed, the development of which was undertaken by the authors.

The main stages of cluster analysis are as follows:
1. selection of comparable objects;
2. selection of a set of features to be compared, and a description of objects according to these features;
3. calculation of a measure of similarity between objects (or a measure of difference between objects) in accordance with the chosen metric;
4. grouping objects into clusters using one or another merging procedures;
5. checking the applicability of the resulting cluster solution.

So, the most important characteristics of the clustering procedure are the choice of a metric (a significant number of different metrics are used in different situations) and the choice of a union procedure (and in this case, many different options). One or the other metrics and join procedures are more suitable for different situations, but to a certain extent the choice between them is a matter of taste and tradition. As explained in more detail in the article Clusters, hoards and the chimera of objectivity, the hope that cluster analysis will lead to the construction of a classification that is in no way dependent on the arbitrariness of the researcher turns out to be unattainable. Of the five stages of the study using cluster analysis, only stage 4 is not associated with the adoption of a more or less arbitrary decision that affects the final result. Both the choice of objects, and the choice of features, and the choice of metrics, together with the merging procedure, significantly affect the final result. This choice may depend on many circumstances, including explicit and implicit preferences and expectations of the study. Alas, this circumstance affects not only the result of cluster analysis. All "objective" methods face similar problems, including all cladistic methods.

Is there a single correct solution to be found by choosing a set of objects, a set of features, a metric type, and a join procedure? No. To prove this, we present a fragment of the article, the link to which is given in the previous paragraph.

"In fact, we cannot always even firmly answer the question of which objects are more similar to each other and which are more different. Alas, there are simply no generally accepted (let alone “objective”) criteria for choosing a metric of similarities and differences between classified objects.

Which object is more similar to object A: B or C? If we use distance as a similarity metric, then on C: |AC|<|AB|. А если полагаться на корреляцию между показанными на рисунке признаками (которую можно описать как угол между вектором, идущим к объекту из начала координат, и осью абсцисс), то на B: . What is the right way then? And there is no single correct answer. On the one hand, the adult toad looks more like an adult frog (both adults), on the other hand, it looks more like a young toad (both toads)! The correct answer depends on what we consider more important.".

Cluster analysis has found the widest application in modern science. Unfortunately, in a large part of the cases where it is used, it would be better to use other methods. In any case, specialist biologists need to clearly understand the basic logic of cluster analysis, and only in this case they will be able to apply it in those cases where it is adequate, and not apply it when the choice of a different method is optimal.

8.2. An example of cluster analysis "on the fingers"

To explain the typical logic of cluster analysis, consider its illustrative example. Consider a set of 6 objects (indicated by letters) characterized by 6 features of the simplest type: alternative, taking one of two values: characteristic (+) and uncharacteristic (-). The description of objects according to the accepted features is called a "rectangular" matrix. In our case, we are talking about a 6×6 matrix, i.e. it can be considered quite "square", but in the general case the number of objects in the analysis may not be equal to the number of features, and a "rectangular" matrix may have a different number of rows and columns. So let's set " rectangular" matrix (objects/features matrix):

The choice of objects and their description according to a certain set of features correspond to the first two stages of cluster analysis. The next stage is the construction of a matrix of similarities or differences (a "square" matrix, an object/object matrix). To do this, we need to choose a metric. Since our example is conditional, it makes sense to choose the simplest metric. What is the easiest way to determine the distance between objects A and B? Count the number of differences between them. As you can see, objects A and B differ in features 3 and 5, so the distance between these two objects corresponds to two units.

Using this metric, we construct " square" matrix (objects/objects matrix). It is easy to see that such a matrix consists of two symmetrical halves, and only one of these halves can be filled:

In this case, we have built a difference matrix. The similarity matrix would look like this, only at each position there would be a value equal to the difference between the maximum distance (6 units) and the difference between objects. For a pair of A and B, naturally, the similarity would be 4 units.

Which two objects are closest to each other? B and F, they differ only in one feature. The essence of cluster analysis is to combine similar objects into a cluster. Combine objects B and F into a cluster (BF). Let's show it on the diagram. As you can see, the objects are combined at the level that corresponds to the distance between them.

Rice. 8.2.1. The first step of clustering a conditional set of 6 objects

Now we have not six objects, but five. We reconstruct the "square" matrix. To do this, we need to determine what the distance from each object to the cluster is. Distance from A to B was 2 units and A to F was 3 units. What is the distance from A to (bf)? There is no correct answer here. Here, look at how these three objects are located relative to each other.

Rice. 8.2.2. The relative position of the three objects

Maybe the distance from the object to the group is the distance from the object to the closest object to it in the group, i.e..e., │A(BF) │=│AB │? This logic matches join by maximum similarity.

Or maybe the distance from the object to the group is the distance from the object to the object furthest from it in the group, i.e..e., │A(BF) │=│AF │? This logic matches minimum similarity join.

It can also be considered that the distance from an object to a group is the arithmetic average of the distances from this object to each of the objects in the group, t.e., │A(BF) │=(│AB │+│AF │)/2. This solution is called joining by mean similarity.

All these three solutions and a significant number of other solutions not described here are correct. Our task is to choose a solution that is more appropriate for the category to which our data belongs. Attachment by maximum similarity leads, ultimately, to long, "ribbon-like" clusters. According to the minimum - to fragmentation of groups. When choosing between the three described options, in biology, accession by average similarity is more often used. We also use them. In this case, after the first clustering step, the "square" matrix will look like this.

Now the closest pair of objects are D and E. Let's merge them too.

Rice. 8.2.3. The second step of clustering a conditional set of 6 objects

Let's rebuild the "square" matrix for four objects.

We see that there are two possibilities for joining at level 2.5: joining A to (BF) and attachment (BF) to (DE). Which one to choose?

We have various options for how to make this choice. It can be done randomly. You can adopt some formal rule that allows you to make a choice. And you can see which of the solutions will give the best clustering option. Let's use the last option. Let's implement the first possibility first.

Rice. 8.2.4. The first version of the third step of clustering a conditional set of 6 objects

Choosing this option, we would have to build such a "square" 3×3 matrix.

If we had chosen the second option of the third step, we would have the following picture.

Rice. 8.2.5. The second variant of the third step of clustering a conditional set of 6 objects

It corresponds to the following 3×3 matrix:

The resulting 3×3 matrices can be compared to make sure that a more compact grouping of objects is achieved in the second variant. When constructing a classification of objects using cluster analysis, we should strive to identify groups that combine similar objects. The higher the similarity of objects in groups, the better such a classification. Therefore, we choose the second option of the third clustering step. Of course, we could take the following steps (and divide the first option into two more sub-options), but, in the end, we would be convinced that the best option for the third clustering step is exactly the one shown in Fig. 8.5. We stop on it.

In this case, the next step is to merge the objects A and C shown in fig. 8.6.

Rice. 8.2.6. The fourth step of clustering

We build a 2×2 matrix:

Now there is nothing to choose. Merge the two remaining clusters at the required level. In accordance with the accepted style of building cluster "trees", let's add another "trunk", which stretches to the level of the maximum possible distance between objects with a given set of features.

Rice. 8.2.7. Fifth and final clustering step

The resulting picture is a tree graph (a collection of vertices and connections between them). This graph is constructed in such a way that the lines forming it intersect each other (we have shown these intersections as "bridges"). Without changing the nature of the relationship between objects, the graph can be rebuilt so that there are no intersections in it. These are done in Fig. 8.2.8.

Rice. 8.2.8. The final view of the tree graph obtained as a result of clustering

The cluster analysis of our conditional example is finished. We just need to understand what we got.

8.3. Fundamental Limitations and Disadvantages of Cluster Analysis

How to interpret the graph shown in fig. 8.2.8? There is no single answer. To answer this question, you need to understand what data and for what purpose we clustered. "On the surface" lies the conclusion that we have registered that the original set of 6 objects consists of three pairs. Looking at the resulting graph, it is difficult to doubt this. However, is this conclusion correct?

Go back to the very first "square" 6×6 matrix and make sure that object E was two units away from both object D and object F. The similarity of E and D on the final "tree" is reflected, but the fact that object E was just as close to object F - lost without a trace! How to explain it?

As a result of clustering, which is shown in Fig. 8.2.8, there is no distance information at all│EF │, there is only information about the distances │DE │ and │(BF)(DE) │!

Each "rectangular" matrix, in the case when a certain metric and method of attachment is chosen, corresponds to a single "square" matrix. However, each "square" matrix can correspond to many "rectangular" matrices. After each step of the analysis, each previous "square" matrix corresponds to the next one, but, based on the next one, we could not restore the previous one. This means that at each step of the cluster analysis, some part of the information about the diversity of the original set of objects is irreversibly lost.

This circumstance is one of the serious drawbacks of cluster analysis.

Another of the insidious shortcomings of cluster analysis is mentioned in the article

© imht.ru, 2022
Business processes. Investments. Motivation. Planning. Implementation