(last rev. 28 April 2019)
Hierarchical cluster analysis is a good first choice when asking questions about
the similarity of groups of documents. Hierarchical clustering does not require
you to choose the number of clusters to begin with. The most commonly used
method (agglomerative hierarchical clustering) begins with every document as its
own cluster and then proceeds to assign these items to “super-clusters” based on
the selected distance metric and linkage criteria.
The clusters that result from hierarchical clustering are typically visualized
with a two-dimensional tree diagram called a dendrogram. For more information
and screenshots about the construction and interpretation of dendrograms in this
method, see “How to Read a Dendogram” and the video at https://youtu.be/Lir0ck691yM.
Since the resulting tree technically contains clusters at multiple levels, the
result of the cluster analysis is obtained by “cutting” the tree at the desired
level. Each connected component then forms a cluster for interpretation.
The results of hierarchical clustering and the topography of the resulting
dendrogram may vary depending on distance metric, linkage method used to form
Hierarchical clustering presents the user with three main challenges:
- Which distance metric to use.
What type of linkage criterion to select.
Where to cut the tree.
Each of these challenges will be considered in turn.
Selecting a Distance Metric
The distance metric is the measure used for defining what constitutes document
similarity, how “far” one document is from another. This can be measured by
comparing the each document’s word vectors. If we visualize these vectors as
lines in a triangle, the hypotenuse between these lines can be used as a measure
of the distance between the two documents. This method of measuring how far
apart two documents are is known as Euclidean distance. Other measures, such as
cosine similarity, which relates the distance between the two documents to the
angle between their two vectors, are also commonly used to measure document
similarity. No one metric is superior to another, although cosine distance may
perform better when the documents are longer and when there is a larger
Selecting a Linkage Method
At each stage of the clustering process a choice must be made about whether two
clusters should be joined. An intuitive means for doing this is to join the
cluster containing a point closest to the current cluster. This is known as
single linkage, which joins clusters based on only a single point. Single
linkage does not take into account the rest of the points in the cluster, and
the resulting dendrograms tend to have spread out clusters (this process is
called “chaining”. Complete linkage uses the opposite approach. It takes the two
points furthest apart between the current cluster and the others. The cluster
with the shortest distance to the current cluster is joined to it. Complete
linkage thus takes into account all the points on the vector that come before
the one with the maximum distance. It tends to produce compact, evenly
distributed clusters in the resulting dendrograms. Average linkage is a
compromise between single and complete linkage. It takes the average distance of
all the points in each cluster and uses the shortest average distance for
deciding which cluster should be joined to the current one. Another commonly
used form of linkage is Ward’s criterion, which attempts to minimize the
differences in cluster size as the dendrogram is built. It may not be
appropriate for use with documents of variable size (c.f.
Which linkage criterion you choose depends greatly on the variability of your
data and your expectations of its likely cluster structure. The fact that it is
very difficult to predict this in advance may explain why the “compromise” of
average linkage is a good place to start.
Cutting the Dendrogram
Once the dendrogram has been generated, every document leaf will form its own
cluster and all documents will belong to a single cluster at the root. In
between, there may be any number of clusters formed at differing levels of the
hierarchy. Not all of these clusters will necessarily be meaningful. In
practice, we have to draw a line on the dendrogram at a particular branch height
below which we will not consider clusters significant. This is known as cutting
the dendrogram. Where to draw the line can be an impressionistic exercise,
depending greatly on our expectations of our data.