Hierarchical classification needs a metric

Assume you want to build a system that helps physicians to diagnose diseases. It would ask for symptoms and vital signs (temperature, blood pressure, etc.) and return the assessment for different diseases. The standard way to model such a classification task is through a function that returns probabilities (and finally, if desired, decisions) across two or more classes. We do this, because it’s convenient: tools and theory are well established, the subtleties are well known. But can everything we want to model be pushed into a flat set of classes?

The problem here is, that the mistakes an algorithm makes can be very artificial and surprising to humans. We will analyze these mistakes after a short review where hierarchical classification is used.

In the example of disease prediction, a classifier puts out two completely different diseases without giving a clue how the decision comes about. Thus it would be natural to not only let the computer predict the most specific and detailed disease but rather go from a broader range, like types of diseases (infectious disease or cancer) to then produce sub-results for each of these groups (e.g. common cold vs. influenza). This has two advantages: First the system has to check a more general concept first (is it a cancer?) before concluding a detail, second the expert can follow the path of decisions and therefore gain more insights, as in the end it’s her who has to take a decision.

In this article we therefore focus on the question how to classify things into a tree hierarchy. Additionally to the introductory use-case, examples are everywhere:

  • Classifying texts into their genre, sub-genre, and so on. For example social networks like facebook or twitter (since recently, with the so called algorithmic timeline) will try to find out what type of links are posted/tweeted/shared: Is it news or a blog post? Is it about politics, sports, global or local events? What kind of politics or event?
  • Detecting the type of product based on image data. Especially marketplace platforms, like eBay or Amazon, are interested in the type of product a customer is a seller offering? Is the image appropriate and fitting to the tag?
  • Protein function prediction: Based on the molecular structure of the protein, it can be predicted how it works. This knowledge would be used to develop drugs, by exploiting general properties (size, is the molecule hydrophilic or hydrophobic?, is it electrically charged and how?) as well as very special properties (is it a protein that can cut a corresponding molecule?).
  • The ImageNet Competition where millions of images have to be classified into the WordNet hierarchy (a tree-type word database). The research that was born through this competition enabled a huge leap in the field of computer vision. Mainly based on the success of convolutional neural networks, which are now running in a lot of places to do face detection, video content analysis, and, to come back to medicine, the detection of anomalies in fMRI, x-ray or ultrasonic images.

For hierarchical classification it is important to note, that the hierarchy is predefined and the classifier should predict a node in the tree. There are several design decision to be made, for example if the classifier is allowed to stop in an intermediary node: “I know it’s a 2-wheeled vehicle, but not sure whether a bicycle, motorbike or segway”. Discussing this might be the topic of another article (or consider Silla/Freitas). The question we will be answering is occurring on a higher level: How can such a classifier be evaluated?


Very popular evaluation metrics are precision and recall. They are based on evaluating a binary decision but can be used to evaluate multi-class classifiers as well through averaging. We could now, to simplify the problem, evaluate each node in the hierarchy as an independent class.

Taking an example out of the online shopping domain, assume we have 1000 fashion images and the hierarchy “fashion” with subclasses “dress” and “shoe”, where “dress” has the two subclasses “summer dress” and “ballroom dress” and “shoe” the two subclasses “sneaker” and “slipper”. Ignoring the dependencies, we ignore the superclasses and only consider the leafs. Say, the images are evenly distributed along the four leafs “sneaker”, “slipper”, “summer dress” and “ballroom dress”.

Say the classifier, just to illustrate types of mistakes, predicts 100 “slipper” correctly and 150 as “sneaker”. Additionally assume, 50 other images were wrongly classified as “slipper”.

Computing one-vs-all precision and recall for class “slipper” we get: 100150 = 23 precision and 100250 = 25 recall.

The problem in this evaluation is, that the mistakes are not weighed in a natural way: For example, even if the classifier was roughly correct by saying “slipper” instead of “sneaker” the penalty is the same as if it would have classified it as a “summer dress” (only 25 recall).

hierarchical Precision/Recall

When evaluating, the classifier should be credited for the number of correct and wrong decisions that are explicitly or implicitly taken when classifying a datapoint. For example, when classifying an image as “slipper”, it is implicitly classified as “shoe” and “fashion”. Consider the following definition:


In the following two images these formulas are illustrated. We see a hierarchy of the abstract classes A to G, the green “ground truth” and two possible predictions in gray and blue.


Hierarchical recall can be seen as “ratio of correct steps that were hit”, missed correct steps are punished. In the example above, there are three correct steps possible. The blue path has a recall of 13, because blue only predicted B correctly. The gray path has a better recall with 23, because it additionally predicted C correctly. Note that the number of mistakes (two in grey and one in the blue path) does not matter.


Hierarchical precision can be seen as “the ratio of correct steps in the prediction”, punishing incorrect steps. In the example above both prediction correctly predicted two classes. The gray path has a precision with 24 = 12, because it predicted F and G wrongly. The blue path has a better precision of 23, because blue only predicted F wrongly.


In fact hP and hR are a quite natural extension of P and R, as can be seen in this illustration (which, we believe, speaks better for itself):



Possible solutions to the hierarchical classification problem need to be compared with a metric. Therefore a hierarchical metric was presented which uses hierarchical information that is already available anyway in a lot of domains. The number of correct and wrong decision are evaluated, which leads to an error metric that is a more intuitive and gives experts the ability to interpret results and fine tune the algorithm at hand. Especially in a critical field like computer aided diagnostics such a metric would be inevitable, because only when we understand it, a certain trust can be developed into automatic systems.


Leave a Comment

Your email address will not be published. Required fields are marked *