A Modern Visualization of Group Comparisons

More and more companies, governmental institutions and researchers employ data-centric methods to derive insights from data. Visualizing complex patterns in a simplistic yet intuitively comprehensible way is more important than ever.

The Importance of Visualization

Nowadays, almost every relevant domain of contemporary life creates an accelerating stream of data, adding to the vast amount of already existing data. This holds true in every sector, be it in healthcare and medical research, social media, industry, finance or retail. More and more companies, governmental institutions and researchers employ specialized data-centric methods to derive insights from this ever-growing body of data. As data analysis is increasingly gaining popularity among decision makers, adequate visual communication of results becomes more important than ever.

But the majority of people involved in data-driven decision-making does not have a scientific background in math, statistics, computer or data science. Thus, visualizing complex patterns in a simplistic yet intuitively comprehensible way is essential. Aggregated displays of data or relations within a dataset require careful balancing between lack of information and information overload or between simplicity and completeness. Depending on the specific question or task at hand, a holistic visual summary of data might easily fill many slides and could become confusing very quickly. On the other hand, a too reductionist visualization might put one at risk of missing the major insights.

Basic Requirements for Modern Visualization – The RDI Scheme

For starters, there’s a useful rule of thumb in data visualization. It’s the three pillars of “RDI”, for raw data, descriptives and inference. A basic visualization should provide some view of the underlying raw data, since a direct, not yet aggregated perspective on the data might already reveal important trends, that might be hard to spot otherwise.

In the next step, aggregated measures like the mean or median are necessary to provide decision makers with an overview, that makes groups or different sets of data comparable, since guesstimating group differences in the face of nothing but raw data is between nonscientific and impossible. Even if differences are clearly visible, it’s usually impossible to tell if the discovered pattern is of significant proportion or completely random.

That’s where inference comes into play. Inferential measures can improve judgements by estimating the level of precision with which different parameters can be regarded. Inferential statistics are required whenever information can only be drawn from data samples that do not comprise the entirety of a variable’s existing values. This is the case if the available body of data is too huge to be examined directly, so investigation is limited to smaller samples, or there is only a limited amount of data accessible in the first place, for example if only a few thousand people have been interviewed on the phone in a voter survey. Inferential methods are then used to generate statistical estimations of relevant parameters (i.e. voter opinions) in the entire population, based on available samples. Since this is often the most interesting type of information which can be drawn from a useful visualization, some inferential measure should always be part of even basic visualizations. Once you have all the necessary information represented in your visualization, an appealing theme might be something to opt for as well, since aesthetics can increase comprehensibility and render a visualization more pleasing to the observer.

Boxplot vs. Pirate Plot

To emphasize the mentioned aspects with the help of an example, let’s have a look at this classic boxplot, showing a comparison between IQ scores in men and women.

The classic boxplot provides you with basic information about the range and distribution of data, with the thick middle line usually representing mean or median, which both fulfil our demand for a descriptive measure. The upper and lower bounds of the boxes represent the 25th and 75th percentile of the distribution. Horizontal lines above and below the boxes show the distance of 1.5*IQR, that means 1.5 times the interquartile range (75th – 25th percentile). Every value that lies even further outside the box, than 1.5*IQR is considered an outlier. Outliers are usually not numerous and thus, their actual raw values are represented as dots in the plot. And that’s as far as raw data representation in a box plot will go. An actual look upon the whole distribution of the raw data is not included, which is a relevant flaw of the classic boxplot.

Comparing men’s and women’s medians here (the middle lines within the boxes), it seems that in this sample, women have a slight advantage over men. It is not possible though, to draw any conclusion about the significance of the difference, i.e. whether it is just random noise in the data or whether it could be assumed to really exist in the whole population. This is a clear lack of inferential information in the classic boxplot.

Now let’s compare this with the same groups visualized within a pirate plot.

So what can be seen here? At first glance, one can see the cloud of black dots, representing every observation within the raw data, with the curved bean-shaped outline indicating how many values are to be found a certain IQ level. It is quickly visible, that our sample of men on the left obviously contains a lot more subjects than the sample of women on the right. This is a very important information. Since, only by judging with the help of a boxplot, one could come to the conclusion, that the IQ range of men is vastly larger in both directions, than it is in women. Although there actually are more men than women at both extreme ends of the IQ scale in this world, this difference in range is overestimated on the basis of a boxplot. Because we drew a way smaller female sample, there just aren’t so many extreme values to be found in it, which can be clearly seen in the pirate plot. This is a huge plus for the pirate plot when it comes to representing raw data.

The horizontal middle lines also represent the median in each group, with the white lines also showing the 25th and 75th percentile. Another big bonus though, is the red bar around the median to be seen in the female group. This shows a range within which the actual median in the whole population is to be found, based on bayesian statistical estimation. This bar is wider for the women, because the sample is so much smaller than the male group, thus including a greater amount of estimation uncertainty. The male median does show that bar too, it’s just too small to be seen properly, indicating a quite precise estimation. We can further see that the red bar reaches all the way down to where the median of the male group can be seen, and because of this overlap, an actual group difference cannot be safely assumed to exist here. Finally, we have the RDI scheme completed, which makes a pirate plot so much more useful than a boxplot. And after all, which one does look more appealing to you?

Share