Data Storytelling for survey data in R

Why you should say goodbye to the classic bar chart and start to make more meaningful plots

Hannah Roos
Towards Data Science

--

Photo by Cristian Escobar on Unsplash

There is a huge variety of possible graphs to choose from for data visualization (R Graph Gallery). It paves the way to visualize all imaginable data in a breath-taking way — but also bears the risk of forgetting the bigger picture too quickly. This is what probably happened to me. If there were only two types of data analysts in the world, I would definitely be rather someone who loses oneself in the aesthetic and fancy aspect of the graphs than just sticking to my all-time-favourite.

Even though it has become so easy to create amazing graphs with popular programming languages and statistical interfaces like R and Python, we should not forget that a data visualization is always a communication tool. The challenge is to find the type of visualization that not only suits the data, but also communicates the story you want to tell — making the content accessible to the reader on the first sight. If your graph clearly illustrates the patterns in the data however does not only depend on your audience, but also on the way certain graphs transport the data on a psychological level. Therefore, let me tell you a little story from a psychologist with a passion for data-science and what she has learnt along the way.

The challenge

One of my former employers once asked me to visualize psychometric test results of leaders who were going to attend one of our workshops. Proud of my statistical knowledge, I have created a Boxplot which he printed out and come to me wondering: “Hannah, thanks for creating this, but what is it? Is this an intelligence test?”

Excuse Me Reaction GIF By Mashable

And we both laughed. I realized that my plot was meaningless to our audience without a proper explanation. It is our job to reduce complexity and present the results in an intuitive manner.

Another time, I worked in a project team which explicitly asked me to produce Bar Plots for Survey data and they had very specific ideas about all the details. As I was working under a lot of time pressure and did not yet have the right words, I just did what was asked for — feeling very unhappy about the outcome, although I actually went the “simple” way. But what was wrong? And what is a better way to present Likert-data in a concise and comprehensible manner?

In the following sections, I have sketched my course of thought for you with a little case-study. Imagine you were interested in your clients’ personal attitude towards Christmas. Accordingly, you run a quick survey with 50 participants. The items you came up with are the following:

  • Christmas has become only another occasion for excessive consumption.
  • Being with my family is the thing I enjoy most about Christmas.
  • Christmas does not really matter to me.
  • If I needed to choose, Christmas would be my absolute favourite holiday.
  • The amount of stress around Christmas exceeds the amount of joy to me.
  • I could not imagine to abolish Christmas as a national holiday.
  • Christmas has religious meaning to me.

Respondents answer on a 5-point Likert-Scale, ranging from “Strongly disagree” to “Strongly agree” for each respective item. After a while, you have collected enough data and need to visualize the results. How would you approach this?

Disclaimer: This is just how I would go about this task and I hope that you can benefit from my thoughts, but other professionals -you included- may approach it in different ways.

Now I have prepared some fake-survey data for you to illustrate what the data could possibly look like.

Image by Author

To prepare the raw data for analysis, we first need to transform the dataframe to a long format.

Image by Author

The simple way? A horizontal bar plot with errorbars

For that one presentation I mentioned above, we used endless slides with various bar plots to show the averaged response for each item as well as the extent to which responses vary across participants. Therefore, I have created horizontal bar plots that display the mean plus errorbars, representing +-1 SD around the mean. This is how it can be done:

Image by Author

It seems that the bar plot gives us a rough impression about the way participants have responded. In contrast, we have no information about the distributional properties of responses for the different response options. For example, it seems like a majority of respondents endorsed the statement “Being with my family is the thing I enjoy most about Christmas”, right? However, we actually do not know if the high average has been a result of very few extremely positive responses which may have shifted the mean towards the right — although a majority of responses are probably scattered around the whole scale. The thing is: We do not know! In addition, the filled bar suggests that responses for items range from zero until the end of each error bar, leaving the upper part of response options out (e.g., “strongly agree”) — giving the impression that participants just made use of the lower part of the scale and did not really agree to the statements. Nevertheless, this is formally not correct, but the graph still naturally transports a biased perception to the reader. Strictly speaking, mean and standard deviation are not even applicable to categorical data like response categories (Strongly disagree — disagree — neutral etc.). This might be confusing at first sight because response categories are oftentimes represented by numbers (1–2–3–4–5), suggesting that we have numeric intervals. However, the response options do not fully map onto any continuous scale. The reason for this lies in the psychological nature of responding: The difference between each of the response options (e.g., 4 vs. 5) are subjective interpretations which may differ from person to person. Thus, interpreting an average of 2.1 for “The amount of stress around exceeds the amount of joy to me.” is actually nonsense and tells us very little about the respective response patterns.

Making it fancy — a custom-made plot

I have decided to change my strategy and to create something entirely new. If ggplot2 offers so many options for customization, why not make use of it? Thus, I have started to write my own functions and came up with a plot that shows the interquartile range, the total range, median and individual data points for each item all at once. This time it should get more informative!

Image by Author

Way better, isn’t it? We get a much better sense about how strongly responses spread across response categories. In addition, we have a reference line that reminds us of where neutral responses lie on the scale. A big plus is the a sense of achievement you get if you are able to create an own custom-made graph from scratch and learn a lot of new techniques along the way.

But I am still not entirely happy with the graph for the following reason: it still suggests numeric continuity where there is none. The data are still categorical — this is why we needed to add jitter (e.g., random noise) to the points explicitly. It simply gives you a wrong impression of the data’s actual nature. This is how the graph would look like without jitter:

There is just one point for each response option that occurred for the respective item and the many different individual responses are layered on top of each other — therefore we cannot see all the data points. But at least they are not scattered in a way that actually does not exist (e.g., What is the response category 2.3?). Reversely, when we removed the individual data points completely, we would not know which option participants have endorsed most. So, we are actually not finished yet.

Use a predefined package — back-to-back divergent bar-chart

Sometimes it might be worth to research a bit to find out if there is already a smart tool that helps you represent your specific use case and then adjust it to your needs. This is what I did. I stumbled upon the likert package from Jason Bryer and Kimberly Speerschneider which you can simply download by calling

install.packages(“likert”)

in your R console. Honestly, it took me a while to make it work with my data, but I’ll now share all the tips and tricks I have used to make it run more efficiently for you.

In particular, there was one issue that has cost me a lot of time: the occurrence of an unequal number of factor levels. What this means is that your data may include items for which respondents did not make use of the whole scale (e.g., Strongly disagree — agree). Therefore, factor levels correspond to the number of naturally occurring response options for each item which gives you all the trouble — this is what people also refer to as implicitly missing data.

Error in likert::likert(d) :

All items (columns) must have the same number of levels.

Image by Author

The resulting graph is aesthetically appealing and gives you an intuitive sense to which extent respondents endorsed or denied specific statements. It even provides you with the percentage of agreement (green) vs. disagreement (red) for each respective item, giving you a good impression about the general response pattern without averaging the data. In the end we have both — information about the central response tendency and variability in the data.

Learnings on a meta-level

It is definitely worth it to engage in a learning process and to get to a solution step-by-step — in the end you have extended your methodological toolkit that will serve you so well in the future. Whenever you create data-visualizations for conferences, presentations, articles etc., keep in mind that the person you present the graph to may not have the same level of statistical know-how, has not seen the data before or just already has the head full of stuff, making it difficult to focus on complex content. Anyway, he or she probably wants to understand the data to the same extent like you. This is why you will be a real hero if you are able to transport these valuable data-driven insights right into their brains, making it easy for them to concentrate on the real underlying challenges. At the end of the day, all of this starts with an intuitive but precise data visualization.

--

--