Data Visualization 101

The art of Data Visualization is over-rated. Data Science mullahs tend to keep the man in the street away from it, by saying it’s a matter for experts only.
Well, at qunb we think it’s not. The art of data visualization and data storytelling should be accessible to anyone, including non experts. That’s our mission.

130924_-_Why_we_hate_infographics.pptx

Data Visualization is the art of conveying simple messages, one at a time, with simple charts. As we already wrote, good ol’ excel is enough in 99% of the cases.

But for most people, the trickiest part in a Data Visualization job is to first define the message you want to convey, then pick up the adequate type of representation.

This is the purpose of this post: to browse the most common types of messages, and propose a simple type of chart that does the job, plus a few tips.

“Here is the history of this series”

That should be an easy one, right? You just have to pick up a 2D chart, with time projected on the horizontal axis. But should you display bars or a line? Well, to keep it simple, below a simple rule of thumb.

If the measure is cumulative over time (ex: revenue, production, …)

… then it’s best represented with a bar chart:

Screenshot_11_8_13__5_39_PM

If the mesure is not cumulative over time (ex: stock price, temperature, …)

… then represent it as a line chart:

Screenshot_11_8_13__5_34_PM

A few tips:

  • Determining whether a series is cumulative or not is pretty simple: ask yourself “If I look at this monthly data for a whole year, what does make sense? A sum or an average?” Sum => Cumulative => Bar. Average => Non cumulative => Line.
  • Another way to answer the same question: “Was this value measured over a period of time or at a specific moment?” Period => cumulative => Bar. Specific moment (timestamp) => Non cumulative => Line.
  • Oh, by the way, always represent the data at an aggregated level if possible
  • If you have more than 10 bars to represent, switch from bars to line (or area) to reduce the number of displayed objects
  • Never represent more than one data series on a chart unless you want to show a specific relationship between the series
  • Line smoothing is really recommended
  • Try to add a trend line to your chart, and check whether it’s better with or without it

“This given item outdoes the others in the group”

Example : “In 2012, the rice production was more than twice the production of any other vegetable production”

This is a classy as well. You want your reader to see side by side the values you’re talking about. And that’s exactly what the bar chart is meant for.

Screenshot_11_7_13__5_56_PM

A few tips:

  • Always highlight with a vivid color the specific bar you’re talking about.
  • In this dummy example, highlight the “eggplants” bar instead, and you’ll see that the visual message changes.
  • Disclaimer: as I’m showing dummy examples, I removed everything like units, measure definition, source, … Remember that you should always put them on your charts.

“The Top N represents X quarter/third/half of the global value”

Example : “In 2012, the rice production represents more than one third of the global vegetable production”

That’s a classy: use a pie chart or a donut. We know it’s very controversial (data visualization mullahs tend to yell after anyone using a pie chart) but it’s actually pretty usefull in a business environment.

Values are the same as in the prior example, but this type of visualization does not convey the same meaning. Bar charts emphasize the relative values of a group of measures, while pie charts emphasize the contribution of the Top N to the global value.

Screenshot_11_7_13__6_02_PM

A few tips:

  • Highlight the Top N contributor(s) you want to talk about with a specific color
  • Aggregate the long tail into a “other” segment to suppress all useless information in your chart

“Here is how the proportion of X evolves over time”

Example : “The proportion of rice in vegetables consumption is slightly increasing over time”

You want a 100% stacked area (or a 100% stacked bars). But be careful, as any stacked chart actually conveys way too much information. So help the reader to focus on what matters for most by applying a vivid color to the only segment you want to hightlight.

Screenshot_11_7_13__6_28_PM

“Let’s compare the N top contributors in 2 groups”

Example : “Among all vegetables, Europeans consume more rice than Americans”

Wanna put two pie charts side by side? You’re almost right. But to allow a visual comparison of the two sets of values, you need to unroll your pie/donut charts and pile the values vertically. That’s a 100% stacked bar chart.

Screenshot_11_7_13__6_41_PM

A few tips:

  • Add dotted lines between the two bars: their gradient ease the visual comparison between the two sets of values.
  • Remember to always highlight the segment covered by your message with a vivid color.

“Here is the average value in this group”

Example : “The average revenue for 3yo startups is X”

Well, this one is not that easy. An average is just one single value. The problem here is that you need to give not only this single value, but you’d rather supply the reader with the distribution of measures around the mean value. In one word, use a histogram.

Screenshot_11_8_13__3_10_PM

A few tips:

  • Histogram are pretty complex to create with Excel. If using Windows, you need the data analysis toolpack. If using Mac, you need to install a 3rd-party software.
  • Highlight the bar embedding the mean value of this distribution with a vivid color.
  • Even if it recalls good old memories like normal distribution, refrain from adding any piece of information like variance or standard deviation. It’s only chinese for 99.99% of your reader.

“Series A and B have a correlated behaviour over time”

Example : “The windier, the higher the roof product sales”

Look no further than a line chart, with two lines representing your two data series. There are formal methods to compute a correlation between two series, but there’s nothing more efficient than showing two lines side by side and letting the user feel that they move together.

Screenshot_11_8_13__4_04_PM

A few tips:

  • Try to aggregate the data at different time levels (minute > hour > day > week > month > …)
  • Careful though, always remember that correlation and causality are two very different things.

“The greater the A, the greater the B”

Example : “The greater the fundraising round size, the greater the valuation of the startup”

Say you have a list of fundraising deals, and for each deals you have the amount raised and the pre-money valuation of the company. The easiest way to see – litterally – if there’s a correlation between those amount raised and valuation is to draw a scatter plot.

Screenshot_11_8_13__5_25_PM

A few tips:

  • You can change the scale type on the 2 axis if necessary (here for instance we used a log scale)
  • You can add a trend line to the scatter plot, it helps the reader to understand the general behaviour of your collection

“This group can be split into N different segments”

Well, segmentation and clustering is a pretty tough topic, it deserves a whole post in itself. Let’s wait until we find the time to write about it.

Stay tuned!

In a word, everybody can create simple, straightforward data visualizations, provided that you first think of what you wanna say with your chart, then select the proper type of chart to convey your message.

Today, Excel does the job pretty well (every chart in this post was made using excel only), and soon you’ll be able to tell stories with any data using qunb. Be patient!

You can already use qunb to enhance your Google Analytics data and tell compelling visual stories on-the-fly. That’s our 1-click web traffic report. Check it out and tell us what you think.

 

I want to check qunb in action and yell at its creators