Kinds of data – Erica Gunn

This post is part of a larger series focused on exploring the fundamental principles of data visualization. Eventually, the collection may grow into something larger and more coherent. For now, each post simply picks up and plays with one idea related to how we represent data visually. Other posts in this series can be found using the Form to Data tag.

There are three main types of data: values, categories, and connections. Understanding more about these kinds of data helps to choose encodings that support a particular analysis, and to create charts that are appropriate to the structure of the data itself.

Values are simply numbers, and are usually the result of a measurement or calculation. Because they quantify some aspect of an object, they are sometimes called quantitative data, or measures. A person’s age, for example, is quantitative information: a measurement of the number of years that they have been alive.

Values can be either continuous or discrete. A continuous number can be measured from one value to the next and at all the places in between. If you draw a line on a piece of paper from value 1 to 2, you can put a dot at 1.1, 1.5, or 1.79. This is an example of a continuous measurement.

Discrete measurements behave a little differently. You can have one value or another, but there is nothing in between. You can think of these like steps or boxes. An item can be on step 1 or 2, but it doesn’t make sense to say that it’s in step 1.1, or box 2.3. We often round continuous values to discrete measurements to make life easier: you probably give your age in years rather than days, hours, minutes, and seconds, and you probably say you’ve gotten one whole year older on your birthday, even though you’ve really gotten older one day at a time. Some kinds of data can only be discrete, and sometimes data is converted to discrete values by rounding, if we don’t need the extra precision.

Ordinal data are (usually) discrete numbers that go in a particular order. Numbers will always keep a particular sense of order (1 goes before 2, 37 comes after 20). Ordinal data use this as a way of putting things in sequence: first, second, third. There might be three people in a group who have ages 50, 28, and 32. Those quantitative values don’t tell you that the 50-year-old is first in line at the grocery store, but they do tell you that the 28-year old is the youngest (first in age order), and the 50-year-old is the oldest (third in age order). Age can be used as ordinal data in this case, but it is not necessarily the same as the order in line. The same number may be ordinal for one measurement and not ordinal for another.

When we use numbers to rank things (first, second, third), those numbers can become a way to look things up in a list. An index tells you where to look to find a particular thing in a sequence/list: it is the first, tenth, or seventeenth item in the list. Indices are usually discrete ordinal numbers, but there is usually only one copy of each value. The group 3, 5, 5, 2, 9, 7 is a set of discrete ordinal numbers that can be put into the correct sequence: 2, 3, 5, 5, 7, 9. These numbers are not good indices, though, because they describe the values rather than the list (2 is the first item in the list), and there is a duplicate (two 5’s). Giving this list of numbers an index lets you talk about which value exists at each position in the list. Here, the index is in italics, and the value is bold: 1: 2, 2: 3, 3: 5, 4: 5, 5: 7, 6: 9. Now, we can say that we want the 5th item in the list, and that its value is 7.

Another interesting thing happened in the grocery line example above: we used a number (a quantitative value) as a way to identify a person, when I typed “the 50-year-old”. This is an example of how we often use numbers as categories, rather than quantitative value. Instead of being a measure of a person’s age, the number 50 just became a part of their identity – a category that I placed them into in order to distinguish them from the rest of the group.

Binary categories are groups that contain only two values, which can be either numbers or category names. In computer code, the binary values are 0 and 1, which tell a programmer whether a switch is on or off. This is an other example of using numbers as categories. Other binary data values include true/false, plus/minus, on/off, yes/no, A/B, etc. Ternary data works the same way, except that it introduces a third value: -1, 0, 1, yes/no/maybe, etc.

In binary systems, we group all data into just two categories, but there is no reason that we have to stop at just two. Just like we can use an index to give items an order in a list, we can use bins to sort items into different categories. The bins can be the same size (usually called width), or they can be different. Let’s say I have a group of students taking a class. They all get different scores on their first exam (a quantitative measure of their performance). I might bin those scores into categories, as follows: A: 92, 93, 98. B: 85, 89, 81. C: 73, 78, 75. And so on. In this case, the bins are all the same width: 10 points each. I could do the same thing with more bins to increase the resolution and further divides students into groups of A, A+, A-.

Sometimes it is more helpful to have uneven bin widths. If I am considering the age range of a population, I could bin everyone into groups by decade: 1-10, 11-20, 21-30, etc. This is useful if age is all I care about, but if I am interested in information that’s related to social structure, it might be more useful to bin the data into groups of children and adults, or to separate it out by life stages: 0-2, 3-10, 11-18, 19-21, 22-35, 36-55, 55-70, 71+. How we choose our bins reflects the questions that we want to ask, and it affects what we will see in the data.

Categorical data includes any method of turning individuals into groups based on shared identity, whether it uses a number (age) or a named property (sex) to do so. Categorical data is also often called nominal data because it names features or properties of an object/item/person so that we can compare them with others. Almost anything can be a category. I am an adult, white, female, designer, married, and living in an apartment. All of those descriptions are categories that describe something about who I am, and that could be used to compare me to others. In the grocery store, we could split things up into fruits, vegetables, grains, and meat. Specific kinds of fruit could be their own set of categories: apples, oranges, bananas, etc.

Categories can also be divided into other categories, as I did above for fruit. This creates nested relationships, where one group of things contains another group. In this case, fruit is the larger category, and it contains several different types. We often talk about nested groups in terms of a hierarchy, which gives them a defined order and helps us to understand relationships between the groups. Fruit is the parent group, and the different kinds of fruit are the child categories.

The parent-child relationship is an example of the last data type: connections (or relationships) between categories. These relationships help us keep track of groups, so that we can see how they relate to each other. Some of the most interesting analyses come out of seeing changes between categories over time. We can look for shared categories between individuals that end up in a similar place (social status and educational background of children who go on to college), or we can track changes in a category over time (size of demographic groups, or changes in population distribution). We can also look for ways to understand how individuals are connected to each other in a social network, and how that affects their success in life. The key feature of connection data is that there are two points involved: either two different people/categories, or, for time-based data, a beginning and an end. Relationships can be hierarchical (parent-child) like you see in a family tree, or they can be non-hierarchical, like in a network of peers.

Leave a Reply Cancel reply