Coursera · Getting and Cleaning Data · R

Tidy Data, wide and long

As Hadley Wickham’s Tidy data is being discussed at the moment, I thought I would put keyboard to screen to make a few points that have been ticking around in my mind for a while. So, on the assumption people are familiar with the original article, I would make the following points:

Atomic members of sets

It is fairly easy to see that, with the one row per observation rule, that each row is a unique and complete member of the set of observations (unique in that it is a different observation, even if the characteristics of that observation are the same as another entry). If any reordering or subsetting takes place, the information is preserved. The table contains a set of observations, each observation contains variables with the set of possible values for that variable.
It is not quite as easy to see that the same is true for columns since they are categories, but given the one variable per column rule, clearly something along those lines is going on. If any reordering or subsetting takes place, the information is preserved. The table contains a set of variables describing the items in the table, each variable contains the rows with the set of possible values for that variable.

Possible values

This shared “set of possible values”, I think leads to a common case of technically not tidy but still useful, the NA. NA (not available) occurs when a reading is not present, but it is also used for when a reading cannot be present (effectively not applicable). Data requiring NA to code for “does not have a value, never had a value, and never will have a value” means that the data is outside of the set of possible values for the variable. The rows include things the variable applies to, and things it does not apply to. By extension, the table includes things described by the variables and things not described by the variables- the table contains more than one kind of observation.
It is actually fairly common to design data collection which is both untidy in this manner and completely useful- the classic follow on style question in surveys “If you bought this product, how strongly did you feel about it” almost always winds up being represented in an untidy form in the data that needs a bit of extra work to process(assuming that it is constructed in a way that isn’t violating the one variable per column rule). So in practice it doesn’t make much difference.

Conversion from Wide form to Long form (or Long to Wide)

One place this idea can make a difference is thinking about converting members of the set of possible values for a variable into new variables (conversions from wide to long form) or the reverse. While the original article favours the Long (for efficiency of storage reasons drawing inspiration from databases), I would argue there is no difference if you can convert from one to another with no loss or gain of information. Pragmatically, this is something that has been done for a very long time in preparing data for logistic regression, and if that is going awry due to underlying problems the world has bigger issues. So, let’s try and describe the process
A no information entropy change from wide to long would mean that each observation is represented in the wide column (in logistic regression, this may often take the form of “x is” in the long version to “is x?” in the wide). For this to be the case, the possible values that are being transformed into variables form subsets of a broader category (the column in the long form). In the conversion from wide to long the variables are each a subset of the category that they become in the long. Practically, they are at least going to have to share the same units (and the possible values cannot be mutually exclusive).
The “one weird trick” that allows it all to work is that the table “about” is defined by the columns (and the set of possible values within the columns). So in going from wide to long (or the reverse) the definition of the table “thing” changes- rather than a table of survey questions answers (long) it is a table of respondents to a survey (wide). That subsetting (or supersetting) of categories changes the table organisation and, because they are linked, the description. But it does not effect the smallest atomic pieces- each value.
As a practical, rule of thumb kind of, extension and summation of all of that- If your data transformation process is introducing NAs, then it is likely there is a change in information entropy, and you should check very carefully what you have done, as it most likely will not be tidy so it had better be useful.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s