A blog on programming and data analysis with R
Graphic packages like lattice
and ggplot2
work best when data are presented in long format as opposed to wide format into which data are typically imported and in which it is usually convenient to perform calculations. For example morley
is a data frame in long format, which allows functions like lattice::xyplot
to receive simple formulas and yet produce a complex trellis display.
library(lattice)
data(morley)
Take e.g. the formula Speed ~ Run | Expt
, in which the speed of light measured by Michelson in 1879 plays the role of a response variable, the measurement run that of a predictor and the experimental replicate acts as a conditioning variable:
morley$Expt <- as.factor(paste("Expt", morley$Expt)) # adjustment for xyplot
xyplot(Speed ~ Run | Expt, data = morley)
So, how is a given data set to be restructured between the wide and long format? How do those structural properties reflect the underlying experimental design such as a longitudinal or a factorial one? And how to use stats::reshape
, R’s principal restructuring tool with a given structure/design?
To grasp the difference between the long and wide format, consider experimental units (a.k.a. records, observations, …) on which variables have been observed. We will distinguish among the units using ids , and among variables with a set of labels/classes where . It is helpful to regard such labels as generalized indices and any set like a index set for variables (and not for units). In the morley
data set ids are run numbers so that , whereas variable indices are experiment numbers. However for most real-world data sets neither type of indices are integers (consider patient ids, for instance).
Our data set may be written as the statistical table
It is this table that is modeled in R as a data frame in wide format. The long format, in the other hand, corresponds to the following table:
Thus, the set of variables (in columns) have been concatenated into a single variable in a single column named and a new column was introduced to label each datum with the appropriate generalized index.
From a purely syntactic viewpoint any data set in wide format may be converted into long format and vice versa. The R functions utils::stack
and stats::reshape
do exactly that so they are invaluable time savers when preparing data for lattice
for example. But semantically it is only sensible to convert wide format into the long one shown above when the index set models some meaningful relation among all variables to each other. For instance, may be a set of genes in a genomic experiment indexing a measure of gene expression. Other examples for index set include the set of floral parts such as , a set of directions along which floral parts have been measured, or a set of Iris species such as . Those familiar with R’s datasets
package may have instantly identified the latter example as the classic Fisher–Anderson data set residing in datasets::iris
.
In general not all variables are necessarily related to each other in a way that justifies indexing them with a single index set. Take longitudinal studies as an example. In such experiments there are say time points at which some quantity is measured (on each of the units) so that the sequence of random variables yields a time series. Those variables are called time-varying in the help documentation of reshape
and their relation is time. The remaining variables are called time-constant in reshape
’s help but instead I will use here the term time-invariant. Time-invariant variables are race and gender of study subjects, etc.
More generally, the index set need not have a temporal interpretation and need not even form an ordered set. That applies to our earlier example of a genomic experiment in which is a set of genes and gene expression might be called “gene-varying”. If then the remaining variables are “gene-invariant”. For instance these may represent metabolic (and hence non-genetic) properties. These variables may even be related to each other in a systematic way that justifies indexing them from a single index set that is distinct from . E.g. besides genome-wide gene expression the concentration of the same metabolite may be measured in different organs so that .
So given a -sized index set it might be appropriate to call the first variables “-varying” and the remaining variables “-invariant”. These are illustrated by the second and third column, respectively, of the table below.
index set | ||
---|---|---|
property of variable | -varying | -invariant |
example I | time | gender |
example II | gene identity | organ |
The important points to realize about the data sets of our current focus:
In the wide format we have
To arrive at a long format the reshape
function lets index set guide the concatenation of the first variables as before. But now reshape
also must take care of the remaining variables indexed by . It does so by replicating those and storing them in the same number of columns as the corresponding columns of the wide format. So, the long format looks like
Note that the simple-to-use stack
is of little help in this case so that only reshape
can be used. Also note that the idvar
argument of reshape
corresponds to in the long format and the timevar
argument to the index set . We will revisit the syntax and semantics of reshape
arguments later in this article.
There is at least another major advantage of reshape
over stack
but this feature is rather vaguely documented so I attempt to elucidate both its usefulness and the way it may have been meant to be used in R.
The Iris data set, despite its miniscule size in today’s standards, illustrates a particularly important structure of data that is the result of factorial (crossed) experimental design. Here we have multiple index sets say , and for the factors species (), floral part (), and measurement direction (), respectively. These are fully crossed factors that collectively specify a set of variables reporting on the size of floral parts. To set a practical goal we will sequentially reshape iris
into a form which will allow us to write xyplot(Size ~ X | Direction * Floral.Part, data = iris.long.format, groups = Species)
to produce a nested trellis display (i.e. one with multiple conditioning variables). Note that we will augment the original iris
data with a variable , which we will consider as a covariate (i.e. continuous predictor) of size—more about it soon.
In the preceding section we dealt with variables indexed by multiple sets and now we see that three sets (, and ) are used to index the iris
data as well. Yet iris
is also fundamentally different:
Let’s see the special meaning of these points in case of the Iris data while we work our way towards the desired plot!
First we generate a covariate based on sepal length with
set.seed(13)
iris$X <- iris$Sepal.Length[seq_len(nrow(iris) / 3)] + rep(rnorm(nrow(iris) / 3), 3)
will not be combined, or “crossed”, with any other variables; rather, it will play a role analogous to that of time invariant variables discussed in the case of longitudinal experiments.
With head(iris)
or str(iris)
we see that the present format of our data is
To connect this specific notation to the general one introduced above note that a size variable like conforms to the general form . The present shape of the data may be called wide and our goal motivates us to reshape it to long format. Notice the following points
“Meaningful way” means that although the two long format data frames may differ in e.g. the order of their components/columns, such differences must not have any impact on the way functions like xyplot
extract information from them.
This requirement is safely fulfilled as long as data frame components are extracted by names—as xyplot
does it—rather than integer or boolean indices or if precautions are taken in integer and boolean indexing.
So, we are to generate the following long format (up to any “meaningless” differences, of course):
where the large gap in the middle represents all possible combinations of species, floral part and direction beyond the ones shown. The specific permutation (order) of these combinations depends on whether we take the F.D or the D.F sequence but that difference carries no information for xyplot
similarly to the order of components in the data frame. (What does matter, though, is the order of levels of components Species
, Floral.Part
and Direction
when those are represented by ordered factors.)
The prerequisite of our goal (the trellis plot of our extended version of the Iris data) is to reshape the data into the completely long format either taking the F.D or the D.F sequence of two reshapes. The next two code blocks represent the F.D sequence. The first reshape is according to floral part and the result is assigned to iris.F
.
iris.F <-
reshape(iris, direction = "long",
varying = list(c("Sepal.Length", "Petal.Length"), c("Sepal.Width", "Petal.Width")),
v.names = c("Length", "Width"),
timevar = "Floral.Part", times = c("Sepal", "Petal"),
idvar = "id1")
Note how the varying
argument is a list of equal-length vectors whose length is the number of levels (in this case two) of the factor (floral part) according to which we perform the reshape. That the length of the list is also two is merely a coincidence. In general the list may be both shorter or longer than each of its own vector components. The concluding section will discuss how that is related to the crossed factors of the experiments.
The second reshape in the F.D sequence is according to direction yielding data frame assigned to the name iris.F.D
.
iris.F.D <-
reshape(iris.F, direction = "long",
varying = c("Length", "Width"), v.names = "Size",
timevar = "Direction", times = c("Length", "Width"),
drop = "id1")
As we see, iris.F.D
structures data in the desired long-format (compare with the most recent table):
head(iris.F.D, n = 2)
## Species X Floral.Part Direction Size id
## 1.Length setosa 5.654327 Sepal Length 5.1 1
## 2.Length setosa 4.619728 Sepal Length 4.9 2
tail(iris.F.D, n = 2)
## Species X Floral.Part Direction Size id
## 299.Width virginica 5.577147 Petal Width 2.3 299
## 300.Width virginica 6.408354 Petal Width 1.8 300
The next code block performs the reverse sequential reshape D.F i.e. first according to direction and then according to floral part.
iris.D <-
reshape(iris, direction = "long",
varying = list(c("Sepal.Length", "Sepal.Width"), c("Petal.Length", "Petal.Width")),
v.names = c("Sepal", "Petal"),
timevar = "Direction", times = c("Length", "Width"),
idvar = "id1")
iris.D.F <-
reshape(iris.D, direction = "long",
varying = c("Sepal", "Petal"), v.names = "Size",
timevar = "Floral.Part", times = c("Sepal", "Petal"),
drop = "id1")
The “head” and “tail” of the new data frame iris.D.F
are the same as those of iris.D.F
(up to the order of the Floral.Part
and Direction
components).
head(iris.D.F, n = 2)
## Species X Direction Floral.Part Size id
## 1.Sepal setosa 5.654327 Length Sepal 5.1 1
## 2.Sepal setosa 4.619728 Length Sepal 4.9 2
iris.D.F[201, ]
## Species X Direction Floral.Part Size id
## 201.Sepal versicolor 5.654327 Width Sepal 3.2 201
tail(iris.D.F, n = 2)
## Species X Direction Floral.Part Size id
## 299.Petal virginica 5.577147 Width Petal 2.3 299
## 300.Petal virginica 6.408354 Width Petal 1.8 300
But the middle portion of the two data frames differ due to the different permutation of crossed factors. That difference, however, has no statistical meaning similarly to the order of columns. Both types of difference are implementation details that well designed software tools should be able to ignore.
iris.F.D[201, ]
## Species X Floral.Part Direction Size id
## 201.Length versicolor 5.654327 Petal Length 4.7 201
iris.D.F[201, ]
## Species X Direction Floral.Part Size id
## 201.Sepal versicolor 5.654327 Width Sepal 3.2 201
Having sequentially shaped the Iris data—supplemented with our covariate —into two long-format data frames, we are ready to produce the trellis display. Starting with iris.F.D
(resulting from the “forward” sequence)
xyplot(Size ~ X | Direction * Floral.Part, data = iris.F.D, groups = Species, auto.key = list(columns = 3))
Using iris.D.F
(“reverse” sequence) yields
xyplot(Size ~ X | Direction * Floral.Part, data = iris.D.F, groups = Species, auto.key = list(columns = 3))
The two plots are identical since the differences between iris.F.D
and iris.D.F
—or equivalently between the corresponding sequences of reshapes—are immaterial from the viewpoint of xyplot
, as discussed earlier.
reshape
and why so?The crucial but poorly documented feature of reshape
is how “time”-varying variables must be specified for the varying
argument in factorial experiments like the Iris study. Suppose there are factors , whose levels count (as we saw these factors are equivalent to index sets ). With the notation we introduced for longitudinal setups because these factors are fully crossed. The remaining variables, if any, are not crossed with the previous factors and consequently play a passive role in any of the reshapes. In our extended Iris example and because there is a single uncrossed variable: covariate .
To illustrate the sequential use of reshape
we will
iris
into a completely wide format iris.wide
iris.wide
according to species to obtain iris.2
iris.2
is any different from iris
(it should not be apart from implementation details)reshape
iris
to a completely wide formatIt will be convenient to assign the value of the varying
argument to reshape to a name, say varying.v
.
(varying.v <-
lapply(c(paste0(".Sepal", c(".Length", ".Width")),
paste0(".Petal", c(".Length", ".Width"))),
function(x) paste0(c("Setosa", "Versicolor", "Virginica"), x)))
## [[1]]
## [1] "Setosa.Sepal.Length" "Versicolor.Sepal.Length"
## [3] "Virginica.Sepal.Length"
##
## [[2]]
## [1] "Setosa.Sepal.Width" "Versicolor.Sepal.Width"
## [3] "Virginica.Sepal.Width"
##
## [[3]]
## [1] "Setosa.Petal.Length" "Versicolor.Petal.Length"
## [3] "Virginica.Petal.Length"
##
## [[4]]
## [1] "Setosa.Petal.Width" "Versicolor.Petal.Width"
## [3] "Virginica.Petal.Width"
That varying.v
is indeed the correct argument will be shown and discussed shortly.
Additionally, we create an id
variable, whose role is to identify all observations in the sequence repeated for all levels of species.
iris$id <- rep(seq_len(nrow(iris) / 3), 3)
Now, the syntax of the desired reshape is this:
iris.wide <-
reshape(iris, direction = "wide",
varying = varying.v,
v.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
timevar = "Species", times = c("setosa", "versicolor", "virginica"),
idvar = "id")
The structure of the new data format looks as expected
str(iris.wide)
## 'data.frame': 50 obs. of 14 variables:
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Setosa.Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Setosa.Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Setosa.Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Setosa.Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Versicolor.Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
## $ Versicolor.Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
## $ Versicolor.Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
## $ Versicolor.Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
## $ Virginica.Sepal.Length : num 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
## $ Virginica.Sepal.Width : num 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
## $ Virginica.Petal.Length : num 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
## $ Virginica.Petal.Width : num 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
## - attr(*, "reshapeWide")=List of 5
## ..$ v.names: chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## ..$ timevar: chr "Species"
## ..$ idvar : chr "id"
## ..$ times : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
## ..$ varying: chr [1:4, 1:3] "Setosa.Sepal.Length" "Setosa.Sepal.Width" "Setosa.Petal.Length" "Setosa.Petal.Width" ...
iris.wide
to iris.2
Next, the reverse reshape operation is carried out. The syntax is conveniently the same as the forward equivalent except for the direction
argument, which has now the value "long"
.
iris.2 <-
reshape(iris.wide, direction = "long",
varying = varying.v,
v.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
timevar = "Species", times = c("setosa", "versicolor", "virginica"),
idvar = "id")
iris
and iris.2
In fact, iris
and iris.2
differ in their attributes, the order of their components, and in the fact that Species
is a factor in iris
but a character vector in iris.2
:
str(iris)
## 'data.frame': 150 obs. of 7 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
str(iris.2)
## 'data.frame': 150 obs. of 7 variables:
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## - attr(*, "reshapeLong")=List of 4
## ..$ varying:List of 4
## .. ..$ : chr "Setosa.Sepal.Length" "Versicolor.Sepal.Length" "Virginica.Sepal.Length"
## .. ..$ : chr "Setosa.Sepal.Width" "Versicolor.Sepal.Width" "Virginica.Sepal.Width"
## .. ..$ : chr "Setosa.Petal.Length" "Versicolor.Petal.Length" "Virginica.Petal.Length"
## .. ..$ : chr "Setosa.Petal.Width" "Versicolor.Petal.Width" "Virginica.Petal.Width"
## ..$ v.names: chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## ..$ idvar : chr "id"
## ..$ timevar: chr "Species"
But do they differ in any meaningful way? Reordering iris.2
and converting its Species
component into factor shows
iris.2$Species <- as.factor(iris.2$Species)
all.equal(iris, iris.2[names(iris)])
## [1] "Attributes: < Component \"row.names\": Modes: numeric, character >"
## [2] "Attributes: < Component \"row.names\": target is numeric, current is character >"
Thus, the answer is: iris.2
does not have any meaningful differences from iris
, demonstrating the reversibility of reshape operations with the reshape
function. This means that we can perform a sequence of three reshapes from iris.wide
and reach essentially the same iris.F.D
or iris.D.F
as we did with two reshapes from iris
.
reshape
and its varying
argumentWhen our data are in a completely wide format, a sequence of reshapes must be used to reach the (completely) long format. Clearly, the sequence of reshapes follows a chosen sequence of factors .
First we want to reshape according to factor (equivalently index set . With what arguments should we call reshape
? In particular what should be the value of the varying
argument? The above examples show that varying
must be a list of character vectors each of length (see e.g. varying.v
in the previous section).
This is because all levels of factor must appear in each vector. The number of such vectors—the length of the list itself—is because we need to consider all combinations of the remaining factors. In case of varying.v
and corresponding to , this is why each component of the list varying.v
is a vector of length . On the other hand, and so because each of floral part and direction has levels, and this is why the length of varying.v
itself is .
In the second reshape plays no more role so it is omitted from varying
. Now the list has vector components and each vector is long. The sequence continues with .
If we follow this rule then the last, -th, reshape is special in the sense that no more factors remain and hence we arrive at the nonsense product (nonsense since it has zero terms). Then varying
is a list with a single component, a -length vector. Such a trivial list may be unlist
ed: replaced by its only component (the -length vector) without any loss of information. Notice how we could write varying = c("Length", "Width")
instead of (the also correct) varying = list("Length", "Width")
when we reshaped iris.F
into iris.F.D
!
In fact, this “final” reshape is probably by far the most frequent way in which reshape
is called since it was designed for longitudinal experiments where with only “factor” present: time.
The multiplicity of data representations is not only computationally convenient but also facilitates the clarification and expression of certain semantic relationships between variables. Unlike stack
the reshape
function allows representation of various experimental design by partitioning variables into related sets and indexing those separately. We only briefly mentioned how longitudinal experiments can be modeled with reshape
and how this can be generalized to cases where instead of time points genes or other objects serve as indices to a set of related variables. reshape
’s documentation provides details and examples on—mostly non-generalized—longitudinal applications.
In contrast, we examined more deeply how reshape
is also capable of dealing with the equally important but more complex case of factorial experiments, where number of combined index sets correspond to crossed factors. In the Iris data set the factors are species, floral part, and measurement direction so . Moreover, we supplemented iris
with a covariate with the intention of having a relation to the original variables akin to time-invariant variables relate time-varying ones in longitudinal studies in a sense that data reshaping is only governed by the latter. This demonstrated how flexibly reshape
can be adapted to various setups.
We also saw that as many reshapes can be performed as many factors are present. Therefore various sequences of reshapes may transform a completely wide-format data frame to a completely long-format one. Thus, intermediate formats exist whenever we have factors. Which sequence of reshapes we chose is immaterial, or at least should be, from the viewpoint of well-designed software such as the trellis display plotting xyplot
.