A blog on programming and data analysis with R

Graphic packages like `lattice`

and `ggplot2`

work best when data are presented in **long format** as opposed to **wide format** into which data are typically imported and in which it is usually convenient to perform calculations. For example `morley`

is a data frame in long format, which allows functions like `lattice::xyplot`

to receive simple formulas and yet produce a complex trellis display.

```
library(lattice)
data(morley)
```

Take e.g. the formula `Speed ~ Run | Expt`

, in which the speed of light measured by Michelson in 1879 plays the role of a response variable, the measurement run that of a predictor and the experimental replicate acts as a conditioning variable:

```
morley$Expt <- as.factor(paste("Expt", morley$Expt)) # adjustment for xyplot
xyplot(Speed ~ Run | Expt, data = morley)
```

So, how is a given data set to be restructured between the wide and long format? How do those structural properties reflect the underlying experimental design such as a longitudinal or a factorial one? And how to use `stats::reshape`

, R’s principal restructuring tool with a given structure/design?

To grasp the difference between the long and wide format, consider experimental units (a.k.a. records, observations, …) on which variables have been observed. We will distinguish among the units using **ids** , and among variables with a set of labels/classes where . It is helpful to regard such labels as generalized indices and any set like a **index set** for variables (and not for units). In the `morley`

data set ids are run numbers so that , whereas variable indices are experiment numbers. However for most real-world data sets neither type of indices are integers (consider patient ids, for instance).

Our data set may be written as the statistical table

It is this table that is modeled in R as a data frame in wide format. The long format, in the other hand, corresponds to the following table:

Thus, the set of variables (in columns) have been concatenated into a single variable in a single column named and a new column was introduced to label each datum with the appropriate generalized index.

From a purely syntactic viewpoint any data set in wide format may be converted into long format and vice versa. The R functions `utils::stack`

and `stats::reshape`

do exactly that so they are invaluable time savers when preparing data for `lattice`

for example. But semantically it is only sensible to convert wide format into the long one shown above when the index set models some meaningful relation among all variables to each other. For instance, may be a set of genes in a genomic experiment indexing a measure of gene expression. Other examples for index set include the set of *floral parts* such as , a set of *directions* along which *floral parts* have been measured, or a set of Iris *species* such as . Those familiar with R’s `datasets`

package may have instantly identified the latter example as the classic Fisher–Anderson data set residing in `datasets::iris`

.

In general not all variables are necessarily related to each other in a way that justifies indexing them with a single index set. Take longitudinal studies as an example. In such experiments there are say time points at which some quantity is measured (on each of the units) so that the sequence of random variables yields a time series. Those variables are called **time-varying** in the help documentation of `reshape`

and their relation is time. The remaining variables are called time-constant in `reshape`

’s help but instead I will use here the term **time-invariant**. Time-invariant variables are race and gender of study subjects, etc.

More generally, the index set need not have a temporal interpretation and need not even form an ordered set. That applies to our earlier example of a genomic experiment in which is a set of genes and gene expression might be called “gene-varying”. If then the remaining variables are “gene-invariant”. For instance these may represent metabolic (and hence non-genetic) properties. These variables may even be related to each other in a systematic way that justifies indexing them from a single index set that is distinct from . E.g. besides genome-wide gene expression the concentration of the same metabolite may be measured in different organs so that .

So given a -sized index set it might be appropriate to call the first variables “-varying” and the remaining variables “-invariant”. These are illustrated by the second and third column, respectively, of the table below.

index set | ||
---|---|---|

property of variable | -varying | -invariant |

example I | time | gender |

example II | gene identity | organ |

The important points to realize about the data sets of our current focus:

- we have multiple index sets (and possibly more) that together index a total of variables
- for each variable the index belongs to one and only one index set
- among all index sets only provides a rule to reshape the data from wide to long format (see below).

In the wide format we have

To arrive at a long format the `reshape`

function lets index set guide the concatenation of the first variables as before. But now `reshape`

also must take care of the remaining variables indexed by . It does so by replicating those and storing them in the same number of columns as the corresponding columns of the wide format. So, the long format looks like

Note that the simple-to-use `stack`

is of little help in this case so that only `reshape`

can be used. Also note that the `idvar`

argument of `reshape`

corresponds to in the long format and the `timevar`

argument to the index set . We will revisit the syntax and semantics of `reshape`

arguments later in this article.

There is at least another major advantage of `reshape`

over `stack`

but this feature is rather vaguely documented so I attempt to elucidate both its usefulness and the way it may have been meant to be used in R.

The *Iris* data set, despite its miniscule size in today’s standards, illustrates a particularly important structure of data that is the result of factorial (crossed) experimental design. Here we have multiple index sets say , and for the factors *species* (), *floral part* (), and measurement *direction* (), respectively. These are fully **crossed factors** that collectively specify a set of variables reporting on the *size* of floral parts. To set a practical goal we will sequentially reshape `iris`

into a form which will allow us to write `xyplot(Size ~ X | Direction * Floral.Part, data = iris.long.format, groups = Species)`

to produce a nested trellis display (i.e. one with multiple conditioning variables). Note that we will augment the original `iris`

data with a variable , which we will consider as a covariate (i.e. continuous predictor) of size—more about it soon.

In the preceding section we dealt with variables indexed by multiple sets and now we see that three sets (, and ) are used to index the `iris`

data as well. Yet `iris`

is also fundamentally different:

- we
*still*have multiple index sets indexing a total of variables - however, for each variable the index is a combination of all types of indices , which correspond to all levels of the crossed factors
- data may be sensibly reshaped according to any of the index sets (i.e. any of the factors)
- moreover, a
**sequence of reshapes**may be performed if an initial reshape according to is followed by one according to another index set

Let’s see the special meaning of these points in case of the *Iris* data while we work our way towards the desired plot!

First we generate a covariate based on sepal length with

```
set.seed(13)
iris$X <- iris$Sepal.Length[seq_len(nrow(iris) / 3)] + rep(rnorm(nrow(iris) / 3), 3)
```

will not be combined, or “crossed”, with any other variables; rather, it will play a role analogous to that of time invariant variables discussed in the case of longitudinal experiments.

With `head(iris)`

or `str(iris)`

we see that the present format of our data is

To connect this specific notation to the general one introduced above note that a size variable like conforms to the general form . The present shape of the data may be called wide and our goal motivates us to reshape it to long format. Notice the following points

- the present form is in fact intermediate because we can reshape the data in both towards a “completely” long or wide format
- a single reshape is sufficient to reach that wide format
- a sequence of two reshapes are needed for the long format
- two such sequences exist:
- sequence
*F.D*: reshape first according to*floral part*and then according to*direction* - sequence
*D.F*: the*F.D*in reverse order

- sequence
- the long formats that result from
*F.D*and*D.F*do not differ in any meaningful way and so software tools should not treat them differently either

“Meaningful way” means that although the two long format data frames may differ in e.g. the order of their components/columns, such differences must not have any impact on the way functions like `xyplot`

extract information from them.
This requirement is safely fulfilled as long as data frame components are extracted by names—as `xyplot`

does it—rather than integer or boolean indices or if precautions are taken in integer and boolean indexing.

So, we are to generate the following long format (up to any “meaningless” differences, of course):

where the large gap in the middle represents all possible combinations of *species*, *floral part* and *direction* beyond the ones shown. The specific permutation (order) of these combinations depends on whether we take the *F.D* or the *D.F* sequence but that difference carries no information for `xyplot`

similarly to the order of components in the data frame. (What does matter, though, is the order of *levels* of components `Species`

, `Floral.Part`

and `Direction`

when those are represented by ordered factors.)

The prerequisite of our goal (the trellis plot of our extended version of the *Iris* data) is to reshape the data into the completely long format either taking the *F.D* or the *D.F* sequence of two reshapes. The next two code blocks represent the *F.D* sequence. The first reshape is according to *floral part* and the result is assigned to `iris.F`

.

```
iris.F <-
reshape(iris, direction = "long",
varying = list(c("Sepal.Length", "Petal.Length"), c("Sepal.Width", "Petal.Width")),
v.names = c("Length", "Width"),
timevar = "Floral.Part", times = c("Sepal", "Petal"),
idvar = "id1")
```

Note how the `varying`

argument is a list of equal-length vectors whose length is the number of levels (in this case two) of the factor (*floral part*) according to which we perform the reshape. That the length of the list is also two is merely a coincidence. In general the list may be both shorter or longer than each of its own vector components. The concluding section will discuss how that is related to the crossed factors of the experiments.

The second reshape in the *F.D* sequence is according to *direction* yielding data frame assigned to the name `iris.F.D`

.

```
iris.F.D <-
reshape(iris.F, direction = "long",
varying = c("Length", "Width"), v.names = "Size",
timevar = "Direction", times = c("Length", "Width"),
drop = "id1")
```

As we see, `iris.F.D`

structures data in the desired long-format (compare with the most recent table):

```
head(iris.F.D, n = 2)
```

```
## Species X Floral.Part Direction Size id
## 1.Length setosa 5.654327 Sepal Length 5.1 1
## 2.Length setosa 4.619728 Sepal Length 4.9 2
```

```
tail(iris.F.D, n = 2)
```

```
## Species X Floral.Part Direction Size id
## 299.Width virginica 5.577147 Petal Width 2.3 299
## 300.Width virginica 6.408354 Petal Width 1.8 300
```

The next code block performs the reverse sequential reshape *D.F* i.e. first according to *direction* and then according to *floral part*.

```
iris.D <-
reshape(iris, direction = "long",
varying = list(c("Sepal.Length", "Sepal.Width"), c("Petal.Length", "Petal.Width")),
v.names = c("Sepal", "Petal"),
timevar = "Direction", times = c("Length", "Width"),
idvar = "id1")
iris.D.F <-
reshape(iris.D, direction = "long",
varying = c("Sepal", "Petal"), v.names = "Size",
timevar = "Floral.Part", times = c("Sepal", "Petal"),
drop = "id1")
```

The “head” and “tail” of the new data frame `iris.D.F`

are the same as those of `iris.D.F`

(up to the order of the `Floral.Part`

and `Direction`

components).

```
head(iris.D.F, n = 2)
```

```
## Species X Direction Floral.Part Size id
## 1.Sepal setosa 5.654327 Length Sepal 5.1 1
## 2.Sepal setosa 4.619728 Length Sepal 4.9 2
```

```
iris.D.F[201, ]
```

```
## Species X Direction Floral.Part Size id
## 201.Sepal versicolor 5.654327 Width Sepal 3.2 201
```

```
tail(iris.D.F, n = 2)
```

```
## Species X Direction Floral.Part Size id
## 299.Petal virginica 5.577147 Width Petal 2.3 299
## 300.Petal virginica 6.408354 Width Petal 1.8 300
```

But the middle portion of the two data frames differ due to the different permutation of crossed factors. That difference, however, has no statistical meaning similarly to the order of columns. Both types of difference are implementation details that well designed software tools should be able to ignore.

```
iris.F.D[201, ]
```

```
## Species X Floral.Part Direction Size id
## 201.Length versicolor 5.654327 Petal Length 4.7 201
```

```
iris.D.F[201, ]
```

```
## Species X Direction Floral.Part Size id
## 201.Sepal versicolor 5.654327 Width Sepal 3.2 201
```

Having sequentially shaped the *Iris* data—supplemented with our covariate —into two long-format data frames, we are ready to produce the trellis display. Starting with `iris.F.D`

(resulting from the “forward” sequence)

```
xyplot(Size ~ X | Direction * Floral.Part, data = iris.F.D, groups = Species, auto.key = list(columns = 3))
```

Using `iris.D.F`

(“reverse” sequence) yields

```
xyplot(Size ~ X | Direction * Floral.Part, data = iris.D.F, groups = Species, auto.key = list(columns = 3))
```

The two plots are identical since the differences between `iris.F.D`

and `iris.D.F`

—or equivalently between the corresponding sequences of reshapes—are immaterial from the viewpoint of `xyplot`

, as discussed earlier.

`reshape`

and why so?The crucial but poorly documented feature of `reshape`

is how “time”-varying variables must be specified for the `varying`

argument in factorial experiments like the *Iris* study. Suppose there are factors , whose levels count (as we saw these factors are equivalent to index sets ). With the notation we introduced for longitudinal setups because these factors are fully crossed. The remaining variables, if any, are not crossed with the previous factors and consequently play a passive role in any of the reshapes. In our extended *Iris* example and because there is a single uncrossed variable: covariate .

To illustrate the sequential use of `reshape`

we will

- reshape our extended
`iris`

into a completely wide format`iris.wide`

- then we will reshape
`iris.wide`

according to*species*to obtain`iris.2`

- we will check if
`iris.2`

is any different from`iris`

(it should not be apart from implementation details) - discuss the semantics behind the syntax of calling
`reshape`

`iris`

to a completely wide formatIt will be convenient to assign the value of the `varying`

argument to reshape to a name, say `varying.v`

.

```
(varying.v <-
lapply(c(paste0(".Sepal", c(".Length", ".Width")),
paste0(".Petal", c(".Length", ".Width"))),
function(x) paste0(c("Setosa", "Versicolor", "Virginica"), x)))
```

```
## [[1]]
## [1] "Setosa.Sepal.Length" "Versicolor.Sepal.Length"
## [3] "Virginica.Sepal.Length"
##
## [[2]]
## [1] "Setosa.Sepal.Width" "Versicolor.Sepal.Width"
## [3] "Virginica.Sepal.Width"
##
## [[3]]
## [1] "Setosa.Petal.Length" "Versicolor.Petal.Length"
## [3] "Virginica.Petal.Length"
##
## [[4]]
## [1] "Setosa.Petal.Width" "Versicolor.Petal.Width"
## [3] "Virginica.Petal.Width"
```

That `varying.v`

is indeed the correct argument will be shown and discussed shortly.

Additionally, we create an `id`

variable, whose role is to identify all observations in the sequence repeated for all levels of *species*.

```
iris$id <- rep(seq_len(nrow(iris) / 3), 3)
```

Now, the syntax of the desired reshape is this:

```
iris.wide <-
reshape(iris, direction = "wide",
varying = varying.v,
v.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
timevar = "Species", times = c("setosa", "versicolor", "virginica"),
idvar = "id")
```

The structure of the new data format looks as expected

```
str(iris.wide)
```

```
## 'data.frame': 50 obs. of 14 variables:
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Setosa.Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Setosa.Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Setosa.Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Setosa.Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Versicolor.Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
## $ Versicolor.Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
## $ Versicolor.Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
## $ Versicolor.Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
## $ Virginica.Sepal.Length : num 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
## $ Virginica.Sepal.Width : num 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
## $ Virginica.Petal.Length : num 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
## $ Virginica.Petal.Width : num 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
## - attr(*, "reshapeWide")=List of 5
## ..$ v.names: chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## ..$ timevar: chr "Species"
## ..$ idvar : chr "id"
## ..$ times : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
## ..$ varying: chr [1:4, 1:3] "Setosa.Sepal.Length" "Setosa.Sepal.Width" "Setosa.Petal.Length" "Setosa.Petal.Width" ...
```

`iris.wide`

to `iris.2`

Next, the reverse reshape operation is carried out. The syntax is conveniently the same as the forward equivalent except for the `direction`

argument, which has now the value `"long"`

.

```
iris.2 <-
reshape(iris.wide, direction = "long",
varying = varying.v,
v.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
timevar = "Species", times = c("setosa", "versicolor", "virginica"),
idvar = "id")
```

`iris`

and `iris.2`

In fact, `iris`

and `iris.2`

differ in their attributes, the order of their components, and in the fact that `Species`

is a factor in `iris`

but a character vector in `iris.2`

:

```
str(iris)
```

```
## 'data.frame': 150 obs. of 7 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
```

```
str(iris.2)
```

```
## 'data.frame': 150 obs. of 7 variables:
## $ X : num 5.65 4.62 6.48 4.79 6.14 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## - attr(*, "reshapeLong")=List of 4
## ..$ varying:List of 4
## .. ..$ : chr "Setosa.Sepal.Length" "Versicolor.Sepal.Length" "Virginica.Sepal.Length"
## .. ..$ : chr "Setosa.Sepal.Width" "Versicolor.Sepal.Width" "Virginica.Sepal.Width"
## .. ..$ : chr "Setosa.Petal.Length" "Versicolor.Petal.Length" "Virginica.Petal.Length"
## .. ..$ : chr "Setosa.Petal.Width" "Versicolor.Petal.Width" "Virginica.Petal.Width"
## ..$ v.names: chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## ..$ idvar : chr "id"
## ..$ timevar: chr "Species"
```

But do they differ in any meaningful way? Reordering `iris.2`

and converting its `Species`

component into factor shows

```
iris.2$Species <- as.factor(iris.2$Species)
all.equal(iris, iris.2[names(iris)])
```

```
## [1] "Attributes: < Component \"row.names\": Modes: numeric, character >"
## [2] "Attributes: < Component \"row.names\": target is numeric, current is character >"
```

Thus, the answer is: `iris.2`

does not have any meaningful differences from `iris`

, demonstrating the reversibility of reshape operations with the `reshape`

function. This means that we can perform a sequence of three reshapes from `iris.wide`

and reach essentially the same `iris.F.D`

or `iris.D.F`

as we did with two reshapes from `iris`

.

`reshape`

and its `varying`

argumentWhen our data are in a completely wide format, a sequence of reshapes must be used to reach the (completely) long format. Clearly, the sequence of reshapes follows a chosen sequence of factors .

First we want to reshape according to factor (equivalently index set . With what arguments should we call `reshape`

? In particular what should be the value of the `varying`

argument? The above examples show that `varying`

must be a list of character vectors each of length (see e.g. `varying.v`

in the previous section).

This is because all levels of factor must appear in each vector. The number of such vectors—the length of the list itself—is because we need to consider all combinations of the *remaining* factors. In case of `varying.v`

and corresponding to , this is why each component of the list `varying.v`

is a vector of length . On the other hand, and so because each of *floral part* and *direction* has levels, and this is why the length of `varying.v`

itself is .

In the second reshape plays no more role so it is omitted from `varying`

. Now the list has vector components and each vector is long. The sequence continues with .

If we follow this rule then the last, -th, reshape is special in the sense that no more factors remain and hence we arrive at the nonsense product (nonsense since it has zero terms). Then `varying`

is a list with a single component, a -length vector. Such a trivial list may be `unlist`

ed: replaced by its only component (the -length vector) without any loss of information. Notice how we could write `varying = c("Length", "Width")`

instead of (the also correct) `varying = list("Length", "Width")`

when we reshaped `iris.F`

into `iris.F.D`

!

In fact, this “final” reshape is probably by far the most frequent way in which `reshape`

is called since it was designed for longitudinal experiments where with only “factor” present: time.

The multiplicity of data representations is not only computationally convenient but also facilitates the clarification and expression of certain semantic relationships between variables. Unlike `stack`

the `reshape`

function allows representation of various experimental design by partitioning variables into related sets and indexing those separately. We only briefly mentioned how longitudinal experiments can be modeled with `reshape`

and how this can be generalized to cases where instead of time points genes or other objects serve as indices to a set of related variables. `reshape`

’s documentation provides details and examples on—mostly non-generalized—longitudinal applications.

In contrast, we examined more deeply how `reshape`

is also capable of dealing with the equally important but more complex case of factorial experiments, where number of combined index sets correspond to crossed factors. In the *Iris* data set the factors are *species*, *floral part*, and measurement *direction* so . Moreover, we supplemented `iris`

with a covariate with the intention of having a relation to the original variables akin to time-invariant variables relate time-varying ones in longitudinal studies in a sense that data reshaping is only governed by the latter. This demonstrated how flexibly `reshape`

can be adapted to various setups.

We also saw that as many reshapes can be performed as many factors are present. Therefore various *sequences* of reshapes may transform a completely wide-format data frame to a completely long-format one. Thus, intermediate formats exist whenever we have factors. Which sequence of reshapes we chose is immaterial, or at least should be, from the viewpoint of well-designed software such as the trellis display plotting `xyplot`

.