Data Frames in R

Arrays generalize the dimensional aspect of a matrix and assume only one data mode.  Data frames in R generalize the mode of a matrix and allow mode mixing.  Data frames with mode mixing are are the most widely used data objects in R.

Creating Data Frames in R

You can create data frames in R several ways:

  • importData() and read.table() both read data from an external file as a data.frame
  • data.frame() binds together R objects of various kinds.
  • as.data.frame() coerces objects of a particular type to objects of class data.frame.

The data.frame() function will create a data frame from existing objects if all columns have a name and equal length:

The names of the input objects are used for the names in the data frame, but the matrix input reverts to the matrix names since multiple columns are supplied. Row names for the data frame are obtained from the first object with a names(), dimnames(), or row.names() functions.

The attributes of the data objects are not lost when they are combined in a data frame.  However, character and logical vectors are converted to factors to facilitate data anlysis.  To prevent coercion, pass the vector to data.frame() in a call to the I() function, which returns the vector unchanged but with the added class “AsIs”.

It is also possible to supply matrices and lists when creating data frames.  If a matrix is submitted to data.frame(), it is the same as if the columns were supplied as individual objects.  If a list is supplied, it is treated as if its components had been supplied individually.  In both cases, suitable names are concocted if none are supplied.

Expanding a Data Grid in R

A unique way to create a data frame in R is to create a data grid.  Data grids contain all combinations of the input data.

Naming Data Frames and Using Names

Column names are defined when data is declared in the data.frame() function.  The row.names argument to data.frame() creates row names, assuming the input vector is the same length as the data.  Meanwhile, the attach() function can be also used to make the columns of a data frame visible by variable name. The detach() function subsequently cleans up the .Data directory of these additional objects.

Subscripting R Data Frames

Many extraction operators will generate vector output with class numeric.  Name extraction maintains object class:

Sorting Data Frames in R

The traditional sort() and rev() functions take a vector and return a vector of sorted values.  To sort larger data structures with several variables in parallel (e.g. tied values across columns) use the order() and sort.list() functions:

The function produces a positive integer index vector that will arrange its arguments in increasing order.  To put a data frame x in decreasing order, use sort.list(-x).

The function order() generalizes sort.list() to an arbitrary number of arguments.  The function also breaks ties across columns.  The following example sorts painters by composition count (descending as indicated by the negative) and then by school (ascending):

All these functions have an argument na.last that determines the handling of missing values.  With na.last=NA (the default for sort()), missing values are deleted; with na.last=TRUE (the default for order()), they are put last.

Combining and Modifying Data Frames in R

You can use data.frame() to combine one or more data frames, or use cbind(), rbind() or merge().  In practice, use rbind() only when you have complete data frames. Do not use it in a loop to add one row at a time to a data frame – this is inefficient. 

Merging R Data Frames and Redundant Data

The merge() function combines multiple sources with duplicated data, using shared columns.  You can specify different combinations using the by, by.x, and by.y arguments:

The following table summarizes some of the basic rule for combining objects into data frames:

Data TypeSub TypeCombination Rule(s)
vectornumeric
complex
factor
ordered
rte
its
cts
1. Combine a single variable as is
charactercharacter
logical
category
1. Convert to a factor data type
2. Contribute a single variable
arraymatrix
array
1. Each column creates a separate variable
2. Column names used for variable names
listlist1. Each component creates one or more unique variables
2. Variable names assigned as usual for each component
model.matrixmodel.matrix1. Object becomes a single variable in result
data.framedata.frame1. Each variable becomes a variable in result design
2. Variable names used for variable names

Splitting and Analyzing Data Frames in R

Splitting data frames is a common manipulation.  The split() function works by taking the columns to be included in the split and a group definition equal to the columns used to split the data:

A common use for split() is to create a data structure accepted by boxplot. 

Analyzing R Data Frames with by()

It is often more convenient to split a data frame using the by() function.  by() takes a data frame and splits it by rows into new data frames subsetted by the values of one or more factors (e.g. INDICES).  The indices must be declared as list objects and then passed to function FUN, which is applied to each subset in turn.  The resulting data object has class “by” and is manipulated further for pretty printing:

Analyzing Data Frames in R with aggregate()

The aggregate() function also allows you to partition a data frame or a matrix by one or more grouping vectors, and then apply a function to the resulting columns that returns a single value (e.g. sum() or mean()).

aggregate() returns a data frame with a factor variable column for each group/level in the index vector, and a column of numeric values from applying the specified function to the subgroup variables in the data frame.

The following list of functions can be used to assess a data frame.

DefinitionArgumentsInput ObjectOutput ObjectComment
aggregate()(x, by=, FUN=, ...)data.framedata.frameFUN should return a scalar
apply()(x, MARGIN=, FUN=, ...)data.frame
matrix
array
vector
array
n/a
by()(x, INDICES=, FUN=, ...)data.framebyIndices should be entred as a list
lapply()(x, FUN=, ...)any objectlistn/a
sapply()(x, FUN=, ..., simplify = TRUE)any objectvector
matrix
list
n/a
sweep()(x, MARGIN=, STATS=, FUN=, ...)data.frame
matrix
array
matrix
array
n/a

These functions ship with base R.  Additional functions from open-source packages will be introduced in the chapter on large data objects.

Back | Next

Leave a Reply