R Data Import

Introduction

Quantitative analysis depends on the ability to load and manage many different types of data and file formats.  There are many R data import functions. Some functions ship with base R and others can be found in R packages.

Data Available in R

R is pre-installed with many data sets in the datasets package, which is included in the base distribution of R. R datasets are automatically loaded when the application is started.  A list of all data sets in the package is obtained using the following command:

Other packages also ship with data sets.  Package data is loaded into the active workspace with the data() command, which gives a list of all data sets that are currently available from installed package sources.  The list is product without function arguments:

One data set is sunspot.month. The following lines will load, display and graph the data:

Help documentation confirms the data set is the number of sunspots by month observed for the years 1749 to 1997 by the Royal Observatory of Belgium.  The data() function can also load multiple datasets simultaneously as follows:

Loading ASCII Flat Files into R

Flat files are delimited text files and are commonly used for small and medium scale projects.  Flat files are the easiest file format to import into R.  The function read.table() is the principal means of reading rectangular flat file data.  Primary arguments used in the functions are filename and header:

The function will quickly determine the number of columns and the data object class of each variable.  The number of data columns is determined by looking at the first five lines of input, or from the length of the optionally specified col.names argument.  Row names are assumed to be in the first column if the number of columns exceeds the number of column names.  Alternatively, you can specify a vector of row names using the optional row.names argument. Otherwise, if row.names is missing, all rows will be numbered.

You need to give the full path name of the target file if it’s not in your working directory.  Use forward slashes (Unix/Linux and OSX) or double backslashes (Windows) between directories.

The default data delimiter is whitespace (e.g. a space, tab, or newline entry). The optional function argument sep is used for comma delimited files as follows:

Comma delimited files can also be loaded using the functions read.csv() and read.csv2.  Similarly, the functions read.delim() and read.delim2() are used for tab delimited files.

Finally, we can download and open flat files from any internet URL.  The example below downloads a compressed file from the data archive of the  Journal of Econometrics, and uses the na.strings argument to pre-identify missing values:

The file address is long and is first broken down into parts.  It is then downloaded using the function download.file().  Since the file is compressed, the unz() function is used, which is appropriate for single files (see function gzfile() for unzipping multiple files).  The data table is then read into memory and the function head() prints the first 5 rows of imported data.

Importing MS-Excel Files into R

The transition to R is greatly facilitated by overcoming the question “How do I import and export data between Excel spreadsheets and R?” The R Core team has a simple answer: Avoid doing so if possible!  The logic here is simple: If Excel data is saved in tab-delimited or comma-separated form, then there is no need to interface with Excel.  The functions read.delim() and read.csv() can be used to import the data into R.

This solution is simple, but is plagued with practical problems.  First, its a multi-step process.  Second it involves leaving R altogether! Third, the process can quickly get  overwhelming when dealing with large Excel workbooks with many workbook tabs, since each tab would have to be saved as a separate flat file.

In practice, its best to open a target Excel file without leaving R, navigate to a specific workbook tab, and then import only a portion of the total data available, as needed, using standard row and column references.  For this purpose, the R package XLConnect is well suited.

Assuming an active internet connection, the following code will download the XLConnect package, load the package into R’s active library, download an Excel workbook from the internet, read data from a pre-defined worksheet and cell region, and display the results.

A detailed user manual for the XLConnect package can be found here.

Text Connections in R

Its not uncommon to find valuable data sets embedded in text documents.  Text connections provide a means to import data sets from text sources and to create standard objects in R.

In the following example, email text is copied into a new script file and all text – except the data table – is quickly commented out (#).  The textConnection() function is then wrapped around the email.  Next, the read.table() function is used to import the data into R.  The text connection is subsequently closed and the resulting data is printed to confirm a successful import.

While the example table is small, the simple process is still faster than manual data entry and can be applied to much larger data objects, where efficiency gains can be significant.

Reading Data from PDF Files into R

The ability to import tables and data sets from PDF files opens access to unlimited data downloads in standard business and academic articles while avoiding manual data entry completely.  A large data table in a PDF document is located here.   The example below downloads the internet file and uses advanced “text mining” tools available in R. The simple script confirms that that PDF data imports are within easy reach:

Importing HTML Internet Tables into R

The internet offers many tables for data analysis where basic “web-scrapping” can be used to import data sets into R.  The following example is straightforward and easily accommodates a table where the data class or type varies by column:

This is a very simple example of “web scrapping.”  An entire chapter is dedicated to internet interface and data downloads given its ability to positively impact business analytics.

R Database Access

Relational database management systems (RDBMS) offer several advantages over flat file storage hierarchies:

  • Provide centralized server storage to a wide range of clients;
  • Store data in more organized ways than simple rectangular grid models;
  • Ensure concurrent data access while enforcing security constraints as needed;
  • Provide fast access to selected parts of a large data set;
  • Summarize and cross-tabulate the data;

There are several large and expensive RDBMSs for commercial use, including Informix,  Oracle, Sybase, IBM’s DB2; and Microsoft SQL Server on Windows.  There are also several very capable RDBMSs for academic and small-system use, such as MySQL, PostgreSQL, Microsoft Access, and others.

The RODBC package on CRAN can be used to provides an interface to database sources supporting an ODBC interface. ODBC (aka “open database connectivity”) is widely supported by almost every database system.  Hence, its possible to use one package and only a few lines of code in a single R script to access data across multiple platforms.

RODBC runs on Unix/Linux, Windows and OS X. On Windows, ODBC support is part of the OS. To enable Unix/Linux, its necessary to download and install an ODBC Driver Manager such as unixODBC or iOBDC (which is pre-installed in OS X), plus a driver for your target database system.

The primary functions in the RODBC package are listed below:

FunctionDescription
odbcConnect(dsn, uid=”", pwd=”")Open a connection to an ODBC database
sqlTables(channel)List all tables in an ODBC database
sqlFetch(channel, sqtable)Read a table from an ODBC database into an R data.frame
sqlQuery(channel, query)Submit a query to an ODBC database and return the results
sqlSave(channel, mydf, tablename=sqltable, append=FALSE) Write or update (append=TRUE) an R data.frame to a table in an ODBC database
sqlDrop(channel, sqltable)Remove a table from the ODBC database
close(channel)Close the connection to an ODBC database

Here is an example:

Additional package functions exist to give a finer level of data control.  More details on the RODBC package can be found here.

R Data Import and Binary netCDF Files

Large scientific datasets (including vender-specific SCADA data) present 3 problems: large size, complex structure, and proprietary binary formats.    

  • Large size means that efficient data compression is important for file transfer and storage. The good news is that R is is capable of handling compression and supporting the largest data objects;
  • Complex structure implies a lot of “fidly” data work will precede analysis simply to extract relevant sample data. The str() function is invaluable for interrogating large data objects, meta data definitions, and finding the data object layers where your sample data is hiding;
  • Proprietary or binary data formats are probably the most difficult problem to overcome.  Typically, you need a vender supplied “can opener” just to access the data.  Fortunately, R is used by many researchers, and is extensible, so tools are available for reading and writing many of data formats…including vender formats  The common process solution is the same, regardless of format:  identify the appropriate package, load the library, point the tool to the data file, and create an R data object.

netCFD has become one of the most common binary data formats for earth and atmospheric science applications.  NOAA/NCAR, NASA, and many European satellite data sources with spatial, temporal, and attribute dimensions are providing data in netCDF format.  Even vendor sensor instruments that produce large data cubes now use this common data format.  The result: importing complex binary data into R can be as easy as standard CSV files.

In order to access netCDF files, it is necessary to first download and install the netCDF framework and drivers from here.  The download section will present source code choices in different languages (Jave, C++, Fortran, etc.).  The software is the same regardless.  

Extracting data from netCDF files requires several simple steps: loading the ncdf libarary, loading the netCDF file, examining the structure of the ncdf data, and extracting values to variables. The following example is from the NCAR guide, and uses their example.nc file.

The print() function for a ncdf object is class specific and shows only the header information. When the author of the ncdf pakage defined the ncdf class, they defined a print.ncdf method to do something sensible, printing the header information only, instead of printing the entire data hypercube. Similarly, they defined a summary.ncdf method to provide complementary information about the variables in the ncdf object:

The last set of commands interrogates the example data str() function and basic graph analysis:

plot19

Back | Next

Leave a Reply

Your email address will not be published. Required fields are marked *

16 − 6 =