Category Archives: Data Science

Plotting Forecast Data Objects Using ggplot

Click to enlarge

Robert Hyndman is the author of the forecast package in R. I’ve been using the package for long-term time series forecasts. The package comes with some built in methods for plotting forecast data objects in R that Ive wanted to customize for improved clarity and presentation.  The following article achieves that goal and shares two scripts for plotting forecast data objects using ggplot.

Posted in Data Science, ggplot2, Modeling, R Programming | Leave a comment

From Least Squares to k-Nearest Neighbor (kNN)

Figure 1 – Click to enlarge

The linear model is the most widely used data science tools and one of the most important.  In addition, there is another basic tool known as the k nearest neighbor method (kNN).  Both models can be used to go beyond prediction for classification.  Feature classes are used by machines to recognize faces within a crowd, to “read” road signs by distinguishing one letter from another, and to set voter registration districts by separating population groups.  This article applies and compares both classification methods

Posted in Data Science, Modeling, R Programming, Website | Leave a comment

R Functions for Best Subset Regression

Best subset regression is an technique for model building and variable selection. The method looks at all combinations of independent predictor variables for use in a multiple regression model. Model developers and analysts will often struggle with variable selection, especially when the number of predictors is high.  Ideally, each set of predictors is run and the best set is selected using a criteria for model performance. The following article provides custom functions for best subset selection that are fast and easy to use.

Posted in Data Science, Faster R, Modeling | Leave a comment

Popularity of R Programming Language

TIOBE IndexThe popularity of R is rapidly increasing and is well on its way to being a top 10 programming language.  The TIOBE index is a standard indicator of the popularity of all programming languages.  The TIOBE index confirms that a subset of languages – those for computational statistics and data analysis – are gaining increased attention. The clear winner of the pack is the open source programming language R.

Posted in Data Science, R Programming | Leave a comment

Binary Data In R

There are many reasons to work with binary data in R.  Solar resource data, solar PV performance data, and real-time grid monitoring data are typically stored and transmitted in binary data formats.  

In practice, the ability to access binary data in R is impossible in the absence of a vender or format specific “can opener” and a properly configured scientific programming environment.  As a result, many business applications often bypass binary data use altogether or, instead, rely on secondary sources and summary statistics with no ability to validate data integrity and accuracy.  

Posted in Data, Data Science, GDAL, R Data Import | Leave a comment

Correlation Plots in R

The standard function for correlation plots in R is pairs(), which generates a matrix of scatter plots based on all pairwise combinations of variables in a data object.  The standard graph looks something like this after a little color enhancement:” plot13Click to enlarge

The code behind this plot is simple:

Posted in Data Science, ggplot2, R Graphics, R Programming | Leave a comment