This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, December 15, 2012

Data Science, Data Analysis, R and Python

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

  1. “Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;
  2. “Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;
  3. “Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety.  All of these aspects of the Big Data characterization problem affect it, but differently:

·        For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself.  Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task. 
·        For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect.  If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate. 
·        For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization.  In large organizations, this data funnel generally involves a mix of different technologies and people.  While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results.  In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1.      The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport.  Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet.  Examples like these describe automated systems put in place to continuously exploit a large but fixed data source. 
2.      The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer.  This use is not represented by any of the examples described in these articles.  In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different.  In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed.  In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form.  In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.)  Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, the Sciences and Medicine: continuous, integer, nominal, ordinal, and binary.  More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages.  Certainly, there is some support for other data types, and the level of this support is growing – the tm package and a variety of other related packages support the analysis of text data, the twitteR package provides support for analyzing Twitter tweets, and the scrapeR package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength.  Yet it is a key component of data science, as Davenport and Patil emphasize:

“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.  A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”


To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, How To Solve It.  Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems.  Polya’s basic approach consists of these four steps:

  1. Understand the problem;
  2. Formulate a plan for solving the problem;
  3. Carry out this plan;
  4. Check the results.

It is important to note what is not included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation.  While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that Davenport and Patil assume in the previous quote correspond very roughly to problems of this type.  They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results.  In contrast, the class of “data scientist” problems implied in Davenport and Patil’s quote comprises an expanded set of steps:

  1. Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;
  2. Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors?  How complete are these data sources?  What would it cost to use them?  Are there significant constraints on how they can be used?  Are some of these data sources strongly incompatible?  If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?
  3. Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;
  4. Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right?  Are the ranges consistent with expectations?  Do the relationships seen between key variables seem to make sense?
  5. Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;
  6. After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results.  (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)

Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R.  (Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.)  The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6.  Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations.  In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).   

The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R.  First, however, I have two observations.  My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future.  The packages twitteR and scrapeR, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases.  My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea.  A number of years ago, I attended a student talk given at an industry/university consortium.  The speaker set up and solved a simple linear program (i.e., he implemented the simplex algorithm to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller.  At the time, programming those controllers was done via relay ladder logic, a diagrammatic approach used by electricians to configure complicated electrical wiring systems.  I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.

Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book Introduction to Data Technologies provides an excellent introduction to many of them.  (This book is also available as a free PDF file under creative commons.)   A point made in the book’s preface mirrors one in Davenport and Patil’s article:

“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form.  Just getting the data ready for analysis often represents a significant component of a research project.”

 Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R.  Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures.  Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along.  He adds:

            “A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”

My one quibble with Murrell’s book is that he gives Python only a passing mention.  While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks.  Like R, Python is an open-source language, freely available for a wide variety of computing environments.  Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing).  In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents.  I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before.  I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling.

Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that Davenport and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap.  If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21st century,” Murrell’s book is an excellent place to start.