All three provide food for thought; this post presents a brief summary of some of those thoughts.
· For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself. Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task.· For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect. If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate.· For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).
1. The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport. Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet. Examples like these describe automated systems put in place to continuously exploit a large but fixed data source.2. The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer. This use is not represented by any of the examples described in these articles. In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different. In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed. In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).
“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”
- Understand the problem;
- Formulate a plan for solving the problem;
- Carry out this plan;
- Check the results.
- Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;
- Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors? How complete are these data sources? What would it cost to use them? Are there significant constraints on how they can be used? Are some of these data sources strongly incompatible? If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?
- Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;
- Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right? Are the ranges consistent with expectations? Do the relationships seen between key variables seem to make sense?
- Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;
- After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results. (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)
“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form. Just getting the data ready for analysis often represents a significant component of a research project.”
“A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”