This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, October 27, 2012

Characterizing a new dataset

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.  So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post and promised to describe later.  My next post will be a special one on Big Data and Data Science, followed by another one about the DataframeSummary procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.

A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset.  Ideally, everything we could want to know about a dataset should come from the accompanying metadata, but this is rarely the case.  As I discuss in Chapter 2 of Exploring Data in Engineering, the Sciences, and Medicine, metadata is the available “data about data” that (usually) accompanies a data source.  In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects.  This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding.  An example, illustrating the consequent problem of disguised missing data is described in my paper The Problem of Disguised Missing Data.  (It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been corrected.)

Once we have converted our data source into an R data frame (e.g., via the read.csv function for an external csv file), there are a number of useful tools to help us begin this characterization process.  Probably the most general is the str command, applicable to essentially any R object.  Applied to a dataframe, this command first tells us that the object is a dataframe, second, gives us the dimensions of the dataframe, and third, presents a brief summary of its contents, including the variable names, their type (specifically, the results of R’s class function), and the values of their first few observations.  As a specific example, if we apply this command to the rent dataset from the gamlss package, we obtain the following summary:

> str(rent)
'data.frame':   1969 obs. of  9 variables:
 $ R  : num  693 422 737 732 1295 ...
 $ Fl : num  50 54 70 50 55 59 46 94 93 65 ...
 $ A  : num  1972 1972 1972 1972 1893 ...
 $ Sp : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Sm : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ H  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
 $ L  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
This dataset summarizes a 1993 random sample of housing rental prices in Munich, including a number of important characteristics about each one (e.g., year of construction, floor space in square meters, etc.).  A more detailed description can be obtained via the command “help(rent)”.

The head command provides similar information to the str command, in slightly less detail (e.g., it doesn’t give us the variable types), but in a format that some will find more natural:

> head(rent)
       R Fl    A Sp Sm B H L loc
1  693.3 50 1972  0  0 0 0 0   2
2  422.0 54 1972  0  0 0 0 0   2
3  736.6 70 1972  0  0 0 0 0   2
4  732.2 50 1972  0  0 0 0 0   2
5 1295.1 55 1893  0  0 0 0 0   2
6 1195.9 59 1893  0  0 0 0 0   2
 (An important difference between these representations is that str characterizes factor variables by their level number and not their level value: thus the first few observations of the factor B assume the first level of the factor, which is the value 0.  As a consequence, while it may appear that str is telling us that the first few records list the value 1 for the variable B while head is indicating a zero, this is not the case.  This is one reason that data analysts may prefer the head characterization.)

While the R data types for each variable can be useful to know – particularly in cases where it isn’t what we expect it to be, as when integers are coded as factors – this characterization doesn’t really tell us the whole story.  In particular, note that R has commands like “as.character” and “as.factor” that can easily convert numeric variables to character or factor data types.  Even beyond this, the range of inherent behaviors that numerically-coded data can exhibit cannot be fully described by a simple data type designation.  As a specific example, one of the variables in the rent dataframe is “A,” described in the metadata available from the help command as “year of construction.”  While this variable is coded as type “numeric,” in fact it takes integer values from 1890 to 1988, with some values in this range repeated many times and others absent.  This point is important, since analysis tools designed for continuous variables – especially outlier-resistant ones like medians and other rank-based methods – sometimes perform poorly in the face of data sequences with many repeated values (i.e., “ties,” which have zero probability for continuous data distributions).  In extreme cases, these techniques may fail completely, as in the case of the MADM scale estimate, discussed in Chapter 7 of Exploring Data.  This data characterization implodes if more than 50% of the data values are the same, returning the useless value zero in this case, independent of the values of all of the other data points.

These observations motivate the DataframeSummary procedure described here, to be included in the ExploringData package.  This function is called with the name of the dataframe to be characterized and an optional parameter Option, which can take any one of the following four values:

  1. “Brief” (the default value)
  2. “NumericOnly”
  3. “FactorOnly”
  4. “AllAsFactor”

In all cases, this function returns a summary dataframe with one row for each column in the dataframe to be characterized.  Like the str command, these results include the name of each variable and its type.  Under the default option “Brief,” this function also returns the following characteristics for each variable:

  • Levels = the number of distinct values the variable exhibits;
  • AvgFreq = the average number of records listing each value;
  • TopLevel = the most frequently occurring value;
  • TopFreq = the number of records listing this most frequent value;
  • TopPct = the percentage of records listing this most frequent value;
  • MissFreq = the number of missing or blank records;
  • MissPct = the percentage of missing or blank records.

For the rent dataframe, this function (under the default “Brief” option) gives the following summary:

> DataframeSummary(rent)
Variable Type Levels AvgFreq TopLevel TopFreq TopPct MissFreq MissPct
3        A    numeric      73   26.97         1957         551       27.98         0       0
6        B    factor           2  984.50           0          1925        97.77        0       0
2       Fl     numeric      91   21.64          60              71          3.61        0       0
7        H    factor            2  984.50          0          1580        80.24        0       0
8        L    factor            2  984.50          0           1808        91.82        0       0
9      loc    factor            3  656.33          2           1247        63.33        0       0
1        R    numeric   1762    1.12          900               7          0.36        0       0
5       Sm  numeric         2  984.50           0          1797         91.26        0       0
4       Sp   numeric         2  984.50           0          1419         72.07        0       0

The variable names and types appear essentially as they do in the results obtained with the str function, and the numbers to the far left indicate the column numbers from the dataframe rent for each variable, since the variable names are listed alphabetically for convenience.  The “Levels” column of this summary dataframe gives the number of unique values for each variable, and it is clear that this can vary widely even within a given data type.  For example, the variable “R” (monthly rent in DM) exhibits 1,762 unique values in 1,969 data observations, so it is almost unique, while the variables “Sm” and “Sp” exhibit only two possible values, even though all three of these variables are of type “numeric.”  The AvgFreq column gives the average number of times each level should appear, assuming a uniform distribution over all possible values.  This number is included as a reference value for assessing the other frequencies (i.e., TopFreq for the most frequently occurring value and MissFreq for missing data values).  Thus, for the first variable, “A,” AvgFreq is 26.97, meaning that if all 73 possible values for this variable were equally represented, each one should occur about 27 times in the dataset.  The most frequently occurring level (TopLevel) is “1957,” which occurs 551 times, suggesting a highly nonuniform distribution of values for this variable.  In contrast, for the variable “R,” AvgFreq is 1.12, meaning that each value of this variable is almost unique.  The TopPct column gives the percentage of records in the dataset exhibiting the most frequent value for each record, which varies from 0.36% for the numeric variable “R” to 97.77% for the factor variable “B.”  It is interesting to note that this variable is of type “factor” but is coded as 0 or 1, while the variables “Sm” and “Sp” are also binary, coded as 0 or 1, but are of type “numeric.”  This illustrates the point noted above that the R data type is not always as informative as we might like it to be.  (This is not a criticism of R, but rather a caution about the fact that, in preparing data, we are free to choose many different representations, and the original logic behind the choice may not be obvious to all ultimate users of the data.)  In addition, comparing the available metadata for the variable “B” illustrates the point about metadata errors noted earlier: of the 1,969 data records, 1,925 have the value “0” (97.77%), while 44 have the value “1” (2.23%), but the information returned by the help command indicates exactly the opposite proportion of values: 1,925 should have the value “1” (indicating the presence of a bathroom), while 44 should have the value “0” (indicating the absence of a bathroom).  Since the interpretation of the variables that enter any analysis is important in explaining our final analytical results, it is useful to detect this type of mismatch between the data and the available metadata as early as possible.  Here, comparing the average rents for records with B = 1 (DM 424.95) against those with B = 0 (DM 820.72) suggests that the levels have been reversed relative to the metadata: the relatively few housing units without bathrooms are represented by B = 1, renting for less than the majority of those units, which have bathrooms and are represented by B = 0.  Finally, the last two columns of the above summary give the number of records with missing or blank values (MissFreq) and the corresponding percentage (MissPct); here, all records are complete so these numbers are zero.

In my next post on this topic, I will present results for the other three options of the DataframeSummary procedure, along with the code that implements it.  In all cases, the results include those generated by the “Brief” option just presented, but the difference between the other options lies first, in what additional characterizations are included, and second, in which subset of variables are included in the summary.  Specifically, for the rent dataframe, we obtain:

  • Under the “NumericOnly” option, a summary of the five numeric variables R, FL, A, Sp, and Sm results, giving characteristics that are appropriate to numeric data types, like the spacing measures described in my last post;
  • Under the “FactorOnly” option, a summary of the four factor variables B, H, L, and loc results, giving measures that are appropriate to categorical data types, like the normalized Shannon entropy measure discussed in several previous posts;
  • Under the “AllAsFactor” option, all variables in the dataframe are first converted to factors and then characterized using the same measures as in the “FactorOnly” option.

The advantage of the “AllAsFactor” option is that it characterizes all variables in the dataframe, but as I discussed in my last post, the characterization of numerical variables with measures like Shannon entropy is not always terribly useful.


  1. Hi this is raj i am having 3 years of experience as a php developer and i am certified. i have knowledge on OOPS concepts in php but dont know indepth. After learning hadoop will be enough to get a good career in IT with good package? and i crossed hadoop training in chennai website where someone please help me to identity the syllabus covers everything or not??

  2. I guess this will be helpful to me badly. And i will surely implement the ideas you have described above as i'm doing the Hadoop training in chennai

  3. I have read your blog it was nice to follow even I am looking for your future updates. Hadoop is a highly growing & scoopful technology in IT market it’s an open-source software framework for managing big data in a distributed fashion on large commodity computing hardware. FITA provides Hadoop training chennai get in to fita and out with your career.
    Hadoop training center in Chennai | Hadoop course in Chennai | Hadoop training institutes in Chennai

  4. This is extremely helpful info!! Very good work. Everything is very interesting to learn and easy to understood. Thank you for giving information.
    AWS Training in chennai | AWS Training chennai | AWS course in chennai

  5. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. Vmware certification in chennai | Vmware certification chennai | Vmware course in chennai | Vmware course chennai

  6. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
    Informatica training in chennai|ccna course in Chennai|Best Informatica Training In Chennai

  7. Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. Java Training in Chennai

  8. Thanks for sharing your informative article on Hive ODBC Driver. Your article is very descriptive and assists me to learn whole concept in detail. Hadoop Training in Chennai | Big Data Training in Chennai


  9. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    QTP Training in Chennai

  10. Excellent information with unique content and it is very useful to know about the information based on blogs.
    Selenium Training in Chennai | QTP Training In Chennai

  11. Excellent information with unique content and it is very useful to know about the information based on blogs.
    Hadoop Training In Chennai | oracle apps financials Training In Chennai | advanced plsql Training In Chennai

  12. In database computing, Oracle Real Application Clusters (RAC) — an option for the Oracle Database software produced by Oracle Corporation and introduced in 2001 with Oracle9i — provides software for clustering and high availability in Oracle database environments. Oracle Corporation includes RAC with the Standard Edition, provided the nodes are clustered using Oracle Clusterware.
    Oracle RAC allows multiple computers to run Oracle RDBMS software simultaneously while accessing a single database, thus providing clustering.

    In a non-RAC Oracle database, a single instance accesses a single database. The database consists of a collection of data files, control files, and redo logs located on disk. The instance comprises the collection of Oracle-related memory and operating system processes that run on a computer system.

    Oracle RAC Training in Chennai


  13. Thanks for sharing this pretty post to our knowledge, SAS is a program that assists to retrieve, managing and uploading the data & simply it’s an integration system of software for performing these actions, thanks for taking your time to discuss about this topic.
    sas training in Chennai|sas course in Chennai

  14. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    ccna training in Chennai|ccna training institute in Chennai|ccna institutes in Chennai

  15. Latest Govt Bank Railway Jobs Notification 2016

    I have visited this blog first time and i got a lot of informative data from here which is quiet helpful for me indeed........................

  16. Great post and informative was awesome to read, thanks for sharing this great content to my vision.
    Informatica Training In Chennai
    Hadoop Training In Chennai
    Oracle Training In Chennai
    SAS Training In Chennai

  17. Best SAS Training Institute In Chennai It’s too informative blog and I am getting conglomerations of info’s about Oracle interview questions and answer .Thanks for sharing, I would like to see your updates regularly so keep blogging.

  18. Excellent post! keep sharing such a informative post.
    msbi training in chennai

  19. Thanks for sharing this pretty post to our knowledgeoracle training in chennai

  20. Wonderful bloggers like yourself who would positively reply encouraged me to be more open and engaging in commenting.So know it's helpful.
    Facility Management Companies in Chennai

  21. This post is much helpful for us. This is really very massive value to all the readers and it will be the only reason for the post to get popular with great authority.
    Software Testing Training in Chennai

  22. This is useful post for me. I learn lot of new information from your post. keep sharing. thank you for share us. Software Testing Training in Chennai | Software Testing Training in Chennai

  23. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Data Science with SAS Training, kindly contact us
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383

  24. This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharng this information,this is useful to me...
    Android training in chennai
    Ios training in chennai


  25. Thanks for posting useful information.You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...Really it was an awesome article...very interesting to read..please sharing like this information......
    Web Design Development Company
    Mobile App Development Company

  26. This article is very much helpful and i hope this will be an useful information for the needed one. Keep on updating these kinds of informative things...

    Android App Development Company
    Android App Development Company

  27. These ways are very simple and very much useful, as a beginner level these helped me a lot thanks fore sharing these kinds of useful and knowledgeable information.
    PHP training in chennai

  28. I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me.. Thanks for all your help and wishing you all the success in your business.
    vmware training london

  29. Someone essentially lend a hand to make severely posts I would state. That is the very first time I frequented your website page and thus far? I surprised with the analysis you made to create this particular submit incredible. Fantastic job!
    Informatica Training in Chennai
    iOS Training in Chennai
    J2EE Training in Chennai

  30. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
    iOS App Development Company

  31. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

  32. After looking into a handful of the blog articles on your site, I really like your technique of writing a blog. I book marked it to my bookmark site list and will be checking back in the near future. Take a look at my website as well and let me know your opinion.
    Office Interior Designers in Coimbatore
    Office Interior Designers in Bangalore
    Office Interior Designers in Hyderabad

  33. it is really amazing...thanks for sharing....provide more useful information...
    Mobile app development company

  34. Very nice post here thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Business Tax Return
    Cpa Tax Accountant
    Tax Return Services

  35. I got lot of information from your blog.And thanks for sharing!!!!
    SAP Basis Training in Chennai

  36. This is incredible posting! I quite enjoyed reading it, you happen to be a great author. I will remember to bookmark your blog and will eventually come back very soon. Also share with my community and friends about this.
    Web development company in bangalore
    Outsource magento ecommerce services india
    ECommerce Website developers in bangalore

  37. Very Nice article...Its found the best information about website design and development,..Thank you for sharing the great information...
    Web designers in Bangalore
    ECommerce Website developers in bangalore

  38. You've made some good points there. I looked on the internet for more information about this
    Mainframe Training In Chennai | Hadoop Training In Chennai | ETL Testing Training In Chennai

  39. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    Hadoop Training Institute In chennai

  40. useful information.
    Home and Beyond is the best home interior designer in India.
    modular kitchen designs
    kitchen interior designs
    modular kitchen showrooms in chennai

  41. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    AWS Certified Developer

    AWS Interview Questions

    Aws Azure Job Opening

    Aws Freshers Opening in Chennai and Bangalore

  42. I’m planning to start my blog soon, but I’m a little lost on everything. Would you suggest starting with a free platform like Word Press or go for a paid option? There are so many choices out there that I’m completely confused. Any suggestions? Thanks a lot.

    DevOps Training in Chennai

  43. Hi,
    Awesome Post!!! With unique content, I really get reading interest when I am following your article, I hope I ll help many of them who looking this pretty information.

    Aws Training in Chennai | Sql Training in Chennai

  44. These ways are very simple and very much useful, as a beginner level these helped me a lot thanks for sharing these kinds of useful and knowledgeable information.

    Also Check out the :