This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Wednesday, March 23, 2011

The Many Uses of Q-Q Plots

My last four posts have dealt with boxplots and some useful variations on that theme.  Just after I finished the series, Tal Galili, who maintains the R-bloggers website, pointed me to a variant I hadn’t seen before.  It's called a beeswarm plot, and it's produced by the beeswarm package in R.  I haven’t played with this package a lot yet, but it does appear to be useful for datasets that aren’t too large and that you want to examine across a moderate number of different segments.  The plot shown below provides a typical illustration: it shows the beeswarm plot comparing the potassium content of different cereals, broken down by manufacturer, from the UScereal dataset included in the MASS package in R.  I discussed this data example in my first couple of boxplot posts and I think this is a case where the beeswarm plot gives you a more useful picture of how the data points are distributed than the boxplots do.  For more information about the beeswarm package, I recommend Tal's post.  More generally, anyone interested in learning more about what you can do with the R software package should find the R-blogger website extremely useful.

Besides boxplots, one of the other useful graphical data characterizations I discuss in Exploring Data in Engineering, the Sciences, and Medicine is the quantile-quantile (Q-Q) plot.  The most common form of this characterization is the normal Q-Q plot, which represents an informal graphical test of the hypothesis that a data sequence is normally distributed.  That is, if the points on a normal Q-Q plot are reasonably well approximated by a straight line, the popular Gaussian data hypothesis is plausible, while marked deviations from linearity provide evidence against this hypothesis.  The utility of normal Q-Q plots goes well beyond this informal hypothesis test, however, which is the main point of this post.  In particular, the shape of a normal Q-Q plot can be extremely useful in highlighting distributional asymmetry, heavy tails, outliers, multi-modality, or other data anomalies.  The specific objective of this post is to illustrate some of these ideas, expanding on the discussion presented in Exploring Data.

The above figure shows four different normal Q-Q plots that illustrate some of the different data characteristics these plots can emphasize.  The upper left plot demonstrates that normal Q-Q plots can be extremely effective in highlighting glaring outliers in a data sequence.  This plot shows the annual number of traffic deaths per ten thousand drivers over an unspecified time period, for 25 of the 50 states in the U.S., plus the District of Columbia.  This plot was constructed from the road dataset included in the MASS package in R, which gives the numbers of deaths, the numbers of drivers (in tens of thousands), and several other characteristics for each of these regions.  Based on the interpretation of normal Q-Q plots offered above, the normal distribution hypothesis appears fairly reasonable for this data sequence, in all cases except the point in the extreme upper right.  This point corresponds to the state of Maine, which exhibited 26 deaths per ten thousand drivers, well above the average of approximately 5 for all other regions considered.  

It is not clear why the reported traffic death rate is so high for Maine.  The scatterplot above shows the reported traffic deaths for each state or district against the number of drivers, in tens of thousands.  The dashed line in the plot corresponds to the average traffic death rate for all regions except Maine, and it is clear that this line fits most of the data points reasonably well, with Maine (the solid point) representing the most glaring exception.  Although it still leaves us wanting to know more, this plot suggests that the number of deaths for Maine is unusually high, rather than the number of drivers being unusually low, which might be a more tempting explanation.

The Q-Q plot for this denominator variable – i.e., for the number of drivers – is shown as the upper right plot in the original set of four shown above.  There, the fact that both tails of the distribution lie above the reference line is suggestive of distributional asymmetry, a point examined further below using Q-Q plots for other reference distributions.  Also, note that both of the upper Q-Q plots shown above are based on only 26 data values, which is right at the lower limit on sample size that various authors have suggested for normal Q-Q plots to be useful (see the discussion of normal Q-Q plots in Section 6.3.3 of Exploring Data for details).  The tricky issues of separating outliers, asymmetry, and other potentially interesting data characteristics in samples this small is greatly facilitated using the Q-Q plot confidence intervals discussed below.

The lower left Q-Q plot in the above sequence is that for the Old Faithful geyser dataset faithful included with the base R package.  As I have discussed previously, the eruption duration data exhibits a pronounced bimodal distribution, which may be seen clearly in nonparametric density estimates computed from these data values.  Normal Q-Q plots constructed from bimodal data typically exhibit a “kink” like the one seen in this plot.  A crude way of explaining this behavior is the following: the lower portion of the Q-Q plot is very roughly linear, suggesting a very approximate Gaussian distribution, corresponding to the first mode of the eruption data distribution (i.e., the durations of the shorter group of eruptions).  Similarly, the upper portion of the Q-Q plot is again very roughly linear, but with a much different intercept that corresponds to the larger mean of the second peak in the distribution (i.e., the durations of the longer group of eruptions).  To connect these two “roughly linear” local segments, the curve must exhibit a “kink” or rapid transition region between them.  By the same reasoning, more general multi-modal distributions will exhibit more than one such “kink” in their Q-Q plots.  Finally, the lower right Q-Q plot in the collection above was constructed from the Pima Indians diabetes dataset available from the UCI Machine Learning Repository.  This dataset includes a number of clinical measurements for 768 female members of the Pima tribe of Native Americans, including their diastolic blood pressure.  The lower right Q-Q plot was constructed from this blood pressure data, and its most obvious feature is the prominent lower tail anomaly.  In fact, careful examination of this plot reveals that these points correspond to the value zero, which is not realistic for any living person.  What has happened here is that zero has been used to code missing values, both for this variable and several others in this dataset.  This observation is important because the metadata associated with this dataset indicates that there is no missing data, and a number of studies in the classification literature have proceeded under the assumption that this is true.  Unfortunately, this assumption can lead to badly biased results, a point discussed in detail in a paper I published in SIGKDD Explorations (Disguised Missing Data paper PDF).  The point of the example presented here is to show that normal Q-Q plots can be extremely effective in highlighting this kind of data anomaly.

The normal Q-Q plots considered so far were constructed using the qqnorm procedure available in base R, and the reference lines shown in these plots were constructed using the qqline command.  It is not difficult to construct Q-Q plots for other reference distributions using procedures in base R, but a much simpler alternative is to use the qqPlot command in the optional car package.  This R add-on package was developed in association with the book An R Companion to Applied Regression, by Fox and Weisberg, and it includes a number of very useful procedures.  The default options of the qqPot procedure automatically generate a reference line, along with upper and lower 95% confidence intervals for the plot, which are particularly useful for small samples like the road dataset.  The figure below shows a normal Q-Q plot for the number of traffic deaths per 10,000 drivers generated using the qqPlot package.   The fact that all of the points but the one obvious outlier fall within the 95% confidence limits suggest that the scatter around the reference line seen for these 25 observations is small enough to be consistent with a normal reference distribution.  Further, these confidence limits also emphasize how much the outlying result for the state of Maine violates this normality assumption.

Another advantage of the qqPlot command is that it provides the basis for very easy generation of Q-Q plots for essentially any reference distribution that is available in R, including those available in add-on packages like gamlss.dist, which supports an extremely wide range of distributions (generalized inverse Gaussian distributions, anyone?).  This capability is illustrated in the four Q-Q plots shown below, all generated with the qqPlot command for non-Gaussian distributions.  In all of these plots, the data corresponds to the driver counts for the 26 states and districts summarized in the road dataset.  Motivation for the specific Q-Q plots shown here is that the four distributions represented by these plots are all better suited to capturing the asymmetry seen in the normal Q-Q plot for this data sequence than the symmetric Gaussian distribution is.  The upper left plot shows the results obtained for the exponential distribution which, like the Gaussian distribution, does not require the specification of a shape parameter.  Comparing this plot with the normal Q-Q plot shown above for this data sequence, it is clear that the exponential distribution is more consistent with the driver data than the Gaussian distribution is.  The data point in the extreme upper right does fall just barely outside the 95% confidence limits shown on this plot, and careful inspection reveals that the points in the lower left fall slightly below these confidence limits, which become quite narrow at this end of the plot. 

The exponential distribution represents a special case of the gamma distribution, with a shape parameter equal to 1.  In fact, the exponential distribution exhibits a J-shaped density, decaying from a maximum value at zero, and it corresponds to a “dividing line” within the gamma family: members with shape parameters larger than 1 exhibit unimodal densities with a single maximum at some positive value, while gamma distributions with shape parameters less than 1 are J-shaped like the exponential distribution.  To construct Q-Q plots for general members of the gamma family, it is necessary to specify a particular value for this shape parameter, and the other three Q-Q plots shown above have done this using the qqPlot command.  Comparing these plots, it appears that increasing the shape parameter causes the points in the upper tail to fall farther outside the 95% confidence limits, while decreasing the shape parameter better accommodates these upper tail points.  Conversely, decreasing the shape parameter causes the cluster of points in the lower tail to fall farther outside the confidence limits.  It is not obvious that any of the plots shown here suggest a better fit than the exponential distribution, but the point of this example was to show the flexibility of the qqPlot procedure in being able to pose the question and examine the results graphically.  Alternatively, the Weibull distribution – which also includes the exponential distribution as a special case – might describe these data values better than any member of the gamma distribution family, and these plots can also be easily generated using the qqPlot command (just specify dist = “weibull” instead of dist = “gamma”, along with shape = a for some positive value of a other than 1).

Finally, one cautionary note is important here for those working with very large datasets.  Q-Q plots are based on sorting data, something that can be done quite efficiently, but which can still take a very long time for a really huge dataset.  As a consequence, while you can attempt to construct Q-Q plots for sequences of hundreds of thousands of points or more, you may have to wait a long time to get your plot.  Further, it is often true that plots made up of a very large number of points reduce to ugly-looking dark blobs that can use up a lot of toner if you make the further mistake of trying to print them.  So, if you are working with really enormous datasets, my suggestion is to construct Q-Q plots from a representative random sample of a few hundred or a few thousand points, not hundreds of thousands or millions of points.  It will make your life a lot easier.


62 comments:

  1. Hi Ron,
    very nice post but I'm wondering about the actual code. Would it be possible to know it so I can improve my R skills?
    Many thanks in advance,
    Ruben

    ReplyDelete
  2. Hi, Ruben - Thanks for the note. I'm glad you like the blog, and thanks for the suggestion: I will try to include more R code going forward, as I have done a bit with my recent post on interestingness measures. Meanwhile, you might be interested in The R Book by Michael J. Crawley as a good source for information on the practical mechanics of working with R.

    ReplyDelete
  3. Hello Ron,

    While purchasing books is well and good for those who have the resources, if you are serious about contributing to the open source community, you should always show your code. Remember, there are people in places other than the first world who need this information as well.

    ReplyDelete
  4. >It is not clear why the reported traffic death rate is so high for Maine.

    non-statistical reply: That doesn't surprise me. It's a very rural state, with spread out population, and aside from I-95 and I-495, most of the highways are 2-lane roads. That and the fact that there are moose which cause numerous collisions.

    ReplyDelete
  5. Thank you so much as you have been willing to share information with us. We will forever admire all you have done here because you have made my work as easy as ABC. qq

    ReplyDelete
  6. Bandar Bola
    Prediksi Bola
    Agen Bola
    Situs Bola
    Bandar Bola Terpercaya
    mari bergabung bersama 816agent,com
    frebet 100rb
    cashback sportbook 5%
    rollingan live casino 0.7%
    bonus menang 10%
    ayuk bergabung masih banyak bonus menarik lainnya

    https://prediksibola816agent.blogspot.com
    https://bokepmantabtop.blogspot.com
    paduanbetbola.com

    ReplyDelete
  7. thanks for sharing the information , so great article
    Live Draw HK

    ReplyDelete
  8. This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. domino online

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Thank you for some other informative blog. Where else could I get that type of information written in such an ideal means? I have a mission that I’m just now working on, and I have been at the look out for such information. dominoqq

    ReplyDelete
  11. I really impressed after read this because of some quality work and informative thoughts . I just wanna say thanks for the writer and wish you all the best for coming!. qq online

    ReplyDelete
  12. I gottafavorite this websiteit seemsvery helpful . qq online

    ReplyDelete
  13. Cool stuff you have got and you keep update all of us. dominoqq

    ReplyDelete
  14. Kemenangan Pasti Pada Situs BandarQ Online Gampang Menang http://babeqq.tblogz.com/kemenangan-pasti-pada-situs-bandarq-online-gampang-menang-9398577 yang ada pada babeqq ini akan menjadi salah satu yang terpenting dalam bermain di situs terbaik dan terpercaya ini.

    ReplyDelete
  15. Pada setiap situs judi bandarq dan dominoqq yang ada di http://babeqq.diowebhost.com/23605322/mengincar-kesuksesan-bermain-judi-bandarq-online tepatnya pada babeqq akan bisa menghasilkan banyak keuntungan dan kemenangan yang besar.

    ReplyDelete
  16. JOIN SEGERA BOSKU, Mainkan Game Online Seru Hanya Dengan Minimal 10RIBU Saja Bisa Menang JADI 500RIBUAN, Lumayan Bos!! Mari Daftar Langsung di Lexusdomino Bandar Dominoqq Online Terpercaya Deposit Pulsa.
    Agen Domino
    Bandar Domino99 Online
    Agen Poker Terpercaya
    Bandar Sakong Online

    ReplyDelete
  17. Rousing and particularly significant, significantly recommended for all bloggers out there.
    https://dewiqqiu.com/

    ReplyDelete
  18. provides a variety of information related to casinos. you can find in here about gaming methods, tips and know-how. Baccara, blackjack, slot machine, rullet. 바카라

    ReplyDelete
  19. Motorqq Merupakan situs judi pkv terbaik juga sebagai situs poker qq online terpercaya tahun 2020 yang depositnya murah bandarqq online

    ReplyDelete
  20. Good blog post. I want to thank you for interesting and helpful information and I like your point of view. Thank you lots of great information here.most relevent website Click here for more details situs qq online

    ReplyDelete
  21. PERMAINAN ONLINE TERBESAR DI INDONESIA

    Website paling ternama dan paling terpercaya di Asia ^^
    Sistem pelayanan 24 Jam Non-Stop bersama dengan CS Berpengalaman respon tercepat :) memiliki 7 macam permainan
    Terima Deposit Pulsa Dan Dompet Digital Lainnya 😁
    Minimal Deposit 25Ribbu & Withdraw 50 Ribu
    Berikut HOT PROMO PINO4D :
    - Bonus Referal Togel :
    *4D & COLOK : 1%
    *2D & 3D : 0.5%
    - Bonus Referral 2%
    - Bonus Referral 2%
    ( Sportbook & Sabung Ayam )
    - Bonus Next Deposit 5%
    - Bonus New Member 10%
    - Bonus Rollingan Casino 0.8%
    - Bonus Rollingan 0.3%
    - Bonus Cashback Up To 15%
    Yuk daftar segera di Pino4d dan dapatkan bonus serta promosi menarik.
    WA :+855962009634
    Line : Pino_4d

    ReplyDelete
  22. DAFTAR SEKARANG!!!
    WWW . PELANGIJITU . COM
    SITUS TOGEL ONLINE TERBESAR DAN TERPERCAYA DIASIA

    Dengan diskon togel terbesar dan promo hot lainnya.

    Pasaran : SYDNEY & HONGKONG
    4D : 66.00% , 3D : 59.00% , 2D : 29.00%

    Pasaran : SENTOSA 4D - SINGAPORE - SENTOSA TOTO - FINLANDIA
    4D : 66.00% , 3D : 59.50% , 2D : 29.50%

    BONUS PELANGI4D :
    > Bonus New Member 10%
    > Bonus Deposit Harian 5%
    > Bonus Cashback up to 15% [SPORT]
    > Bonus Cashback up to 10% [SABUNG AYAM]
    > Bonus Cashback 5% [SLOTS & TANGKAS]
    > Bonus Rollingan 0.3%
    > Bonus Rollingan 0.3% [SLOT & TANGKAS]
    > Bonus Rollingan 0.8% [CASINO]
    > Bonus Referral Togel :
    -- 4D & COLOK : 1%
    -- 2D & 3D : 0.5%
    > Bonus Referral up to 2% [SPORT & SABUNG AYAM]


    kami online 24 Jam
    LIVE CHAT ONLINE 24 JAM
    LINE : PelangiJitu
    WHATSAPP : +6281287736082

    ReplyDelete
  23. The poker shone unequivocally hot in the fire. pokerqq pkv games

    ReplyDelete
  24. DAFTAR SEKARANG!!!
    WWW . PELANGIJITU . COM
    SITUS TOGEL ONLINE TERBESAR DAN TERPERCAYA DIASIA

    Dengan diskon togel terbesar dan promo hot lainnya.

    Pasaran : SYDNEY & HONGKONG
    4D : 66.00% , 3D : 59.00% , 2D : 29.00%

    Pasaran : SENTOSA 4D - SINGAPORE - SENTOSA TOTO - FINLANDIA
    4D : 66.00% , 3D : 59.50% , 2D : 29.50%

    BONUS PELANGI4D :
    > Bonus New Member 10%
    > Bonus Deposit Harian 5%
    > Bonus Cashback up to 15% [SPORT]
    > Bonus Cashback up to 10% [SABUNG AYAM]
    > Bonus Cashback 5% [SLOTS & TANGKAS]
    > Bonus Rollingan 0.3%
    > Bonus Rollingan 0.3% [SLOT & TANGKAS]
    > Bonus Rollingan 0.8% [CASINO]
    > Bonus Referral Togel :
    -- 4D & COLOK : 1%
    -- 2D & 3D : 0.5%
    > Bonus Referral up to 2% [SPORT & SABUNG AYAM]


    kami online 24 Jam
    LIVE CHAT ONLINE 24 JAM
    LINE : PelangiJitu
    WHATSAPP : +6281287736082

    ReplyDelete
  25. Nice information, valuable and excellent design, as share good stuff with good ideas and concepts, lots of great information.most relevent site more idea here
    situs dominoqq

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete
  27. VIPADUQ,Adalah Situs Agen Judi Poker DominoQQ BandarQ Online Terbesar di Indonesia Hadir Untuk Anda Semua Dengan Games dan Bonus Yang Menarik!Dan Pelayanan Yang 24Jam Selalu Online Untuk Melayani Anda dengan Senang Hati Kami Juga Menerima Deposit Dari Segala Jenis Bank, Link Aja, Ovo, Go-pay, Dana
    Bonus yang diberikan Poker VIPADUQ Setiap Hari Minggu :
    * Minimal Depo 15.000
    * Minimal WD 20.00
    * Deposit dan Withdraw 24 jam Non stop ( Kecuali Bank offline / gangguan )
    * Bonus rollingan 0.5%
    * Bonus Refferal 10% + 10%,seumur hidup
    * Bonus Jackpot, yang dapat anda dapatkan dengan mudah Seperi :
    - POKER
    - DOMINOQQ
    - CAPSA SUSUN
    - BANDAR POKER
    - ADUQ
    - BANDARQQ
    - BANDAR66
    - SAKONG
    - NEW GAME **PERANG BACCARAT**

    Ayo Daftar Sekarang juga dan Menangkan HADIAH nya Setiap Hari
    Terima kasih !!!

    Info Lebih lanjut bisa menghubungi kami melalui :
    - Wa : +6287785609038
    - SKYPE : VIPADUQ
    - LINE : VIPADUQ
    - WECHAT: VIPADUQ

    * https://DAFTAR/
    * https://kumpulanbandar.com
    * https://pusatprediksitogel365.blogspot.com/
    * http://elisacarista.over-blog.com/

    ReplyDelete
  28. Thank you nice post. I am interested to play online game but at first I have to learn how to play game.
    Can you suggest which Game pkv is the best here. Thanks.

    ReplyDelete
  29. thank you for your nice post.You described clearly.I am interest to play online game but i don't know how to play.please,suggest me which game is best here and to do it?...thanks in advance.

    ReplyDelete
  30. These are play pokers and poker rooms that allow players to play poker at a good level without worrying about anything. Online poker rooms are also very popular.

    ReplyDelete
  31. I curious more interest in some of them hope you will give more information on this topics in your next articles. agen bola bonus 100

    ReplyDelete
  32. Glad to chat your blog, I seem to be forward to more reliable articles and I think we all wish to thank so many good articles, blog to share with us. judi slot bonus terbesar

    ReplyDelete
  33. Glad to chat your blog, I seem to be forward to more reliable articles and I think we all wish to thank so many good articles, blog to share with us. judi slot bonus terbesar

    ReplyDelete
  34. QQDoyan.net sudah sangat dikenal dikalangan masyarakat Indonesia yang kini ada jutaan member aktif didalamnya. Pengalaman Bandar judi poker pkv online Indonesia terpercaya ini tidak hanya menebar janji-janji belaka. Sudah banyak bukti yang telah dibuktikan kepada pemain yang berhasil sukses dengan keuntungannya menghasilkan uang dari judi online seperti poker qq online DoyanQQ, dominoqq online mudah menang, bandarqq online uang asli yang telah tersedia

    ReplyDelete
  35. https://hasilbola.live/prediksi-sepakbola/baca/5279/paraguay-vs-peru-09-oktober-2020

    Prediksi Bola Paraguay vs Peru 09 Oktober 2020 yang akan diselenggarakan langsung tanpa penonton Estadio Defensores del Chaco.

    Dalam pertemuan kedua tim di Piala Dunia Amerika kali ini. Akan di Jadwal Bola Malam Ini pada hari Jumat 09 Oktober 2020 pada pukul 05:30 WIB.

    Untuk Melihat Live Streaming Online Bisa Kunjungi Situs https://hasilbola.live/

    ReplyDelete
  36. A big thank you for your blog article.Thanks Again. Will read on... qq slot

    ReplyDelete
  37. Awesome blog article.Really thank you! Awesome. Slot pulsa

    ReplyDelete
  38. Great, thanks for sharing this article post.Really thank you! Really Great. Joker2929

    ReplyDelete
  39. Enjoyed every bit of your blog post.Much thanks again. Awesome. joker3939

    ReplyDelete
  40. Yes i am totally agreed with this article and i just want say that this article is very nice and very informative article.I will make sure to be reading your blog more. You made a good point but I can't help but wonder, what about the other side? !!!!!!Thanks slot pragmatic play

    ReplyDelete
  41. https://hasilbola.live/prediksi-liga-champion/baca/5403/ajax-vs-atalanta-10-desember-2020

    Prediksi Bola Ajax vs Atalanta 10 Desember 2020 yang akan diselenggarakan langsung tanpa penonton di Johan Cruijff Arena.

    Dalam pertemuan kedua tim di Liga Champion kali ini. Akan di Jadwal Bola Malam Ini pada hari Kamis, 10 Desember 2020 pada pukul 00:55 WIB.

    Untuk Streaming Bola Online situs kami sudah menyediakannya, Yuk langsung saja kunjungi situs bola kami di https://hasilbola.live/

    ReplyDelete
  42. https://hasilbola.live/prediksi-sepakbola/baca/5461/valencia-vs-sevilla-22-desember-2020

    Prediksi Bola Valencia vs Sevilla 22 Desember 2020 yang akan diselenggarakan langsung Estadio de Mestalla.

    Dalam pertemuan kedua tim di Liga Spanyol kali ini. Akan di Jadwal Bola Malam Ini pada hari Selasa, 22 Desember 2020 pada pukul 23:30 WIB.

    Cara bergabung dengan situs bola yang resmi yaitu Daftar Di Agen Bola Terpercaya hasilbola.live

    ReplyDelete
  43. I just want to let you know that I just check out your site and I find it very interesting and informative..
    slot online terlengkap

    ReplyDelete
  44. https://manilabet365.weebly.com/jdbslot.html
    Slot JDB terbaik dan terpercaya 2021

    ReplyDelete
  45. https://sites.google.com/view/situsjudionline365/bola88 = https://rebrand.ly/dewahoki
    Situs Judi Online

    ReplyDelete
  46. https://manilabet365.blogspot.com = https://rebrand.ly/hokidewa
    situs Judi Slot Online manilabet365

    ReplyDelete
  47. manilabet 365 https://manilabet365.blogspot.com/2020/10/manilabet365-situs-judi-bola88.html = http://bit.ly/manila365bet
    situs Judi Bola

    ReplyDelete
  48. https://sites.google.com/view/slotonline365/pragmaticplay = https://s.id/jokerslot
    situs Slot Online

    ReplyDelete
  49. Black coffee can help us age faster and live longer, according to a study from Coffe consumption and all-cause and cause-specific mortality. ตารางคะแนน

    ReplyDelete