I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
(My) Background
I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
Undergraduate in Psychology
Psychophysics: illusory contours in 3D
Phil Grove
Christina Lee
If every psychologist in the world delivered gold standard smoking cessation therapy, the rate of smoking would still increase. You need to change policy to make change. To make effective policy, you need to have good data, and do good statistics.
I discovered an interest in public health and statistics.
I started a PhD in statistics at QUT, under (now distinguished) Professor Kerrie Mengersen, Looking at people's health over time.
Focus on building a bridge across a river. Less focus on how it is built, and the tools used.
My research:
Design and improve tools for (exploratory) data analysis
...EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (Wikipedia)
John Tukey, Frederick Mosteller, Bill Cleveland, Dianne Cook, Heike Hoffman, Rob Hyndman, Hadley Wickham
visdat::vis_dat(airquality)
visdat::vis_miss(airquality)
naniar
Tierney, NJ. Cook, D. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations." [Pre-print]
naniar::gg_miss_var(airquality)
naniar::gg_miss_upset(airquality)
Current work:
How to explore longitudinal data effectively
Something observed sequentially over time
## # A tsibble: 1 x 4 [!]## # Key: country [1]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania
## # A tsibble: 2 x 4 [!]## # Key: country [1]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania
## # A tsibble: 3 x 4 [!]## # Key: country [1]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania ## 3 Australia 1960 176. Oceania
## # A tsibble: 4 x 4 [!]## # Key: country [1]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania ## 3 Australia 1960 176. Oceania ## 4 Australia 1970 178. Oceania
Problem #1: How do I look at some of the data?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #3: How do I understand my statistical model
brolgar
: brolgar.njtierney.comSomething observed sequentially over time
SomethingAnything that is observed sequentially over time is a time series
SomethingAnything that is observed sequentially over time is a time series
heights <- as_tsibble(heights, index = year, key = country, regular = FALSE)
1. + 2.
determine distinct rows in a tsibble.
(From Earo Wang's talk: Melt the clock)
Record important time series information once, and use it many times in other places
## # A tsibble: 1,499 x 4 [!]## # Key: country [153]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Afghanistan 1870 168. Asia ## 2 Afghanistan 1880 166. Asia ## 3 Afghanistan 1930 167. Asia ## 4 Afghanistan 1990 167. Asia ## 5 Afghanistan 2000 161. Asia ## 6 Albania 1880 170. Europe ## # … with 1,493 more rows
Remember:
key = variable(s) defining individual groups (or series)
Look at only a sample of the data:
n
rows with sample_n()
n
rows with sample_n()
heights %>% sample_n(5)
## # A tsibble: 5 x 4 [!]## # Key: country [5]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Cote d'Ivoire 1890 169. Africa ## 2 Argentina 1810 169. Americas ## 3 United States 1920 173. Americas ## 4 Netherlands 1810 166 Europe ## 5 Nigeria 1870 166. Africa
n
rows with sample_n()
n
rows with sample_n()
## # A tsibble: 5 x 4 [!]## # Key: country [5]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Malawi 1940 166. Africa ## 2 Italy 1960 173 Europe ## 3 Kazakhstan 1850 165. Asia ## 4 Colombia 1980 171. Americas ## 5 Switzerland 1910 172. Europe
n
rows with sample_n()
## # A tsibble: 5 x 4 [!]## # Key: country [5]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Malawi 1940 166. Africa ## 2 Italy 1960 173 Europe ## 3 Kazakhstan 1850 165. Asia ## 4 Colombia 1980 171. Americas ## 5 Switzerland 1910 172. Europe
... sampling needs to select not random rows of the data, but the keys - the countries.
sample_n_keys()
to sample ... keyssample_n_keys(heights, 5)
## # A tsibble: 35 x 4 [!]## # Key: country [5]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Azerbaijan 1850 170. Asia ## 2 Azerbaijan 1860 171. Asia ## 3 Azerbaijan 1950 171. Asia ## 4 Azerbaijan 1960 172. Asia ## 5 Azerbaijan 1970 172. Asia ## 6 Azerbaijan 1980 172. Asia ## # … with 29 more rows
sample_n_keys()
to sample ... keysLook at subsamples
Look at subsamples
Look at many subsamples
(Something I made up)
(Something I made up)
If you have to solve 3+ substantial smaller problems in order to solve a larger problem, your focus shifts from the current goal to something else. You are distracted.
I want to look at many subsamples of the data
I want to look at many subsamples of the data
How many keys are there?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
Do I want length.out
or times
?
We can blame ourselves when we are distracted for not being better.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We need to make things as easy as reasonable, with the least amount of distraction.
How many plots do I want to look at?
How many plots do I want to look at?
gg_heights(heights) + facet_sample( n_per_facet = 3, n_facets = 9 )
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line()
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample()
facet_sample()
: See more individualsfacet_strata()
: See all individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata()
facet_strata()
: See all individuals"How many lines per facet"
"How many facets?"
ggplot + facet_sample( n_per_facet = 10, n_facets = 12 )
"How many lines per facet"
"How many facets?"
ggplot + facet_sample( n_per_facet = 10, n_facets = 12 )
"How many facets to shove all the data in?"
ggplot + facet_strata( n_strata = 10, )
In asking these questions we can solve something else interesting
facet_strata(along = -year)
: see all individuals along some variableggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata(along = -year)
facet_strata(along = -year)
: see all individuals along some variablefacet_strata()
& facet_sample()
Under the hoodsample_n_keys()
and stratify_keys()
facet_strata()
& facet_sample()
Under the hoodsample_n_keys()
and stratify_keys()
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
Define interesting?
Let's see that one more time, but with the data
## # A tsibble: 1,499 x 4 [!]## # Key: country [153]## country year height_cm continent## <chr> <dbl> <dbl> <chr> ## 1 Afghanistan 1870 168. Asia ## 2 Afghanistan 1880 166. Asia ## 3 Afghanistan 1930 167. Asia ## 4 Afghanistan 1990 167. Asia ## 5 Afghanistan 2000 161. Asia ## 6 Albania 1880 170. Europe ## 7 Albania 1890 170. Europe ## 8 Albania 1900 169. Europe ## 9 Albania 2000 168. Europe ## 10 Algeria 1910 169. Africa ## # … with 1,489 more rows
## # A tibble: 153 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## 7 Australia 170 171. 172. 173. 178.## 8 Austria 162. 164. 167. 169. 179.## 9 Azerbaijan 170. 171. 172. 172. 172.## 10 Bahrain 161. 161. 164. 164. 164 ## # … with 143 more rows
heights_five %>% filter(max == max(max) | max == min(max))
## # A tibble: 2 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Denmark 165. 168. 170. 178. 183.## 2 Papua New Guinea 152. 152. 156. 160. 161.
heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country")
## # A tibble: 21 x 9## country min q25 med q75 max year height_cm continent## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Denmark 165. 168. 170. 178. 183. 1820 167. Europe ## 2 Denmark 165. 168. 170. 178. 183. 1830 165. Europe ## 3 Denmark 165. 168. 170. 178. 183. 1850 167. Europe ## 4 Denmark 165. 168. 170. 178. 183. 1860 168. Europe ## 5 Denmark 165. 168. 170. 178. 183. 1870 168. Europe ## 6 Denmark 165. 168. 170. 178. 183. 1880 170. Europe ## 7 Denmark 165. 168. 170. 178. 183. 1890 169. Europe ## 8 Denmark 165. 168. 170. 178. 183. 1900 170. Europe ## 9 Denmark 165. 168. 170. 178. 183. 1910 170 Europe ## 10 Denmark 165. 168. 170. 178. 183. 1920 174. Europe ## # … with 11 more rows
heights %>% features(height_cm, feat_five_num)
## # A tibble: 153 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 147 more rows
heights %>% features(height_cm, #<< # variable we want to summarise feat_five_num) #<< # feature to calculate
heights %>% features(height_cm, #<< # variable we want to summarise feat_five_num) #<< # feature to calculate
## # A tibble: 153 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 147 more rows
feat_five_num?
feat_five_num?
feat_five_num
## function(x, ...) {## list(## min = b_min(x, ...),## q25 = b_q25(x, ...),## med = b_median(x, ...),## q75 = b_q75(x, ...),## max = b_max(x, ...)## )## }## <bytecode: 0x7fb6bc8fcba0>## <environment: namespace:brolgar>
feat_five_num?
features()
in brolgar
feat_ranges
heights %>% features(height_cm, feat_ranges)
## # A tibble: 153 x 5## country min max range_diff iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 168. 7 3.27## 2 Albania 168. 170. 2.20 1.53## 3 Algeria 166. 171. 5.06 2.15## 4 Angola 159. 169. 10.5 7.87## 5 Argentina 167. 174. 7 2.21## 6 Armenia 164. 172. 8.82 5.30## 7 Australia 170 178. 8.4 2.58## 8 Austria 162. 179. 17.2 5.35## 9 Azerbaijan 170. 172. 1.97 1.12## 10 Bahrain 161. 164 3.3 2.75## # … with 143 more rows
feat_monotonic
heights %>% features(height_cm, feat_monotonic)
## # A tibble: 153 x 5## country increase decrease unvary monotonic## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## 7 Australia FALSE FALSE FALSE FALSE ## 8 Austria FALSE FALSE FALSE FALSE ## 9 Azerbaijan FALSE FALSE FALSE FALSE ## 10 Bahrain TRUE FALSE FALSE TRUE ## # … with 143 more rows
feat_spread
heights %>% features(height_cm, feat_spread)
## # A tibble: 153 x 5## country var sd mad iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 7.20 2.68 1.65 3.27## 2 Albania 0.950 0.975 0.667 1.53## 3 Algeria 3.30 1.82 0.741 2.15## 4 Angola 16.9 4.12 3.11 7.87## 5 Argentina 2.89 1.70 1.36 2.21## 6 Armenia 10.6 3.26 3.60 5.30## 7 Australia 7.63 2.76 1.66 2.58## 8 Austria 26.6 5.16 3.93 5.35## 9 Azerbaijan 0.516 0.718 0.621 1.12## 10 Bahrain 3.42 1.85 0.297 2.75## # … with 143 more rows
feasts
Such as:
feat_acf
: autocorrelation-based featuresfeat_stl
: STL (Seasonal, Trend, and Remainder by LOESS) decompositionLet's fit a simple mixed effects model to the data
Fixed effect of year + Random intercept for country
heights_fit <- lmer(height_cm ~ year + (1|country), heights)heights_aug <- heights %>% add_predictions(heights_fit, var = "pred") %>% add_residuals(heights_fit, var = "res")
## # A tsibble: 1,499 x 6 [!]## # Key: country [153]## country year height_cm continent pred res## <chr> <dbl> <dbl> <chr> <dbl> <dbl>## 1 Afghanistan 1870 168. Asia 164. 4.58 ## 2 Afghanistan 1880 166. Asia 164. 1.51 ## 3 Afghanistan 1930 167. Asia 166. 0.815## 4 Afghanistan 1990 167. Asia 168. -1.05 ## 5 Afghanistan 2000 161. Asia 169. -7.11 ## 6 Albania 1880 170. Europe 168. 2.40 ## 7 Albania 1890 170. Europe 168. 1.74 ## 8 Albania 1900 169. Europe 168. 0.775## 9 Albania 2000 168. Europe 172. -4.13 ## 10 Algeria 1910 169. Africa 168. 1.28 ## # … with 1,489 more rows
gg_heights_fit + facet_sample()
gg_heights_fit + facet_strata()
gg_heights_fit + facet_strata(along = -res)
set.seed(2019-11-13)heights_sample <- heights_aug %>% sample_n_keys(size = 9) %>% #<< sample the data ggplot(aes( x = year, y = pred, group = country)) + geom_line() + facet_wrap(~country)heights_sample
heights_sample + geom_point( aes( y = height_cm #<< add the original data ))
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.167 -1.614 -0.157 0.000 1.352 12.174
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.167 -1.614 -0.157 0.000 1.352 12.174
Which countries are nearest to these statistics?
keys_near()
heights_aug %>% keys_near(key = country, var = res)
## # A tibble: 5 x 5## country res stat stat_value stat_diff## <chr> <dbl> <fct> <dbl> <dbl>## 1 Ireland -8.17 min -8.17 0 ## 2 Vietnam -1.61 q_25 -1.61 0.00000278## 3 Mauritania -0.157 med -0.157 0 ## 4 Sudan 1.35 q_75 1.35 0.000174 ## 5 Poland 12.2 max 12.2 0
This shows us the keys that closely match the five number summary.
left_join(heights_near, heights_aug, by = "country") %>% ggplot(aes(x = year, y = pred, group = country, colour = stat)) + geom_line() + geom_point(aes(y = height_cm)) + facet_wrap(~country)
facet_sample()
/ facet_strata()
to look at datafeatures
to find interesting observationsEnd.
now let's go through these same principles:
Which are most similar to which stats?
Anscombe's quartet
(My) Background
I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |