Data Wrangling Part 1: Basic to Advanced Ways to Select Columns
I went through the entire dplyr
documentation for a talk last week about
pipes, which resulted in a few “aha!” moments. I discovered and
re-discovered a few useful functions, which I wanted to collect in a few
blog posts so I can share them with others.
This first post will cover ordering, naming and selecting columns, it covers the basics of selecting columns and more advanced functions like select_all()
, select_if()
and shortcuts like everything()
.
Other posts in this series:
- Part 2: Transforming your columns into the right shape, which covers functions such as
mutate_all()
,mutate_if()
andmutate_at()
, but also ways to code into discrete columns withcase_when()
. - Part 3: Filtering rows, which includes tricks like the nifty
near()
functions and thefilter_all
/fitler_if
/filter_at
family. - Part 4: Summarising your data, which includes the
add_count()
shortcut, and thesummarise_all
/summarise_if
/summarise_at
family.
Content Part 1:
To note: All code will be presented as part of pipe even though hardly any
of them are a full pipe. In some cases I added a glimpse()
statement to allow you to
see the columns selected in the output tibble without printing all the
data every time.
The dataset
library(tidyverse)
#built-in R dataset
glimpse(msleep)
## Observations: 83
## Variables: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
Selecting columns
Selecting columns: the basics
To select a few columns just add their names in the select statement. The order in which you add them, will determine the order in which they appear in the output.
msleep %>%
select(name, genus, sleep_total, awake) %>%
glimpse()
## Observations: 83
## Variables: 4
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, ...
If you want to add a lot of columns, it can save you some typing to have a good look at your data and see whether you can’t get to your selection by using chunks, deselecting or even deselect a column and re-add it straight after.
To add a chunk of columns use the start_col:end_col
syntax:
msleep %>%
select(name:order, sleep_total:sleep_cycle) %>%
glimpse
## Observations: 83
## Variables: 7
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi"...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6...
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383...
An alternative is to deselect columns by adding a minus sign in front of the column name. You can also deselect chunks of columns.
msleep %>%
select(-conservation, -(sleep_total:awake)) %>%
glimpse
## Observations: 83
## Variables: 6
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000...
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0...
It’s even possible to deselect a whole chunk, and then re-add a column
again.
The below sample code deselects the whole chunk from ID to pledged, but
re-adds column ‘conservation’, even though it was part of the deselected
chunk. This only works if you re-add it in the same select()
statement.
msleep %>%
select(-(name:awake), conservation) %>%
glimpse
## Observations: 83
## Variables: 3
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
`
Selecting columns based on partial column names
If you have a lot of columns with a similar structure you can use
partial matching by adding starts_with()
, ends_with()
or
contains()
in your select statement.
msleep %>%
select(name, starts_with("sleep")) %>%
glimpse
## Observations: 83
## Variables: 4
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Great...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6...
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383...
msleep %>%
select(contains("eep"), ends_with("wt")) %>%
glimpse
## Observations: 83
## Variables: 5
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6...
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.0...
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.49...
Selecting columns based on regex
The previous helper functions work with exact pattern matches. If you
have similar patterns that are not entirely the same you can use any
regular expression inside matches()
.
The below sample code will add any column that contains an ‘o’, followed
by one or more other letters, and ‘er’.
#selecting based on regex
msleep %>%
select(matches("o.+er")) %>%
glimpse
## Observations: 83
## Variables: 2
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
Selecting columns based pre-identified columns
There is another option which avoids the continuous retyping of columns
names: one_of()
. You can set up column names upfront, and then refer
to them inside a select()
statement by either wrapping them inside
one_of()
or by using the !!
operator.
classification <- c("name", "genus", "vore", "order", "conservation")
msleep %>%
select(!!classification)
## # A tibble: 83 x 5
## name genus vore order conservation
## <chr> <chr> <chr> <chr> <chr>
## 1 Cheetah Acinonyx carni Carnivora lc
## 2 Owl monkey Aotus omni Primates <NA>
## 3 Mountain beaver Aplodontia herbi Rodentia nt
## 4 Greater short-tailed shrew Blarina omni Soricomorpha lc
## 5 Cow Bos herbi Artiodactyla domesticated
## 6 Three-toed sloth Bradypus herbi Pilosa <NA>
## 7 Northern fur seal Callorhinus carni Carnivora vu
## 8 Vesper mouse Calomys <NA> Rodentia <NA>
## 9 Dog Canis carni Carnivora domesticated
## 10 Roe deer Capreolus herbi Artiodactyla lc
## # ... with 73 more rows
`
Selecting columns by their data type
The select_if
function allows you to pass functions which return
logical statements. For instance you can select all string columns by
using select_if(is.character)
. Similarly, you can add is.numeric
,
is.integer
, is.double
, is.logical
, is.factor
.
If you have date columns, you can load the lubridate
package, and use
is.POSIXt
or is.Date
.
msleep %>%
select_if(is.numeric) %>%
glimpse
## Observations: 83
## Variables: 6
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1,...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6...
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.383...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, ...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.0...
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.49...
You can also select the negation but in this case you will need to add a
tilde to ensure that you still pass a function to select_if
. The
select_all/if/at functions require that a function is passed as an
argument. If you have to add any negation or arguments, you will have to
wrap your function inside funs()
or add a tilde before to remake it a
function.
msleep %>%
select_if(~!is.numeric(.)) %>%
glimpse
## Observations: 83
## Variables: 5
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
Selecting columns by logical expressions
In fact, select_if
allows you to select based on any logical function,
not just based on data type. It is possible to select all columns with
an average above 500 for instance. To avoid errors you do have to also
select numeric columns only, which you can do either upfront for easier
syntax, or in the same line.
Similarly mean > 500
is not a function in itself, so you will need to
add a tilde upfront, or wrap it insie funs()
to turn the statement
into a function.
msleep %>%
select_if(is.numeric) %>%
select_if(~mean(., na.rm=TRUE) > 10)
or shorter:
msleep %>%
select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 10)
## # A tibble: 83 x 3
## sleep_total awake bodywt
## <dbl> <dbl> <dbl>
## 1 12.1 11.9 50.0
## 2 17.0 7.00 0.480
## 3 14.4 9.60 1.35
## 4 14.9 9.10 0.0190
## 5 4.00 20.0 600
## 6 14.4 9.60 3.85
## 7 8.70 15.3 20.5
## 8 7.00 17.0 0.0450
## 9 10.1 13.9 14.0
## 10 3.00 21.0 14.8
## # ... with 73 more rows
Another useful function for select_if
is n_distinct()
, which counts
the amount of distinct values that can be found in a column.
To return the columns that have less than 20 distinct answers for
instance you pass ~n_distinct(.) < 20
within the select_if statement.
Given that n_distinct(.) < 20
is not a function, you will need to put
a tilde in front.
msleep %>%
select_if(~n_distinct(.) < 10)
## # A tibble: 83 x 2
## vore conservation
## <chr> <chr>
## 1 carni lc
## 2 omni <NA>
## 3 herbi nt
## 4 omni lc
## 5 herbi domesticated
## 6 herbi <NA>
## 7 carni vu
## 8 <NA> <NA>
## 9 carni domesticated
## 10 herbi lc
## # ... with 73 more rows
Re-ordering columns
You can use the select()
function (see below) to re-order columns. The
order in which you select them will determine the final order.
msleep %>%
select(conservation, sleep_total, name) %>%
glimpse
## Observations: 83
## Variables: 3
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
If you are just moving a few columns to the front, you can use
everything()
afterwards which will add all the remaining columns and
save a lot of typing.
msleep %>%
select(conservation, sleep_total, everything()) %>%
glimpse
## Observations: 83
## Variables: 11
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
Column names
Sometimes changes are necessary to column names in itself:
Renaming columns
If you will be using a select()
statement, you can rename straight in
the select
function.
msleep %>%
select(animal = name, sleep_total, extinction_threat = conservation) %>%
glimpse
## Observations: 83
## Variables: 3
## $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", ...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0,...
## $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "v...
If you want to retain all columns and therefore have no select()
statement, you can rename by adding a rename()
statement.
msleep %>%
rename(animal = name, extinction_threat = conservation) %>%
glimpse
## Observations: 83
## Variables: 11
## $ animal <chr> "Cheetah", "Owl monkey", "Mountain beaver", ...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina"...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Sorico...
## $ extinction_threat <chr> "lc", NA, "nt", "lc", "domesticated", NA, "v...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0,...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, N...
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667,...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, ...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, N...
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850,...
Reformatting all column names
The select_all()
function allows changes to all columns, and takes a
function as an argument.
To get all column names in uppercase, you can use toupper()
, similarly
you could use tolower()
.
msleep %>%
select_all(toupper)
## # A tibble: 83 x 11
## NAME GENUS VORE ORDER CONSERVATION SLEEP_TOTAL SLEEP_REM SLEEP_CYCLE
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Cheet~ Acin~ carni Carn~ lc 12.1 NA NA
## 2 Owl m~ Aotus omni Prim~ <NA> 17.0 1.80 NA
## 3 Mount~ Aplo~ herbi Rode~ nt 14.4 2.40 NA
## 4 Great~ Blar~ omni Sori~ lc 14.9 2.30 0.133
## 5 Cow Bos herbi Arti~ domesticated 4.00 0.700 0.667
## 6 Three~ Brad~ herbi Pilo~ <NA> 14.4 2.20 0.767
## 7 North~ Call~ carni Carn~ vu 8.70 1.40 0.383
## 8 Vespe~ Calo~ <NA> Rode~ <NA> 7.00 NA NA
## 9 Dog Canis carni Carn~ domesticated 10.1 2.90 0.333
## 10 Roe d~ Capr~ herbi Arti~ lc 3.00 NA NA
## # ... with 73 more rows, and 3 more variables: AWAKE <dbl>, BRAINWT <dbl>,
## # BODYWT <dbl>
You can go further than that by creating functions on the fly: if you have messy column names coming from excel for instance you can replace all white spaces with an underscore.
#making an unclean database:
msleep2 <- select(msleep, name, sleep_total, brainwt)
colnames(msleep2) <- c("name", "sleep total", "brain weight")
msleep2 %>%
select_all(~str_replace(., " ", "_"))
## # A tibble: 83 x 3
## name sleep_total brain_weight
## <chr> <dbl> <dbl>
## 1 Cheetah 12.1 NA
## 2 Owl monkey 17.0 0.0155
## 3 Mountain beaver 14.4 NA
## 4 Greater short-tailed shrew 14.9 0.000290
## 5 Cow 4.00 0.423
## 6 Three-toed sloth 14.4 NA
## 7 Northern fur seal 8.70 NA
## 8 Vesper mouse 7.00 NA
## 9 Dog 10.1 0.0700
## 10 Roe deer 3.00 0.0982
## # ... with 73 more rows
Or in case your columns contain other meta-data like question numbers:
#making an unclean database:
msleep2 <- select(msleep, name, sleep_total, brainwt)
colnames(msleep2) <- c("Q1 name", "Q2 sleep total", "Q3 brain weight")
msleep2[1:3,]
## # A tibble: 3 x 3
## `Q1 name` `Q2 sleep total` `Q3 brain weight`
## <chr> <dbl> <dbl>
## 1 Cheetah 12.1 NA
## 2 Owl monkey 17.0 0.0155
## 3 Mountain beaver 14.4 NA
You can use select_all
in combination with str_replace
to get rid of
the extra characters.
msleep2 %>%
select_all(~str_replace(., "Q[0-9]+", "")) %>%
select_all(~str_replace(., " ", "_"))
## # A tibble: 83 x 3
## `_name` `_sleep total` `_brain weight`
## <chr> <dbl> <dbl>
## 1 Cheetah 12.1 NA
## 2 Owl monkey 17.0 0.0155
## 3 Mountain beaver 14.4 NA
## 4 Greater short-tailed shrew 14.9 0.000290
## 5 Cow 4.00 0.423
## 6 Three-toed sloth 14.4 NA
## 7 Northern fur seal 8.70 NA
## 8 Vesper mouse 7.00 NA
## 9 Dog 10.1 0.0700
## 10 Roe deer 3.00 0.0982
## # ... with 73 more rows
Row names to column
Some dataframes have rownames that are not actually a column in itself, like the mtcars dataset:
mtcars %>%
head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
If you want this column to be an actual column, you can use the
rownames_to_column()
function, and specify a new column name.
mtcars %>%
tibble::rownames_to_column("car_model") %>%
head
## car_model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1