The Trouble with Tibbles
Published: January 8, 2018
Let’s get something straight, there isn’t really any trouble with tibbles. I’m hoping you’ve noticed this is a play on 1967 Star Trek episode, “The Trouble with Tribbles”. I’ve recently got myself a job as a Data Scientist, here, at Jumping Rivers. Having never come across tibbles until this point, I now find myself using them in nearly every R script I compose. Be that your timeless standard R script, your friendly Shiny app or an analytical Markdown document.
What are tibbles?
Presumably this is why you came here, right?
Tibbles are a modern take on data frames, but crucially they are still data frames. Well, what’s the difference then? There’s a quote I found somewhere on the internet that decribes the difference quite well;
“keeping what time has proven to be effective, and throwing out what is not”.
Basically, some clever people took the classic data.frame()
, shook it til the ineffective parts fell out, then added some new, more appropriate features.
Precursors
# The easiest way to get access is to isstall the tibble package.
install.packages("tibble")
# Alternatively, tibbles are a part of the tidyverse and hence
# installing the whole tidyverse will give you access.
install.packages("tidyverse")
# I am just going to use tibble.
library("tibble")
Tribblemaking
There are three ways to form a tibble. It pretty much acts as your friendly old pal data.frame()
does. Just like standard data frames, we can create tibbles, coerce objects into tibbles and import data sets into R
as a tibble. Below is a table of the traditional data.frame()
commands and their respective {tidyverse} commands.
Formation Type | Data Frame Commands | Tibbles Commands |
---|---|---|
Creation | data.frame() |
data_frame() tibble() tribble() |
Coercion | as.data.frame() |
as_data_frame() as_tibble() |
Importing | read.*() |
read_delim() read_csv() read_csv2() read_tsv() |
Let’s take a closer look…
1) Creation.
Just as data.frame()
creates data frames,tibble()
, data_frame()
and tribble()
all create tibbles.
Standard data frame.
data.frame(a = 1:5, b = letters[1:5])
## a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
A tibble using tibble()
(identical to using data_frame
).
tibble(a = 1:5, b = letters[1:5])
## # A tibble: 5 x 2
## a b
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
A tibble using tribble()
.
tribble( ~a, ~b,
#---|----
1, "a",
2, "b")
## # A tibble: 2 x 2
## a b
## <dbl> <chr>
## 1 1.00 a
## 2 2.00 b
Notice the odd one out? tribble()
is different. It’s a way of laying out small amounts of data in an easy to read form. I’m not too keen on these, as even writing out that simple 2 x 2 tribble got tedious.
2) Coercion.
Just as as.data.frame()
coerces objects into data frames, as_data_frame()
and as_tibble()
coerce objects into tibbles.
df = data.frame(a = 1:5, b = letters[1:5])
as_data_frame(df)
## # A tibble: 5 x 2
## a b
## <int> <fct>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
as_tibble(df)
## # A tibble: 5 x 2
## a b
## <int> <fct>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
You can coerce more than just data frames, too. Objects such as lists, matrices, vectors and single instances of class are convertible.
3) Importing.
There’s a few options to read in data files within the {tidyverse}, so we’ll just compare read_csv()
and its representative data.frame()
pal, read.csv()
. Let’s take a look at them. I have here an example data set that I’ve created in MS Excel. You can download/look at this data here. To get access to this function you’ll need the {readr} package. Again this is part of the {tidyverse} so either will do.
library("readr")
url = "https://gist.githubusercontent.com/theoroe3/8bc989b644adc24117bc66f50c292fc8/raw/f677a2ad811a9854c9d174178b0585a87569af60/tibbles_data.csv"
tib = read_csv(url)
## Parsed with column specification:
## cols(
## `<-` = col_integer(),
## `8` = col_integer(),
## `%` = col_double(),
## name = col_character()
## )
tib
## # A tibble: 4 x 4
## `<-` `8` `%` name
## <int> <int> <dbl> <chr>
## 1 1 2 0.250 t
## 2 2 4 0.250 h
## 3 3 6 0.250 e
## 4 4 8 0.250 o
df = read.csv(url)
df
## X.. X8 X. name
## 1 1 2 0.25 t
## 2 2 4 0.25 h
## 3 3 6 0.25 e
## 4 4 8 0.25 o
Not only does read_csv()
return a pretty tibble, it is also much faster. For proof, check out this article by Erwin Kalvelagen. The keen eyes amongst you will have noticed something odd about the variable names… we’ll get on to that soon.
Tibbles vs Data Frames
Did you notice a key difference in the tibble()
s and data.frame()
s above? Take a look again.
tibble(a = 1:26, b = letters)
## # A tibble: 26 x 2
## a b
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## # ... with 21 more rows
The first thing you should notice is the pretty print process. The class of each column is now displayed above it and the dimensions of the tibble are shown at the top. The default print option within tibbles mean they will only display 10 rows if the data frame has more than 20 rows (I’ve changed mine to display 5 rows). Neat. Along side that we now only view columns that will fit on the screen. This is already looking quite the part. The row settings can be changed via
options(tibble.print_max = 3, tibble.print_min = 1)
So now if there is more than 3 rows, we print only 1 row. Tibbles of length 3 and 4 would now print as so.
tibble(1:3)
## # A tibble: 3 x 1
## `1:3`
## <int>
## 1 1
## 2 2
## 3 3
tibble(1:4)
## # A tibble: 4 x 1
## `1:4`
## <int>
## 1 1
## # ... with 3 more rows
Yes, OK, you could do this with the traditional data frame. But it would be a lot more work, right?
As well as the fancy printing, tibbles don’t drop the variable type, don’t partial match and they allow non-syntactic column names when importing data in. We’re going to use the data from before. Again, it is available here. Notice it has 3 non-syntactic column names and one column of characters. Reading this is as a tibble and a data frame we get
tib
## # A tibble: 4 x 4
## `<-` `8` `%` name
## <int> <int> <dbl> <chr>
## 1 1 2 0.250 t
## 2 2 4 0.250 h
## 3 3 6 0.250 e
## 4 4 8 0.250 o
df
## X.. X8 X. name
## 1 1 2 0.25 t
## 2 2 4 0.25 h
## 3 3 6 0.25 e
## 4 4 8 0.25 o
We see already that in the read.csv()
process we’ve lost the column names. Let’s try some partial matching…
tib$n
## Warning: Unknown or uninitialised column: 'n'.
## NULL
df$n
## [1] t h e o
## Levels: e h o t
With the tibble we get an error, yet with the data frame it leads us straight to our name
variable. To read more about why partial matching is bad, check out this thread.
What about subsetting? Let’s try it out using the data from our csv file.
tib[,2]
## # A tibble: 4 x 1
## `8`
## <int>
## 1 2
## 2 4
## 3 6
## 4 8
tib[2]
## # A tibble: 4 x 1
## `8`
## <int>
## 1 2
## 2 4
## 3 6
## 4 8
df[,2]
## [1] 2 4 6 8
df[2]
## X8
## 1 2
## 2 4
## 3 6
## 4 8
Using the a normal data frame we get a vector and a data frame using single square brackets. Using tibbles, single square brackets, [
, will always return another tibble. Much neater. Now for double brackets.
tib[[1]]
## [1] 1 2 3 4
tib$name
## [1] "t" "h" "e" "o"
df[[1]]
## [1] 1 2 3 4
df$name
## [1] t h e o
## Levels: e h o t
Double square brackets, [[
, and the traditional dollar, $
are ways to access individual columns as vectors. Now, with tibbles, we have seperate operations for data frame operations and single column operations. Now we don’t have to use that pesky drop = FALSE
. Note, these are actually quicker than the [[
and $
of the data.frame()
, as shown in the documentation for the tibble package.
At last, no more strings as factors! Upon reading the data in, tibbles recognise strings as strings, not factors. For example, with the name column in our data set.
class(df$name)
## [1] "factor"
class(tib$name)
## [1] "character"
I quite like this, it’s much easier to turn a vector of characters into factors than vice versa, so why not give me everything as strings? Now I can choose whether or not to convert to factors.
Disadvantages
This won’t be long, there’s only one. Some older packages don’t work with tibbles because of their alternative subsetting method. They expect tib[, 1]
to return a vector, when infact it will now return another tibble. Until this functionality is added in you must convert your tibble back to a data frame using as_data_frame()
or as_tibble()
as discussed previously. Whilst adding this functionality will give users the chance to use packages with tibbles and normal data frames, it of course puts extra work on the shoulders of package writers, who now have to change every package to be compatible with tibbles. For more on this discussion, see this thread.
To summarise..
So, most of the things you can accomplish with tibbles, you can accomplish with data.frame()
, but it’s bit of a pain. Simple things like checking the dimensions of your data or converting strings to factors are small jobs. Small jobs that take time. With tibbles they take no time. Tibbles force you to look at your data earlier; confront the problems earlier. Ultimately leading to cleaner code.
Thanks for chatting!