Looking into a 'remote future' with R12 Jun 2018 —
Instead of using the term “computer nerd”, I always preferred to describe myself as a “computer jock” growing up1. After all, if there had been a programming team in high school, I felt like I probably would have made varsity.
But sometimes, athletes are limited by the equipment they have on hand. Cyclists can’t ride the Tour de France with tricycles, and my lab can’t run hundreds of thousands of simulations on our 2014 Macbooks. We’ve been feeling the crunch as of late, which is why we’ve begun exploring the various computing clusters available to us on campus.
This post is meant for two purposes: the first is to document an example of using remote clusters with R, and the second is to serve as instructions/reference for my lab members at Rochester.
Here’s what you’ll need, and what I’ll be covering.
- a remote computing cluster
- Here, I’ll be using the University of Rochester’s CS cluster
- a non-remote computer (duh)
- Since everybody in our lab uses Macs (yay UNIX!), this walkthrough will focus on the travails of using OS X
- R, and the packages
parallel, and if you don’t have
tidyversealready installed, shame on you
Step 1: Get
ssh-askpass working on your Mac
Long story short, when you try to connect to a remote computer via SSH, Mac OS X does not prompt for a password outside of Terminal (for security reasons). But you often need a password to log in, so in those cases that you do, you’ll be screwed. This is a way around that.
First, download and install ssh-askpass. If you don’t already have
brew installed, you probably should. It’s super useful. After you install them, open up a Terminal and type the following:
sudo brew services start theseal/ssh-askpass/ssh-askpass
After that, restart your computer.
Get some keys
You want to be able to authenticate yourself when you connect to a remote computer, and you need the magic of modern cryptology to do that. Make sure you have a place to store your keys first:
If there are keys, they are probably listed as
id_rsa_<something>. Whatever the case, a key should generally have two files: a file without an extension and another file with the same name but with an extension
.pub. If there isn’t a directory, create one with
mkdir ~/.ssh and move into it with
If you don’t have any keys, generate one with:
ssh-keygen -t rsa -b 2048
If this is your first key, don’t give it a name, just leave it as the default (which should be
id_rsa). Give it a password to protect it in case someone else gets their mitts on it. If there’s already a file called
id_rsa.pub, you’re fine.
This is where we started running into some problems with
ssh-askpass on some of our Macs. Run the command below to add the default key to your authentication agent. Add
-c as an argument if you want to be prompted for the password to it each time you use it.
If doing this gives you an error about connecting to an agent, first the following and then attempt it again:
eval `ssh-agent -s`
I don’t know why this works, but it does and stackexchange said to.
If you don’t want to keep getting prompted for your password every time you log in, after completing Step 2 below, follow these instructions.
Step 2: Log in to the remote cluster
For this, you’ll need access to the cluster, i.e., an account that can log in. @Labmates: get your PI to talk with the CS folks to get you an account. After that’s done, go pick up your login password from Jim Roche in the Wegmans building on campus.
After you have this, open up that Terminal again and connect with the following command:
For me, it would be
ssh firstname.lastname@example.org. If the username on your computer is the same as the cluster account username, I don’t think you need to add the
Type in your password when prompted, and log in. Make sure you say “yes” when it asks if you want to add the computer to the list of known hosts. @Labmates: I believe you have to be connected to the school’s network to log in to the cluster. Either connect via school wifi or use the VPN. Nope, it looks like you don’t!
Note: while you’re here, make sure the
future package is installed on whatever remote computer you’re trying to log into.
Step 3: Setting up R and
Now open up some RStudio or R in the Terminal. If you’re using R in the Terminal, you can probably skip the next subsection. If you’re using the “R Console” version of R (not in an actual terminal), welllllp… for some reason it always crashes on my computer when I try to make a remote connection, so good luck.
The exact workings of all this quickly went beyond my need to know, so I’ll just tell you what we did to get this to work with RStudio. Do the following chunk, making changes where needed.
Did that last line prompt you for your password? If so, hooray! It wasn’t that easy for me!
When I first tried this, I got the following error message:
ssh_askpass: exec(/usr/libexec/ssh-askpass): No such file or directory Permission denied, please try again. ssh_askpass: exec(/usr/libexec/ssh-askpass): No such file or directory Permission denied, please try again. ssh_askpass: exec(/usr/libexec/ssh-askpass): No such file or directory Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
After some research, the version of
ssh-askpass we installed seemed like it was supposed to deal with this problem. I’m running pretty old versions of OS X, R, and RStudio, however, so who knows. After much frustration, we found the following solution:
Hopefully that will get things working! You’ll probably have to set these paths every time you start a new R session with RStudio.
Remote processing with
When you’re running a lot of jobs, you want to have as many operate in parallel as possible. Also, when you’re actively interacting with data in R, you don’t want to have to wait until the last command is finished running. You can have things complete asynchronously with the
This thing is huge (both in terms of functionality and how much it can help) so read more about it here. In an incredibly small nutshell, instead of using
<-, you can make something a future (something that will hold a value in the “future”) by using
Let’s try it out:
Now let’s see what the value of
Doing that, we notice two things:
First, if we tried to access the value of
x right after we assigned it, R made us wait a while. That’s because making
x a future means that its value isn’t guaranteed to have been made yet. Our
foo function sleeps for 10 seconds before returning a value; so when we try to access the value of
x before that time, R makes us wait until it exists. But if we don’t care about accessing it immediately, we can go on and do other things, so long as those things don’t rely on knowing the value of
Second, you’ll notice that the value of
x is not your home directory on your local computer. If everything went right, it should be R’s default directory on your remote machine. This is because we did
plan(remote, workers = ...) earlier. Specifying
plan as “remote” tells R to make these futures be processed remotely on whatever computer we defined.
If you do it this way, you can run computationally intensive models and simulations on a beefier remote server, while you use your own personal computer to orchestrate everything. Wooo!
But you can also harness multiple remote computers to do your dirty work for you. If you’re running a bunch of simulations, you can make each run a certain subset. Remote, parallel computing!
Remote AND parallel!
This is where the
parellel package comes in. There’s a lot there, so look over the docs and do a bit of Googling to get a gist of how to use it. Theoretically you could do the following and you’d basically be done with the setup:
Unfortunately, this isn’t going to fly for what my labmates and I have to work with. First, the “cycle” hostnames are basically just gateways to the cluster and shouldn’t be used for intensive jobs, and secondly, we call these commands from our local computers, and–long story short–this sort of remote clustering only really works well when the remote clusters can connect back to you (you have to set up a socket connection)2.
Fortunately, we’re not screwed. @Labmates: there are two options you can do here. You can either
ssh into the cycle hosts, boot up R from the Terminal on these machines, and then run the cluster, or you can call:
But that’s kind of a whole ‘nother deal, so check that shizz out in my next blog post about it! (Gotta get those extra page-clicks or whatever.)
Also, special thanks to my advisor Florian Jaeger, who introduced me to the
future package, got the ball rolling, and got us connected to the cluster.