IMBI Freiburg
News: 2009-08-03 Extensive snowfall tutorial + materials 2009-June R-Journal article about snowfall/sfCluster 2008-12-23 snowfall 1.70 released
Parallel computing in R with sfCluster/snowfall
Overview
Although there are many working cluster solutions for R, all of them need to have the user to setup a cluster, connect to servers or perform any other, non-R specific, task. sfCluster/snowfall is a solution to run parallel R programs easier through MPI clusters, as users can concentrate on their R-code and are not forced into managing environments for parallel computing.
snowfall is an R package based on snow. It features no additional technical abstraction layer, but enriched functionality increasing usability. It offers all snow cluster techniques, sequential mode for "run anywhere", extended error checkings and some functions for comfort. It is also the connection to sfCluster, but can be used without it as well.
sfCluster is a Unix tool for managing and observing clusters. It automatically set up clusters for the user and shutdown them after finish. If something nasty is going on during execution, sfCluster will notice and react. sfCluster is based on LAM/MPI (with a port to OpenMPI coming in the near future).
MPI, PVM, Socket, what?
There are several techniques to bind computers together for parallel computing, where snow and so snowfall can use four:
  • Socket-connection: the easiest, where everything is managed inside R. Socket connections run over direct TCP/IP connection and so can be used on virtual any machines. If you just want to use parallelization on one computer (laptop or workstation) or on very few machines, you are fine. Biggest advantage is that you do not have to install additional software to use this kind of clusters.
  • MPI: Message Passing Interface. Basically an definition of a networking protocoll. There are several different implementations today, where openMPI is the most common and widely used. sfCluster uses the bit more older LAM, but will feature openMPI in the future, as well. Open MPI home.
  • PVM: Parallel Virtual Machine. Most Unix distributions offers packages for PVM. PVM home.
  • NetWorkSpaces/NWS is a framework for coordinating programs written in scripting languages. It has support for parallel computing with it's Sleight mode. NetWorkSpaces for R.
If you are not using Socket-clusters, you have to install and configure the chosen cluster solution first. Please consult your local administrator first. If you do not know what this is all about, you most likely are fine using Socket-clusters first.
Get started
SSH access without password

For most of the needed cluster techniques secure shell (SSH) connections needed. As these require the input of passwords, you should have access without password on these machines you want to use (even if it is only your local machine).

A description on howto install this can be found here.

Installing snowfall

First, install snow and snowfall on any machine you want to use in your cluster:

jo@biom7:~$ R --no-save R version 2.7.2 (2008-08-25) [...] > install.packages( c( "snow", "snowfall" ) )

If you want to use more than one machine or a probably existing cluster installation in your institute (like MPI or PVM), most likely the desired R packages are installed (if you are not sure about the cluster techniques available, please consult your administrators). If the needed R packages are not present, here is the list to install them as well:

Additional you probably want to install rlecuyer and/or rsprng for network enabled random number generators.

Now you are ready for a first test. We use the Socket cluster type, as this should run anywhere:

jo@biom7:~$ R --no-save R version 2.7.2 (2008-08-25) [...] > library( snowfall ) Loading required package: snow > sfInit( parallel=FALSE ) Forced to sequential mode. snowfall 1.60 initialized (parallel=FALSE, CPUs=1) > sfLapply( 1:4, exp ) [[1]] [1] 2.718282 [[2]] [1] 7.389056 [[3]] [1] 20.08554 [[4]] [1] 54.59815 > sfStop() Stopping cluster > sfInit( parallel=TRUE, cpus=2, type="SOCK" ) Forced parallel. Using session: XXXXXXXXR_jo_135451_103008 JOB STARTED AT Thu Oct 30 13:54:52 2008 ON biom9 (OSLinux) 2.6.18-6-686-bigmem R Version: R version 2.7.2 (2008-08-25) snowfall 1.60 initialized (parallel=TRUE, CPUs=2) > sfLapply( 1:4, exp ) [[1]] [1] 2.718282 [[2]] [1] 7.389056 [[3]] [1] 20.08554 [[4]] [1] 54.59815 >sfStop() Stopping cluster >q()

On the first call, we are running sequential mode, which means the program is just running on one cpu (like any "normal" R program before). This is just an example that you do not have to change your snowfall-using programs even if it is run on a machine without parallel computing possibilities. The call sfLapply is equivalent to the R function lapply.

We stop snowfall afterwards (sfStop()) and reinitialise it for running in parallel using 2 CPUs using the Socket cluster type (sfInit( parallel=TRUE, cpus=2, type="SOCK" )). Afterwards, the list functions runs on two CPUs, where CPU 1 is calculating list indices 1 and 3 and CPU 2 is calculating list indices 2 and 4.

The complete list of functions and options is listed in the snowfall help files (help( snowfall ), and the more detailed help files e.g. for the initialisation help( sfInit ), parallel calculations help( sfLapply ) or tools help( sfLibrary ).

sfCluster
The installation of sfCluster is a straight forward Unix installer. A short description can be found be found here. The Debian package and the latest version of sfCluster are coming very soon...
FAQ (frequently asked questions)
FAQ subsections: [General] [snowfall] [sfCluster]
General
  1. I just want to run parallel programs on my multicore laptop/workstation/PC. What do I need?
    You only need to install the R packages snow and snowfall. No further software is needed. sfCluster or any other workload/management solution is not needed.
  2. Using Socket or MPI clusters, I am asked for SSH passwords (on Unix). How to get rid of these?
    Secure Shell (SSH) can configured not to ask for passwords in most installations. In this document it is described how to get rid of the passwords on login.
  3. I am using snow. Is there a reason to change to snowfall?
    As snowfall is no technical abstraction above snow and mostly adds functions for comfort, you get nothing more. But due to better error control, some comfort functions, cleaner API, snowfall may be easier for people without further knowledge of computer networks.
snowfall
  1. Does snowfall runs on Microsoft Windows?
    Yes. But there was an error in snowfall before 1.7 that prevented it from working correctly (shame on me!). If you want to run snowfall on Windows, be sure to use version 1.70 or later.
  2. Are the results from sequential and parallel runs absolutetly identically?
    Yes. But beware if your parallel functions are using random numbers. Use sfClusterSetupPRNG() for reproducible results if run on the same amount of CPUs. If you really want to have identically results for sequential executions and parallel runs on different amounts of CPUs, you have to use some tricks (more documentation to come).
  3. Are there worker/slave processes spawn on any sfClusterApply/sfLapply etc. call?
    No. worker R instances are only created on the sfInit() call and reused on any parallel call. Only if calling sfStop() and afterwards reinitialise with sfInit() new workers are spawned.
sfCluster
  1. sfCluster just runs on LAM/MPI, why?
    Well, this is an "historical" decision. As we started, we only thought about an internal solution, but it grew so fast we released the program. We decided pro LAM, as it features a very handy ressource management, which is possible with openMPI only with additional programs. The support for openMPI is coming in the near future.
  2. sfCluster requires root (administrator) rights to install. Why?
    This is true, but it is possible to install it without root rights. The reason is not sfCluster itself, but the installation of the required Perl libraries and the R additions (like R CMD par), as these need root rights.
    How to install sfCluster without root rights:
    The latter you can get rid off using the --disable-rwrapper option on the configure call, the first is a bit harder to manage. Sadly the CPAN (the Perl kind of CRAN) can not easily modified to install Packages in local directories. You can tweak that by changing CPANs installation path: Perlmonks article. If done so, sfCluster can be installed without root rights. But sadly, the Perl problems remains and therefore the installation needs root rights by default.
Download
  • R package snowfall (1.70, 2008-12-23): Unix package (tar.gz), Windows package (Zip)
    snowfall is also available through CRAN, too: snowfall on CRAN. Over this page you also have direct access to help files and vignette of snowfall.
    Information about the lastest changes.
  • sfCluster sfCluster.tar.gz (0.4beta, 2008-06-23)
    Small installation documentation: README.txt
  • sfCluster installation packages for Debian distribution, additional documentation and tutorials are coming soon...
Documentation
For snowfall, please consult the vignette and the help files. For sfCluster there are some help inside the package and you can call sfCluster --help. A tutorial will be available soon.
Packages using snowfall
  • The package peperr by Christine Porzelius is designed for prediction error estimation through resampling techniques and uses snowfall for automatically enhancing speed if a working cluster is accessible.
Talks, tutorials and articles
This year Jochen Knaus and Christine Porzelius held a tutorial about parallel computing using snowfall on the workshop in GŁnzburg. The tutorial contains excercises and all needed additional informations and can be used for teaching or self learning.
Passwordless Secure Shell (SSH) login

Login using Secure Shell (SSH) is common on *nix systems. Normally you need to type your password on login, which is quite unhandy for cluster computing. This section describes the way to get rid of the password typing on SSH logins.

Please note: we only cover OpenSSH version 2 in this description. The very old version 1 is not described. You can finf short informations e.g. on this site.

All SSH data for your account are stored in the directory ~/.ssh ("~" means: your home directory, most likely something like /home/yourlogin).

For login without password, the SSH program needs something to identify you. This is done via a pair of encrypted files: a private and a public key. First you have to create this pair of files with the command:

ssh-keygen -t rsa

The suggested location is most likely correct, the command should offer your SSH data directory (i.e. the directory ~/.ssh, see above). If asked for a pass phrase, do not enter one! (If you would, you have to type the pass phrase on each login - which most likely would not be a real improvement to typing your password)

The command will create two files (id_rsa and id_rsa.pub) in the SSH data directory. The first file contains your private key, the second file contains your public key. These files authorize yourself even without password. You should change the permissions for these two files using the following commands.

cd ~/.ssh

As your private key is the "opener" for the public key, it have to be restricted for yourself:

chmod go-rwx id_rsa # restrict access to private key chmod a+r id_rsa.pub # everybody may access your public key

The public keyfile should now be copied to the file authorized_keys containing information on machines allowed to connect without password:

cat id_rsa.pub >> authorized_keys

Please use two ">", as this will append the content of the file id_rsa.pub to the file authorized_keys. One ">" would overwrite it.

Now, you can connect without password to your local machine:

ssh localhost

The generated public key has to be appended to the file authorized_keys on any machine you want to connect without password.

History/Changes snowfall
  • snowfall 1.70 (2008-12-23):
    Due to many bugfixes an update is strongly recommended!
    • Windows initialisation error fixed: snowfall is now working correctly and creates probably needed temporary directories in the users home (for example running Windows Server 2003: Documents and Settings/USER/My Documents/.sfCluster).
    • NWS startup works correct now (thanks to M. Schmidtberger for the little patch).
    • sfSapply is now working as intended.
    • New default is no logging of slave/worker output. Can be changed via new argument slaveOutfile at sfInit call or the command line argument --tmpdir. If using sfCluster everything is set correctly anytimes.
    • Restore function of sfClusterApplySR can now be globally set using the new argument restore on sfInit call. This is equal to the command line argument --restoreSR, which can now also be simply written as --restore.
    • Bugfix for sfSource() in sequential mode.
    • Changing CPU amount during runtime (with multiple sfInit() calls with different settings in a single program) is now possible using socket and NWS clusters.
    • Many messages reworked and with changed behavior (many sfCluster depending warnings are now not displayed if used without it). Some typos are corrected.
    • sfCat() is now using argument master, too.
    • Package vignette is slightly corrected and extended.
    • Several code improvements without user relevance.
    Sadly the reworking of the internal snowfall configuration (moving from global to namespace) has not finished yet. These cause several warnings (in this case: no errors!) during package check.
  • snowfall 1.60:
    • snowfall now supports all snow featured cluster types: Socket, MPI, PVM and nws. Type can be changed during initialisation (on sfInit() call).
      Please note: On parallel mode, default is Socket-cluster now (previous MPI)!
    • Larger extension to help an vignette.
    • snowfalls settings can be changed on the command line, even without sfCluster. This worked in the previous version as well, but is extended now:
      jo@biom9:~$ R --no-save --args --cpus=4 --parallel --type=SOCK > library(snowfall) Loading required package: snow > sfInit() [...] snowfall 1.60 initialized (parallel=TRUE, CPUs=4) > sfType() [1] "SOCK" > sfCpus() [1] 4 > sfParallel() [1] TRUE
  • snowfall 1.53 (2008-08-22):
    • Only one change: if a snowfall function is called without having a call to sfInit() first, sfInit() will be called without parameters. Useful with sfCluster, where sfInit() is always called without parameters:
      jo@biom9:~$ sfCluster -i --cpus=4 Session-ID : 0T3NyYGN_R [...] ASSIGNED 4 cpus on 1 machines (4 requested). -- /usr/local/bin/sfCluster: START R-interactive session -- Rechner: biom9 > library(snowfall) > sum(unlist(sfLapply(1:10, exp))) 4 slaves are spawned successfully. 0 failed. [...] R Version: R version 2.5.1 (2007-06-27) snowfall 1.53 initialized (parallel=TRUE, CPUs=4) Warning message: Calling to snowfall function without calling sfInit. Calling sfInit() now. in: sfCheck() [1] 34843.77 > q()
Contact
Feel free to ask question, send comments to Jochen Knaus: jo[at]imbi.uni-freiburg.de
IMBI Freiburg, Germany
2008/12/16, jo.