Saturday, December 17, 2011

Use kramdown + sed + syntaxhighlighter on Blogger.com

if you use kramdown and Blogger.com,
you may want to use the following BASH script to get
syntaxhighlighter works for your
kramdown outputs:


#!/bin/sh
kramdown $1 | sed -e 's/^\(<pre class="brush:.*.">\)\(<code>\)/\1\n/g' | sed -e 's/^\(<\/code>\)\(<\/pre>\)/\2/g' 

R + condor + git (1)

For computational statisticians, putting codes to cluster sometimes becomes a frustrating task.


For this situation, and, if you are a UNIX user, these tools might save you some time:


Here we use a R task on a condor


Rscipt file test.R:


#!/usr/bin/env Rscript
args = commandArgs()
load("funs.RData")
data.pros = load("data_file.RData")[[args[length(args)]]]
data.result = funct1(data.pros)
save(data.result, file=paste("output_RData_of_node_", 
                             args[length(args)]))           

The condor file test.condor:


executable = test.R
universe = vanilla
Requirements = ParallelSchedulingGroup == "stats group"
 
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

arguments = $(Process) 
output    = cluster1-$(Process).Rout
error     = cluster1-$(Process).err
log       = cluster1.log

transfer_input_files = funs.Data, data_file.RData 

Queue 100

In old days, we need to do the endless write->upload->test->modify->upload->test loops, now, with git, we can have different fate.


First, you need to confirm the following conditions if all of them have been satisfied:


  • The cluster has ssh installed on it;
  • The cluster has git installed on it;
  • The cluster has condor installed on it;
  • Your account has sufficient disk quota on the cluster.

Second, create a folder on the cluster, which well be used as the working directory of the condor task, and a directory as git repository:


ssh your_account@cluster "mkdir -p git_repos/$TASK_NAME.git"
ssh your_account@cluster "mkdir -p workspace/$TASK_NAME"
ssh your_account@cluster "cd git_repos/$TASK_NAME.git; git init --bare"
ssh your_account@clustre "cd git_repos/$TASK_NAME; git-init --bare" ## if you use old version of git

On ocal machine, put the following three lines to your $TASK_NAME/.git/config:


[remote "cluster"]
 url = ssh://cluster/home/your_account/git_repos/$TASK_NAME.git
 fetch = +refs/heads/*:refs/remotes/cluster/*

On cluster, edit git_repos/$TASK_NAME.git/hooks/post-receive:


export GIT_WORK_TREE=/home/your_account/workspace/$TASK_NAME
git checkout -f 
git-checkout -f ## if the cluster use older version of git 

then, simply do:


git push cluster master
git-push cluster master ## if you use older version of git
ssh your_account@cluster "cd workspace/$TASK_NAME/; condor_submit test.condor"

Then check your outputs.

Tuesday, November 08, 2011

MKL for Debian

After testing the Intel MKL, I found the way to make it works on 64-bit Debian-based Linux distribution:



If you system is 32-bit:



and then, you may do the following steps:



Now you can use MKL with most Debian numerical packages on Debian-based systems without too many problems.

Friday, September 09, 2011

Condor and R (1)

It is quite useful to use and condor and Rscript to compute some heavy R tasks.


But, condor dose not provided some easy mechanism to
transfer files over nodes.
For example, you have a .RData
file which contains the data frames and variables you need for the
computing tasks like the following setup:
\[
\boldsymbol y = \mathbf X_i + \boldsymbol\varepsilon_i
\]
where you have a moving window over $\mathbf X$ with size $w$.
For this purpose, you may want to use following R code to
do it sequentially:



It's not a big problem if you run it sequentially or run it
parallelly on your local machine:



The problem is doing it with condor:

with the condor file:

You will get some condor error message like:

then how should we fix it ?

Update: Just use:

Thursday, September 08, 2011

RcppArmadillo's matrix allocation cost

Let's see a simple example of RcppArmadilloSVD.




Then we can have a very simple comparison:



It seems that RcppArmadillo does not always outperform the vanilla R. We can find that matrix allocation in Armadillo costs some footprints in both memory (because call by value mechanism) and CPU time. I didn't ask RcppArmadillo to return U, V in arma.code.2, but the benchmark results show that RcppArmadillo spent more time on same SVD task than vanilla R.


p.s.: Thanks Yihui Xie's R brush for Syntaxhighlighter.

Tuesday, September 06, 2011

My .Emacs -- 2011 Sep 06

Here is my dot Emacs files, you can extract then put them under you home directory, just don't forget to rename dot.emacs(.d) as .emacs(.d).

Monday, February 21, 2011

Using Revolution R 4.2 in Ubuntu 10.10

This is a quick note for those who want to use Revolution Analytics's Enterprise R 4.2 in Ubuntu 10.10

Of course, the first thing you have to do is getting a copy of the software, you can purchase it by contacting Revolution Analytics or download the version for academic.

The biggest issue is that Enterprise R 4.2 only supports RHEL 5.5 or earlier. I have confirmed that the "aliened" rpms do not work in Ubuntu or Debian  (you can't find the executables, only the libraries),  thus.  An instance of working installed RHEL is required.

Installing RHEL (or, CentOS) is not a big deal, the big deal is that you have to use the old version software without "optimized usability" shipped with RHEL, I think that definitely a nightmare for spoiled Ubuntu users.

Thank KVM, we can reach the point we want: we can use the Ubuntu as the host and CentOS as the guest OS running on the virtual machine and commute them via ssh.

The following steps are what I done to make Enterprise R 4.2 works for me:

0. Download CentOS 5.5 installation CD image and Enterprise R's tar package

1. In host Ubuntu:
      sudo apt-get install libvirt-bin qemu-kvm
      sudo service qemu-kvm start
      sudo service libvirt-bin start 

 2. From main menu, choose  "System Tools-> Virtual Machine Manager" and add a new virtual machine using the CentOS image as the virtual cdrom, then install the CentOS in the virtual machine. I suggest that do not install the desktop environment.

 3. In the installed guest CentOS, run the command "/sbin/ifconfig" to see the IP address of the guest CentOS, then use sftp to "upload" the Enterprise R tar package to the guest OS

 4. In guest OS, run yum install unixODBC  then untar the tar package, then cd to the uncompressed Revo folder and execute  ./install.py  
  
 5. Now you can use the Enterprise R 4.2 in Ubuntu, with CentOS as the guest OS.
     ESS can help you process the output of the R process running on the guest OS.

 There are several things we need to keep in mind:
 A. There are some performance penalty on the guest OS
 B. You need to upload (or tell R the accessible path) the data files to the guest OS
 C. You can not use other virtual machines (like virtualbox) at the same time
 D. Enterprise R for RHEL does not provide the fancy IDE available on Win32(or Win64)