September 29, 2014

Drawing a 95% confidence interval in R

Posted on August 5, 2013 by Nathan Lemoine

I’m writing a post on how to draw a in 95% confidence interval in R by hand. I spent an hour or so trying to figure this out, and most message threads point someone to the ellipse() function. However, I wanted to know how it works.

The basic problem was this. Imagine two random variables with a bivariate normal distribution, called y, which is an x 2 matrix with n rows and 2 columns. The random variables are described by a mean vector mu and covariance matrix S. The equation for an ellipse is:

(y – mu) S^1 (y – mu)’ = c^2

The number c^2 controls the radius of the ellipse, which we want to extend to the 95% confidence interval, which is given by a chi-square distribution with 2 degrees of freedom. The ellipse has two axes, one for each variable. The axes have half lengths equal to the square-root of the eigenvalues, with the largest eigenvalue denoting the largest axis. A further description of this can be found in any multivariate statistics book (or online).

To calculate the ellipse, we need to do a few things: 1) convert the variables to polar coordinates, 2) extend the new polar variables by the appropriate half lengths (using eigenvalues), 3) rotate the coordinates based on the variances and covariances, and 4) move the location of the new coordinates back to the original means. This will make more sense when we do it by hand.

First, generate some data, plot it, and use the ellipse() function to make the 95% confidence interval. This is the target interval (I use it to check myself. If my calculations match, hooray. If not, I screwed up).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(mvtnorm) # References rmvnorm()
library(ellipse) # References ellipse()
set.seed(17)
 
# Set the covariance matrix
sigma2 <- matrix(c(5, 2, 2, 5), ncol=2)
 
# Set the means
mu <- c(5,5)
 
# Get the correlation matrix
P <- cov2cor(sigma2)
 
# Generate the data
p <- rmvnorm(n=50, mean=mu, sigma=sqrt(sigma2))
 
# Plot the data
plot(p)
 
# Plot the ellipse
lines( ellipse( P, centre = c(5,5)) , col='red')

Second, get the eigenvalues and eigenvectors of the correlation matrix.

1
2
evals <- eigen(P)$values
evecs <- eigen(P)$vectors

Third, make a vector of coordinates for a full circle, from 0 to 2*pi and get the critical value (c^2).

1
2
3
4
5
6
# Angles of a circle
a <- seq(0, 2*pi, len=100)
 
# Get critical value
c2 <- qchisq(0.95, 2)
c <- sqrt(c2)

The vector A above are angles that describe a unit circle. The coordinates of a unit circle are found by x = cos(a) and y = sin(a) (use trigonometry of a triangle to get this, where the hypotenuse = 1). We need to extend the unit circle by the appropriate lengths based on the eigenvalues and then even more by the critical value.

1
2
3
4
5
# Get the distances
xT <- c * sqrt(evals[1]) * cos(a)
yT <- c * sqrt(evals[2]) * sin(a)
 
M <- cbind(xT, yT)

If you plot M, you’ll get an ellipse of the appropriate axes lengths, but centered on 0 and unrotated. Rotate the ellipse using the eigenvectors, which describe the relationships between the variables (more appropriately, they give the directions for the vectors of the major axes of variation). Use the equation u*M’ (write this out to see why this works).

1
2
3
# Covert the coordinates
transM <- evecs %*% t(M)
transM <- t(transM)

The final step is to move the rotated ellipse back to the original scale (centered around the original means) and plot the data.

1
lines(transM + mu)

This gives the following plot, with the red line being the output from the ellipse() function.
And that’s that! Hopefully this helps someone like me who spent hours looking but couldn’t find anything.
©

September 23, 2014

Basic Probability Distributions in R

We look at some of the basic operations associated with probability distributions. There are a large number of probability distributions available, but we only look at a few. If you would like to know what distributions are available you can do a search using the command help.search(“distribution”).

Here we give details about the commands associated with the normal distribution and briefly mention the commands for other distributions. The functions for different distributions are very similar where the differences are noted below.

For this chapter it is assumed that you know how to enter data which is covered in the previous chapters.

To get a full list of the distributions available in R you can use the following command:
help(Distributions)

For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:

“d” returns the height of the probability density function
“p” returns the cumulative density function
“q” returns the inverse cumulative density function (quantiles)
“r” returns randomly generated numbers

 

The Normal Distribution

There are four functions that can be used to generate the values associated with the normal distribution. You can get a full list of them and their options using the help command:
> help(Normal)
The first function we look at it is dnorm. Given a set of values it returns the height of the probability distribution at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for the mean and standard deviation, though:

> dnorm(0)
[1] 0.3989423
> dnorm(0)*sqrt(2*pi)
[1] 1
> dnorm(0,mean=4)
[1] 0.0001338302
> dnorm(0,mean=4,sd=10)
[1] 0.03682701
>v <- c(0,1,2)
> dnorm(v)
[1] 0.39894228 0.24197072 0.05399097
> x <- seq(-20,20,by=.1)
> y <- dnorm(x)
> plot(x,y)
> y <- dnorm(x,mean=2.5,sd=0.1)
> plot(x,y)
The second function we examine is pnorm. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. This function also goes by the rather ominous title of the “Cumulative Distribution Function.” It accepts the same options as dnorm:
> pnorm(0)
[1] 0.5
> pnorm(1)
[1] 0.8413447
> pnorm(0,mean=2)
[1] 0.02275013
> pnorm(0,mean=2,sd=3)
[1] 0.2524925
> v <- c(0,1,2)
> pnorm(v)
[1] 0.5000000 0.8413447 0.9772499
> x <- seq(-20,20,by=.1)
> y <- pnorm(x)
> plot(x,y)
> y <- pnorm(x,mean=3,sd=4)
> plot(x,y)
If you wish to find the probability that a number is larger than the given number you can use the lower.tail option:
> pnorm(0,lower.tail=FALSE)
[1] 0.5
> pnorm(1,lower.tail=FALSE)
[1] 0.1586553
> pnorm(0,mean=2,lower.tail=FALSE)
[1] 0.9772499
The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, if you have a normally distributed random variable with mean zero and standard deviation one, then if you give the function a probability it returns the associated Z-score:
> qnorm(0.5)
[1] 0
> qnorm(0.5,mean=1)
[1] 1
> qnorm(0.5,mean=1,sd=2)
[1] 1
> qnorm(0.5,mean=2,sd=2)
[1] 2
> qnorm(0.5,mean=2,sd=4)
[1] 2
> qnorm(0.25,mean=2,sd=2)
[1] 0.6510205
> qnorm(0.333)
[1] -0.4316442
> qnorm(0.333,sd=3)
[1] -1.294933
> qnorm(0.75,mean=5,sd=2)
[1] 6.34898
> v = c(0.1,0.3,0.75)
> qnorm(v)
[1] -1.2815516 -0.5244005  0.6744898
> x <- seq(0,1,by=.05)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)
The last function we examine is the rnorm function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation:

> rnorm(4)
[1]  1.2387271 -0.2323259 -1.2003081 -1.6718483
> rnorm(4,mean=3)
[1] 2.633080 3.617486 2.038861 2.601933
> rnorm(4,mean=3,sd=3)
[1] 4.580556 2.974903 4.756097 6.395894
> rnorm(4,mean=3,sd=3)
[1]  3.000852  3.714180 10.032021  3.295667
> y <- rnorm(200)
> hist(y)
> y <- rnorm(200,mean=-2)
> hist(y)
> y <- rnorm(200,mean=-2,sd=4)
> hist(y)
> qqnorm(y)
> qqline(y)

 

The Chi-Squared Distribution

There are four functions that can be used to generate the values associated with the Chi-Squared distribution. You can get a full list of them and their options using the help command:
> help(Chisquare)
These commands work just like the commands for the normal distribution. The first difference is that it is assumed that you have normalized the value so no mean can be specified. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same kind of naming convention, and the names of the commands are dchisq, pchisq, qchisq, and rchisq.
A few examples are given below to show how to use the different commands. First we have the distribution function, dchisq:
> x <- seq(-20,20,by=.5)
> y <- dchisq(x,df=10)
> plot(x,y)
> y <- dchisq(x,df=12)
> plot(x,y)
Next we have the cumulative probability distribution function:
> pchisq(2,df=10)
[1] 0.003659847
> pchisq(3,df=10)
[1] 0.01857594
> 1-pchisq(3,df=10)
[1] 0.981424
> pchisq(3,df=20)
[1] 4.097501e-06
> x = c(2,4,5,6)
> pchisq(x,df=20)
[1] 1.114255e-07 4.649808e-05 2.773521e-04 1.102488e-03
Next we have the inverse cumulative probability distribution function:
> qchisq(0.05,df=10)
[1] 3.940299
> qchisq(0.95,df=10)
[1] 18.30704
> qchisq(0.05,df=20)
[1] 10.85081
> qchisq(0.95,df=20)
[1] 31.41043
> v <- c(0.005,.025,.05)
> qchisq(v,df=253)
[1] 198.8161 210.8355 217.1713
> qchisq(v,df=25)
[1] 10.51965 13.11972 14.61141
Finally random numbers can be generated according to the Chi-Squared distribution:
> rchisq(3,df=10)
[1] 16.80075 20.28412 12.39099
> rchisq(3,df=20)
[1] 17.838878  8.591936 17.486372
> rchisq(3,df=20)
[1] 11.19279 23.86907 24.81251

©

September 18, 2014

Выходное пособие

16 апреля в 16:28
политика, №97

Философ Кристофер Уэлман объясняет, откуда берется право на сепаратизм, а

Esquire предлагает поддержать самые известные сепаратистские организации и получить новую карту мира.

25 лет назад почти никто из философов не стал бы даже рассматривать вопрос о праве сепаратистов на независимость. Если вы спросили бы философа, есть ли на этот счет какая-нибудь теория, он бы, скорее всего, ответил: «Вы не можете отделиться». Вы возразили бы: «Ну хорошо, а если я живу на территории, где мне отказывают в базовых правах, и единственным способом спасти себя было бы отделение и формирование собственного государства?» Вам бы ответили: «Так и быть, вы можете это сделать, но только в качестве самой крайней меры, когда ничего больше не помогает».

Но за последние четверть века распалось на части такое количество стран, что отношение к проблеме сильно изменилось. Кроме того, философ Аллен Бьюкенен написал в 1991 году влиятельную книгу «Сецессия», которая заставила коллег обсуждать такую возможность. И отношение поменялось. Сейчас абсолютное большинство философов верят, что государство лишает себя права на территорию, если не соблюдает прав ее жителей. Споры уже идут о том, есть ли у территории право на независимость даже в том случае, если ее жителей никак не обижают. Могут ли они отделиться просто потому, что хотят формировать свое собственное государство.

Представьте себе группу людей, которые не идентифицируются со своими согражданами, или не похожи на них, или просто хотят самостоятельно определять свою политику. Все чаще звучит мнение, что их право на сецессию должно признаваться даже в этом случае. Правда, при выполнении двух основных квалифицирующих условий.

Условие первое: отделяющаяся территория должна быть способна самостоятельно выполнять базовые политические функции. Если при обретении независимости она скатится в полную анархию и начнет нарушать права собственных меньшинств, то у государства есть полное право не отпустить ее. Потому что государство обязано защищать всех своих граждан от попадания в такие условия.

Второе условие относится к тем случаям, когда территория хочет отделиться, потому что она богата или располагает большими ресурсами. Тут, как при разводе супругов, возникает вполне резонный вопрос, кто кому должен: «Мы оба инвестировали в твою карьеру, я работала официанткой в ресторане, чтобы ты мог закончить медицинскую школу. И вот через десять лет ты стал модным доктором с кучей денег... Конечно, подавай на развод, но мне нужна какая-то компенсация». Если у сепаратистов есть электростанция, которая снабжает всю страну и вся страна в нее при этом инвестировала, то независимость получить можно, но нужно продолжать делиться электричеством.
Карта мирового сепаратизма

Выберите сепаратистские организации, которые вы поддерживаете.
Выбрать все
Показать карту

В реальной жизни первое условие соблюдается, конечно, реже. Представьте себе гипотетическую Чечню. И представьте, что Россия нарушает права чеченцев, очень жестоко. В таком случае вы можете подумать, что у Чечни есть автоматическое право на независимость. Но мы ведь не знаем, как Чечня будет обращаться с собственными жителям. Если мы предполагаем, что новые власти не будут относиться к меньшинствам хоть сколько-нибудь лучше, то, вероятно, у них нет права на отделение. До тех пор пока новая независимая власть проявляет желание и волю относиться к своим жителям лучше, чем это делает Россия, трудно спорить, что она имеет право на независимость.

Самый состоятельный аргумент против всеобщего права на самоопределение звучит так: если мы сделаем сепаратизм нормой, то как мы сможем остановить его постоянное распространение? Большие государства развалятся на маленькие, маленькие — на крохотные, и так до тех пор, пока система не станет политически нестабильной. Слишком маленькие независимые группы просто не могут выполнять политические функции. В самоопределении регионов ничего плохого нет, но есть принципиальные соображения, чтобы не делать этот процесс лавинообразным. Потому что люди придумали государство в первую очередь для того, чтобы бороться с политической нестабильностью.

Практически все поддерживают право на сепаратизм, но почти никто не хочет его поощрять. Если люди разных национальностей поймут, что независимость — это легко доступная опция, они начнут очень агрессивно за нее бороться. Особенно потому, что среди них найдутся политические лидеры, которые предпочтут быть президентом маленькой независимой страны, а не мэром города или начальником муниципалитета в большой стране. Очень сложная система стимулов.

Мой критик мог бы сказать, что легко поддерживать сепаратизм, сидя в кресле философа. В реальном мире будет происходить множество сецессий, которые не удовлетворяют моим двум условиям, и отделить одни от других будет совсем не просто. Кроме того, реальные государства цепляются за свою целостность и будут агрессивно подавлять все поползновения сепаратистов. Поэтому, скажет критик, нельзя даже заикаться о таких вещах, не говоря уж о том, чтобы призывать к ним и их поддерживать. Примерно как президент Путин.

В России, и не только, вообще распространены безнадежно устаревшие представления о территориальной целостности и суверенитете страны как об абсолютной категории. В рамках этих представлений все происходящее внутри страны — это ее внутреннее дело, не более того, а сепаратистов можно давить как угодно, ни на кого не оглядываясь. Но сейчас даже международные законы, написанные, заметим, суверенными государствами, рисуют другую картину. По мере того как люди воспринимают права человека все более и более серьезно, они с неизбежностью меньше уважают суверенитет.

В Канаде, например, ситуация совсем другая. Недавно канадский верховный суд постановил, что если жители Квебека проголосуют за независимость, то все граждане страны будут обязаны участвовать в серьезных и доброжелательных переговорах об условиях выхода. Суд решил: страна не может исходить из того, что у нее есть право на целостность. Квебек, понятное дело, не готов устраивать кровавую бойню ради своей независимости. Вслед за канадским философом Уильямом Кимликой многие говорят, что сепаратизм сам по себе не является проблемой: все дело в политической культуре на более высоком уровне. Сепаратистская группа внутри государства, уважающего права человека, сама будет их соблюдать. И наоборот: внутри государства, пренебрегающего правами своих граждан, следует ожидать сепаратистов, которые не будут их уважать.

Я надеюсь, что вопросы сепаратизма будут в ближайшем будущем терять актуальность. Ведь мир глобализируется. Возьмите для примера Шотландию. Там есть сепаратисты, которые пользуются довольно широкой поддержкой. Но теперь, когда Объединенное Королевство стало частью Европейского союза, суверенитет Великобритании стал куда менее значим. И кто-то из шотландцев может сказать: зачем нам выходить из состава страны? Сейчас мы входим в федерацию, которая, в свою очередь, входит в ЕС, а завтра мы будем входить в ЕС, который тоже, в общем, федерация. Получается, по мере того как экономика становится все более глобальной, а международное право набирает силу, мешая государствам делать что угодно на своей территории, сепаратизм потихоньку теряет смысл.

В этом и состоит моя надежда: с идеей суверенитета может произойти то же, что с идеей монархии в Англии. Монархия все еще существует, но от нее зависит так мало, что никого это по большому счету не волнует. Есть противники и сторонники монархии, но никто не пойдет всерьез умирать за королеву, никто не пойдет убивать за нее, никто не пошлет своих детей в бой. С развитием глобализации и международного права сепаратизм будет значить все меньше, и значит, будет порождать все меньше насилия. Мне хочется, чтобы за целостность суверенных государств тоже не проливали кровь.
©

Least squares fitting in Excel

Set up four parallel columns in the spreadsheet:

* X has the x-values.
* Y has the y-values.
* Fit computes the Gaussian values (based on the x-values and three parameters).
* Residual is the difference between the y-values and the fits.

In order to compute the fit, you need to create three cells holding the three gaussian parameters. The formula for the fit must be identical to that used by the other software so you can compare your results with its. The example below names the three parameters kappa0, kappa1, and kappa2--just as in the documentation. The formula in the second row, where the x-value is in cell A2, is

=Kappa0 * EXP(-1*(A2 - Kappa1)^2 / Kappa2)

It is copied down to all the other rows.

This is enough to check the software's results simply by plugging in its reported values of kappa0, kappa1, and kappa2. To see whether they are correct, compute the sum of squared residuals (SSR). A formula for this uses the SUMSQ function ("=SUMSQ(D2:D32)" in the example). If you think a better combination of the parameters will work, plug in that new combination and see whether SSR decreases: if it goes down, the new values are better.

Spreadsheet

You can have this automated for you using Excel's "Solver" tool. Specify that you want to minimize the SSR by varying the three kappas. Start with the solution given by the other software. Solver will try systematically to improve its solution.

Solver dialog

The same method--suitably adapted--works well for least squares, maximum likelihood, and other optimization procedures in statistics, provided the objective function is well-behaved (i.e., differentiable and convex) and you can obtain an excellent starting value. Otherwise, Solver is perfectly capable of reporting inferior solutions or failing altogether: it is wise not to use it as the sole method to solve a problem.
share|improve this answer

answered Jun 4 '11 at 14:50
whuber♦
©

September 9, 2014

September 3, 2014

How to fit data that looks like a gaussian?

I am quite new to statistics, so please forgive me for using probably the wrong vocabulary.
I have some data that looks (to me) like a gaussian when plotted.
The data is an extract from a jpeg image. It's a vertical line taken from the image, and only the Red data is used (from RGB).
Here is the full data (27 data points):
> r
 [1] 0.003921569 0.031372549 0.023529412 0.015686275 0.003921569 0.027450980
 [7] 0.003921569 0.015686275 0.031372549 0.105882353 0.305882353 0.490196078
[13] 0.560784314 0.615686275 0.592156863 0.505882353 0.364705882 0.227450980
[19] 0.050980392 0.031372549 0.019607843 0.054901961 0.031372549 0.015686275
[25] 0.027450980 0.003921569 0.011764706

> dput(r)
c(0.00392156862745098, 0.0313725490196078, 0.0235294117647059, 
0.0156862745098039, 0.00392156862745098, 0.0274509803921569, 
0.00392156862745098, 0.0156862745098039, 0.0313725490196078, 
0.105882352941176, 0.305882352941176, 0.490196078431373, 0.56078431372549, 
0.615686274509804, 0.592156862745098, 0.505882352941176, 0.364705882352941, 
0.227450980392157, 0.0509803921568627, 0.0313725490196078, 0.0196078431372549, 
0.0549019607843137, 0.0313725490196078, 0.0156862745098039, 0.0274509803921569, 
0.00392156862745098, 0.0117647058823529)
plot(r)
 
-------------------
 
Fitting a distribution is, roughly speaking, what you'd do if you made a histogram of your data, and tried to see what sort of shape it had. What you're doing, instead, is simply plotting a curve. That curve happens to have a hump in the middle, like what you get by plotting a gaussian density function.
To get what you want, you can use something like optim to fit the curve to your data. The following code will use nonlinear least-squares to find the three parameters giving the best-fitting gaussian curve: m is the gaussian mean, s is the standard deviation, and k is an arbitrary scaling parameter (since the gaussian density is constrained to integrate to 1, whereas your data isn't).
x <- seq_along(r)

f <- function(par)
{
    m <- par[1]
    sd <- par[2]
    k <- par[3]
    rhat <- k * exp(-0.5 * ((x - m)/sd)^2)
    sum((r - rhat)^2)
}

optim(c(15, 2, 1), f, method="BFGS", control=list(reltol=1e-9))
 

I propose to use non-linear least squares for this analysis.
# First present the data in a data-frame
tab <- data.frame(x=seq_along(r), r=r)
#Apply function nls
(res <- nls( r ~ k*exp(-1/2*(x-mu)^2/sigma^2), start=c(mu=15,sigma=5,k=1) , data = tab))
And from the output, I was able to obtain the following fitted "Gaussian curve":
v <- summary(res)$parameters[,"Estimate"]
plot(r~x, data=tab)
plot(function(x) v[3]*exp(-1/2*(x-v[1])^2/v[2]^2),col=2,add=T,xlim=range(tab$x) )
©

R: Robust fitting of data points to a Gaussian function

Fitting a Gaussian curve to the data, the principle is to minimise the sum of squares difference between the fitted curve and the data, so we define f our objective function and run optim on it:
 
fitG =
function(x,y,mu,sig,scale){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2])
    sum((d-y)^2)
  }

  optim(c(mu,sig,scale),f)
 }

Now, extend this to two Gaussians:
 
fit2G <- function(x,y,mu1,sig1,scale1,mu2,sig2,scale2,...){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2]) + p[6]*dnorm(x,mean=p[4],sd=p[5])
    sum((d-y)^2)
  }
  optim(c(mu1,sig1,scale1,mu2,sig2,scale2),f,...)
}

Fit with initial params from the first fit, and an eyeballed guess of the second peak. Need to increase the max iterations:
 
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000))
Warning messages:
1: In dnorm(x, mean = p[1], sd = p[2]) : NaNs produced
2: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
3: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
> fit2P
$par
[1] 6.035610393 0.653149616 0.023744876 8.317215066 0.107767881 0.002055287

What does this all look like?
 
> plot(data$V3,data$V6)
> p = fit2P$par
> lines(data$V3,p[3]*dnorm(data$V3,p[1],p[2]))
> lines(data$V3,p[6]*dnorm(data$V3,p[4],p[5]),col=2)




However I would be wary about statistical inference about your function parameters...
The warning messages produced are probably due to the sd parameter going negative. You can fix this and also get a quicker convergence by using L-BFGS-B and setting a lower bound:
 
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000),method="L-BFGS-B",lower=c(0,0,0,0,0,0))
> fit2P
$par
[1] 6.03564202 0.65302676 0.02374196 8.31424025 0.11117534 0.00208724

As pointed out, sensitivity to initial values is always a problem with curve fitting things like this.

©

Appendix L: TORQUE Quick Start Guide

L.1 Initial Installation

  • Download the TORQUE distribution file from http://clusterresources.com/downloads/torque
  • Extract and build the distribution on the machine that will act as the "TORQUE server" - the machine that will monitor and control all compute nodes by running the pbs_server daemon. See the example below:
    > tar -xzvf torque.tar.gz
    > cd torque
    > ./configure
    > make
    > make install
    
    Note OSX 10.4 users need to change the #define __TDARWIN in src/include/pbs_config.h to #define __TDARWIN_8.
  • Note After installation, verify you have PATH environment variables configured for /usr/local/bin/ and /usr/local/sbin/. Client commands are installed to /usr/local/bin and server binaries are installed to /usr/local/sbin.<
    Note In this document TORQUE_HOME corresponds to where TORQUE stores its configuration files. The default is /var/spool/torque.

L.2 Initialize/Configure TORQUE on the Server (pbs_server)

  • Once installation on the TORQUE server is complete, configure the pbs_server daemon by executing the command torque.setup <USER> found packaged with the distribution source code, where <USER> is a username that will act as the TORQUE admin. This script will set up a basic batch queue to get you started. If you experience problems, make sure that the most recent TORQUE executables are being executed, or that the executables are in your current PATH.
  • If doing this step manually, be certain to run the command 'pbs_server -t create' to create the new batch database. If this step is not taken, the pbs_server daemon will be unable to start.
  • Proper server configuration can be verified by following the steps listed in Section 1.4 Testing

L.3 Install TORQUE on the Compute Nodes

To configure a compute node do the following on each machine (see page 19, Section 3.2.1 of PBS Administrator's Manual for full details):
  • Create the self-extracting, distributable packages with make packages (See the INSTALL file for additional options and features of the distributable packages) and use the parallel shell command from your cluster management suite to copy and execute the package on all nodes (ie: xCAT users might do prcp torque-package-linux-i686.sh main:/tmp/; psh main /tmp/torque-package-linux-i686.sh --install. Optionally, distribute and install the clients package.

L.4 Configure TORQUE on the Compute Nodes

  • For each compute host, the MOM daemon must be configured to trust the pbs_server daemon. In TORQUE 2.0.0p4 and earlier, this is done by creating the TORQUE_HOME/mom_priv/config file and setting the $pbsserver parameter. In TORQUE 2.0.0p5 and later, this can also be done by creating the TORQUE_HOME/server_name file and placing the server hostname inside.
  • Additional config parameters may be added to TORQUE_HOME/mom_priv/config (See the MOM Config page for details.)

L.5 Configure Data Management on the Compute Nodes

Data management allows jobs' data to be staged in/out or to and from the server and compute nodes.
  • For shared filesystems (i.e., NFS, DFS, AFS, etc.) use the $usecp parameter in the mom_priv/config files to specify how to map a user's home directory.
    (Example: $usecp gridmaster.tmx.com:/home /home)
  • For local, non-shared filesystems, rcp or scp must be configured to allow direct copy without prompting for passwords (key authentication, etc.)

L.6 Update TORQUE Server Configuration

  • On the TORQUE server, append the list of newly configured compute nodes to the TORQUE_HOME/server_priv/nodes file:
    server_priv/nodes
    computenode001.cluster.org
    computenode002.cluster.org
    computenode003.cluster.org
    

L.7 Start the pbs_mom Daemons on Compute Nodes

  • Next start the pbs_mom daemon on each compute node by running the pbs_mom executable.

L.8 Verifying Correct TORQUE Installation

The pbs_server daemon was started on the TORQUE server when the torque.setup file was executed or when it was manually configured. It must now be restarted so it can reload the updated configuration changes.
# shutdown server
> qterm # shutdown server

# start server
> pbs_server

# verify all queues are properly configured
> qstat -q

# view additional server configuration
>  qmgr -c 'p s'

# verify all nodes are correctly reporting
> pbsnodes -a 

# submit a basic job
> echo "sleep 30" | qsub

# verify jobs display
> qstat
At this point, the job will not start because there is no scheduler running. The scheduler is enabled in the next step below.

L.9 Enabling the Scheduler

Selecting the cluster scheduler is an important decision and significantly affects cluster utilization, responsiveness, availability, and intelligence. The default TORQUE scheduler, pbs_sched, is very basic and will provide poor utilization of your cluster's resources. Other options, such as Maui Scheduler or Moab Workload Manager are highly recommended. If using Maui/Moab, refer to the Moab-PBS Integration Guide. If using pbs_sched, start this daemon now.
Note If you are installing ClusterSuite, TORQUE and Moab were configured at installation for interoperability and no further action is required.

L.10 Startup/Shutdown Service Script for TORQUE/Moab (OPTIONAL)

Optional startup/shutdown service scripts are provided as an example of how to run TORQUE as an OS service that starts at bootup. The scripts are located in the contrib/init.d/ directory of the TORQUE tarball you downloaded. In order to use the script you must:
  • Determine which init.d script suits your platform the best.
  • Modify the script to point to TORQUE's install location. This should only be necessary if you used a non-default install location for TORQUE (by using the --prefix option of ./configure).
  • Place the script in the /etc/init.d/ directory.
  • Use a tool like chkconfig to activate the start-up scripts or make symbolic links (S99moab and K15moab, for example) in desired runtimes (/etc/rc.d/rc3.d/ on Redhat, etc.).

See Also:

©

Torque configuration notes on Ububtu 12.04

Client:
$ sudo apt-get install torque-common torque-client torque-mom

$ sudo nano /var/spool/torque/server_name:
server_name

$ sudo nano /var/spool/torque/mom_priv/config
$pbsserver      server_name          # note: this is the hostname of the headnode

start pbs_mom:
$ pbs_mom

or reboot


Server:
$sudo nano /var/spool/torque/server_priv/nodes
node_name np=4

Torque configuration on Archlinux


TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. Basically, one can setup a home or small office Linux cluster and queue jobs with this software. A cluster consists of one head node and many compute nodes. The head node runs the torque-server daemon and the compute nodes run the torque-client daemon. The head node also runs a scheduler daemon.

Installation

Note: Although TORQUE is a very powerful queuing system, if the goal of the cluster is solely to increase compilation throughput, distcc is a much easier and elegant solution.
Install the torque package found in the AUR.

Must haves

/etc/hosts

Make sure that /etc/hosts on all of the boxes in the cluster contains the hostnames of every PC in the cluster. Example, cluster consists of 3 PCs, mars, phobos, and deimos.
192.168.0.20   mars
192.168.0.21   phobos
192.168.0.22   deimos

Firewall configuration (if installed)

Be sure to open TCP for all machines using TORQUE.
The pbs_server (server) and pbs_mom (client) by default use TCP and UDP ports 15001-15004. pbs_mom (client) also uses UDP ports 1023 and below if privileged ports are configured (the default).

NFS

Technically, one does not need to use NFS but doing so simplifies the whole process. An NFS share either on the server or another machine is highly recommended to simplify the process of sharing common build disk space.

Setup

Server (head node) configuration

Follow these steps on the head node/scheduler.
Edit /var/spool/torque/server_name to name the head node. It is recommended to match the hostname in /etc/hostname for simplicity's sake.
Create and configure the torque server:
# pbs_server -t create
PBS_Server localhost.localdomain: Create mode and server database exists,
do you wish to continue y/(n)?y
A minimal set of options are provided here. Adjust the first line substituting "mars" with the hostname entered in /var/spool/torque/server_name:
qmgr -c "set server acl_hosts = mars"
qmgr -c "set server scheduling=true"
qmgr -c "create queue batch queue_type=execution"
qmgr -c "set queue batch started=true"
qmgr -c "set queue batch enabled=true"
qmgr -c "set queue batch resources_default.nodes=1"
qmgr -c "set queue batch resources_default.walltime=3600"
qmgr -c "set server default_queue=batch"
It may be of interest to keep finished jobs in the queue for a period of time.
qmgr -c "set server keep_completed = 86400"
Here, 86400 sec = 24 h after which point, the job will be auto removed from the queue. One can see the full log of jobs removed from the queue with the -f switch on qstat:
qstat -f
Verify the server config with this command:
# qmgr -c 'p s'
Edit /var/spool/torque/server_priv/nodes adding all compute nodes. Again, it is recommended to match the hostname(s) of the machines on the LAN. The syntax is HOSTNAME np=x gpus=y properties
  • HOSTNAME=the hostname of the machine
  • np=number of processors
  • gpus=number of gpus
  • properties=comments
Only the hostname is required, all other fields are optional.
Example:
mars np=4
phobos np=2
deimos np=2
Note:
  • One can run both the server and client on the same box.
  • Re-running pbs_server -t create may delete this nodes file.
Restart the server and the new options are sourced.

Client (compute node) configuration

Follow these steps on each compute node in the cluster.
Note: If running both the server and client on the same box, be sure to complete these steps as well for that machine as well as other pure clients on the cluster.
Edit /var/spool/torque/mom_priv/config to contain some basic info identifying the server:
$pbsserver      mars          # note: this is the hostname of the headnode
$logevent       255           # bitmap of which events to log

Restart the server and client(s)

That should be it. Restart the server and the client: torque-server torque-node.

Verifying cluster status

To check the status of the cluster, issue the following:
$ pbsnodes -a
Each node if up should indicate that it is ready to receive jobs echoing a state of free. If a node is not working, it will report a state of down.
Example output:
mars
     state = free
     np = 4
     ntype = cluster
     status = rectime=1308479899,varattr=,jobs=0.localhost.localdomain,state=free,netload=1638547057,
gres=,loadave=2.69,ncpus=4,physmem=8195892kb,availmem=7172508kb,totmem=8195892kb,
idletime=24772,nusers=1,nsessions=5,sessions=1333 1349 1353 1388 9095,
uname=Linux mars 2.6.39-ck #1 SMP PREEMPT Sat Jun 18 14:19:01 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 2

phobos
     state = free
     np = 2
     ntype = cluster
     status = rectime=1308479933,varattr=,jobs=,state=free,netload=1085755815,
gres=,loadave=2.84,ncpus=2,physmem=4019704kb,availmem=5753552kb,totmem=6116852kb,
idletime=7324,nusers=2,nsessions=6,sessions=1565 1562 1691 1716 1737 1851,
uname=Linux phobos 2.6.37-ck #1 SMP PREEMPT Sun Apr 3 17:16:35 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

deimos
     state = free
     np = 2
     ntype = cluster
     status = rectime=1308479890,varattr=,jobs=2.localhost.localdomain,state=free,netload=527239670,
gres=,loadave=0.52,ncpus=2,physmem=4057808kb,availmem=3955624kb,totmem=4057808kb,
idletime=644,nusers=1,nsessions=1,sessions=865,
uname=Linux deimos 2.6.39-ck #1 SMP PREEMPT Sat Jun 11 12:36:21 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

Queuing jobs

Queuing to the cluster is accomplished via the qsub command.
A trivial test is to simply run sleep:
$ echo "sleep 30" | qsub
Check the status of the queue via the qstat command described below. At this point, the work will have a status of "Q" which means queued. To start it, run the scheduler:
# pbs_sched
One can modify the torque-server systemd daemon to activate pbs_sched at boot.
Another usage of qsub is to name a job and queue a script:
$ qsub -N x264 /home/facade/bin/x264_HQ.sh
Note: STDOUT and STDERR for a queued job will be logged by default in the form text files corresponding to the respective outputs pid.o and pid.e and will be written to the path from which the qsub command was issued.
Another example can use a wrapper script to make and queue work en mass automatically.

Checking job status

qstat is used to check work status.
$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
13.localhost               generic-i686.pbs facade         00:05:06 R batch          
14.localhost               atom-i686.pbs    facade         00:03:09 R batch          
15.localhost               core2-i686.pbs   facade         00:01:02 R batch          
16.localhost               k7-i686.pbs      facade                0 Q batch          
17.localhost               k8-i686.pbs      facade                0 Q batch          
18.localhost               k10-i686.pbs     facade                0 Q batch          
19.localhost               p4-i686.pbs      facade                0 Q batch          
20.localhost               pentm-i686.pbs   facade                0 Q batch          
21.localhost               ...ic-x86_64.pbs facade                0 Q batch          
22.localhost               atom-x86_64.pbs  facade                0 Q batch          
23.localhost               core2-x86_64.pbs facade                0 Q batch          
24.localhost               k8-x86_64.pbs    facade                0 Q batch          
25.localhost               k10-x86_64.pbs   facade                0 Q batch          
Append the -n switch to see which nodes are doing which jobs.
$ qstat -n
localhost.localdomain:
405.localhost.lo     facade  batch    i686-generic       3035     1   0    --  01:00 C 00:12
   mars/3+mars/2+mars/1+mars/0
406.localhost.lo     facade  batch    i686-atom          5768     1   0    --  01:00 C 00:46
   phobos/1+phobos/0
407.localhost.lo     facade  batch    i686-core2        22941     1   0    --  01:00 C 00:12
   mars/3+mars/2+mars/1+mars/0
408.localhost.lo     facade  batch    i686-k7           10152     1   0    --  01:00 C 00:12
   mars/3+mars/2+mars/1+mars/0
409.localhost.lo     facade  batch    i686-k8           29657     1   0    --  01:00 C 00:12
   mars/3+mars/2+mars/1+mars/0
410.localhost.lo     facade  batch    i686-k10          16838     1   0    --  01:00 C 00:12
   mars/3+mars/2+mars/1+mars/0
411.localhost.lo     facade  batch    i686-p4           25340     1   0    --  01:00 C 00:46
   deimos/1+deimos/0
412.localhost.lo     facade  batch    i686-pentm        12544     1   0    --  01:00 R 00:20
   phobos/1+phobos/0
413.localhost.lo     facade  batch    x86_64-generic     4024     1   0    --  01:00 C 00:13
   mars/3+mars/2+mars/1+mars/0
414.localhost.lo     facade  batch    x86_64-atom       19330     1   0    --  01:00 C 00:13
   mars/3+mars/2+mars/1+mars/0
415.localhost.lo     facade  batch    x86_64-core2       2146     1   0    --  01:00 C 00:13
   mars/3+mars/2+mars/1+mars/0
416.localhost.lo     facade  batch    x86_64-k8         17234     1   0    --  01:00 R 00:11
   mars/3+mars/2+mars/1+mars/0
417.localhost.lo     facade  batch    x86_64-k10          --      1   0    --  01:00 Q   -- 
    -- 


©

bash script, erase previous line?

Q: In lots of Linux programs, like curl, wget, and anything with a progress meter, they have the bottom line constantly update, every certain amount of time. How do I do that in a bash script?


A:

{
for pc in $(seq 1 100); do
echo -ne "$pc%\033[0K\r"
usleep 100000
done
echo
}

The "\033[0K" will delete to the end of the line - in case your progress line gets shorter at some point, although this may not be necessary for your purposes.

The "\r" will move the cursor to the beginning of the current line

The -n on echo will prevent the cursor advancing to the next line
©

September 2, 2014

Multivariate Computations

This tutorial deals with a few multivariate techniques including clustering and principal components. We begin with a short introduction to generating multivariate normal random vectors.

Multivariate normal distributions

We'll start off by generating some multivariate normal random vectors. There are packages that do this automatically, such as the mvtnorm package available from CRAN, but it is easy and instructive to do from first principles.

Let's generate from a bivariate normal distribution in which the standard deviations of the components are 2 and 3 where the correlation between the components is -1/2. For simplicity, let the mean of the vectors be the origin. We need to figure out what the covariance matrix looks like.

The diagonal elements of the covariance matrix are the marginal variances, namely 4 and 9. The off-diagonal element is the covariance, which equals the correlation times the product of the marginal standard deviations, or -3:
sigma <- matrix(c(4,-3,-3,9),2,2)
   sigma
We now seek to find a matrix M such that M times its transpose equals sigma. There are many matrices that do this; one of them is the transpose of the Cholesky square root:
M <- t(chol(sigma))
   M %*% t(M)
We now recall that if Z is a random vector and M is a matrix, then the covariance matrix of MZ equals M cov(Z) Mt. It is very easy to simulate normal random vectors whose covariance matrix is the identity matrix; this is accomplished whenever the vector components are independent standard normals. Thus, we obtain a multivariate normal random vector with covariance matrix sigma if we first generate a standard normal vector and then multiply by the matrix M above. Let us create a dataset with 200 such vectors:
Z <- matrix(rnorm(400),2,200) # 2 rows, 200 columns
   X <- t(M %*% Z)
The transpose above is taken so that X becomes a 200x2 matrix, since R prefers to have the columns as the vector components rather than the rows. Let us now plot the randomly generated normals and find the sample mean and covariance.
plot(X)
   Xbar <- apply(X,2,mean)
   S <- cov(X)
We can compare the S matrix with the sigma matrix, but it is also nice to plot an ellipse to see what shape these matrices correspond to. The car package, which we used in the EDA and regression tutorial, has the capability to plot ellipses. You might not need to run the install.packages function below since this package may already have been installed in the previous tutorial. However, the library function is necessary.
install.packages("car",lib="V:/")
   library(car,lib.loc="V:/")
To use the ellipse function in the car package, we need the center (mean), shape (covariance), and the radius. The radius is the radius of a circle that represents the "ellipse" for a standard bivariate normal distribution. To understand how to provide a radius, it is helpful to know that if we sum the squares of k independent standard normal random variables, the result is (by definition) a chi-squared random variable on k degrees of freedom. Thus, for a standard bivariate normal vector, the squares of the radii should be determined by the quantiles of the chi-squared distribution on 2 degrees of freedom. Let us then construct an ellipses with radius based on the median of the chi-squared distribution. Thus, this ellipse should contain roughly half of the points generated. We'll also produce a second ellipse, based on the true mean and covariance matrix, for purposes of comparison.
ellipse(Xbar, S, sqrt(qchisq(.5,2)))
   ellipse(c(0,0), sigma, 
      sqrt(qchisq(.5,2)), col=3, lty=2)
 
©