September 2, 2014

Multivariate Computations

This tutorial deals with a few multivariate techniques including clustering and principal components. We begin with a short introduction to generating multivariate normal random vectors.

Multivariate normal distributions

We'll start off by generating some multivariate normal random vectors. There are packages that do this automatically, such as the mvtnorm package available from CRAN, but it is easy and instructive to do from first principles.

Let's generate from a bivariate normal distribution in which the standard deviations of the components are 2 and 3 where the correlation between the components is -1/2. For simplicity, let the mean of the vectors be the origin. We need to figure out what the covariance matrix looks like.

The diagonal elements of the covariance matrix are the marginal variances, namely 4 and 9. The off-diagonal element is the covariance, which equals the correlation times the product of the marginal standard deviations, or -3:
sigma <- matrix(c(4,-3,-3,9),2,2)
   sigma
We now seek to find a matrix M such that M times its transpose equals sigma. There are many matrices that do this; one of them is the transpose of the Cholesky square root:
M <- t(chol(sigma))
   M %*% t(M)
We now recall that if Z is a random vector and M is a matrix, then the covariance matrix of MZ equals M cov(Z) Mt. It is very easy to simulate normal random vectors whose covariance matrix is the identity matrix; this is accomplished whenever the vector components are independent standard normals. Thus, we obtain a multivariate normal random vector with covariance matrix sigma if we first generate a standard normal vector and then multiply by the matrix M above. Let us create a dataset with 200 such vectors:
Z <- matrix(rnorm(400),2,200) # 2 rows, 200 columns
   X <- t(M %*% Z)
The transpose above is taken so that X becomes a 200x2 matrix, since R prefers to have the columns as the vector components rather than the rows. Let us now plot the randomly generated normals and find the sample mean and covariance.
plot(X)
   Xbar <- apply(X,2,mean)
   S <- cov(X)
We can compare the S matrix with the sigma matrix, but it is also nice to plot an ellipse to see what shape these matrices correspond to. The car package, which we used in the EDA and regression tutorial, has the capability to plot ellipses. You might not need to run the install.packages function below since this package may already have been installed in the previous tutorial. However, the library function is necessary.
install.packages("car",lib="V:/")
   library(car,lib.loc="V:/")
To use the ellipse function in the car package, we need the center (mean), shape (covariance), and the radius. The radius is the radius of a circle that represents the "ellipse" for a standard bivariate normal distribution. To understand how to provide a radius, it is helpful to know that if we sum the squares of k independent standard normal random variables, the result is (by definition) a chi-squared random variable on k degrees of freedom. Thus, for a standard bivariate normal vector, the squares of the radii should be determined by the quantiles of the chi-squared distribution on 2 degrees of freedom. Let us then construct an ellipses with radius based on the median of the chi-squared distribution. Thus, this ellipse should contain roughly half of the points generated. We'll also produce a second ellipse, based on the true mean and covariance matrix, for purposes of comparison.
ellipse(Xbar, S, sqrt(qchisq(.5,2)))
   ellipse(c(0,0), sigma, 
      sqrt(qchisq(.5,2)), col=3, lty=2)
 
©