我的PCA怎么了?

My code:

我的代码:

from numpy import *

def pca(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    u, s, v = linalg.svd(data)
    print s #should be s**2 instead!
    print v

def load_iris(path):
    lines = []
    with open(path) as input_file:
        lines = input_file.readlines()
    data = []
    for line in lines:
        cur_line = line.rstrip().split(',')
        cur_line = cur_line[:-1]
        cur_line = [float(elem) for elem in cur_line]
        data.append(array(cur_line))
    return array(data)

if __name__ == '__main__':
    data = load_iris('iris.data')
    pca(data)

The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

iris数据集:http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Output:

输出:

[ 20.89551896  11.75513248   4.7013819    1.75816839]
[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess

期望输出:特征值-[2.9108 0.9212 0.1474 0.0206]主分量-和我得到的一样，但是转置了，所以我想可以

Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:

还有，linalg的输出是什么。eig函数?根据*上PCA的描述，我应该这么说:

cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val

But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?

但它与我在网上找到的教程中的输出并不匹配。另外，如果我有4个维度，我想我应该有4个特征值而不是像eig给出的那样有150个。我做错什么了吗?

edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.

编辑:我注意到值之间的差异为150，即数据集中元素的数量。同样，特征值应该等于维数，在这种情况下，是4。我不明白的是为什么会出现这种差异。如果我把特征值除以len(data)我可以得到我想要的结果，但是我不明白为什么。不管怎样特征值的比例都没有改变，但是它们对我来说很重要，所以我想知道发生了什么。

4 个解决方案

#1

You decomposed the wrong matrix.

你分解了错误的矩阵。

Principal Component Analysis requires manipulating the eigenvectors/eigenvalues of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.

主成分分析需要操作协方差矩阵的特征向量/特征值，而不是数据本身。由mxn数据矩阵生成的协方差矩阵，将是一个mxm矩阵，主对角线上有一个。

You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:

您确实可以使用cov函数，但是需要对数据进行进一步的操作。它可能更容易使用类似的函数，corrcoef:

import numpy as NP
import numpy.linalg as LA

# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)

# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)

# calculate the covariance matrix 
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)

# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)

To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD, though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's) LA module--it is a little easier to work with than svd, the return values are the eigenvectors and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.

为了得到特征向量/特征值，我没有用SVD分解协方差矩阵，但是，你当然可以。我倾向于使用NumPy(或SciPy) LA模块中的eig来计算它们——它比svd更容易处理，返回值是特征向量和特征值本身，而不是别的。相反，正如您所知道的，svd不会直接返回这些。

Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose, regardless of the form that your data is in. This is obvious because the matrix you are decomposing in PCA is a covariance matrix, which by definition is always square (i.e., the columns are the individual data points of the original matrix, likewise for the rows, and each cell is the covariance of those two points, as evidenced by the ones down the main diagonal--a given data point has perfect covariance with itself).

假设SVD函数将分解任何矩阵，而不仅仅是平方矩阵(eig函数受其限制);然而，在做PCA时，无论数据是什么形式，都将始终有一个方阵进行分解。这是显而易见的，因为你在PCA中分解的矩阵是一个协方差矩阵，根据定义它总是平方的。，列是原始矩阵的单个数据点，对于行也是如此，每个单元格都是这两个点的协方差，从主对角线上的协方差可以看出——给定的数据点与自身具有完全的协方差)。

#2

The left singular values returned by SVD(A) are the eigenvectors of AA^T.

左奇异值返回的圣言(A)的特征向量是AA ^ T。

The covariance matrix of a dataset A is : 1/(N-1) * AA^T

一个数据集的协方差矩阵是:1 /(n - 1)* AA ^ T

Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.

现在，当你用SVD做PCA的时候，你必须把矩阵中的每一个元素除以(N-1)这样你就得到了协方差的特征值。

In your case, N=150 and you haven't done this division, hence the discrepancy.

在你的例子中，N=150，而你还没有做这个除法，所以出现了差异。

This is explained in detail here

这里将对此进行详细解释

#3

(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)

你能问一个问题吗?或者至少分开列出你的问题。你的文章读起来像意识流，因为你没有问一个问题。

You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.

你可能用错了cov，因为你没有先置换矩阵。如果cov_mat为4×4，则eig将产生4个特征值和4个特征向量。
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:

请注意，SVD和PCA虽然是相关的，但它们并不是完全相同的。假设X是一个4×150的观测矩阵每个4元素的列都是一个观测。那么，以下是等价的:

a. the left singular vectors of X,

a X的左奇异向量，

b. the principal components of X,

b. X的主分量，

c. the eigenvectors of X X^T.

c . X X ^ T的特征向量。

Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.

的特征值X X ^ T等于X的奇异值的平方看到这一切,让X有圣言X = QSV ^ T,S是一个对角矩阵的奇异值。然后考虑eigendecomposition D = ^ T X ^ T,D是一个对角矩阵的特征值。用SVD替换X，看看会发生什么。

#4

Question already adressed: Principal component analysis in Python

已附加的问题:Python中的主组件分析

#1