Python循序渐进主成分分析-码农场

简介
主成分分析vs多重判别分析
什么是“好的”子空间
推导PCA
PCA过程
生成一些三维数据
为什么用三维数据
去掉类别
计算散布矩阵
计算协方差矩阵
计算特征向量和对应的特征值
检查特征向量与特征值
可视化特征向量
排序特征向量
数据变换
Reference

译自《Implementing a Principal Component Analysis (PCA)– in Python, step by step》，一步步地实现了PCA，验证了散布矩阵和协方差矩阵可以得到同样的子空间，并友好地可视化出来，读完后对Python的爱又加深了一层。

简介

PCA的主要目的是发现数据中的模式，并利用这些模式在尽量不损失信息的情况下降低数据的维度。

主成分分析vs多重判别分析

多重判别分析MDA和主成分分析PCA都是互相紧密联系的线性变换方法。在PCA中，我们希望找到一些方向（成分）使得数据集的方差最大。在MDA中，我们还需要考虑数据点的类别，让数据最大程度上变得可分，而PCA则完全忽略类别。

换句话说，在PCA中我们将数据集映射到一个子空间中；在MDA中，我们尝试找到一个子空间来区分不同的类别。或者说，在PCA中我们想找到新的坐标系，使得数据的投影尽量长，而在MDA中还需额外考虑不同类别的差异性。

在典型的模式识别问题中，通常先运行MDA后运行PCA。

什么是“好的”子空间

假设我们想把d维的数据映射到k维的子空间（k<d），我们如何选择k，如何判断某个子空间可以“很好地”表示我们的数据呢？

之后，我们将会计算特征向量（成分）然后将其放到一个所谓的“散布矩阵”中。每个特征向量都有一个特征值，告诉我们数据集在该方向投影的长度。如果我们发现所有的特征值大小都非常相似，那么这就说明这个子空间很好。如果有些特征值比另一些大很多，那说明它们含有很多信息。相反，那些特征值接近0的向量则没有多少信息，可以去掉这一维度。

推导PCA

参考斯坦福大学cs229的讲义，我们想找到一个单位向量u，对于给定的点x，其在u方向的投影是，代表子空间中的一个点，我们希望所有点到原点的距离之和最远（假设我们将数据集的均值处理为0，则这等效于方差最大），也就是说我们想要找到一个单位向量u，使得：

hankcs.com 2016-11-09 下午4.55.33.png

最大化。由于是单位向量，所以我们有约束条件，这个最大化问题的解是如下矩阵的特征向量：

这个解是如何得到的呢？可以参考stdcoutzyx的推导：

引入拉格朗日方程

hankcs.com 2016-11-09 下午5.00.30.png

对u求导

hankcs.com 2016-11-09 下午5.00.55.png

令导数为0，恰好发现u就是Σ的特征向量。

由于Σ是实对称矩阵，所以n*n的Σ矩阵一定有n个正交的特征向量。

在降维的时候，顺序选取特征值最大的k个特征向量，做如下变换即可：

hankcs.com 2016-11-09 下午5.06.22.png

PCA过程

有了上述理论依据，PCA过程可分为6步：

将数据集的标签去掉，仅保留d维特征
计算d维均值向量（每个维度上的均值构成）
计算散布矩阵
计算特征向量及其特征值
选取特征值最大的k个特征向量，构成d*k的矩阵W
对每个数据点，执行如下降维变换：

生成一些三维数据

利用2个多元高斯分布生成40个3三维数据点，这两个高斯分布的均值和协方差矩阵分别为：

hankcs.com 2016-11-09 下午5.16.47.png

为什么用三维数据

三维数据可以可视化，降维到二维依然可以，这就是原因。

数据的生成代码如下：

# -*- coding:utf-8 -*-
# Filename: main.py
# Author：hankcs
# Date: 2016-11-09 PM5:18
import numpy as np

np.random.seed(234234782384239784) # random seed for consistency, you can remove this if needed

mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20).T
assert class1_sample.shape == (3,20), "The matrix has not the dimensions 3x20"

mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20).T
assert class1_sample.shape == (3,20), "The matrix has not the dimensions 3x20"

通过上述代码，我们生成了两个3*20的数据集，每个数据点都是三维列向量 hankcs.com 2016-11-09 下午5.21.28.png ，数据集则是一个大矩阵 hankcs.com 2016-11-09 下午5.21.58.png 。

可视化代码如下：

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
plt.rcParams['legend.fontsize'] = 10
ax.plot(class1_sample[0,:], class1_sample[1,:], class1_sample[2,:], 'o', markersize=8, color='blue', alpha=0.5, label='class1')
ax.plot(class2_sample[0,:], class2_sample[1,:], class2_sample[2,:], '^', markersize=8, alpha=0.5, color='red', label='class2')

plt.title('Samples for class 1 and class 2')
ax.legend(loc='upper right')

plt.show()

得到：

去掉类别

忽略数据集的类别：

all_samples = np.concatenate((class1_sample, class2_sample), axis=1)assert all_samples.shape == (3,40), "The matrix has not the dimensions 3x40"

计算d维均值矩阵

mean_x = np.mean(all_samples[0,:])
mean_y = np.mean(all_samples[1,:])
mean_z = np.mean(all_samples[2,:])

mean_vector = np.array([[mean_x],[mean_y],[mean_z]])

print('Mean Vector:\n', mean_vector)

得到

('Mean Vector:\n', array([[ 0.50576644],
       [ 0.30186591],
       [ 0.76459177]]))

计算散布矩阵

散布矩阵定义如下：

hankcs.com 2016-11-09 下午5.26.02.png

其中，m是均值向量

hankcs.com 2016-11-09 下午5.26.25.png

实现如下：

scatter_matrix = np.zeros((3,3))
for i in range(all_samples.shape[1]):
    scatter_matrix += (all_samples[:,i].reshape(3,1) - mean_vector).dot((all_samples[:,i].reshape(3,1) - mean_vector).T)
print('Scatter Matrix:\n', scatter_matrix)

得到

('Scatter Matrix:\n', array([[ 48.91593255,   7.11744916,   7.20810281],
       [  7.11744916,  37.92902984,   2.7370493 ],
       [  7.20810281,   2.7370493 ,  35.6363759 ]]))

计算协方差矩阵

作为散布矩阵的替代品，也可以利用numpy内置的cov直接计算协方差矩阵。协方差矩阵和散布矩阵的公式非常像，只不过协方差矩阵是散布矩阵的1/(N-1)倍，N代表样本总数。但是无论如何，它们的特征空间是一样的（特征向量相同，只不过特征值被缩放了一个常量倍数）。

hankcs.com 2016-11-09 下午5.34.50.png

实现如下

cov_mat = np.cov([all_samples[0,:],all_samples[1,:],all_samples[2,:]])
print('Covariance Matrix:\n', cov_mat)

得到

('Covariance Matrix:\n', array([[ 1.25425468,  0.1824987 ,  0.18482315],
       [ 0.1824987 ,  0.97253923,  0.07018075],
       [ 0.18482315,  0.07018075,  0.91375323]]))

计算特征向量和对应的特征值

为了验证“对散布矩阵和协方差矩阵而言其特征向量都是相同的”，我们用assert来试验一下，我们还可以将这个常数倍数（40-1=39）算出来：

# eigenvectors and eigenvalues for the from the scatter matrix
eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix)

# eigenvectors and eigenvalues for the from the covariance matrix
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)

for i in range(len(eig_val_sc)):
    eigvec_sc = eig_vec_sc[:,i].reshape(1,3).T
    eigvec_cov = eig_vec_cov[:,i].reshape(1,3).T
    assert eigvec_sc.all() == eigvec_cov.all(), 'Eigenvectors are not identical'

    print('Eigenvector {}: \n{}'.format(i+1, eigvec_sc))
    print('Eigenvalue {} from scatter matrix: {}'.format(i+1, eig_val_sc[i]))
    print('Eigenvalue {} from covariance matrix: {}'.format(i+1, eig_val_cov[i]))
    print('Scaling factor: ', eig_val_sc[i]/eig_val_cov[i])
    print(40 * '-')

输出：

Eigenvector 1: 
[[-0.84190486]
 [-0.39978877]
 [-0.36244329]]
Eigenvalue 1 from scatter matrix: 55.3988559573
Eigenvalue 1 from covariance matrix: 1.42048348608
('Scaling factor: ', 38.999999999999972)
----------------------------------------
Eigenvector 2: 
[[-0.44565232]
 [ 0.13637858]
 [ 0.88475697]]
Eigenvalue 2 from scatter matrix: 32.4275480129
Eigenvalue 2 from covariance matrix: 0.831475590075
('Scaling factor: ', 38.999999999999986)
----------------------------------------
Eigenvector 3: 
[[ 0.30428639]
 [-0.90640489]
 [ 0.29298458]]
Eigenvalue 3 from scatter matrix: 34.6549343281
Eigenvalue 3 from covariance matrix: 0.888588059694
('Scaling factor: ', 39.0)
----------------------------------------

检查特征向量与特征值

让我们快速验证一下特征向量和特征值的计算是正确的，也就是满足

这里Σ是协方差矩阵，v是特征向量，lambda是特征值。

实现如下

for i in range(len(eig_val_sc)):
    eigv = eig_vec_sc[:,i].reshape(1,3).T
    np.testing.assert_array_almost_equal(scatter_matrix.dot(eigv), eig_val_sc[i] * eigv,
                                         decimal=6, err_msg='', verbose=True)

这里numpy的assert_array_almost_equal真是人性化啊，连精度都给考虑了。

可视化特征向量

在进行下一步之前，不如满足一下我们的好奇心，看看特征向量到底长什么样。

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
from matplotlib.patches import FancyArrowPatch


class Arrow3D(FancyArrowPatch):
    def __init__(self, xs, ys, zs, *args, **kwargs):
        FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
        self._verts3d = xs, ys, zs

    def draw(self, renderer):
        xs3d, ys3d, zs3d = self._verts3d
        xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        FancyArrowPatch.draw(self, renderer)

fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')

ax.plot(all_samples[0,:], all_samples[1,:], all_samples[2,:], 'o', markersize=8, color='green', alpha=0.2)
ax.plot([mean_x], [mean_y], [mean_z], 'o', markersize=10, color='red', alpha=0.5)
for v in eig_vec_sc.T:
    a = Arrow3D([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
    ax.add_artist(a)
ax.set_xlabel('x_values')
ax.set_ylabel('y_values')
ax.set_zlabel('z_values')

plt.title('Eigenvectors')

plt.show()

得到

排序特征向量

排序前可以先验证一下特征向量的确长度为1：

for ev in eig_vec_sc:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))
    # instead of 'assert' because of rounding errors

我们想去掉一个特征值最小的特征维度：

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
for i in eig_pairs:
    print(i[0])

得到

55.3988559573
34.6549343281
32.4275480129

没错，就要丢掉那个32了。

选取k个特征值最大的特征向量

三维降低到二维，所以选取前2个特征向量，它们构成3*2的矩阵W：

matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)

得到

    Matrix W:
     [[-0.84190486  0.30428639]
     [-0.39978877 -0.90640489]
     [-0.36244329  0.29298458]]

数据变换

利用将三维数据变换为二维数据：

transformed = matrix_w.T.dot(all_samples)
assert transformed.shape == (2,40), "The matrix is not 2x40 dimensional."

plt.plot(transformed[0,0:20], transformed[1,0:20], 'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(transformed[0,20:40], transformed[1,20:40], '^', markersize=7, color='red', alpha=0.5, label='class2')
plt.xlim([-4,4])
plt.ylim([-4,4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')

plt.show()

得到

Reference

《Implementing a Principal Component Analysis (PCA)– in Python, step by step》

知识共享署名-非商业性使用-相同方式共享：码农场 » Python循序渐进主成分分析

Python循序渐进主成分分析

简介

主成分分析vs多重判别分析

什么是“好的”子空间

推导PCA

PCA过程

生成一些三维数据

为什么用三维数据

去掉类别

计算散布矩阵

计算协方差矩阵

计算特征向量和对应的特征值

检查特征向量与特征值

可视化特征向量

排序特征向量

数据变换

Reference

评论欢迎留言

我的作品

简介

主成分分析vs多重判别分析

什么是“好的”子空间

推导PCA

PCA过程

生成一些三维数据

为什么用三维数据

去掉类别

计算散布矩阵

计算协方差矩阵

计算特征向量和对应的特征值

检查特征向量与特征值

可视化特征向量

排序特征向量

数据变换

Reference

评论 欢迎留言

我的作品

评论欢迎留言