降维(Dimensionality Reduction)
admin
2024-02-24 01:11:33
0

降维(Dimensionality Reduction)

1. The Curse of Dimensionality

  • The main motivations for dimensionality reduction
    1. To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
    2. To visualize the data and gain insights on the most important features
    3. Simply to save space (compression)
  • The main drawbacks
    1. Some information is lost, possibly degrading the performance of subse‐
    2. quent training algorithms
    3. It can be computationally intensive
    4. It adds some complexity to your Machine Learning pipelines Transformed features are often hard to interpret

2. Main Approaches for Dimensionality Reduction

  • Projection 投影
    Many features are almost constant,highly correlated,将高纬数据投影到低纬度数据
  • Manifold Learning
    d-dimensional的数据在n-dimensional的空间卷起来,然后可以压缩回d-dimensional,假设高纬的数据是由低纬的数据变换来的
    if you reduce the dimensionality of your training set before training a model, it will definitely speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset

3. PCA

  • 主要思想
    First it identifies the hyperplane that lies closest to the data 找到最优的超平面, preserves the maximum amount of Variance,then it projects the data onto it 将数据投影上去
  • Principal Components
    The unit vector that defines the ith axis is called the ith principal component (PC) 主成分是投影平面的单位坐标轴向量,n维平面有n个主成分向量,主成分的方向不重要,重要的是定义的平面
# Singular Value Decomposition (SVD)求解矩阵的主成分向量
X_centered = X - X.mean(axis=0) #主成分要求数据以原点为中心
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

VT=(∣∣∣c1c2⋯cn∣∣∣)V^T=\begin{pmatrix} | & | & & | \\ c_1 & c_2 & \cdots & c_n\\ | & | & & | \end{pmatrix} VT=⎝⎛​∣c1​∣​∣c2​∣​⋯​∣cn​∣​⎠⎞​
Xd−proj=X⋅WdX_{d-proj}=X \cdot W_d Xd−proj​=X⋅Wd​

  • Projecting Down to d Dimensions (m, d) = (m, n) · (n, d)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
# automatically takes care of centering the data
pca.components_.T[:, 0])
  • Choosing the Right Number of Dimensions 通过PC的方差占比选择压缩的维度d
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(pca.explained_variance_ratio_)
array([ 0.84248607, 0.14631839])

或者直接设置我们要保存的数据方差SUM

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

或者绘制维度-方差和曲线来选择拐点

  • PCA inverse transform 从(m, d)=(m, n)·(n, d) 到 (m, n)=(m, d)·(d, n)
# 将压缩的数据反转回去 
pca = PCA(n_components = 154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)

the reconstruction error
The mean squared distance between the original data and the reconstructed data
Xrecovered=Xd−proj⋅WdTX_{recovered} = X_{d-proj} \cdot W_d^T Xrecovered​=Xd−proj​⋅WdT​

  • Incremental PCA 在线PCA
    增强PCA,IPCA,支持在线增量fit,专门用于处理超大数据和在线学习情况
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist, n_batches):inc_pca.partial_fit(X_batch)
X_mnist_reduced = inc_pca.transform(X_mnist)

或者用memmap, 支持将数据以二进制的形式存储在磁盘上, 按需加载进内存使用

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
  • Randomized PCA 随机PCA,当你确定要快速大量缩减维度的时候
    a stochastic algorithm that quickly finds an approximation of the first d principal components,O(m × d x d) + O(d x d x d), instead of O(m × n x n) + O(n x n x n)
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)
  • Kernel PCA 核PCA,主要用于非线性数据
    将核函数运用到PCA中,得到 KPCA
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

Selecting a Kernel and Tuning Hyperparameters
有label,训练一个分类器,对比准确率来选择KPCA参数

clf = Pipeline([("kpca", KernelPCA(n_components=2)),("log_reg", LogisticRegression())
])
param_grid = [{"kpca__gamma": np.linspace(0.03, 0.05, 10),"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)

无label,将原始数据作为label,训练一个回归模型

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
mean_squared_error(X, X_preimage)

4. LLE

  • a Manifold Learning technique

first measuring how each training instance linearly relates to its closest neighbors(c.n)
W^=argminw∑i=1m∥x(i)−∑j=1mw^i,jx(j)∥2subjectto{wi,j=0ifx(j)isnotoneofthekc.nofx(i)∑j=1mw^i,j=1fori=1,2,⋯,m\begin{aligned} &{\hat{W} = \underset{w}{argmin} \sum_{i=1}^{m} \left \| x^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}x^{(j)} \right \|^2} \\ & subject \; to \; \left\{\begin{matrix} w_{i,j}=0 & if \; x^{(j)} \; is \; not \; one \; of \; the \; k \; c.n \; of \; x^{(i)}\\ \\ \sum_{j=1}^{m}\hat{w}_{i,j}=1 & for \; i=1,2,\cdots,m \end{matrix}\right. \end{aligned} ​W^=wargmin​i=1∑m​∥∥∥∥∥​x(i)−j=1∑m​w^i,j​x(j)∥∥∥∥∥​2subjectto⎩⎨⎧​wi,j​=0∑j=1m​w^i,j​=1​ifx(j)isnotoneofthekc.nofx(i)fori=1,2,⋯,m​​
then looking for a low-dimensional representation of the training set
尤其擅长展开卷状数据
Z^=argminz∑i=1m∥z(i)−∑j=1mw^i,jz(j)∥2\hat{Z} = \underset{z}{argmin} \sum_{i=1}^{m} \left \| z^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}z^{(j)} \right \|^2 Z^=zargmin​i=1∑m​∥∥∥∥∥​z(i)−j=1∑m​w^i,j​z(j)∥∥∥∥∥​2

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)
  • computational complexity, scale poorly to very large datasets
    1. O(m log(m)n log(k)) for finding the k nearest neighbors
    2. O(mnk^3) for optimizing the weights
    3. O(dm^2) for constructing the low-dimensional representations

5. Other Dimensionality Reduction Techniques

  • Multidimensional Scaling (MDS)
    reduces dimensionality while trying to preserve the distances between the instances
  • Isomap
    trying to preserve the geodesic distances9 between the instances
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    trying to keep similar instances close and dissimilar instances apart mostly used for visualization
  • Linear Discriminant Analysis (LDA)
    is actually a classification algorithm
    the projection will keep classes as far apart as possible

@ 学必求其心得,业必贵其专精

相关内容

热门资讯

安卓系统蛋仔派对,安卓系统下的... 你有没有发现,最近你的手机里多了一个超级好玩的游戏?没错,就是安卓系统上的“蛋仔派对”!这款游戏可是...
坚果3安卓原生系统,深度体验原... 你有没有听说过坚果3这款手机?它可是最近在数码圈里火得一塌糊涂呢!今天,我就要来给你详细介绍一下这款...
安卓子系统点不开,排查与解决指... 最近是不是你也遇到了安卓子系统点不开的烦恼?这可真是让人头疼啊!别急,今天就来给你详细解析一下这个问...
安卓系统经常误删文件,如何有效... 你有没有遇到过这种情况?手机里的文件突然不见了,找来找去,怎么也找不到。别急,这可能是安卓系统的小调...
安卓51系统如何破解,轻松解锁... 安卓51系统如何破解——探索未知的技术边界在数字化时代,手机已经成为我们生活中不可或缺的一部分。而安...
安卓系统怎么换回主题,安卓系统... 亲爱的手机控们,你是不是也和我一样,对安卓系统的主题换换换乐此不疲呢?不过,有时候换着换着,突然发现...
黑莓安卓系统 太垃圾,令人失望... 你有没有用过黑莓的安卓系统?别告诉我你没有,因为现在这个系统真的是太垃圾了!是的,你没听错,就是那个...
修改安卓系统权限代码,安卓系统... 你有没有想过,你的安卓手机里那些神秘的系统权限代码?它们就像隐藏在手机里的秘密通道,有时候让你觉得既...
虚拟大师安卓系统教程,教程详解... 你有没有想过,手机里的世界可以变得更加神奇?今天,就让我带你一起探索虚拟大师安卓系统的奥秘吧!想象你...
基于安卓系统个人博客,轻松构建... 你有没有想过,在这个信息爆炸的时代,拥有一片属于自己的网络小天地是多么酷的事情啊!想象每天都能在这里...
安卓怎么传到苹果系统,从安卓到... 你是不是也有过这样的烦恼:手机里存了好多好用的安卓应用,可是一换到苹果系统,就发现这些宝贝们都不见了...
安卓改系统字体app,安卓系统... 你有没有想过,手机上的字体也能变得个性十足?没错,就是那个安卓改系统字体app,它可是让手机界面焕然...
安卓系统重启密码错误,破解与预... 手机突然重启了,屏幕上竟然出现了密码输入的界面!这可怎么办?别急,让我来帮你一步步解决这个安卓系统重...
安卓系统怎么删除相片,照片删除... 手机里的相片越来越多,是不是感觉内存都要不够用了?别急,今天就来教你怎么在安卓系统里轻松删除那些不再...
什么安卓机系统最好,安卓系统最... 你有没有想过,手机里那个默默无闻的系统,其实才是决定你手机体验好坏的关键呢?没错,说的就是安卓机系统...
小米手环8安卓系统,智能生活新... 你有没有注意到,最近小米手环8安卓系统成了大家热议的话题呢?这款智能手环自从上市以来,就凭借其强大的...
虹膜系统怎么换为安卓,技术革新... 你有没有想过,你的虹膜系统怎么换为安卓呢?这可是个挺酷的话题,不是吗?想象你的手机上装了个高科技的虹...
安卓刷苹果mac系统,探索跨平... 你有没有想过,你的安卓手机竟然能变身成为苹果Mac系统的超级战士?没错,这就是今天我要跟你分享的神奇...
安卓系统不模仿苹果,不模仿苹果... 你知道吗?在科技圈里,有一场关于操作系统的大戏正在上演。没错,就是安卓系统和苹果iOS系统之间的较量...
安卓系统计步开启,开启健康生活... 你有没有发现,最近你的手机里多了一个小助手——计步器?没错,就是那个默默记录你每一步的小家伙。今天,...