降维(Dimensionality Reduction)
admin
2024-02-24 01:11:33
0

降维(Dimensionality Reduction)

1. The Curse of Dimensionality

  • The main motivations for dimensionality reduction
    1. To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
    2. To visualize the data and gain insights on the most important features
    3. Simply to save space (compression)
  • The main drawbacks
    1. Some information is lost, possibly degrading the performance of subse‐
    2. quent training algorithms
    3. It can be computationally intensive
    4. It adds some complexity to your Machine Learning pipelines Transformed features are often hard to interpret

2. Main Approaches for Dimensionality Reduction

  • Projection 投影
    Many features are almost constant,highly correlated,将高纬数据投影到低纬度数据
  • Manifold Learning
    d-dimensional的数据在n-dimensional的空间卷起来,然后可以压缩回d-dimensional,假设高纬的数据是由低纬的数据变换来的
    if you reduce the dimensionality of your training set before training a model, it will definitely speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset

3. PCA

  • 主要思想
    First it identifies the hyperplane that lies closest to the data 找到最优的超平面, preserves the maximum amount of Variance,then it projects the data onto it 将数据投影上去
  • Principal Components
    The unit vector that defines the ith axis is called the ith principal component (PC) 主成分是投影平面的单位坐标轴向量,n维平面有n个主成分向量,主成分的方向不重要,重要的是定义的平面
# Singular Value Decomposition (SVD)求解矩阵的主成分向量
X_centered = X - X.mean(axis=0) #主成分要求数据以原点为中心
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

VT=(∣∣∣c1c2⋯cn∣∣∣)V^T=\begin{pmatrix} | & | & & | \\ c_1 & c_2 & \cdots & c_n\\ | & | & & | \end{pmatrix} VT=⎝⎛​∣c1​∣​∣c2​∣​⋯​∣cn​∣​⎠⎞​
Xd−proj=X⋅WdX_{d-proj}=X \cdot W_d Xd−proj​=X⋅Wd​

  • Projecting Down to d Dimensions (m, d) = (m, n) · (n, d)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
# automatically takes care of centering the data
pca.components_.T[:, 0])
  • Choosing the Right Number of Dimensions 通过PC的方差占比选择压缩的维度d
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(pca.explained_variance_ratio_)
array([ 0.84248607, 0.14631839])

或者直接设置我们要保存的数据方差SUM

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

或者绘制维度-方差和曲线来选择拐点

  • PCA inverse transform 从(m, d)=(m, n)·(n, d) 到 (m, n)=(m, d)·(d, n)
# 将压缩的数据反转回去 
pca = PCA(n_components = 154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)

the reconstruction error
The mean squared distance between the original data and the reconstructed data
Xrecovered=Xd−proj⋅WdTX_{recovered} = X_{d-proj} \cdot W_d^T Xrecovered​=Xd−proj​⋅WdT​

  • Incremental PCA 在线PCA
    增强PCA,IPCA,支持在线增量fit,专门用于处理超大数据和在线学习情况
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist, n_batches):inc_pca.partial_fit(X_batch)
X_mnist_reduced = inc_pca.transform(X_mnist)

或者用memmap, 支持将数据以二进制的形式存储在磁盘上, 按需加载进内存使用

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
  • Randomized PCA 随机PCA,当你确定要快速大量缩减维度的时候
    a stochastic algorithm that quickly finds an approximation of the first d principal components,O(m × d x d) + O(d x d x d), instead of O(m × n x n) + O(n x n x n)
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)
  • Kernel PCA 核PCA,主要用于非线性数据
    将核函数运用到PCA中,得到 KPCA
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

Selecting a Kernel and Tuning Hyperparameters
有label,训练一个分类器,对比准确率来选择KPCA参数

clf = Pipeline([("kpca", KernelPCA(n_components=2)),("log_reg", LogisticRegression())
])
param_grid = [{"kpca__gamma": np.linspace(0.03, 0.05, 10),"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)

无label,将原始数据作为label,训练一个回归模型

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
mean_squared_error(X, X_preimage)

4. LLE

  • a Manifold Learning technique

first measuring how each training instance linearly relates to its closest neighbors(c.n)
W^=argminw∑i=1m∥x(i)−∑j=1mw^i,jx(j)∥2subjectto{wi,j=0ifx(j)isnotoneofthekc.nofx(i)∑j=1mw^i,j=1fori=1,2,⋯,m\begin{aligned} &{\hat{W} = \underset{w}{argmin} \sum_{i=1}^{m} \left \| x^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}x^{(j)} \right \|^2} \\ & subject \; to \; \left\{\begin{matrix} w_{i,j}=0 & if \; x^{(j)} \; is \; not \; one \; of \; the \; k \; c.n \; of \; x^{(i)}\\ \\ \sum_{j=1}^{m}\hat{w}_{i,j}=1 & for \; i=1,2,\cdots,m \end{matrix}\right. \end{aligned} ​W^=wargmin​i=1∑m​∥∥∥∥∥​x(i)−j=1∑m​w^i,j​x(j)∥∥∥∥∥​2subjectto⎩⎨⎧​wi,j​=0∑j=1m​w^i,j​=1​ifx(j)isnotoneofthekc.nofx(i)fori=1,2,⋯,m​​
then looking for a low-dimensional representation of the training set
尤其擅长展开卷状数据
Z^=argminz∑i=1m∥z(i)−∑j=1mw^i,jz(j)∥2\hat{Z} = \underset{z}{argmin} \sum_{i=1}^{m} \left \| z^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}z^{(j)} \right \|^2 Z^=zargmin​i=1∑m​∥∥∥∥∥​z(i)−j=1∑m​w^i,j​z(j)∥∥∥∥∥​2

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)
  • computational complexity, scale poorly to very large datasets
    1. O(m log(m)n log(k)) for finding the k nearest neighbors
    2. O(mnk^3) for optimizing the weights
    3. O(dm^2) for constructing the low-dimensional representations

5. Other Dimensionality Reduction Techniques

  • Multidimensional Scaling (MDS)
    reduces dimensionality while trying to preserve the distances between the instances
  • Isomap
    trying to preserve the geodesic distances9 between the instances
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    trying to keep similar instances close and dissimilar instances apart mostly used for visualization
  • Linear Discriminant Analysis (LDA)
    is actually a classification algorithm
    the projection will keep classes as far apart as possible

@ 学必求其心得,业必贵其专精

相关内容

热门资讯

安卓系统用的华为应用,探索智能... 你知道吗?在安卓系统里,华为的应用可是个宝库呢!它们不仅功能强大,而且使用起来超级方便。今天,就让我...
安卓变ios系统魅蓝 你知道吗?最近有个朋友突然告诉我,他要把自己的安卓手机换成iOS系统,而且还是魅蓝品牌的!这可真是让...
幻书启世录安卓系统,安卓世界中... 亲爱的读者们,你是否曾在某个夜晚,被一本神奇的书所吸引,仿佛它拥有着穿越时空的力量?今天,我要带你走...
电脑安装安卓系统进不去,安卓系... 电脑安装安卓系统后竟然进不去,这可真是让人头疼的问题啊!你是不是也遇到了这种情况,心里直呼“怎么办怎...
用键盘切换控制安卓系统,畅享安... 你有没有想过,用键盘来控制你的安卓手机?是的,你没听错,就是那个我们每天敲敲打打的小玩意儿——键盘。...
小米安卓镜像系统在哪,小米安卓... 你有没有想过,你的小米手机里有一个隐藏的宝藏——安卓镜像系统?没错,就是那个可以让你的手机瞬间变身成...
安卓手机下载排班系统,高效排班... 你有没有想过,每天忙碌的工作中,有没有什么好帮手能帮你轻松管理时间呢?今天,就让我来给你介绍一个超级...
桌面组件如何弄安卓系统,桌面组... 亲爱的桌面爱好者们,你是否曾梦想过将安卓系统搬到你的电脑桌面上?想象那些流畅的动画、丰富的应用,还有...
安卓13系统介绍视频,新功能与... 亲爱的读者们,你是否对安卓13系统充满好奇?想要一探究竟,却又苦于没有足够的时间去研究?别担心,今天...
车机安卓7.1系统,功能升级与... 你有没有发现,现在的车机系统越来越智能了?尤其是那些搭载了安卓7.1系统的车机,简直就像是个贴心的智...
安卓系统下如何读pdf,And... 你有没有遇到过这种情况:手机里存了一大堆PDF文件,可是怎么也找不到一个能顺畅阅读的工具?别急,今天...
安卓系统全国通用的吗,畅享智能... 你有没有想过,为什么你的手机里装的是安卓系统呢?安卓系统,这个名字听起来是不是有点神秘?今天,就让我...
假苹果手机8安卓系统,颠覆传统... 你有没有想过,如果苹果手机突然变成了安卓系统,会是怎样的景象呢?想象那熟悉的苹果外观,却运行着安卓的...
安卓12.0系统vivo有吗,... 你有没有听说最近安卓系统又升级啦?没错,就是那个让手机焕然一新的安卓12.0系统!那么,咱们国内的手...
核心芯片和安卓系统,探索核心芯... 你知道吗?在科技的世界里,有一对“黄金搭档”正悄悄改变着我们的生活。他们就是——核心芯片和安卓系统。...
如何调安卓系统屏幕颜色,安卓系... 亲爱的手机控们,你是否曾觉得安卓系统的屏幕颜色不够个性,或者是因为长时间盯着屏幕而感到眼睛疲劳?别担...
旧台式电脑安装安卓系统,轻松安... 你那台旧台式电脑是不是已经服役多年,性能逐渐力不从心,却又不忍心让它退役呢?别急,今天就来教你怎么给...
美国要求关闭安卓系统,科技霸权... 美国要求关闭安卓系统:一场技术革新还是政治博弈?在数字化时代,智能手机已经成为我们生活中不可或缺的一...
安卓系统日记本 你有没有发现,手机里的安卓系统日记本,简直就是记录生活点滴的宝藏库呢?想象每天忙碌的生活中,有没有那...
安卓手机广告最少的系统,探索安... 你有没有发现,用安卓手机的时候,广告总是无处不在,让人烦得要命?不过别急,今天我要给你揭秘一个秘密—...