降维(Dimensionality Reduction)
admin
2024-02-24 01:11:33
0

降维(Dimensionality Reduction)

1. The Curse of Dimensionality

  • The main motivations for dimensionality reduction
    1. To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
    2. To visualize the data and gain insights on the most important features
    3. Simply to save space (compression)
  • The main drawbacks
    1. Some information is lost, possibly degrading the performance of subse‐
    2. quent training algorithms
    3. It can be computationally intensive
    4. It adds some complexity to your Machine Learning pipelines Transformed features are often hard to interpret

2. Main Approaches for Dimensionality Reduction

  • Projection 投影
    Many features are almost constant,highly correlated,将高纬数据投影到低纬度数据
  • Manifold Learning
    d-dimensional的数据在n-dimensional的空间卷起来,然后可以压缩回d-dimensional,假设高纬的数据是由低纬的数据变换来的
    if you reduce the dimensionality of your training set before training a model, it will definitely speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset

3. PCA

  • 主要思想
    First it identifies the hyperplane that lies closest to the data 找到最优的超平面, preserves the maximum amount of Variance,then it projects the data onto it 将数据投影上去
  • Principal Components
    The unit vector that defines the ith axis is called the ith principal component (PC) 主成分是投影平面的单位坐标轴向量,n维平面有n个主成分向量,主成分的方向不重要,重要的是定义的平面
# Singular Value Decomposition (SVD)求解矩阵的主成分向量
X_centered = X - X.mean(axis=0) #主成分要求数据以原点为中心
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

VT=(∣∣∣c1c2⋯cn∣∣∣)V^T=\begin{pmatrix} | & | & & | \\ c_1 & c_2 & \cdots & c_n\\ | & | & & | \end{pmatrix} VT=⎝⎛​∣c1​∣​∣c2​∣​⋯​∣cn​∣​⎠⎞​
Xd−proj=X⋅WdX_{d-proj}=X \cdot W_d Xd−proj​=X⋅Wd​

  • Projecting Down to d Dimensions (m, d) = (m, n) · (n, d)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
# automatically takes care of centering the data
pca.components_.T[:, 0])
  • Choosing the Right Number of Dimensions 通过PC的方差占比选择压缩的维度d
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(pca.explained_variance_ratio_)
array([ 0.84248607, 0.14631839])

或者直接设置我们要保存的数据方差SUM

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

或者绘制维度-方差和曲线来选择拐点

  • PCA inverse transform 从(m, d)=(m, n)·(n, d) 到 (m, n)=(m, d)·(d, n)
# 将压缩的数据反转回去 
pca = PCA(n_components = 154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)

the reconstruction error
The mean squared distance between the original data and the reconstructed data
Xrecovered=Xd−proj⋅WdTX_{recovered} = X_{d-proj} \cdot W_d^T Xrecovered​=Xd−proj​⋅WdT​

  • Incremental PCA 在线PCA
    增强PCA,IPCA,支持在线增量fit,专门用于处理超大数据和在线学习情况
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist, n_batches):inc_pca.partial_fit(X_batch)
X_mnist_reduced = inc_pca.transform(X_mnist)

或者用memmap, 支持将数据以二进制的形式存储在磁盘上, 按需加载进内存使用

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
  • Randomized PCA 随机PCA,当你确定要快速大量缩减维度的时候
    a stochastic algorithm that quickly finds an approximation of the first d principal components,O(m × d x d) + O(d x d x d), instead of O(m × n x n) + O(n x n x n)
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)
  • Kernel PCA 核PCA,主要用于非线性数据
    将核函数运用到PCA中,得到 KPCA
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

Selecting a Kernel and Tuning Hyperparameters
有label,训练一个分类器,对比准确率来选择KPCA参数

clf = Pipeline([("kpca", KernelPCA(n_components=2)),("log_reg", LogisticRegression())
])
param_grid = [{"kpca__gamma": np.linspace(0.03, 0.05, 10),"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)

无label,将原始数据作为label,训练一个回归模型

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
mean_squared_error(X, X_preimage)

4. LLE

  • a Manifold Learning technique

first measuring how each training instance linearly relates to its closest neighbors(c.n)
W^=argminw∑i=1m∥x(i)−∑j=1mw^i,jx(j)∥2subjectto{wi,j=0ifx(j)isnotoneofthekc.nofx(i)∑j=1mw^i,j=1fori=1,2,⋯,m\begin{aligned} &{\hat{W} = \underset{w}{argmin} \sum_{i=1}^{m} \left \| x^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}x^{(j)} \right \|^2} \\ & subject \; to \; \left\{\begin{matrix} w_{i,j}=0 & if \; x^{(j)} \; is \; not \; one \; of \; the \; k \; c.n \; of \; x^{(i)}\\ \\ \sum_{j=1}^{m}\hat{w}_{i,j}=1 & for \; i=1,2,\cdots,m \end{matrix}\right. \end{aligned} ​W^=wargmin​i=1∑m​∥∥∥∥∥​x(i)−j=1∑m​w^i,j​x(j)∥∥∥∥∥​2subjectto⎩⎨⎧​wi,j​=0∑j=1m​w^i,j​=1​ifx(j)isnotoneofthekc.nofx(i)fori=1,2,⋯,m​​
then looking for a low-dimensional representation of the training set
尤其擅长展开卷状数据
Z^=argminz∑i=1m∥z(i)−∑j=1mw^i,jz(j)∥2\hat{Z} = \underset{z}{argmin} \sum_{i=1}^{m} \left \| z^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}z^{(j)} \right \|^2 Z^=zargmin​i=1∑m​∥∥∥∥∥​z(i)−j=1∑m​w^i,j​z(j)∥∥∥∥∥​2

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)
  • computational complexity, scale poorly to very large datasets
    1. O(m log(m)n log(k)) for finding the k nearest neighbors
    2. O(mnk^3) for optimizing the weights
    3. O(dm^2) for constructing the low-dimensional representations

5. Other Dimensionality Reduction Techniques

  • Multidimensional Scaling (MDS)
    reduces dimensionality while trying to preserve the distances between the instances
  • Isomap
    trying to preserve the geodesic distances9 between the instances
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    trying to keep similar instances close and dissimilar instances apart mostly used for visualization
  • Linear Discriminant Analysis (LDA)
    is actually a classification algorithm
    the projection will keep classes as far apart as possible

@ 学必求其心得,业必贵其专精

相关内容

热门资讯

安卓系统把视频设为铃声,如何将... 你有没有想过,手机里的视频竟然能变成铃声?是的,你没听错,安卓系统竟然有这样的神奇功能!今天,就让我...
边锋杭麻圈安卓系统,边锋科技引... 你有没有听说过边锋杭麻圈安卓系统?这可是最近在游戏圈里火得一塌糊涂的存在哦!想象你正坐在家里,手握着...
ios系统能转移到安卓系统,轻... 你有没有想过,有一天你的手机从iOS系统跳转到安卓系统,会是怎样的体验呢?这可不是天方夜谭,随着科技...
安卓系统网络连接查看,安卓系统... 你有没有遇到过这种情况:手机里装了各种APP,可就是不知道哪个在偷偷消耗着你的流量?别急,今天就来教...
什么安卓系统优化的好,打造流畅... 你有没有发现,手机用久了,就像人一样,会变得有些“臃肿”呢?尤其是安卓系统,有时候感觉就像一个老态龙...
手机安卓系统ios系统怎么安装... 你有没有发现,现在手机里的软件真是五花八门,让人眼花缭乱?无论是安卓系统还是iOS系统,都能轻松安装...
按音量键报警安卓系统,安卓系统... 你有没有遇到过这种情况:手机突然发出刺耳的警报声,吓得你心跳加速,还以为发生了什么大事?别担心,今天...
安卓系统改win版系统怎么安装... 你有没有想过,把你的安卓手机换成Windows系统的电脑呢?想象那流畅的触控体验和强大的办公功能,是...
安卓系统如何备份到苹果,轻松实... 你是不是也有过这样的烦恼:手机里的照片、联系人、应用数据等等,突然间就消失得无影无踪?别担心,今天就...
修改安卓系统参数设置,安卓系统... 你有没有发现,你的安卓手机有时候就像一个不听话的小孩子,总是按照自己的节奏来,让你有点头疼?别急,今...
安卓手机gps定位系统,畅享智... 你有没有发现,现在不管走到哪里,手机都能帮你找到路?这都得归功于安卓手机的GPS定位系统。想象你站在...
华为不能安卓系统升级,探索创新... 你知道吗?最近有个大新闻在科技圈里炸开了锅,那就是华为的新款手机竟然不能升级安卓系统了!这可真是让人...
汽车加装安卓系统卡住,探究原因... 你有没有遇到过这样的尴尬情况:汽车加装了安卓系统,结果屏幕突然卡住了,就像被施了魔法一样,怎么也动弹...
电量壁纸安卓系统下载,打造个性... 手机电量告急,是不是又得赶紧找充电宝了?别急,今天就来给你安利一款超实用的电量壁纸,让你的安卓手机瞬...
iPhonex里面是安卓系统,... 你有没有想过,那个我们每天都离不开的iPhone,里面竟然可能是安卓系统?是的,你没听错,就是那个以...
ios系统比安卓系统好在哪里,... 你有没有想过,为什么有些人对iOS系统情有独钟,而有些人却对安卓系统爱不释手呢?今天,就让我带你从多...
安卓系统跟踪设置大小,跟踪设置... 你知道吗?现在智能手机几乎成了我们生活的必需品,而安卓系统作为全球最受欢迎的操作系统之一,它的跟踪设...
在线迎新系统下载安卓,轻松开启... 你有没有想过,开学季的到来,就像一场盛大的狂欢,而在这个狂欢中,有一个小助手,它默默地守护着你的入学...
安卓系统怎么申请微信号,一键申... 你有没有想过,在安卓手机上申请一个微信账号,竟然也能变得如此简单?没错,就是那个我们每天离不开的社交...
安卓手机系统里怎么清理,轻松优... 手机里的东西越来越多,是不是感觉安卓手机系统越来越慢了呢?别急,今天就来教你怎么清理安卓手机系统,让...