降维(Dimensionality Reduction)
admin
2024-02-24 01:11:33
0

降维(Dimensionality Reduction)

1. The Curse of Dimensionality

  • The main motivations for dimensionality reduction
    1. To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
    2. To visualize the data and gain insights on the most important features
    3. Simply to save space (compression)
  • The main drawbacks
    1. Some information is lost, possibly degrading the performance of subse‐
    2. quent training algorithms
    3. It can be computationally intensive
    4. It adds some complexity to your Machine Learning pipelines Transformed features are often hard to interpret

2. Main Approaches for Dimensionality Reduction

  • Projection 投影
    Many features are almost constant,highly correlated,将高纬数据投影到低纬度数据
  • Manifold Learning
    d-dimensional的数据在n-dimensional的空间卷起来,然后可以压缩回d-dimensional,假设高纬的数据是由低纬的数据变换来的
    if you reduce the dimensionality of your training set before training a model, it will definitely speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset

3. PCA

  • 主要思想
    First it identifies the hyperplane that lies closest to the data 找到最优的超平面, preserves the maximum amount of Variance,then it projects the data onto it 将数据投影上去
  • Principal Components
    The unit vector that defines the ith axis is called the ith principal component (PC) 主成分是投影平面的单位坐标轴向量,n维平面有n个主成分向量,主成分的方向不重要,重要的是定义的平面
# Singular Value Decomposition (SVD)求解矩阵的主成分向量
X_centered = X - X.mean(axis=0) #主成分要求数据以原点为中心
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

VT=(∣∣∣c1c2⋯cn∣∣∣)V^T=\begin{pmatrix} | & | & & | \\ c_1 & c_2 & \cdots & c_n\\ | & | & & | \end{pmatrix} VT=⎝⎛​∣c1​∣​∣c2​∣​⋯​∣cn​∣​⎠⎞​
Xd−proj=X⋅WdX_{d-proj}=X \cdot W_d Xd−proj​=X⋅Wd​

  • Projecting Down to d Dimensions (m, d) = (m, n) · (n, d)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
# automatically takes care of centering the data
pca.components_.T[:, 0])
  • Choosing the Right Number of Dimensions 通过PC的方差占比选择压缩的维度d
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(pca.explained_variance_ratio_)
array([ 0.84248607, 0.14631839])

或者直接设置我们要保存的数据方差SUM

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

或者绘制维度-方差和曲线来选择拐点

  • PCA inverse transform 从(m, d)=(m, n)·(n, d) 到 (m, n)=(m, d)·(d, n)
# 将压缩的数据反转回去 
pca = PCA(n_components = 154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)

the reconstruction error
The mean squared distance between the original data and the reconstructed data
Xrecovered=Xd−proj⋅WdTX_{recovered} = X_{d-proj} \cdot W_d^T Xrecovered​=Xd−proj​⋅WdT​

  • Incremental PCA 在线PCA
    增强PCA,IPCA,支持在线增量fit,专门用于处理超大数据和在线学习情况
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_mnist, n_batches):inc_pca.partial_fit(X_batch)
X_mnist_reduced = inc_pca.transform(X_mnist)

或者用memmap, 支持将数据以二进制的形式存储在磁盘上, 按需加载进内存使用

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)
  • Randomized PCA 随机PCA,当你确定要快速大量缩减维度的时候
    a stochastic algorithm that quickly finds an approximation of the first d principal components,O(m × d x d) + O(d x d x d), instead of O(m × n x n) + O(n x n x n)
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_mnist)
  • Kernel PCA 核PCA,主要用于非线性数据
    将核函数运用到PCA中,得到 KPCA
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

Selecting a Kernel and Tuning Hyperparameters
有label,训练一个分类器,对比准确率来选择KPCA参数

clf = Pipeline([("kpca", KernelPCA(n_components=2)),("log_reg", LogisticRegression())
])
param_grid = [{"kpca__gamma": np.linspace(0.03, 0.05, 10),"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)

无label,将原始数据作为label,训练一个回归模型

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
mean_squared_error(X, X_preimage)

4. LLE

  • a Manifold Learning technique

first measuring how each training instance linearly relates to its closest neighbors(c.n)
W^=argminw∑i=1m∥x(i)−∑j=1mw^i,jx(j)∥2subjectto{wi,j=0ifx(j)isnotoneofthekc.nofx(i)∑j=1mw^i,j=1fori=1,2,⋯,m\begin{aligned} &{\hat{W} = \underset{w}{argmin} \sum_{i=1}^{m} \left \| x^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}x^{(j)} \right \|^2} \\ & subject \; to \; \left\{\begin{matrix} w_{i,j}=0 & if \; x^{(j)} \; is \; not \; one \; of \; the \; k \; c.n \; of \; x^{(i)}\\ \\ \sum_{j=1}^{m}\hat{w}_{i,j}=1 & for \; i=1,2,\cdots,m \end{matrix}\right. \end{aligned} ​W^=wargmin​i=1∑m​∥∥∥∥∥​x(i)−j=1∑m​w^i,j​x(j)∥∥∥∥∥​2subjectto⎩⎨⎧​wi,j​=0∑j=1m​w^i,j​=1​ifx(j)isnotoneofthekc.nofx(i)fori=1,2,⋯,m​​
then looking for a low-dimensional representation of the training set
尤其擅长展开卷状数据
Z^=argminz∑i=1m∥z(i)−∑j=1mw^i,jz(j)∥2\hat{Z} = \underset{z}{argmin} \sum_{i=1}^{m} \left \| z^{(i)}-\sum_{j=1}^{m}\hat{w}_{i,j}z^{(j)} \right \|^2 Z^=zargmin​i=1∑m​∥∥∥∥∥​z(i)−j=1∑m​w^i,j​z(j)∥∥∥∥∥​2

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)
  • computational complexity, scale poorly to very large datasets
    1. O(m log(m)n log(k)) for finding the k nearest neighbors
    2. O(mnk^3) for optimizing the weights
    3. O(dm^2) for constructing the low-dimensional representations

5. Other Dimensionality Reduction Techniques

  • Multidimensional Scaling (MDS)
    reduces dimensionality while trying to preserve the distances between the instances
  • Isomap
    trying to preserve the geodesic distances9 between the instances
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    trying to keep similar instances close and dissimilar instances apart mostly used for visualization
  • Linear Discriminant Analysis (LDA)
    is actually a classification algorithm
    the projection will keep classes as far apart as possible

@ 学必求其心得,业必贵其专精

相关内容

热门资讯

电视安卓系统哪个品牌好,哪家品... 你有没有想过,家里的电视是不是该升级换代了呢?现在市面上电视品牌琳琅满目,各种操作系统也是让人眼花缭...
安卓会员管理系统怎么用,提升服... 你有没有想过,手机里那些你爱不释手的APP,背后其实有个强大的会员管理系统在默默支持呢?没错,就是那...
安卓系统软件使用技巧,解锁软件... 你有没有发现,用安卓手机的时候,总有一些小技巧能让你玩得更溜?别小看了这些小细节,它们可是能让你的手...
安卓系统提示音替换 你知道吗?手机里那个时不时响起的提示音,有时候真的能让人心情大好,有时候又让人抓狂不已。今天,就让我...
安卓开机不了系统更新 手机突然开不了机,系统更新还卡在那里,这可真是让人头疼的问题啊!你是不是也遇到了这种情况?别急,今天...
安卓系统中微信视频,安卓系统下... 你有没有发现,现在用手机聊天,视频通话简直成了标配!尤其是咱们安卓系统的小伙伴们,微信视频功能更是用...
安卓系统是服务器,服务器端的智... 你知道吗?在科技的世界里,安卓系统可是个超级明星呢!它不仅仅是个手机操作系统,竟然还能成为服务器的得...
pc电脑安卓系统下载软件,轻松... 你有没有想过,你的PC电脑上安装了安卓系统,是不是瞬间觉得世界都大不一样了呢?没错,就是那种“一机在...
电影院购票系统安卓,便捷观影新... 你有没有想过,在繁忙的生活中,一部好电影就像是一剂强心针,能瞬间让你放松心情?而我今天要和你分享的,...
安卓系统可以写程序? 你有没有想过,安卓系统竟然也能写程序呢?没错,你没听错!这个我们日常使用的智能手机操作系统,竟然有着...
安卓系统架构书籍推荐,权威书籍... 你有没有想过,想要深入了解安卓系统架构,却不知道从何下手?别急,今天我就要给你推荐几本超级实用的书籍...
安卓系统看到的炸弹,技术解析与... 安卓系统看到的炸弹——揭秘手机中的隐形威胁在数字化时代,智能手机已经成为我们生活中不可或缺的一部分。...
鸿蒙系统有安卓文件,畅享多平台... 你知道吗?最近在科技圈里,有个大新闻可是闹得沸沸扬扬的,那就是鸿蒙系统竟然有了安卓文件!是不是觉得有...
宝马安卓车机系统切换,驾驭未来... 你有没有发现,现在的汽车越来越智能了?尤其是那些豪华品牌,比如宝马,它们的内饰里那个大屏幕,简直就像...
p30退回安卓系统 你有没有听说最近P30的用户们都在忙活一件大事?没错,就是他们的手机要退回安卓系统啦!这可不是一个简...
oppoa57安卓原生系统,原... 你有没有发现,最近OPPO A57这款手机在安卓原生系统上的表现真是让人眼前一亮呢?今天,就让我带你...
安卓系统输入法联想,安卓系统输... 你有没有发现,手机上的输入法真的是个神奇的小助手呢?尤其是安卓系统的输入法,简直就是智能生活的点睛之...
怎么进入安卓刷机系统,安卓刷机... 亲爱的手机控们,你是否曾对安卓手机的刷机系统充满好奇?想要解锁手机潜能,体验全新的系统魅力?别急,今...
安卓系统程序有病毒 你知道吗?在这个数字化时代,手机已经成了我们生活中不可或缺的好伙伴。但是,你知道吗?即使是安卓系统,...
奥迪中控安卓系统下载,畅享智能... 你有没有发现,现在汽车的中控系统越来越智能了?尤其是奥迪这种豪华品牌,他们的中控系统简直就是科技与艺...