Introduction
Random Projections have emerged as a powerful method for dimensionality reduction. Theoretical results indicate that it preserves distances quite nicely but empirical results are sparse. It is often employed in dimensionality reduction in both noisy and noiseless data especially image and text data. Results of projecting on random lowerdimensional subspace yields results comparable to conventional methods like PCA etc but using it is computationally less expensive than the traditional alternatives.
Curse of Dimensionality
 High dimensional data restricts the choice of data processing methods.
 A statistically optimal way of dimensionality reduction is to project the data onto a lowerdimensional orthogonal subspace that captures as much of the variations of the data as possible. The most widely used method of this sort is PCA (Principal Component Analysis).
 Drawback of PCA is however that it is computationally expensive to calculate as the dimensions of data increases.
JohnsonLindenstrauss Lemma
If points in vector space are projected onto a randomly selected subspace of suitably high dimensions, then the distances between the points are approximately preserved.
Random Projection (RP)
 In RP, a higher dimensional data is projected onto a lowerdimensional subspace using a random matrix whose columns have unit length.
 RP is computationally efficient, yet accurate enough for this purpose as it does not introduce a significant distortion in the data.

It is not sensitive to impulse noise. So RP is promising alternative to some existing methods in noise reduction (like mean filtering) too.
 The original ddimensional data is projected to a kdimensional \((k \lt \lt d)\) through the origin, using a random \(k * d\) matrix R whose columns have unit lengths. It is given by
 Where
 \(X_{k*N}^{RP}\) is the kdimensional random projection
 \(R_{k*d}\) is the random matrix used for transformation
 \(X_{d*N}\) are the original set of N ddimensional observations

The key idea of random mapping arises from JohnsonLindenstrauss Lemma
 Complexity
 Forming a random matrix R and projecting d * N data matrix X into k dimensions is of the order O(dkN)
 If X is a sparse matrix with c nonzero values per column, then the complexity is O(ckN)

Theoretically, equation (1) is not a projection because R is generally not orthogonal. A linear mapping like (1) can cause significant distortion in data if R is not orthogonal. Orthogonalizing R is computationally expensive.

Instead of orthogonalizing, RP relies on the result presented by HechtNeilsen i.e. In a high dimensional space, there exists a much larger number of almost orthogonal than orthogonal directions. Thus the vectors with random directions might be sufficiently close to orthogonal, and equivalently \(R^TR\) would approximate an identity matrix.

Experimental results show the mean squared difference between \(R^TR\) and identity matrix is around 1/k per element.
 Euclidean distances between \(x_1\) and \(x_2\) in the original largedimensional space is given by
 After RP, this distance can be approximated by scaled Euclidean distance of these vectors in reduced spaces:

Where d is the original and k is the reduced dimensionality of the data set and scaling factor \(\sqrt{d \over k}\) takes into account the decrease in dimensionality of the data set.

According to JohnsonLindenstrauss Lemma, the expected norm of a projection of unit vector onto a random subspace through origin is \(\sqrt{k \over d}\).

The choice of R matrix is generally such that \(r_{ij}\) of R are often Gaussian Distributed although many other choices are available. There is a peculiar result which says one can use the following distribution which is much simpler.

Practically, all zero mean, unit variance distributions of \(r_{ij}\) would give a mapping that satisfies the JohnsonLindenstrauss Lemma.

Equation (2) helps reduce the computational expense even further.
Principal Component Analysis (PCA)
 PCA is eigenvalue decomposition of the data covariance matrix. It is computed as
 Where
 columns of E are the eigenvectors of the data covariance matrix \(E{XX^T}\)
 \(\Lambda\) is a the diagonal matrix containing the respective eigenvalues.
 Dimensionality reduction of data set is obtained by projection on a subspace spanned by most important eigenvectors, given by
 Where
 The d * k matrix \(E_k\) contains the k eigenvectors corresponding to the k larges eigenvalues.

PCA is the optimal way to project data in the meansquare sense, i.e. the squared error introduced in the projection is minimized over all projections onto a kdimensional space.

Eigen value decomposition of the data covariance matrix is very expensive.
 Computational Complexity : \(O(d^2N) + O(d^3)\)
Singular Value Decomposition (SVD)
 Closely related to PCA, singular value decomposition is given by
 Where
 Orthogonal matrices U and V contain the left and right singular vectors of X respectively
 The diagonal matrix S contains the singular values of X.
 Using SVD, the dimensionality of data can be reduced by projecting data onto the space spanned by the left singular vectors corresponding to the k largest singular values, given by

Where \(U_k\) is of size d * k and contains k singular vectors.

Like PCA, SVD is also expensive to compute.

For sparse data matrix \(X_{d*N}\) with about c nonzero entries per column, the computational complexity is of the order O(dcN).
Latent Semantic Indexing (LSI)

It is a dimensionality reduction method for text document data.

Using LSI, the document data is represented in a lowerdimensional “topic” space: the documents are characterized by some underlying (latent, hidden) concepts referred to by the terms.

LSI can be computed either by PCA or SVD of the data matrix of N ddimensional document vectors.
Discrete Cosine Transform (DCT)

Widely used for image compression and can be used for dimensionality reduction of image data.

Computational more efficient than PCA and has performance approachable to PCA.

DCT is optimal for human eye: the distortions introduced occur at the highest frequencies only, neglected by human eye as noise.

DCT can be performed by simple matrix operations: Image is first transformed to DCT space and dimensionality reduction is achieved during inverse transform by discarding the transform coefficients corresponding to highest frequencies.

Computing DCT is not datadependent, unlike PCA that needs the eigenvalue decomposition of data covariance matrix, which is why DCT is orders of magnitude cheaper to compute than PCA.

Computational Complexity : \(O(dN\,log_2(dN))\) for data matrix of size d * N.
Notes About Random Projection

RP does not distort the data significantly more than PCA. Also at smaller dimensions PCA seems to distort data more than RP because of removal of significantly important eigenvectors by elimination.

Computational complexity of RP is significantly lesser than other methods like PCA, but more than DCT.

So it can be inferred that RP are a better choice given the trade off of DCT accuracy for reduced complexity.

At smaller dimension RP outperforms DCT both on accuracy and complexity.

Use case can often be a factor in considering the most optimal way of dimensionality reduction. Say, even though RP outperforms DCT, it cannot be used for purposes where aim is to transmit the minimized dataset and reobtain original data on the other end for human viewing.

Psuedoinverse computation of R is expensive to compute but because R is almost orthogonal, \(R^T\) would be a good approximation of psuedoinverse. So the image can be computed from RP as,

Where \(X_{k*N}^{RP}\) is the result of random projection. But the obtained image is visually worse than a DCT compressed image, to a human eye.

So RP would serve very well in applications where distance or similarity between data vectors should be preseved under dimensionality reduction as well as possible, but where data is not intended to be visualized for the human eye, eg. machine vision.

In case of text dataset, the error in dimensionality reduction is calculated by calculating the inner product among randomly chosen data pairs before and after transform. It is observed that RP is not as accurate as SVD but the error may be neglectable in various use cases. The possible cause could be that the JohnsonLindenstrauss makes statement about Euclidean distance, but Inner product is a different metric even if Euclidean distance is maintained well.
REFERENCES
Random projection in dimensionality reduction: Applications to image and text data
Various Embeddings on Digits Data  Python Sklearn