DiffRed: Dimensionality Reduction guided by stable rank

This is the official repository containing the code for the experiments of our AISTATS 2024 paper DiffRed: Dimensionality Reduction guided by stable rank

Setup

The DiffRed package maybe installed either from PyPI or from the source.

PyPI Installation:

  pip install diffred

Installation from source:

git clone https://github.com/S3-Lab-IIT/DiffRed.git

cd DiffRed

pip install -r requirements.txt

pip install -e .

Example Usage

from DiffRed import DiffRed
import numpy as np
n=100
D=50
data=np.random.normal(size=(n,D))
dr=DiffRed(k1=5,k2=5)
embeddings=dr.fit_transform(data)

Using the parallel stress package

Currently, the parallel stress implementation can only be used by installing from source.

git clone https://github.com/S3-Lab-IIT/DiffRed.git

cd DiffRed

After cloning, parallel stress can be imported like a regular package.

Example Usage

from parallel_stress import stress as pstress
from share_array.share_array import get_shared_array, make_shared_array
import numpy as np
from Experiments.dimensionality_reduction_metrics.metrics import distance_matrix
from DiffRed import DiffRed

n=100
D=50

data=np.random.normal(size=(n,D))
dist_matrix=distance_matrix(data,None,None,None)
dr=DiffRed(k1=5,k2=5)
embedding_matrix=dr.fit_transform(data)

make_shared_array(dist_matrix, 'dist_matrix')
make_shared_array(embedding_matrix, name='embedding_matrix')

stress=pstress('dist_matrix', 'embedding_matrix')

print(f"Stress: {stress}")

Reproducing Experiment Results

For reproducing our experiment results, refer to the scripts in the Experiments/ directory.

Low and Moderate Dimensional Datasets

For datasets having low to moderate dimensionality, the code can be run with relatively less memory and GPU. The datasets used in our experiments, which fall under this category are:

Dataset	$\mathbf{D}$	$\mathbf{n}$	Stable Rank	Domain
Bank	17	45K	1.48	Finance
hatespeech	100	3.2K	11.00	NLP
FMNIST	784	60K	2.68	Image
Cifar10	3072	50K	6.13	Image
geneRNASeq	20.53K	801	1.12	Biology
Reuters30k	30.92K	10.7K	14.50	NLP

The experiments related to these datasets can be run from the scripts available in the Experiments/dimensionality_reduction_metrics directory.

Grid Search Experiments

Note: Before proceeding, make sure that all bash scripts have executable permission. Use the following command:

chmod u+x script-name.sh

To run the grid search experiments for Stress and M1 metrics, follow these steps:

Create the required subdirectories by running :

./create_subdirectories

Now download and preprocess the dataset:

./prepare_datasets dataset

Here, a list of datasets may also be provided as CLI argument (dataset names seperated by space). Ensure that the dataset names are the same as the names in the table above (case-sensitive).

Next, compute the distance matrices of the datasets:

./compute_distance_matrix dataset

Now, compute the DiffRed embeddings:

./compute_embeddings [dataset-name] [list of k1 values seperated by space] [list of k2 values seperated by space] 100

Now, compute Stress and M1 distortion using:

./compute_stress [dataset-name] [save-as] [list of k1 values seperated by space] [list of k2 values seperated by space] 100

./compute_m1 [dataset-name] [save-as] [list of k1 values seperated by space] [list of k2 values seperated by space] 100

For using the same $k_1$ and $k_2$ that we used in the paper, refer to the excel sheet in the results/M1_results/ and the results/Stress_results directories.

Comparison with other dimensionality reduction algorithms

The scripts to compute the stress using other dimensionality reduction algorithms (PCA, RMap, Kernel PCA, Sparse PCA, t-SNE, UMap) are in the Experiments/dimensionality_reduction_metrics/other_dr_techniques directory.

The compile_results script compiles the best values of the grid search results into an excel file for all dimensionality reduction techniques (including DiffRed).

Running custom experiments/Extending Research

The repository was developed to allow adding new datasets and dimensionality reduction algorithms, and to provide customizability for extending our research. For running experiments with custom datasets/settings, the python scripts can be run by specifying the datasets/other settings via CLI arguments. To view the utility of a particular script, use the help option of the python script in the command line:

python3 <script-name>.py --help

Adding a new dataset

A new dataset, may be added to the repository by adding a corresponding data class (inherited from the Datasets.Dataset) to the Datasets.py file. Then, the get_datasets.py file needs to be updated by adding the download url and the data class object to the url and objs dictionary.

Adding a new dimensionality reduction algorithm

A new dimensionality reduction algorithm can be added to the repository by implementing it as a function in other_dr_techniques/dr_techniques.py and adding the initial values of hyperparameters to other_dr_techniques.settings.SETTINGS.

High and Very High Dimensionality Datasets

For datasets having high and very high dimensionality, more memory and GPU may be required. For such datasets, we used a shared commodity cluster. The following datasets from our paper fall in this category:

Dataset	$\mathbf{D}$	$\mathbf{n}$	Stable Rank	Domain
APTOS 2019	509K	13K	1.32	Healthcare
DIV2K	6.6M	800	8.39	High Res Image

The experiment scripts for these datasets can be found at Experiments/high_dimensionality_datasets/. Slurm job scripts have been provided to facilitate usage in HPC environments. The usage is similar to what is described above for low dimensionality datasets.

Reproducing Plots

To reproduce the plots provided in the paper and the supplementary material, use the make_plots.ipynb at Experiments/dimensionality_reduction_metrics. For the stable rank plots (Figure 11, Figure 13) and the spectral value plots (Figure 12), use the plot_stable_rank.py and the compute_spectral_plot.py scripts.

Experiment Results

The results of our experiments may be obtained from Experiments/dimensionality_reduction_metrics/results and Experiments/high_dimensionality_datasets/results/ directories. For the low and moderate dimensionality datasets, refer to the the subdirectories Stress_results and M1_results for the full grid search results. For the high dimensionality datasets, refer to the excel files aptos2019_M1_results.xlsx, aptos2019_Stress_results.xlsx, DIV2k_M1_results.xlsx, and DIV2k_Stress_results.xlsx for the full grid search results. For both, refer to the subdirectory other_dr_techniques for results of other dimensionality reduction algorithms and refer to the compiled_results subdirectory for a comparitive summary of the grid search experiments.

Citations

Please cite the paper and star this repository if you use DiffRed and/or find it interesting. Queries regarding the paper or the code may be directed to [email protected]. Alternatively, you may also open an issue on this repository.

@misc{shukla2024diffred,
      title={DiffRed: Dimensionality Reduction guided by stable rank}, 
      author={Prarabdh Shukla and Gagan Raj Gupta and Kunal Dutta},
      year={2024},
      eprint={2403.05882},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github		.github
Applications/Visualization		Applications/Visualization
DiffRed		DiffRed
Experiments		Experiments
parallel_stress		parallel_stress
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffRed: Dimensionality Reduction guided by stable rank

Setup

Example Usage

Using the parallel stress package

Example Usage

Reproducing Experiment Results

Low and Moderate Dimensional Datasets

Grid Search Experiments

Comparison with other dimensionality reduction algorithms

Running custom experiments/Extending Research

Adding a new dataset

Adding a new dimensionality reduction algorithm

High and Very High Dimensionality Datasets

Reproducing Plots

Experiment Results

Citations

About

Releases 4

Packages

Languages

S3-Lab-IIT/DiffRed

Folders and files

Latest commit

History

Repository files navigation

DiffRed: Dimensionality Reduction guided by stable rank

Setup

Example Usage

Using the parallel stress package

Example Usage

Reproducing Experiment Results

Low and Moderate Dimensional Datasets

Grid Search Experiments

Comparison with other dimensionality reduction algorithms

Running custom experiments/Extending Research

Adding a new dataset

Adding a new dimensionality reduction algorithm

High and Very High Dimensionality Datasets

Reproducing Plots

Experiment Results

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages