Rahee S.

Below is a very high level overview of the project that I am working on right now.

Compositional data, ie. measurements constrained to a simplex $\mathcal{S}^p \subset \mathbb{R}^p$ where $x_j > 0$ and $\sum_j x_j = c$ , is the native output of metabolomics, microbiome sequencing, and geochemistry. The constant-sum constraint means only relative information is statistically identifiable, and standard multivariate methods applied naively produce artifacts. Log-ratio transformations (Aitchison, 1982) address this by mapping the simplex into a linear subspace of $\mathbb{R}^p$ where Euclidean geometry applies, but existing implementations of sparse multivariate methods do not fully respect the geometric structure these transformations impose. This project develops constrained optimization frameworks for sparse dimensionality reduction and supervised classification on compositional data. The methods enforce algebraic constraints arising from the compositional geometry exactly throughout the optimization.

The Centered Log-Ratio transformation maps compositions into the hyperplane $\mathbf{1}^\perp = \{\mathbf{z} \in \mathbb{R}^p : \mathbf{1}^\top \mathbf{z} = 0\}$ : $\text{CLR}(\mathbf{x})_j = \log \frac{x_j}{g(\mathbf{x})}, \qquad g(\mathbf{x}) = \biggl(\prod_{k=1}^p x_k\biggr)^{1/p}$

This subspace carries specific algebraic structure that interacts nontrivially with penalized estimation. Sparsity-inducing penalties ( $\ell_1$ , $\ell_0$ relaxations) can produce solutions that violate the constraints inherited from the compositional geometry (even when the input data satisfies them). The resulting estimators lose the interpretive properties that motivate log-ratio methods in the first place. Our work identifies where and why this failure occurs in widely-used pipelines and develops optimization formulations that correct it. The approach is related to the structure of projection operators on $\mathbf{1}^\perp$ , convex relaxations of combinatorial sparsity constraints, and connections between penalized regression and eigenvalue problems on singular covariance matrices.

This is part of a broader (and much larger) effort in our group on log-contrast-constrained inference for compositional data. The constraint $\sum_j \beta_j = 0$ , which ensures that linear combinations of log-transformed features define functions of ratios alone, connects dimensionality reduction, supervised classification, and penalized regression through a shared geometric foundation. Related work in the group includes LASSO-penalized logistic regression under the sum-to-zero constraint for disease classification from metagenomic and metabolomic panels, benchmarked against the DiCoVar framework (Hinton & Mucha, 2021) on publicly available sequencing data and clinical case studies.

Note: Details of the methodology, analysis, and benchmarking against existing approaches on both targeted and untargeted metabolomics datasets are in preparation.

References

Hinton, A. L. & Mucha, P. J. (2021). Differential Compositional Variation Feature Selection: A Machine Learning Framework with Log Ratios for Compositional Metagenomic Data. bioRxiv.

Sparse Log-Contrast Methods for Compositional Data

References