Overview

Diagram

../_images/deltapd_overview.png

Explanation

Model

Estimator

Regression models (\(y=mx+c\)) are fit using the Theil-Sen estimator. The intercept is assumed to be \(c=0\) as we would expect points around the \(y \approx x \approx 0\) region to be very good representations of correlation between the query and reference trees. This greatly simplifies the estimator to:

Let \(TS_i\) be the the Theil-Sen estimate for each data point \(i\):

\begin{eqnarray} TS_i &= \begin{cases} 0& \text{if $x_i = 0$},\\ \frac{y_i}{x_i}& \text{otherwise}. \end{cases} \end{eqnarray}

The Theil-Sen estimate of the gradient is then \(m = \text{median}(TS)\) for the set \(TS\).

Coefficient of determination

The correlation coefficient \(R^2\) is calculated as follows, where \(\hat{y}\) is the estimate, and \(\bar{y}\) is the mean:

\begin{eqnarray} R^2 & = 1 - \frac{\sum_i \left(y_i - \hat{y_i}\right)^2}{\sum_i \left(y_i - \bar{y}\right)^2} \end{eqnarray}

Mean squared error

The mean squared error (MSE) is typically reported as the normalised mean squared error (nMSE). The nMSE is normalised by the standard deviation of the predictor as follows, where \(\sigma\) is standard deviation:

\begin{eqnarray} nMSE & = \sqrt{\frac{\text{MSE}(y, \hat{y})}{\sigma(y)}} \end{eqnarray}

Relative influence function

The relative influence of jackknifing a query taxon \(u_i\) from the full set of taxa size \(n\) is calculated as follows:

\begin{eqnarray} u_i & = \frac{(n-1)\left(\bar{x}-\text{MSE}_i\right)}{\sqrt{\frac{\sum_i \left(\left(n-1)(\bar{x} - \text{MSE}_i\right)\right)^2}{n-1}}} \end{eqnarray}