---
title: "Calculations performed by the unusualprofile package"
author: "W. Joel Schneider & Feng Ji"
date: "`r Sys.Date()`"
as_is: true
output: 
  rmarkdown::html_vignette: default
font-import: https://fonts.googleapis.com/css?family=Titillium+ Web:100&amp;subset=latin-ext
font-family: 'Titillium Web'
vignette: >
  %\VignetteIndexEntry{Calculations performed by the unusualprofile package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{css, echo = FALSE}

@import url('http://fonts.googleapis.com/css?family=Titillium+Web');
body {font-family: 'Titillium Web', 'Open Sans', 'sans-serif'; 
      font-size:18px; 
      line-height:1.25;}

.caption {text-align:right; font-size:10pt}
table caption {text-align:left; font-size:10pt}

img {border: none}
```

```{r setup, include = FALSE}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "ragg_png",
  out.height = "100%",
  out.width = "100%",
  fig.width = 7.2916667,
  fig.height = 7.2916667,
  message = FALSE,
  warning = FALSE,
  cache = FALSE
)

library(dplyr)
library(stringr)
library(ggnormalviolin)
library(ggplot2)
library(purrr)
library(simstandard)
library(unusualprofile)
```

```{r more-setup, echo=FALSE}
myfont <- "Titillium Web"

# font_add_google("Titillium Web", "Titillium Web")
theme_set(theme_minimal(16, base_family = myfont))
update_geom_defaults(geom = "text",
                     new = list(family = myfont))
update_geom_defaults(geom = "label",
                     new = list(family = myfont))
update_geom_defaults(geom = "point",
                     new = list(pch = 16))
#Rounds proportions to significant digits both near 0 and 1
PropRound <- function(p, maxDigits = 10) {
  d <- rep(2, length(p))
  pp <- rep(0, length(p))
  for (i in seq(1, length(p))) {
    if (p[i] > 0.99 || p[i] < 0.01) {
      d[i] <-
        min(maxDigits, 1 - 1 - floor(log10(abs(
          ifelse(p[i] < 0.5, p[i], 1 - p[i])
        ))))
    }
    pp[i] <-
      formatC(round(p[i], digits = d[i]), digits = d[i], flag = "")
    if (round(p[i], digits = maxDigits) == 0) {
      pp[i] <- 0
    }
    if (round(p[i], digits = maxDigits) == 1) {
      pp[i] <- 1
    }
    gsub(" ", "", pp[i])
  }
  return(gsub(" ", "", pp))
}

numText <- function(x, digits = 2) {
  str_replace(formatC(x, digits = digits, format = "f"), pattern = "\\-", "−")
}

bmatrix <- function(x, digits = 2) {
  x_formatted <- apply(x, 1, formatC, digits = digits, format = "f")
  x_formatted <- apply(x_formatted, 1, str_remove, pattern = "^0")
  x_formatted[x == 0] <- "0"
  x_formatted[x == 1] <- "1"
  paste0("\\begin{bmatrix}\n",
         paste0(
           apply(
             x_formatted,
             MARGIN = 1,
             FUN = paste0,
             collapse = " & "
           ),
           collapse = "\\\\\n"
         ),
         "\n\\end{bmatrix}")
}


```


```{r worked-model, echo = FALSE}
model <- "X =~ 0.95 * X_1 + 
               0.90 * X_2 + 
               0.85 * X_3 + 
               0.60 * X_4"
```

```{r worked-data, echo = FALSE}
# Create data.frame, and add the composite score
d <- data.frame(
  X_1 = 2,
  X_2 = 3,
  X_3 = 1,
  X_4 = 2
) %>%
  simstandard::add_composite_scores(m = model)

```

```{r standardized-scores, echo=FALSE}
# Standardized observed scores
X <- d[1, -5] %>% t %>% as.vector %>% `names<-`(colnames(d)[-5])

```


Suppose that there are a set of variables $X$ that are related to each other as seen in the model below:

```{r one-dimensional, echo = FALSE, fig.align='center', fig.cap="A simple model with standardized loadings", out.width=300}
knitr::include_graphics("One_dimensional.svg")
```

Suppose that there is a profile of scores in $X$ such that:

$$X=\{X_1,X_2, X_3, X_4\} = \{`r paste0(X, collapse = ",")`\}. $$


As seen in Figure 1, this profile of scores is summarized by a composite score of `r formatC(d$X_Composite, 2, format = "f")`.

```{r example-profile, echo = F, fig.cap="Figure 1. Example profile in a standard multivariate normal distribution."}
tibble(
  Variable = paste0("italic(X)[", 1:4, "]"),
  Score = X,
  vjust = c(1.5, -0.5, 1.5, 1.5)
) %>%
  ggplot(aes(Variable, Score)) +
  geom_normalviolin(aes(mu = 0, sigma = 1), fill = "gray", alpha = .4) +
  geom_line(aes(group = 1), color = "firebrick") +
  geom_point(pch = 16, color = "firebrick", size = 2) +
  geom_text(aes(label = Score, vjust = vjust), color = "firebrick") +
  geom_hline(yintercept = d$X_Composite) +
  scale_x_discrete(
    NULL,
    labels = function(l) {
      parse(text = l)
      }
  ) +
  scale_y_continuous() +
  annotate(
    geom = "text",
    x = 3.5,
    y = d$X_Composite,
    label = paste0("Composite = ", formatC(d$X_Composite, 2, format = "f")),
    vjust = -.6,
    size = 4.5
  )
```

How can we calculate the Mahalanobis distance for profiles that all have the same elevation (i.e., composite score)? For your reference, we will do everything "by hand" using matrix algebra.

In r, we can make a named vector of scores like so:

```{r X-vector}
X <- c(
  X_1 = 2,
  X_2 = 3,
  X_3 = 1,
  X_4 = 2
)

X
```

We will need to store variable names:

```{r variable-names}
v_observed <- names(X)
v_latent <- "X"
v_composite <- "X_Composite"
v_names <- c(v_observed, v_composite)
```

We can create a matrix of factor loadings:

```{r matrix-loadings}
lambda <- c(0.95, 0.90, 0.85, 0.60) %>%
  matrix() %>%
  `rownames<-`(v_observed) %>%
  `colnames<-`(v_latent)

lambda
```


Now we calculate the model-implied correlations among the observed variables:

```{r R-X}
# Observed Correlations
R_X <- lambda %*% t(lambda) %>%
  `diag<-`(1)

R_X
```

Presented formally, the model-implied correlations are:

```{r implied-corelations, results='asis', echo = FALSE}
cat(paste0("$$R_{X} \\approx ", bmatrix(R_X), "$$"))
```

We need to use this matrix to create a new 5 × 5 correlation matrix that includes the correlations among the four variables and also each variable's correlation with the general composite score (i.e., the standardized sum of four variables). Fortunately, such a matrix can be calculated with only a few steps.

We will need a "weight" matrix that will select each variable individually and also the sum of the four variables.

$$w=\begin{bmatrix}
1 & 0 & 0 & 0 & 1\\
0 & 1 & 0 & 0 & 1\\
0 & 0 & 1 & 0 & 1\\
0 & 0 & 0 & 1 & 1
\end{bmatrix}$$

Notice that the first column of this matrix has a 1 in first position and zeroes elsewhere. It selects the first variable, *X*~1~. The second column selects *X*~2~, and so on to the fourth column. The last column is all ones, which will select all four variables and add them up.

We can construct this matrix with the `diag` function, which creates an identity matrix. This matrix is appended to a column of ones:

```{r weight-matrix}
w <-  cbind(diag(4),
            rep(1, 4)) %>%
  `rownames<-`(v_observed) %>%
  `colnames<-`(v_names)
w
```

Now we can use the weight matrix *w* to calculate the covariance matrix:

$$\Sigma = w'R_{X}w$$

```{r sigma-calculations, include= FALSE}
Sigma <- (t(w) %*% R_X %*% w)
Sigma
```

```{r Sigma-matrix, results='asis', echo = FALSE}
cat(paste0("$$\\Sigma \\approx ", bmatrix(Sigma), "$$"))
```

```{r, ref.label="sigma-calculations", echo = TRUE}

```


Now we need to convert the covariance matrix to a correlation matrix. With matrix equations, we would need to create a matrix of with a vector of variances on the diagonal:

$$D = \text{diag}(\Sigma)$$ Then we would take the square root, invert this matrix, and then pre-multiply it and post-multiply it by the covariance matrix.

$$R_{All} = D^{-0.5}\Sigma D^{-0.5}$$

```{r R-matrix, results='asis', echo = FALSE}
cat(paste0("$$R_{All} \\approx ",
           bmatrix(round(cov2cor(Sigma), 2)),
           "$$"))
```


If we really want to use pure matrix algebra functions, we could do this:

```{r matrix-purist, echo = TRUE}
D_root_inverted <- Sigma %>%
  diag() %>%
  sqrt() %>%
  diag() %>%
  solve() %>%
  `rownames<-`(v_names) %>%
  `colnames<-`(v_names)

R_all <- D_root_inverted %*% Sigma %*% D_root_inverted

R_all
```

However, it is probably best to sidestep all this complication of converting covariances to correlations with the `cov2cor` function:

```{r cor2cov}
# Convert covariance matrix to correlations
R_all <- cov2cor(Sigma)

R_all
```


## Calculate composite scores

To calculate the standardized composite score $z_C$, add each variable's deviation from its own mean and divide by the square root of the sum of the observed score covariance matrix.

$$z_C=\frac{1'(X-\mu_X)}{\sqrt{1'\Sigma_X1}}$$

Where

> $z_C$ is a standardized composite score.\
> $X$ is a vector of observed scores.\
> $\mu_X$ is the vector of means for the $X$ variables.\
> $\Sigma_X$ is the covariance matrix of the $X$ variables.\
> $1$ is a vector of ones compatible with $\Sigma_X$.

The composite score is:

```{r composite-calculations}
# Population means of observed variables
mu_X <- rep_along(X, 0)

# Population standard deviations of observed variables
sd_X <- rep_along(X, 1)

# Covariance Matrix
Sigma_X <- diag(sd_X) %*% R_X %*% diag(sd_X)

# Vector of ones
ones <- rep_along(X, 1)

# Standardized composite score
z_C <- c(ones %*% (X - mu_X) / sqrt(ones %*% (Sigma_X) %*% ones))


```

## Estimate expected test scores conditioned on a composite score

Given a particular composite score, we need to calculate a predicted score. That is, if the composite score is 1.5 standard deviations above the mean, what are the expected subtest scores?

$$\hat{X}=\sigma_Xz_Cr_{XX_C}+\mu_X$$

Where

> $\hat{X}$ is the vector of expected subtest scores\
> $\sigma_X$ is the vector of standard deviations for $X$\
> $z_C$ is the composite score\
> $r_{XX_C}$ is a vector of correlations of each variable in $X$ with the composite score $z_C$\
> $\mu_X$ is the vector of means for $X$\

Thus,

```{r expected-X}
# Predicted value of X, given composite score
X_hat <- sd_X * z_C * R_all[v_observed, v_composite] + mu_X
```

## Calculating the Conditional Mahalanobis Distance

$$d_{M_C}=\sqrt{\left(X-\hat{X}\right)'\Sigma_{X}^{-1}\left(X-\hat{X}\right)}$$

Where

> $d_{M_C}$ is the Conditional Mahalanobis Distance\
> $X$ is a vector of subtest scores\
> $\hat{X}$ is the vector of expected subtest scores\
> $\Sigma_{X}$ is the covariance matrix of the subtest scores

```{r}
d_mc <- c(sqrt(t(X - X_hat) %*% solve(Sigma_X) %*% (X - X_hat)))
d_mc
```

Suppose there are *k* outcome scores, and *j* composite scores used to calculate the expected scores $\hat{X}$. If multivariate normality of the subtest scores can be assumed, then the Conditional Mahalanobis Distance squared has a *χ*^2^ distribution with *k* − *j* degrees of freedom.

$$d_{M_C}^{2} \sim\chi^{2}(k-j)$$

```{r cm-p}
# Number of observed variables
k <- length(v_observed)

# Number of composite variables
j <- length(v_composite)

# Cumulative distribution function
p <- pchisq(d_mc ^ 2, df = k - j)
p
```

If we can assume that the observed variables in *X* are multivariate normal, a profile of *X* = {`r paste(X, collapse = ",")`} is more unusual than `r round(p * 100)`% of profiles that also have a composite score of *z~C~* = `r round(z_C, 2)`.