This document provides background information on the R code used in the survival analysis part of the Biostatistics Methods II course.

We start by loading the packages that we will require during these illustrations. These are packages **survival**, **splines**, **lattice** and **JM**. The first three packages are recommended packages and exist by default in all R installations. Package **JM** though is an optional package. If you do not have it already installed, you will need to install it. The command to do this is `install.packages("JM")`

and needs to be performed only once.

Assuming that all packages are installed, the following commands load them into the active R session:

```
library("survival")
library("splines")
library("lattice")
library("JM")
```

Next, we load the datasets that we will need for our illustrations, namely the AIDS, PBC, Lung and Stanford datasets introduced in the main slides of the course. These datasets are available as objects `aids.id`

, `pbc2.id`

, `lung`

and `stanford2`

, respectively. The first two from package **JM** and the last two from package **survival**. To load the datasets from the packages we use the `data()`

function, i.e.,

```
data("aids", package = "JM")
data("aids.id", package = "JM")
data("pbc2.id", package = "JM")
data("lung", package = "survival")
data("stanford2", package = "survival")
```

For the `lung`

dataset we also need to perform two data management steps, in particular, to specify variable `sex`

as a factor, and to remove the missing values from the variables we will later use. The relevant commands are:

```
lung$sex <- factor(lung$sex, levels = 1:2, labels = c("male", "female"))
lung <- with(lung, lung[complete.cases(time, status, sex, age, ph.karno), ])
```

A key function for the analysis of survival data in R is function `Surv()`

. This is used to specify the type of survival data that we have, namely, right censored, left censored, interval censored. For our illustrations, we will only consider right censored data. In this case, function `Surv()`

accepts as first argument the observed survival times, and as second the event indicator. Function `Surv()`

is used by many other functions we will see later.

The function that is used to calculate the Kaplan-Meier estimate is `survfit()`

. Its first argument is an R `formula`

. The left-hand side of this formula specifies the survival times information using function `Surv()`

, and the right-hand side is used to specify grouping variables (e.g., if we would like to obtain separate estimates of the survival function per treatment group) â€“ if we put â€˜1â€™ as below, then one survival curve is estimated based on all the data. Argument `data`

specifies the data frame that contains the variables of interest. For the Stanford data, we obtain the Kaplan-Meier estimator using the command:

```
KM_fit <- survfit(Surv(time, status) ~ 1, data = stanford2)
KM_fit
```

```
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2)
##
## n events median 0.95LCL 0.95UCL
## 184 113 631 328 1232
```

The basic output of function `survfit()`

gives the number of subjects, the number of events, the median survival time and its 95% confidence interval. Using the `summary()`

method and its `times`

argument we can obtain the survival probabilities at specific follow-up times. For example, the following code computes the survival probabilities at 1000 and 2000 days:

`summary(KM_fit, times = c(1000, 2000))`

```
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1000 52 94 0.443 0.0396 0.372 0.528
## 2000 15 15 0.274 0.0440 0.200 0.375
```

The `quantile()`

method computes the corresponding follow-up times at which the survival probability takes a specific value. For example, we want to find at how many days the survival probability equals 0.7 and at how many days it equals 0.6 â€“ the code is:

`quantile(KM_fit, probs = 1 - c(0.7, 0.6))`

```
## $quantile
## 30 40
## 136 274
##
## $lower
## 30 40
## 65 148
##
## $upper
## 30 40
## 227 623
```

Note that in the `probs`

argument of `quantile()`

we have to specify one minus our target survival probabilities; this is because the function works under the cumulative distribution function (CDF) convention, and the CDF equals one minus survival probability. Moreover, also note that we obtain the lower and upper limits of the 95% confidence intervals for the quantiles we have asked.

The `plot()`

method produces the figure of the estimated curve; by default, the 95% confidence interval is included when we have only one curve:

```
plot(KM_fit, xlab = "Time to Death (days)", ylab = "Survival Probability",
main = "Kaplan-Meier Estimate of S(t) for the Stanford Data")
```

To calculate the Breslow estimator, we use again function `survfit()`

with the same syntax as for the Kaplan-Meier estimator â€“ the only difference is that we now need to set argument `type`

to `"fleming-harrington"`

. For the Stanford data we have:

```
Br_fit <- survfit(Surv(time, status) ~ 1, data = stanford2,
type = "fleming-harrington")
Br_fit
```

```
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2, type = "fleming-harrington")
##
## n events median 0.95LCL 0.95UCL
## 184 113 729 328 1232
```

```
plot(Br_fit, mark.time = FALSE, xlab = "Time to Death (days)",
ylab = "Survival Probability",
main = "Breslow Estimate of S(t) for the Stanford Data")
```

The interpretation of the output, as well as functions `summary()`

and `quantile()`

work in the same manner as in the examples we have seen above for the Kaplan-Meier estimator.

We have two standard statistical tests for testing simple hypothesis with one grouping variable (i.e., to test the hypothesis whether the survival functions of different groups of subjects differ statistically significantly). The first and most widely used test is the log-rank test. This test is performed in R using function `survdiff()`

. For example, to test with the log-rank test whether there are differences in the survival rates in the Lung dataset between males and females, we use the code:

```
logrank <- survdiff(Surv(time, status == 2) ~ sex, data = lung)
logrank
```

```
## Call:
## survdiff(formula = Surv(time, status == 2) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=male 137 111 90.9 4.42 10
## sex=female 90 53 73.1 5.51 10
##
## Chisq= 10 on 1 degrees of freedom, p= 0.002
```

We have found a strong indication of a difference between the two groups. To test the same hypothesis but with the Peto & Peto Gehan-Wilcoxon test, we use the `survdiff()`

function again, but now we set argument `rho`

to 1:

```
peto_peto <- survdiff(Surv(time, status == 2) ~ sex, data = lung, rho = 1)
peto_peto
```

```
## Call:
## survdiff(formula = Surv(time, status == 2) ~ sex, data = lung,
## rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=male 137 69.7 55.2 3.83 12.3
## sex=female 90 28.8 43.4 4.88 12.3
##
## Chisq= 12.3 on 1 degrees of freedom, p= 4e-04
```

The conclusion remains the same with a p-value of a smaller magnitude.

The log-rank test is the most powerful test when the proportional hazards (PH) assumption is satisfied. To check this assumption, we can plot the cumulative hazard functions for the two groups; when PH is satisfied the two curves will be proportional to each other (i.e., the steadily grow away of each other). The cumulative hazards plot can be produced by the `plot()`

method for `survfit()`

objects using the `fun`

argument, e.g.,

```
KM_fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
plot(KM_fit, fun = "cumhaz")
```

The function that fits AFT models from the **survival** package is `survreg()`

. Its first argument is a formula and has a similar syntax as for function `survfit()`

. The `dist`

argument specifies the distribution for the survival times (**Note:** the `dist`

argument specifies the distribution of the survival times *not* the log survival times). The default distribution (i.e., if you do not specify the `dist`

argument yourself) is the Weibull distribution. As with other model fitting functions in R, the `summary()`

function returns a detailed output of the fitted model. As an example, with fit an AFT model assuming the Weibull distribution for the PBC dataset. We control for `drug`

, `sex`

and `age`

. The code is:

```
fit_weibull <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
summary(fit_weibull)
```

```
##
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
## Value Std. Error z p
## (Intercept) 4.13862 0.48260 8.58 <2e-16
## drugD-penicil 0.13046 0.15576 0.84 0.402
## sexfemale 0.44472 0.20147 2.21 0.027
## age -0.03874 0.00793 -4.89 1e-06
## Log(scale) -0.10223 0.07423 -1.38 0.168
##
## Scale= 0.903
##
## Weibull distribution
## Loglik(model)= -494.7 Loglik(intercept only)= -511.8
## Chisq= 34.23 on 3 degrees of freedom, p= 1.8e-07
## Number of Newton-Raphson Iterations: 5
## n= 312
```

To fit the same model but with the exponential distribution, the code is:

```
fit_exp <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
dist = "exponential")
summary(fit_exp)
```

```
##
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
## dist = "exponential")
## Value Std. Error z p
## (Intercept) 4.3282 0.5159 8.39 < 2e-16
## drugD-penicil 0.1438 0.1721 0.84 0.40
## sexfemale 0.4816 0.2217 2.17 0.03
## age -0.0420 0.0084 -5.00 5.7e-07
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -495.6 Loglik(intercept only)= -512.3
## Chisq= 33.35 on 3 degrees of freedom, p= 2.7e-07
## Number of Newton-Raphson Iterations: 4
## n= 312
```

To fit the same model but with the log-normal distribution, the code is:

```
fit_lnorm <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
dist = "lognormal")
summary(fit_lnorm)
```

```
##
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
## dist = "lognormal")
## Value Std. Error z p
## (Intercept) 4.41534 0.59602 7.41 1.3e-13
## drugD-penicil 0.21196 0.18775 1.13 0.26
## sexfemale 0.30242 0.27761 1.09 0.28
## age -0.04841 0.00944 -5.13 2.9e-07
## Log(scale) 0.34711 0.06473 5.36 8.2e-08
##
## Scale= 1.41
##
## Log Normal distribution
## Loglik(model)= -498.9 Loglik(intercept only)= -515.2
## Chisq= 32.72 on 3 degrees of freedom, p= 3.7e-07
## Number of Newton-Raphson Iterations: 3
## n= 312
```

To fit the same model but with the log-logistic distribution, the code is:

```
fit_llogis <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
dist = "loglogistic")
summary(fit_llogis)
```

```
##
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
## dist = "loglogistic")
## Value Std. Error z p
## (Intercept) 4.00019 0.54926 7.28 3.3e-13
## drugD-penicil 0.14532 0.17387 0.84 0.40325
## sexfemale 0.38446 0.24161 1.59 0.11155
## age -0.04151 0.00884 -4.69 2.7e-06
## Log(scale) -0.26129 0.07326 -3.57 0.00036
##
## Scale= 0.77
##
## Log logistic distribution
## Loglik(model)= -497.4 Loglik(intercept only)= -513
## Chisq= 31.08 on 3 degrees of freedom, p= 8.2e-07
## Number of Newton-Raphson Iterations: 4
## n= 312
```

In cases we consider statistical models with complex terms (interactions and nonlinear terms), the regression coefficients we obtain in the output do not have a straightforward interpretation. In these settings, we can use effect plots to communicate the results of the model. As an example, we fit a model in the PBC dataset that contains the effect of `drug`

, the effect of `sex`

, the linear effect of `age`

, the quadratic effect of `age`

, and the interaction effects between `sex`

and the linear and quadratic effects of `age`

, and between `drug`

and the linear and quadratic effects of `age`

. We assume that the survival times follow the Weibull distribution. The code to fit the model is:

```
fit <- survreg(Surv(years, status2) ~ (drug + sex) * (age + I(age^2)), data = pbc2.id)
summary(fit)
```

```
##
## Call:
## survreg(formula = Surv(years, status2) ~ (drug + sex) * (age +
## I(age^2)), data = pbc2.id)
## Value Std. Error z p
## (Intercept) 8.342101 6.266023 1.33 0.18
## drugD-penicil 3.489874 3.353680 1.04 0.30
## sexfemale -6.523710 6.189094 -1.05 0.29
## age -0.161596 0.217081 -0.74 0.46
## I(age^2) 0.000822 0.001843 0.45 0.66
## drugD-penicil:age -0.148887 0.127901 -1.16 0.24
## drugD-penicil:I(age^2) 0.001549 0.001197 1.29 0.20
## sexfemale:age 0.240257 0.215642 1.11 0.27
## sexfemale:I(age^2) -0.002015 0.001847 -1.09 0.28
## Log(scale) -0.113198 0.074516 -1.52 0.13
##
## Scale= 0.893
##
## Weibull distribution
## Loglik(model)= -492.4 Loglik(intercept only)= -511.8
## Chisq= 38.81 on 8 degrees of freedom, p= 5.3e-06
## Number of Newton-Raphson Iterations: 5
## n= 312
```

As mentioned above, in this particular example, the linear and quadratic effects of age, and the corresponding interactions effects cannot be easily interpreted. To visualize the model we will use an effects plot. The first step is to specify a dataset that contains combinations of values for the covariates of the model based on which we will create the plot. In our example, we can create a single figure with all possible combinations of the covariates, i.e., both treatment groups, both sexes and increasing age. To create the dataset that contains all possible combinations of covariates we use function `expand.grid()`

. The code is:

```
ND <- with(pbc2.id, expand.grid(
age = seq(30, 65, length.out = 25),
sex = levels(sex),
drug = levels(drug)
))
head(ND, 10)
```

```
## age sex drug
## 1 30.00000 male placebo
## 2 31.45833 male placebo
## 3 32.91667 male placebo
## 4 34.37500 male placebo
## 5 35.83333 male placebo
## 6 37.29167 male placebo
## 7 38.75000 male placebo
## 8 40.20833 male placebo
## 9 41.66667 male placebo
## 10 43.12500 male placebo
```

Function `seq()`

creates a sequence of numbers, in our example from 30 to 65 years old, with the length of the sequence been 25 numbers. Function `levels()`

extracts from the `pbc2.id`

database the possible values of the `sex`

and `drug`

variables. Function `head()`

prints the first ten rows of this dataset.

Next, based on this dataset we compute the predictions from the model using function `predict()`

:

```
prs <- predict(fit, ND, se.fit = TRUE, type = "lp")
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- ND$pred - 1.96 * ND$se
ND$up <- ND$pred + 1.96 * ND$se
head(ND, 10)
```

```
## age sex drug pred se lo up
## 1 30.00000 male placebo 4.234429 1.4533787 1.385807 7.083051
## 2 31.45833 male placebo 4.072483 1.3055301 1.513644 6.631322
## 3 32.91667 male placebo 3.914035 1.1659479 1.628777 6.199293
## 4 34.37500 male placebo 3.759085 1.0347438 1.730987 5.787183
## 5 35.83333 male placebo 3.607634 0.9120677 1.819981 5.395287
## 6 37.29167 male placebo 3.459681 0.7981221 1.895362 5.024000
## 7 38.75000 male placebo 3.315227 0.6931838 1.956586 4.673867
## 8 40.20833 male placebo 3.174270 0.5976336 2.002909 4.345632
## 9 41.66667 male placebo 3.036812 0.5119939 2.033304 4.040321
## 10 43.12500 male placebo 2.902853 0.4369676 2.046396 3.759309
```

The output of the `predict()`

function as specified above is the predictions (in the log survival times scale) and the corresponding standard errors. The second and third lines of the code include as extra columns in the dataset `ND`

these predictions and standard errors. The fourth and fifth lines calculate the lower and upper limits of a 95% confidence interval for the log survival times. Finally, we create the plot using function `xyplot()`

from package **lattice**:

```
xyplot(pred + lo + up ~ age | drug*sex, data = ND, type = "l",
lty = c(1,2,2), col = "black", lwd = 2, xlab = "Age",
ylab = "Log Survival Time")
```

To create the plot for only one level of a categorical covariate, we use function `factor()`

in the definition of the dataset `ND`

in order to denote that this variable has two levels but only one of them is used. As an example, we recreate the plot from above but only for females. The specification of `ND`

becomes:

```
ND <- with(pbc2.id, expand.grid(
age = seq(30, 65, length.out = 25),
sex = factor("female", levels = levels(sex)),
drug = levels(drug)
))
head(ND, 10)
```

```
## age sex drug
## 1 30.00000 female placebo
## 2 31.45833 female placebo
## 3 32.91667 female placebo
## 4 34.37500 female placebo
## 5 35.83333 female placebo
## 6 37.29167 female placebo
## 7 38.75000 female placebo
## 8 40.20833 female placebo
## 9 41.66667 female placebo
## 10 43.12500 female placebo
```

The rest of the code remains the same; the only extra difference is that we now do the plot in the original timescale and not for the log survival times - note the use of function `exp()`

in the code below, i.e.,

```
prs <- predict(fit, ND, se.fit = TRUE, type = "lp")
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- exp(ND$pred - 1.96 * ND$se)
ND$up <- exp(ND$pred + 1.96 * ND$se)
ND$pred <- exp(ND$pred)
xyplot(pred + lo + up ~ age | drug, data = ND, type = "l",
lty = c(1,2,2), col = "black", lwd = 2, xlab = "Age",
ylab = "Survival Time")
```

A statistical comparison between two nested models can be performed with a likelihood ratio test calculated by function `anova()`

. As an example, we use again the complex model that we used for the effect plot in the previous section, i.e.,

`fit_alt <- survreg(Surv(years, status2) ~ (drug + sex) * (age + I(age^2)), data = pbc2.id)`

Say that we want to test whether we can drop all the complex terms, that is, the interaction and nonlinear terms. Then, the model above is the model under the alternative hypothesis (i.e., the full model). The model under the null hypothesis is the one that does not contain any of these terms (i.e., the reduced model):

`fit_null <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)`

We compare the two models using `anova()`

, i.e.,

`anova(fit_null, fit_alt)`

```
## Terms Resid. Df -2*LL Test Df Deviance
## 1 drug + sex + age 307 989.4673 NA NA
## 2 (drug + sex) * (age + I(age^2)) 302 984.8867 = 5 4.580586
## Pr(>Chi)
## 1 NA
## 2 0.4691742
```

**Note** that the user is responsible to supply appropriately nested AFT models such that the LRT to be valid. The result suggests that we could use the simplified model.

In order to check whether the assumed distribution of survival times fits the observed data sufficiently well, we can use the residuals of the model. However, because the residuals are calculated based on the observed event times, they will also be censored. Hence, to investigate whether they follow a particular distribution, we will need to employ a graphical procedure accounting for censoring. One way to achieve this is by using the Kaplan-Meier estimator of the residuals. As an example, we would like to investigate whether the Weibull distribution is an appropriate distribution for the survival times in the AIDS dataset. We first fit the model of interest:

`fit_weib <- survreg(Surv(Time, death) ~ drug * gender, data = aids.id)`

Next, we construct the residuals based on their definition

```
fitted_values <- fit_weib$linear.predictors
resids <- (log(fit_weib$y[, 1]) - fitted_values) / fit_weib$scale
```

We then compute the Kaplan-Meier estimate of these residuals, and we plot it. Lines 3-5 of the code below are used to superimpose in the graph the assumed Weibull distribution.

```
resKM <- survfit(Surv(resids, death) ~ 1, data = aids.id)
plot(resKM, mark.time = FALSE, xlab = "AFT Residuals", ylab = "Survival Probability")
xx <- seq(min(resids), max(resids), length.out = 35)
yy <- exp(- exp(xx))
lines(xx, yy, col = "red", lwd = 2)
legend("bottomleft", c("KM estimate", "95% CI KM estimate",
"Survival function of Extreme Value distribution"),
lty = c(1,2,1), col = c(1,1,2), bty = "n")
```

In this example, we observe that the Weibull distribution provides a good fit to the data. For models that assume other distributions of the survival times, the only thing that we need to change is the line of the code where the `yy`

values are defined. In particular, for AFT models assuming the log-normal or the Gaussian distribution, we use `yy <- pnorm(xx, lower.tail = FALSE)`

, and for AFT models assuming the log-logistic or the logistic distribution, you should use `yy <- plogis(xx, lower.tail = FALSE)`

.

The function that fits Cox models from the **survival** package is `coxph()`

. It has similar syntax to `survreg()`

that we saw in the previous section, with only exception that it does not have the `dist`

argument (i.e., the Cox model is a semi-parametric model that does not assume a particular distribution for the survival times). As in the majority of the model fitting functions in R, the `summary()`

method returns a detailed output of the fitted model. As an example, we fit a Cox model to the PBC dataset in which we include as covariates the `drug`

, `sex`

and `age`

; the code is:

```
fit <- coxph(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
summary(fit)
```

```
## Call:
## coxph(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
##
## n= 312, number of events= 140
##
## coef exp(coef) se(coef) z Pr(>|z|)
## drugD-penicil -0.146013 0.864146 0.172143 -0.848 0.3963
## sexfemale -0.470905 0.624437 0.221785 -2.123 0.0337 *
## age 0.042842 1.043773 0.008505 5.037 4.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## drugD-penicil 0.8641 1.1572 0.6167 1.2109
## sexfemale 0.6244 1.6014 0.4043 0.9644
## age 1.0438 0.9581 1.0265 1.0613
##
## Concordance= 0.629 (se = 0.024 )
## Rsquare= 0.101 (max possible= 0.991 )
## Likelihood ratio test= 33.25 on 3 df, p=3e-07
## Wald test = 34.87 on 3 df, p=1e-07
## Score (logrank) test = 35.31 on 3 df, p=1e-07
```

The column `coef`

in the output denotes the log hazard ratios, and `exp(coef)`

the corresponding hazard ratios.

Similarly to AFT models, when we consider a more complex model that includes interactions and nonlinear terms, it is more useful to communicate the results of the model using an effects plot. The procedure to construct such a plot is the same as in AFT models. For example, we want to produce an effect plot for the following Cox model for the Lung dataset that contains the `sex`

the nonlinear effect of `age`

and its interaction with `sex`

, and the nonlinear effect of `ph.karno`

:

`fit <- coxph(Surv(time, status) ~ ns(age, 3) * sex + ns(ph.karno, 3), data = lung)`

We first construct the dataset that gives the combinations of covariate values based on which the plot will be constructed. This done with function `expand.grid()`

:

```
ND <- with(lung, expand.grid(
age = seq(45, 75, length.out = 25),
sex = levels(sex),
ph.karno = quantile(ph.karno, c(0.25, 0.5, 0.75))
))
head(ND)
```

```
## age sex ph.karno
## 1 45.00 male 75
## 2 46.25 male 75
## 3 47.50 male 75
## 4 48.75 male 75
## 5 50.00 male 75
## 6 51.25 male 75
```

As seen above, we will do the plot, for increasing age from 45 to 75 years old, both males and females, and for `ph.karno`

increasing from the 1st quartile, to the median, to the 3rd quartile. Next, we obtain the predictions from the model (these are in the log-hazard scale), and we calculate the 95% confidence interval:

```
prs <- predict(fit, newdata = ND, type = "lp", se.fit = TRUE)
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- ND$pred - 1.96 * ND$se
ND$up <- ND$pred + 1.96 * ND$se
ND$ph.karno <- factor(ND$ph.karno, labels = paste("PH Karno =", sort(unique(ND$ph.karno))))
```

The last line of the code transforms the `ph.karno`

variable to a factor in order to have a more visible label in the plot that is produced with the following code:

```
xyplot(pred + lo + up ~ age | ph.karno * sex, data = ND,
type = "l", col = "black", lwd = 2, lty = c(1, 2, 2),
abline = list(h = 0, lty = 2, lwd = 2, col = "red"),
xlab = "Age (years)", ylab = "log Hazard Ratio")
```

When a single categorical covariate is included in the Cox model, the `summary()`

method returns the Wald, score (in this case equivalent to the log-rank test), and likelihood ratio tests. For example, for the PBC data these three tests for the `sex`

variable are:

```
fit <- coxph(Surv(years, status2) ~ sex, data = pbc2.id)
summary(fit)
```

```
## Call:
## coxph(formula = Surv(years, status2) ~ sex, data = pbc2.id)
##
## n= 312, number of events= 140
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexfemale -0.6502 0.5219 0.2183 -2.979 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexfemale 0.5219 1.916 0.3403 0.8006
##
## Concordance= 0.537 (se = 0.016 )
## Rsquare= 0.024 (max possible= 0.991 )
## Likelihood ratio test= 7.73 on 1 df, p=0.005
## Wald test = 8.87 on 1 df, p=0.003
## Score (logrank) test = 9.18 on 1 df, p=0.002
```

For a more complex hypothesis, we can use the likelihood ratio test by comparing the models under the null and alternative hypothesis. This is done using the `anova()`

function. For example, in the PBC dataset, and for the Cox model that contains `drug`

, `age`

and their interaction, we would like to test for the effect of `drug`

. Hence, the model under the null hypothesis is the one that does not contain `drug`

. The relevant code is:

```
fit_null <- coxph(Surv(years, status2) ~ age, data = pbc2.id)
fit_alt <- coxph(Surv(years, status2) ~ drug * age, data = pbc2.id)
anova(fit_null, fit_alt)
```

```
## Analysis of Deviance Table
## Cox model: response is Surv(years, status2)
## Model 1: ~ age
## Model 2: ~ drug * age
## loglik Chisq Df P(>|Chi|)
## 1 -712.41
## 2 -711.29 2.2259 2 0.3286
```

As an additional example, we illustrate how we can test for non-linearity using natural cubic splines. We refit the above model by now allowing the effect of `age`

to be nonlinear using natural cubic splines with 3 degrees of freedom.

`fit_alt <- coxph(Surv(years, status2) ~ drug + ns(age, 3), data = pbc2.id)`

To test if the effect of `age`

is nonlinear in the log hazard scale, we fit the model under the null hypothesis that assumes a linear effect for age, and we compare the two models using the likelihood ratio test:

```
fit_null <- coxph(Surv(years, status2) ~ drug + age, data = pbc2.id)
anova(fit_null, fit_alt)
```

```
## Analysis of Deviance Table
## Cox model: response is Surv(years, status2)
## Model 1: ~ drug + age
## Model 2: ~ drug + ns(age, 3)
## loglik Chisq Df P(>|Chi|)
## 1 -711.97
## 2 -711.90 0.1233 2 0.9402
```

The result shows that there does not seem to be any indication that the effect of `age`

is nonlinear (i.e., the two models do not differ statistically). Hence, we could proceed with the linear model.

For categorical covariates, we can check the proportional hazards assumption by appropriately transforming the Kaplan-Meier estimate in the log-log scale. In this scale we would expect to see parallel lines. We illustrate this procedure for the `sex`

variable from the PBC dataset.

```
KM_fit <- survfit(Surv(years, status2) ~ sex, data = pbc2.id)
par(mfrow = c(1, 2))
plot(KM_fit, xlab = "Years", ylab = "Survival", col = 1:2)
legend("topright", levels(pbc2.id$sex), lty = 1, col = 1:2, bty = "n")
plot(KM_fit, fun = function (s) -log(-log(s)), xlab = "Years",
ylab = "-log(- log(Survival))", col = 1:2)
```

The plot in the left panel of the figure is the classical Kaplan-Meier estimator (i.e., on the y-axis we have survival probabilities). The plot in the right panel has on the y-axis the \(-\log[-\log\{S(t)\}]\) transformation of the survival function \(S(t)\). We observe some strong indications of non-parallel lines. As a possible solution, we could stratify by `sex`

.

For continuous covariates, we can test the proportional hazards assumption using the Schoenfeld residuals. We should note that this procedure also can be used for categorical covariates. We illustrate their use in a Cox model for the PBC dataset that contains `sex`

, `age`

and their interaction. The model is fitted with the code:

`fit <- coxph(Surv(years, status2) ~ sex * age, data = pbc2.id)`

The function that calculates the Schoenfeld residuals is `cox.zph()`

. The two primary arguments of this function are the fitted Cox model and the transformation of time to be used. The code below does the calculations for the KM scale.

```
check_PH <- cox.zph(fit, transform = "km")
check_PH
```

```
## rho chisq p
## sexfemale -0.0674 0.503 0.478
## age -0.0511 0.270 0.604
## sexfemale:age 0.0408 0.186 0.667
## GLOBAL NA 2.343 0.504
```

The output of the function reports a test for non-proportionality (i.e., the null hypothesis is that the corresponding variable satisfies PH). However, it is more useful to inspect the PH assumption by plotting the Schoenfeld residuals graphically. This figure is produced with the `plot()`

method. We produce the plot separately for each predictor of the model in order also to include the red line corresponding to PH.

```
plot(check_PH, var = 1)
abline(h = coef(fit)[1], col = "red", lwd = 2)
```

```
plot(check_PH, var = 2)
abline(h = coef(fit)[2], col = "red", lwd = 2)
```

```
plot(check_PH, var = 3)
abline(h = coef(fit)[3], col = "red", lwd = 2)
```

We observe that for all variables PH seems to hold.

**Notes**

There is a contradiction with regard to whether the

`sex`

variable satisfies PH between the first method we have seen based on the transformation of the Kaplan-Meier estimator and the second method based on the Schoenfeld residuals. From the two methods, most trustworthy is the latter.From the plots, we observe that the residuals are quite evenly spaced along the x-axis and therefore there is no need to consider other transformations of the timescale.

As we have previously explained, the Cox model does not estimate the baseline hazard, and therefore we cannot directly obtain survival probabilities from it. To achieve that we need to combine it with a non-parametric estimator of the baseline hazard function. The most popular method to do that is to use the Breslow estimator. For a fitted Cox model from package **survival** these probabilities are calculated by function `survfit()`

. As an illustration, we would like to derive survival probabilities from the following Cox model for the AIDS dataset:

`fit <- coxph(Surv(Time, death) ~ CD4 + drug + gender, data = aids.id)`

Since this model contains three covariates, we need to specify for which combinations of values of these covariates we want to derive the probabilities. These combinations of values need to be specified in a dataset. As an example, we derive survival probabilities for both males and females, who receive both treatments and have square root CD4 equal to 3. The code to create this data uses the `expand.grid()`

function that we have seen before:

```
ND <- with(aids.id, expand.grid(CD4 = 3, drug = levels(drug), gender = levels(gender)))
ND
```

```
## CD4 drug gender
## 1 3 ddC female
## 2 3 ddI female
## 3 3 ddC male
## 4 3 ddI male
```

Then we give the fitted model and the data defined above as arguments to `survfit()`

:

```
surv_probs_Cox <- survfit(fit, newdata = ND)
surv_probs_Cox
```

```
## Call: survfit(formula = fit, newdata = ND)
##
## n events median 0.95LCL 0.95UCL
## 1 467 188 12.2 9.8 NA
## 2 467 188 10.6 8.0 16.6
## 3 467 188 14.1 11.9 17.8
## 4 467 188 11.6 10.4 14.1
```

The output gives the median survival times and the corresponding 95% confidence intervals for these groups of patients. Similarly to how we worked in Section 2.1, we can derive the survival probabilities at specific follow-up times using the `summary()`

method and the `times`

argument. For example, the following code extracts the survival probabilities for the groups above at eight months of follow-up:

`summary(surv_probs_Cox, times = 8)`

```
## Call: survfit(formula = fit, newdata = ND)
##
## time n.risk n.event survival1 survival2 survival3 survival4
## 8 379 86 0.7 0.627 0.747 0.683
```

To obtain the corresponding lower and upper limits of the corresponding 95% confidence intervals, we need to manually extract them from the output of `summary()`

, i.e.,

```
summ <- summary(surv_probs_Cox, times = 8)
out <- t(rbind(summ$surv, summ$lower, summ$upper))
colnames(out) <- c("Surv_Prob", "Lower", "Upper")
out
```

```
## Surv_Prob Lower Upper
## 1 0.6998629 0.5850566 0.8371977
## 2 0.6274689 0.4966114 0.7928075
## 3 0.7471514 0.6894090 0.8097300
## 4 0.6834015 0.6182431 0.7554271
```

Finally, the `plot()`

method produces the plot with all four survival curves

```
plot(surv_probs_Cox, col = c("black", "red", "blue", "green"),
xlab = "Follow-Up Time (months)", ylab = "Survival Probabilities")
legend("bottomleft", paste("drug =", ND$drug, ", gender = ", ND$gender),
lty = 1, col = c("black", "red", "blue", "green"), bty = "n")
```