1 Introduction and Set-Up

This document provides background information on the R code used in the survival analysis part of the Biostatistics Methods II course.

We start by loading the packages that we will require during these illustrations. These are packages survival, splines, lattice and JM. The first three packages are recommended packages and exist by default in all R installations. Package JM though is an optional package. If you do not have it already installed, you will need to install it. The command to do this is install.packages("JM") and needs to be performed only once.

Assuming that all packages are installed, the following commands load them into the active R session:

library("survival")
library("splines")
library("lattice")
library("JM")

Next, we load the datasets that we will need for our illustrations, namely the AIDS, PBC, Lung and Stanford datasets introduced in the main slides of the course. These datasets are available as objects aids.id, pbc2.id, lung and stanford2, respectively. The first two from package JM and the last two from package survival. To load the datasets from the packages we use the data() function, i.e.,

data("aids", package = "JM")
data("aids.id", package = "JM")
data("pbc2.id", package = "JM")
data("lung", package = "survival")
data("stanford2", package = "survival")

For the lung dataset we also need to perform two data management steps, in particular, to specify variable sex as a factor, and to remove the missing values from the variables we will later use. The relevant commands are:

lung$sex <- factor(lung$sex, levels = 1:2, labels = c("male", "female"))
lung <- with(lung, lung[complete.cases(time, status, sex, age, ph.karno), ])
Back to top

2 Basic Survival Analysis

2.1 Estimators of the Survival Function

A key function for the analysis of survival data in R is function Surv(). This is used to specify the type of survival data that we have, namely, right censored, left censored, interval censored. For our illustrations, we will only consider right censored data. In this case, function Surv() accepts as first argument the observed survival times, and as second the event indicator. Function Surv() is used by many other functions we will see later.

The function that is used to calculate the Kaplan-Meier estimate is survfit(). Its first argument is an R formula. The left-hand side of this formula specifies the survival times information using function Surv(), and the right-hand side is used to specify grouping variables (e.g., if we would like to obtain separate estimates of the survival function per treatment group) – if we put ‘1’ as below, then one survival curve is estimated based on all the data. Argument data specifies the data frame that contains the variables of interest. For the Stanford data, we obtain the Kaplan-Meier estimator using the command:

KM_fit <- survfit(Surv(time, status) ~ 1, data = stanford2)
KM_fit
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##     184     113     631     328    1232

The basic output of function survfit() gives the number of subjects, the number of events, the median survival time and its 95% confidence interval. Using the summary() method and its times argument we can obtain the survival probabilities at specific follow-up times. For example, the following code computes the survival probabilities at 1000 and 2000 days:

summary(KM_fit, times = c(1000, 2000))
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  1000     52      94    0.443  0.0396        0.372        0.528
##  2000     15      15    0.274  0.0440        0.200        0.375

The quantile() method computes the corresponding follow-up times at which the survival probability takes a specific value. For example, we want to find at how many days the survival probability equals 0.7 and at how many days it equals 0.6 – the code is:

quantile(KM_fit, probs = 1 - c(0.7, 0.6))
## $quantile
##  30  40 
## 136 274 
## 
## $lower
##  30  40 
##  65 148 
## 
## $upper
##  30  40 
## 227 623

Note that in the probs argument of quantile() we have to specify one minus our target survival probabilities; this is because the function works under the cumulative distribution function (CDF) convention, and the CDF equals one minus survival probability. Moreover, also note that we obtain the lower and upper limits of the 95% confidence intervals for the quantiles we have asked.

The plot() method produces the figure of the estimated curve; by default, the 95% confidence interval is included when we have only one curve:

plot(KM_fit, xlab = "Time to Death (days)", ylab = "Survival Probability", 
    main = "Kaplan-Meier Estimate of S(t) for the Stanford Data")

To calculate the Breslow estimator, we use again function survfit() with the same syntax as for the Kaplan-Meier estimator – the only difference is that we now need to set argument type to "fleming-harrington". For the Stanford data we have:

Br_fit <- survfit(Surv(time, status) ~ 1, data = stanford2, 
    type = "fleming-harrington")
Br_fit
## Call: survfit(formula = Surv(time, status) ~ 1, data = stanford2, type = "fleming-harrington")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     184     113     729     328    1232
plot(Br_fit, mark.time = FALSE, xlab = "Time to Death (days)", 
    ylab = "Survival Probability", 
    main = "Breslow Estimate of S(t) for the Stanford Data")

The interpretation of the output, as well as functions summary() and quantile() work in the same manner as in the examples we have seen above for the Kaplan-Meier estimator.

Back to top

2.2 Statistical Tests

We have two standard statistical tests for testing simple hypothesis with one grouping variable (i.e., to test the hypothesis whether the survival functions of different groups of subjects differ statistically significantly). The first and most widely used test is the log-rank test. This test is performed in R using function survdiff(). For example, to test with the log-rank test whether there are differences in the survival rates in the Lung dataset between males and females, we use the code:

logrank <- survdiff(Surv(time, status == 2) ~ sex, data = lung)
logrank
## Call:
## survdiff(formula = Surv(time, status == 2) ~ sex, data = lung)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=male   137      111     90.9      4.42        10
## sex=female  90       53     73.1      5.51        10
## 
##  Chisq= 10  on 1 degrees of freedom, p= 0.002

We have found a strong indication of a difference between the two groups. To test the same hypothesis but with the Peto & Peto Gehan-Wilcoxon test, we use the survdiff() function again, but now we set argument rho to 1:

peto_peto <- survdiff(Surv(time, status == 2) ~ sex, data = lung, rho = 1)
peto_peto
## Call:
## survdiff(formula = Surv(time, status == 2) ~ sex, data = lung, 
##     rho = 1)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=male   137     69.7     55.2      3.83      12.3
## sex=female  90     28.8     43.4      4.88      12.3
## 
##  Chisq= 12.3  on 1 degrees of freedom, p= 4e-04

The conclusion remains the same with a p-value of a smaller magnitude.

The log-rank test is the most powerful test when the proportional hazards (PH) assumption is satisfied. To check this assumption, we can plot the cumulative hazard functions for the two groups; when PH is satisfied the two curves will be proportional to each other (i.e., the steadily grow away of each other). The cumulative hazards plot can be produced by the plot() method for survfit() objects using the fun argument, e.g.,

KM_fit <- survfit(Surv(time, status == 2) ~ sex, data = lung)
plot(KM_fit, fun = "cumhaz")

Back to top

3 Accelerated Failure Time Models

3.1 Model Fitting

The function that fits AFT models from the survival package is survreg(). Its first argument is a formula and has a similar syntax as for function survfit(). The dist argument specifies the distribution for the survival times (Note: the dist argument specifies the distribution of the survival times not the log survival times). The default distribution (i.e., if you do not specify the dist argument yourself) is the Weibull distribution. As with other model fitting functions in R, the summary() function returns a detailed output of the fitted model. As an example, with fit an AFT model assuming the Weibull distribution for the PBC dataset. We control for drug, sex and age. The code is:

fit_weibull <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)

summary(fit_weibull)
## 
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
##                  Value Std. Error     z      p
## (Intercept)    4.13862    0.48260  8.58 <2e-16
## drugD-penicil  0.13046    0.15576  0.84  0.402
## sexfemale      0.44472    0.20147  2.21  0.027
## age           -0.03874    0.00793 -4.89  1e-06
## Log(scale)    -0.10223    0.07423 -1.38  0.168
## 
## Scale= 0.903 
## 
## Weibull distribution
## Loglik(model)= -494.7   Loglik(intercept only)= -511.8
##  Chisq= 34.23 on 3 degrees of freedom, p= 1.8e-07 
## Number of Newton-Raphson Iterations: 5 
## n= 312

To fit the same model but with the exponential distribution, the code is:

fit_exp <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
                   dist = "exponential")

summary(fit_exp)
## 
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id, 
##     dist = "exponential")
##                 Value Std. Error     z       p
## (Intercept)    4.3282     0.5159  8.39 < 2e-16
## drugD-penicil  0.1438     0.1721  0.84    0.40
## sexfemale      0.4816     0.2217  2.17    0.03
## age           -0.0420     0.0084 -5.00 5.7e-07
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -495.6   Loglik(intercept only)= -512.3
##  Chisq= 33.35 on 3 degrees of freedom, p= 2.7e-07 
## Number of Newton-Raphson Iterations: 4 
## n= 312

To fit the same model but with the log-normal distribution, the code is:

fit_lnorm <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
                   dist = "lognormal")

summary(fit_lnorm)
## 
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id, 
##     dist = "lognormal")
##                  Value Std. Error     z       p
## (Intercept)    4.41534    0.59602  7.41 1.3e-13
## drugD-penicil  0.21196    0.18775  1.13    0.26
## sexfemale      0.30242    0.27761  1.09    0.28
## age           -0.04841    0.00944 -5.13 2.9e-07
## Log(scale)     0.34711    0.06473  5.36 8.2e-08
## 
## Scale= 1.41 
## 
## Log Normal distribution
## Loglik(model)= -498.9   Loglik(intercept only)= -515.2
##  Chisq= 32.72 on 3 degrees of freedom, p= 3.7e-07 
## Number of Newton-Raphson Iterations: 3 
## n= 312

To fit the same model but with the log-logistic distribution, the code is:

fit_llogis <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id,
                   dist = "loglogistic")

summary(fit_llogis)
## 
## Call:
## survreg(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id, 
##     dist = "loglogistic")
##                  Value Std. Error     z       p
## (Intercept)    4.00019    0.54926  7.28 3.3e-13
## drugD-penicil  0.14532    0.17387  0.84 0.40325
## sexfemale      0.38446    0.24161  1.59 0.11155
## age           -0.04151    0.00884 -4.69 2.7e-06
## Log(scale)    -0.26129    0.07326 -3.57 0.00036
## 
## Scale= 0.77 
## 
## Log logistic distribution
## Loglik(model)= -497.4   Loglik(intercept only)= -513
##  Chisq= 31.08 on 3 degrees of freedom, p= 8.2e-07 
## Number of Newton-Raphson Iterations: 4 
## n= 312
Back to top

3.2 Effect Plots

In cases we consider statistical models with complex terms (interactions and nonlinear terms), the regression coefficients we obtain in the output do not have a straightforward interpretation. In these settings, we can use effect plots to communicate the results of the model. As an example, we fit a model in the PBC dataset that contains the effect of drug, the effect of sex, the linear effect of age, the quadratic effect of age, and the interaction effects between sex and the linear and quadratic effects of age, and between drug and the linear and quadratic effects of age. We assume that the survival times follow the Weibull distribution. The code to fit the model is:

fit <- survreg(Surv(years, status2) ~ (drug + sex) * (age + I(age^2)), data = pbc2.id)

summary(fit)
## 
## Call:
## survreg(formula = Surv(years, status2) ~ (drug + sex) * (age + 
##     I(age^2)), data = pbc2.id)
##                            Value Std. Error     z    p
## (Intercept)             8.342101   6.266023  1.33 0.18
## drugD-penicil           3.489874   3.353680  1.04 0.30
## sexfemale              -6.523710   6.189094 -1.05 0.29
## age                    -0.161596   0.217081 -0.74 0.46
## I(age^2)                0.000822   0.001843  0.45 0.66
## drugD-penicil:age      -0.148887   0.127901 -1.16 0.24
## drugD-penicil:I(age^2)  0.001549   0.001197  1.29 0.20
## sexfemale:age           0.240257   0.215642  1.11 0.27
## sexfemale:I(age^2)     -0.002015   0.001847 -1.09 0.28
## Log(scale)             -0.113198   0.074516 -1.52 0.13
## 
## Scale= 0.893 
## 
## Weibull distribution
## Loglik(model)= -492.4   Loglik(intercept only)= -511.8
##  Chisq= 38.81 on 8 degrees of freedom, p= 5.3e-06 
## Number of Newton-Raphson Iterations: 5 
## n= 312

As mentioned above, in this particular example, the linear and quadratic effects of age, and the corresponding interactions effects cannot be easily interpreted. To visualize the model we will use an effects plot. The first step is to specify a dataset that contains combinations of values for the covariates of the model based on which we will create the plot. In our example, we can create a single figure with all possible combinations of the covariates, i.e., both treatment groups, both sexes and increasing age. To create the dataset that contains all possible combinations of covariates we use function expand.grid(). The code is:

ND <- with(pbc2.id, expand.grid(
    age = seq(30, 65, length.out = 25),
    sex = levels(sex),
    drug = levels(drug)
))

head(ND, 10)
##         age  sex    drug
## 1  30.00000 male placebo
## 2  31.45833 male placebo
## 3  32.91667 male placebo
## 4  34.37500 male placebo
## 5  35.83333 male placebo
## 6  37.29167 male placebo
## 7  38.75000 male placebo
## 8  40.20833 male placebo
## 9  41.66667 male placebo
## 10 43.12500 male placebo

Function seq() creates a sequence of numbers, in our example from 30 to 65 years old, with the length of the sequence been 25 numbers. Function levels() extracts from the pbc2.id database the possible values of the sex and drug variables. Function head() prints the first ten rows of this dataset.

Next, based on this dataset we compute the predictions from the model using function predict():

prs <- predict(fit, ND, se.fit = TRUE, type = "lp")
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- ND$pred - 1.96 * ND$se
ND$up <- ND$pred + 1.96 * ND$se

head(ND, 10)
##         age  sex    drug     pred        se       lo       up
## 1  30.00000 male placebo 4.234429 1.4533787 1.385807 7.083051
## 2  31.45833 male placebo 4.072483 1.3055301 1.513644 6.631322
## 3  32.91667 male placebo 3.914035 1.1659479 1.628777 6.199293
## 4  34.37500 male placebo 3.759085 1.0347438 1.730987 5.787183
## 5  35.83333 male placebo 3.607634 0.9120677 1.819981 5.395287
## 6  37.29167 male placebo 3.459681 0.7981221 1.895362 5.024000
## 7  38.75000 male placebo 3.315227 0.6931838 1.956586 4.673867
## 8  40.20833 male placebo 3.174270 0.5976336 2.002909 4.345632
## 9  41.66667 male placebo 3.036812 0.5119939 2.033304 4.040321
## 10 43.12500 male placebo 2.902853 0.4369676 2.046396 3.759309

The output of the predict() function as specified above is the predictions (in the log survival times scale) and the corresponding standard errors. The second and third lines of the code include as extra columns in the dataset ND these predictions and standard errors. The fourth and fifth lines calculate the lower and upper limits of a 95% confidence interval for the log survival times. Finally, we create the plot using function xyplot() from package lattice:

xyplot(pred + lo + up ~ age | drug*sex, data = ND, type = "l", 
    lty = c(1,2,2), col = "black", lwd = 2, xlab = "Age", 
    ylab = "Log Survival Time")

To create the plot for only one level of a categorical covariate, we use function factor() in the definition of the dataset ND in order to denote that this variable has two levels but only one of them is used. As an example, we recreate the plot from above but only for females. The specification of ND becomes:

ND <- with(pbc2.id, expand.grid(
    age = seq(30, 65, length.out = 25),
    sex = factor("female", levels = levels(sex)),
    drug = levels(drug)
))

head(ND, 10)
##         age    sex    drug
## 1  30.00000 female placebo
## 2  31.45833 female placebo
## 3  32.91667 female placebo
## 4  34.37500 female placebo
## 5  35.83333 female placebo
## 6  37.29167 female placebo
## 7  38.75000 female placebo
## 8  40.20833 female placebo
## 9  41.66667 female placebo
## 10 43.12500 female placebo

The rest of the code remains the same; the only extra difference is that we now do the plot in the original timescale and not for the log survival times - note the use of function exp() in the code below, i.e.,

prs <- predict(fit, ND, se.fit = TRUE, type = "lp")
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- exp(ND$pred - 1.96 * ND$se)
ND$up <- exp(ND$pred + 1.96 * ND$se)
ND$pred <- exp(ND$pred)

xyplot(pred + lo + up ~ age | drug, data = ND, type = "l", 
    lty = c(1,2,2), col = "black", lwd = 2, xlab = "Age", 
    ylab = "Survival Time")

Back to top

3.3 Hypothesis Testing

A statistical comparison between two nested models can be performed with a likelihood ratio test calculated by function anova(). As an example, we use again the complex model that we used for the effect plot in the previous section, i.e.,

fit_alt <- survreg(Surv(years, status2) ~ (drug + sex) * (age + I(age^2)), data = pbc2.id)

Say that we want to test whether we can drop all the complex terms, that is, the interaction and nonlinear terms. Then, the model above is the model under the alternative hypothesis (i.e., the full model). The model under the null hypothesis is the one that does not contain any of these terms (i.e., the reduced model):

fit_null <- survreg(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)

We compare the two models using anova(), i.e.,

anova(fit_null, fit_alt)
##                             Terms Resid. Df    -2*LL Test Df Deviance
## 1                drug + sex + age       307 989.4673      NA       NA
## 2 (drug + sex) * (age + I(age^2))       302 984.8867    =  5 4.580586
##    Pr(>Chi)
## 1        NA
## 2 0.4691742

Note that the user is responsible to supply appropriately nested AFT models such that the LRT to be valid. The result suggests that we could use the simplified model.

Back to top

3.4 Residuals

In order to check whether the assumed distribution of survival times fits the observed data sufficiently well, we can use the residuals of the model. However, because the residuals are calculated based on the observed event times, they will also be censored. Hence, to investigate whether they follow a particular distribution, we will need to employ a graphical procedure accounting for censoring. One way to achieve this is by using the Kaplan-Meier estimator of the residuals. As an example, we would like to investigate whether the Weibull distribution is an appropriate distribution for the survival times in the AIDS dataset. We first fit the model of interest:

fit_weib <- survreg(Surv(Time, death) ~ drug * gender, data = aids.id)

Next, we construct the residuals based on their definition

fitted_values <- fit_weib$linear.predictors
resids <- (log(fit_weib$y[, 1]) - fitted_values) / fit_weib$scale

We then compute the Kaplan-Meier estimate of these residuals, and we plot it. Lines 3-5 of the code below are used to superimpose in the graph the assumed Weibull distribution.

resKM <- survfit(Surv(resids, death) ~ 1, data = aids.id)
plot(resKM, mark.time = FALSE, xlab = "AFT Residuals", ylab = "Survival Probability")
xx <- seq(min(resids), max(resids), length.out = 35)
yy <- exp(- exp(xx))
lines(xx, yy, col = "red", lwd = 2)
legend("bottomleft", c("KM estimate", "95% CI KM estimate", 
    "Survival function of Extreme Value distribution"), 
    lty = c(1,2,1), col = c(1,1,2), bty = "n")

In this example, we observe that the Weibull distribution provides a good fit to the data. For models that assume other distributions of the survival times, the only thing that we need to change is the line of the code where the yy values are defined. In particular, for AFT models assuming the log-normal or the Gaussian distribution, we use yy <- pnorm(xx, lower.tail = FALSE), and for AFT models assuming the log-logistic or the logistic distribution, you should use yy <- plogis(xx, lower.tail = FALSE).

Back to top

4 Cox Proportional Hazards Models

4.1 Model Fitting

The function that fits Cox models from the survival package is coxph(). It has similar syntax to survreg() that we saw in the previous section, with only exception that it does not have the dist argument (i.e., the Cox model is a semi-parametric model that does not assume a particular distribution for the survival times). As in the majority of the model fitting functions in R, the summary() method returns a detailed output of the fitted model. As an example, we fit a Cox model to the PBC dataset in which we include as covariates the drug, sex and age; the code is:

fit <- coxph(Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
summary(fit)
## Call:
## coxph(formula = Surv(years, status2) ~ drug + sex + age, data = pbc2.id)
## 
##   n= 312, number of events= 140 
## 
##                    coef exp(coef)  se(coef)      z Pr(>|z|)    
## drugD-penicil -0.146013  0.864146  0.172143 -0.848   0.3963    
## sexfemale     -0.470905  0.624437  0.221785 -2.123   0.0337 *  
## age            0.042842  1.043773  0.008505  5.037 4.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##               exp(coef) exp(-coef) lower .95 upper .95
## drugD-penicil    0.8641     1.1572    0.6167    1.2109
## sexfemale        0.6244     1.6014    0.4043    0.9644
## age              1.0438     0.9581    1.0265    1.0613
## 
## Concordance= 0.629  (se = 0.024 )
## Rsquare= 0.101   (max possible= 0.991 )
## Likelihood ratio test= 33.25  on 3 df,   p=3e-07
## Wald test            = 34.87  on 3 df,   p=1e-07
## Score (logrank) test = 35.31  on 3 df,   p=1e-07

The column coef in the output denotes the log hazard ratios, and exp(coef) the corresponding hazard ratios.

Back to top

4.2 Effect Plots

Similarly to AFT models, when we consider a more complex model that includes interactions and nonlinear terms, it is more useful to communicate the results of the model using an effects plot. The procedure to construct such a plot is the same as in AFT models. For example, we want to produce an effect plot for the following Cox model for the Lung dataset that contains the sex the nonlinear effect of age and its interaction with sex, and the nonlinear effect of ph.karno:

fit <- coxph(Surv(time, status) ~ ns(age, 3) * sex + ns(ph.karno, 3), data = lung)

We first construct the dataset that gives the combinations of covariate values based on which the plot will be constructed. This done with function expand.grid():

ND <- with(lung, expand.grid(
    age = seq(45, 75, length.out = 25),
    sex = levels(sex),
    ph.karno = quantile(ph.karno, c(0.25, 0.5, 0.75))
))
head(ND)
##     age  sex ph.karno
## 1 45.00 male       75
## 2 46.25 male       75
## 3 47.50 male       75
## 4 48.75 male       75
## 5 50.00 male       75
## 6 51.25 male       75

As seen above, we will do the plot, for increasing age from 45 to 75 years old, both males and females, and for ph.karno increasing from the 1st quartile, to the median, to the 3rd quartile. Next, we obtain the predictions from the model (these are in the log-hazard scale), and we calculate the 95% confidence interval:

prs <- predict(fit, newdata = ND, type = "lp", se.fit = TRUE)
ND$pred <- prs[[1]]
ND$se <- prs[[2]]
ND$lo <- ND$pred - 1.96 * ND$se
ND$up <- ND$pred + 1.96 * ND$se
ND$ph.karno <- factor(ND$ph.karno, labels = paste("PH Karno =", sort(unique(ND$ph.karno))))

The last line of the code transforms the ph.karno variable to a factor in order to have a more visible label in the plot that is produced with the following code:

xyplot(pred + lo + up ~ age | ph.karno * sex, data = ND, 
       type = "l", col = "black", lwd = 2, lty = c(1, 2, 2),
       abline = list(h = 0, lty = 2, lwd = 2, col = "red"),
       xlab = "Age (years)", ylab = "log Hazard Ratio")

Back to top

4.3 Hypothesis Testing

When a single categorical covariate is included in the Cox model, the summary() method returns the Wald, score (in this case equivalent to the log-rank test), and likelihood ratio tests. For example, for the PBC data these three tests for the sex variable are:

fit <- coxph(Surv(years, status2) ~ sex, data = pbc2.id)
summary(fit)
## Call:
## coxph(formula = Surv(years, status2) ~ sex, data = pbc2.id)
## 
##   n= 312, number of events= 140 
## 
##              coef exp(coef) se(coef)      z Pr(>|z|)   
## sexfemale -0.6502    0.5219   0.2183 -2.979   0.0029 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## sexfemale    0.5219      1.916    0.3403    0.8006
## 
## Concordance= 0.537  (se = 0.016 )
## Rsquare= 0.024   (max possible= 0.991 )
## Likelihood ratio test= 7.73  on 1 df,   p=0.005
## Wald test            = 8.87  on 1 df,   p=0.003
## Score (logrank) test = 9.18  on 1 df,   p=0.002

For a more complex hypothesis, we can use the likelihood ratio test by comparing the models under the null and alternative hypothesis. This is done using the anova() function. For example, in the PBC dataset, and for the Cox model that contains drug, age and their interaction, we would like to test for the effect of drug. Hence, the model under the null hypothesis is the one that does not contain drug. The relevant code is:

fit_null <- coxph(Surv(years, status2) ~ age, data = pbc2.id)
fit_alt <- coxph(Surv(years, status2) ~ drug * age, data = pbc2.id)

anova(fit_null, fit_alt)
## Analysis of Deviance Table
##  Cox model: response is  Surv(years, status2)
##  Model 1: ~ age
##  Model 2: ~ drug * age
##    loglik  Chisq Df P(>|Chi|)
## 1 -712.41                    
## 2 -711.29 2.2259  2    0.3286

As an additional example, we illustrate how we can test for non-linearity using natural cubic splines. We refit the above model by now allowing the effect of age to be nonlinear using natural cubic splines with 3 degrees of freedom.

fit_alt <- coxph(Surv(years, status2) ~ drug + ns(age, 3), data = pbc2.id)

To test if the effect of age is nonlinear in the log hazard scale, we fit the model under the null hypothesis that assumes a linear effect for age, and we compare the two models using the likelihood ratio test:

fit_null <- coxph(Surv(years, status2) ~ drug + age, data = pbc2.id)

anova(fit_null, fit_alt)
## Analysis of Deviance Table
##  Cox model: response is  Surv(years, status2)
##  Model 1: ~ drug + age
##  Model 2: ~ drug + ns(age, 3)
##    loglik  Chisq Df P(>|Chi|)
## 1 -711.97                    
## 2 -711.90 0.1233  2    0.9402

The result shows that there does not seem to be any indication that the effect of age is nonlinear (i.e., the two models do not differ statistically). Hence, we could proceed with the linear model.

Back to top

4.4 Proportional Hazards Assumption

For categorical covariates, we can check the proportional hazards assumption by appropriately transforming the Kaplan-Meier estimate in the log-log scale. In this scale we would expect to see parallel lines. We illustrate this procedure for the sex variable from the PBC dataset.

KM_fit <- survfit(Surv(years, status2) ~ sex, data = pbc2.id)

par(mfrow = c(1, 2))

plot(KM_fit, xlab = "Years", ylab = "Survival", col = 1:2)
legend("topright", levels(pbc2.id$sex), lty = 1, col = 1:2, bty = "n")

plot(KM_fit, fun = function (s) -log(-log(s)), xlab = "Years", 
    ylab = "-log(- log(Survival))", col = 1:2)

The plot in the left panel of the figure is the classical Kaplan-Meier estimator (i.e., on the y-axis we have survival probabilities). The plot in the right panel has on the y-axis the \(-\log[-\log\{S(t)\}]\) transformation of the survival function \(S(t)\). We observe some strong indications of non-parallel lines. As a possible solution, we could stratify by sex.

For continuous covariates, we can test the proportional hazards assumption using the Schoenfeld residuals. We should note that this procedure also can be used for categorical covariates. We illustrate their use in a Cox model for the PBC dataset that contains sex, age and their interaction. The model is fitted with the code:

fit <- coxph(Surv(years, status2) ~ sex * age, data = pbc2.id)

The function that calculates the Schoenfeld residuals is cox.zph(). The two primary arguments of this function are the fitted Cox model and the transformation of time to be used. The code below does the calculations for the KM scale.

check_PH <- cox.zph(fit, transform = "km")
check_PH
##                   rho chisq     p
## sexfemale     -0.0674 0.503 0.478
## age           -0.0511 0.270 0.604
## sexfemale:age  0.0408 0.186 0.667
## GLOBAL             NA 2.343 0.504

The output of the function reports a test for non-proportionality (i.e., the null hypothesis is that the corresponding variable satisfies PH). However, it is more useful to inspect the PH assumption by plotting the Schoenfeld residuals graphically. This figure is produced with the plot() method. We produce the plot separately for each predictor of the model in order also to include the red line corresponding to PH.

plot(check_PH, var = 1)
abline(h = coef(fit)[1], col = "red", lwd = 2)

plot(check_PH, var = 2)
abline(h = coef(fit)[2], col = "red", lwd = 2)

plot(check_PH, var = 3)
abline(h = coef(fit)[3], col = "red", lwd = 2)

We observe that for all variables PH seems to hold.

Notes

  • There is a contradiction with regard to whether the sex variable satisfies PH between the first method we have seen based on the transformation of the Kaplan-Meier estimator and the second method based on the Schoenfeld residuals. From the two methods, most trustworthy is the latter.

  • From the plots, we observe that the residuals are quite evenly spaced along the x-axis and therefore there is no need to consider other transformations of the timescale.

Back to top

5 Extensions of the Cox Model

5.1 Survival Probabilities from Cox Models

As we have previously explained, the Cox model does not estimate the baseline hazard, and therefore we cannot directly obtain survival probabilities from it. To achieve that we need to combine it with a non-parametric estimator of the baseline hazard function. The most popular method to do that is to use the Breslow estimator. For a fitted Cox model from package survival these probabilities are calculated by function survfit(). As an illustration, we would like to derive survival probabilities from the following Cox model for the AIDS dataset:

fit <- coxph(Surv(Time, death) ~ CD4 + drug + gender, data = aids.id)

Since this model contains three covariates, we need to specify for which combinations of values of these covariates we want to derive the probabilities. These combinations of values need to be specified in a dataset. As an example, we derive survival probabilities for both males and females, who receive both treatments and have square root CD4 equal to 3. The code to create this data uses the expand.grid() function that we have seen before:

ND <- with(aids.id, expand.grid(CD4 = 3, drug = levels(drug), gender = levels(gender)))
ND
##   CD4 drug gender
## 1   3  ddC female
## 2   3  ddI female
## 3   3  ddC   male
## 4   3  ddI   male

Then we give the fitted model and the data defined above as arguments to survfit():

surv_probs_Cox <- survfit(fit, newdata = ND)
surv_probs_Cox
## Call: survfit(formula = fit, newdata = ND)
## 
##     n events median 0.95LCL 0.95UCL
## 1 467    188   12.2     9.8      NA
## 2 467    188   10.6     8.0    16.6
## 3 467    188   14.1    11.9    17.8
## 4 467    188   11.6    10.4    14.1

The output gives the median survival times and the corresponding 95% confidence intervals for these groups of patients. Similarly to how we worked in Section 2.1, we can derive the survival probabilities at specific follow-up times using the summary() method and the times argument. For example, the following code extracts the survival probabilities for the groups above at eight months of follow-up:

summary(surv_probs_Cox, times = 8)
## Call: survfit(formula = fit, newdata = ND)
## 
##  time n.risk n.event survival1 survival2 survival3 survival4
##     8    379      86       0.7     0.627     0.747     0.683

To obtain the corresponding lower and upper limits of the corresponding 95% confidence intervals, we need to manually extract them from the output of summary(), i.e.,

summ <- summary(surv_probs_Cox, times = 8)
out <- t(rbind(summ$surv, summ$lower, summ$upper))
colnames(out) <- c("Surv_Prob", "Lower", "Upper")
out
##   Surv_Prob     Lower     Upper
## 1 0.6998629 0.5850566 0.8371977
## 2 0.6274689 0.4966114 0.7928075
## 3 0.7471514 0.6894090 0.8097300
## 4 0.6834015 0.6182431 0.7554271

Finally, the plot() method produces the plot with all four survival curves

plot(surv_probs_Cox, col = c("black", "red", "blue", "green"),
     xlab = "Follow-Up Time (months)", ylab = "Survival Probabilities")
legend("bottomleft", paste("drug =", ND$drug, ", gender = ", ND$gender), 
       lty = 1, col = c("black", "red", "blue", "green"), bty = "n")