# Two-sample randomized trial

#### Emily C. Zabor

#### Last updated: 2022-08-08

Source:`vignettes/two_sample_randomized.Rmd`

`two_sample_randomized.Rmd`

## Setup

To execute the code in this vignette, you first need to install and load the {ppseq} package. You will also need the {future} package to parallelize your code to improve speed.

```
install.packages("ppseq")
install.packages("future")
```

## Introduction

Efforts to develop biomarker-targeted anti-cancer therapies have progressed rapidly in recent years. With the aim of expediting regulatory reviews of promising therapies, an increasing number of targeted cancer therapies are being granted accelerated approval on the basis of evidence acquired in single-arm phase II clinical trials. The historical control rates used to design and evaluate emerging targeted therapies in single-arm trials often arise as population averages, lacking specificity to the biomarker-targeted subpopulation of interest. Thus, historical trial results are inherently limited for inferring the potential “comparative efficacy” of novel targeted therapies. Randomization may be the best option in this setting, and is not out of the question given the increasingly large sample sizes being used in phase II trials. We propose a design for two-arm randomized phase II trials based on sequential predictive probability monitoring that allows an investigator to identify an optimal clinical trial design within constraints of traditional type I error and power, and with futility stopping rules that preserve valuable human and financial resources (cite two-sample paper once it’s available).

## Case study background

Atezolizumab is a programmed death-ligand 1 (PD-L1) blocking monoclonal antibody that was given accelerated approval by the U.S. Food and Drug Administration in May 2016 for the treatment of patients with locally advanced or metastatic urothelial carcinoma who had disease progression following platinum-containing chemotherapy. The approval was based on the results of a single-arm phase II study in 310 patients (Rosenberg et al. 2016). The phase II study used a hierarchical fixed-sequence testing procedure to test increasingly broad subgroups of patients based on PD-L1 status, and found overall response rates of 26% (95% CI: 18-36), 18% (95% CI: 13-24), and 15% (95% CI 11-19) in patients with ≥5% PD-L1-positive immune cells (IC2/3 subgroup), in patients with ≥1% PD-L1-positive immune cells (IC1/2/3 subgroup), and in all patients, respectively (Rosenberg et al. 2016). All three rates exceeded the historical control rate of 10%. Then, in March 2021, the approval in this indication was voluntarily withdrawn by the sponsor following negative results from a randomized phase III study (Powles et al. 2018). In the phase III study, 931 patients were randomly assigned to receive atezolizumab or chemotherapy in a 1:1 ratio, and the same hierarchical fixed-sequence testing procedure as in the phase II study was used. The phase III study found that overall survival did not differ significantly between the atezolizumab and chemotherapy groups of the IC2/3 subgroup (median survival 11.1 months [95% CI: 8.6-15.5] versus 10.6 months [95% CI: 8.4-12.2]), so no further testing was conducted for the primary endpoint (3). Further analyses revealed that while the response rates to atezolizumab were comparable to those seen in the phase II study, the response rates to chemotherapy were much higher than the historical control rate of 10%. The overall response rates to chemotherapy were 21.6% (95% CI: 14.5-30.2), 14.7% (95% CI: 10.9-19.2), and 13.4% (95% CI: 10.5-16.9) for the IC2/3 subgroup, IC1/2/3 subgroup, and all patients, respectively. The overall response rates to atezolizumab were 23% (95% CI: 15.6-31.9), 14.1% (95% CI: 10.4-18.5), and 13.4% (95% CI: 10.5-16.9) for the IC2/3 subgroup, IC1/2/3 subgroup, and all patients, respectively. These results indicate that PD-L1 status is a predictive biomarker for both standard of care chemotherapies that comprised the control arm as well as atezolizumab in this patient population.

## Re-design of case study

We will demonstrate the use of the functions in {ppseq} to re-design the phase II trial of atezolizumab using a two-arm randomized design with sequential predictive probability monitoring. We focus here on the main biomarker subgroup of interest, the IC2/3 subgroup. We design the study with a null response rate of 0.1 in both arms, and an alternative response rate of 0.25 in the atezolizumab arm. We plan the study with 100 participants, assuming that the total sample size available is similar to the 310 used in the actual single-arm phase II trial, and that a third of that patient population fall into our desired biomarker subgroup. We will check for futility after every 10 patients are enrolled on each arm. To design the study, we need to calibrate the design over a range of posterior probability thresholds and predictive probability thresholds.

The posterior probability is calculated from the posterior distribution based on the specified priors and the data observed so far, and represents the probability of success based only on the data accrued so far. A posterior probability threshold would be set during the design stage and if, at the end of the trial, the posterior probability exceeded the pre-specified threshold, the trial would be declared a success. The posterior predictive probability is the probability that, at a given interim monitoring point, the treatment will be declared efficacious at the end of the trial when full enrollment is reached, conditional on the currently observed data and the specified priors. Predictive probability provides an intuitive monitoring framework that tells an investigator what the chances are of declaring the treatment efficacious at the end of the trial if we were to continue enrolling to the maximum planned sample size, given the data observed so far in the trial. If this probability drops below a certain threshold, which is pre-specified during the design stage, the trial would be stopped early. Predictive probability thresholds closer to 0 lead to less frequent stopping for futility, whereas thresholds near 1 lead to frequent stopping unless there is almost certain probability of success.

We consider a grid of thresholds so that we have a range of possible designs from which to select a design with optimal operating characteristics such as type I error, power, and sample size under the null and alternative. In this example we will consider posterior thresholds of 0.7, 0.74, 0.78, 0.82, 0.86, 0.9, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, and 0.99, and predictive thresholds of 0.05, 0.1, 0.15, and 0.2.

## Using `calibrate_thresholds()`

to obtain design
options

To conduct the case study re-design, we use the
`calibrate_thresholds()`

function from the {ppseq} package.
This function is written using the `future`

and
`furrr`

packages, but the user will have to set up a call to
`future::plan`

that is appropriate for their operating
environment and their simulation setup prior to running the function. In
this example, we used the following code on a Unix server with 192
cores, with the goal of utilizing 56 cores since our grid of thresholds
to consider was 14 posterior thresholds by 4 predictive thresholds.

The `calibrate_thresholds()`

function will simulate
`nsim`

datasets under the null hypothesis and
`nsim`

datasets under the alternative hypothesis. For each
simulated dataset, every combination of posterior and predictive
thresholds will be considered and the final sample size and whether or
not the trial was positive will be saved. Then across all simulated
datasets, `calibrate_thresholds()`

will return the average
sample size under the null, the average sample size under the
alternative, the proportion of positive trials under the null (i.e. the
type I error), and the proportion of positive trials under the
alternative (i.e. the power).

Because `calibrate_thresholds()`

randomly generates
simulated datasets, you will want to set a seed before running the
function in order to ensure your results are reproducible.

The inputs to `calibrate_thresholds()`

for our case study
re-design include `p_null = c(0.1, 0.1)`

as the null response
rate, `p_alt = c(0.1, 0.25)`

as the alternative response
rate, `n = cbind(seq(10, 50, 10), seq(10, 50, 10))`

indicates
interim looks after every 10 patients up to a total of 50 in each arm,
`N = c(50, 50)`

indicates the final total sample size of 50
in each arm, `direction = "greater"`

specifies that interest
is in whether the response rate in the experimental arm exceeds the
response rate in the control arm, `delta = 0`

is the default
for the clinically meaningful difference in the two-sample case,
`prior = c(0.5, 0.5)`

specifies that both hyperparameters of
the prior beta distribution be set to 0.5, `S = 5000`

specifies that 5000 posterior samples will be drawn to calculate the
posterior and predictive probabilities, and `nsim = 1000`

specifies that we will generate 1000 simulated datasets under both the
null and the alternative. `pp_threshold`

is a vector of the
posterior thresholds of interest and `ppp_threshold`

is a
vector of predictive thresholds of interest. Note that due to the
computational time involved, the object produced from the below example
code `two_sample_cal_tbl`

is available as a dataset in the
{ppseq} package.

```
set.seed(123)
future::plan(future::multicore(workers = 56))
two_sample_cal_tbl <-
calibrate_thresholds(p_null = c(0.1, 0.1),
p_alt = c(0.1, 0.25),
n = cbind(seq(10, 50, 10), seq(10, 50, 10)),
N = c(50, 50),
pp_threshold = c(0.7, 0.74, 0.78, 0.82, 0.86, 0.9,
0.92, 0.93, 0.94, 0.95, 0.96, 0.97,
0.98, 0.99),
ppp_threshold = seq(0.05, 0.2, 0.05),
direction = "greater",
delta = 0,
prior = c(0.5, 0.5),
S = 5000,
nsim = 1000
)
```

## Results

We will limit our consideration to designs with type I error between 0.05 and 0.1, and a minimum power of 0.7.

###
`print()`

When you pass the results of `calibrate_thresholds()`

to
`print()`

you will get back a table of the resulting design
options that satisfy the desired range of type I error, specified as a
vector of minimum and maximum passed to the `type1_range`

argument, and minimal power, specified as a numeric value between 0 and
1 passed to the argument `minimum_power`

.

pp_threshold | ppp_threshold | mean_n0_null | mean_n1_null | prop_pos_null | prop_stopped_null | mean_n0_alt | mean_n1_alt | prop_pos_alt | prop_stopped_alt |
---|---|---|---|---|---|---|---|---|---|

0.86 | 0.10 | 28.56 | 28.56 | 0.100 | 0 | 45.09 | 45.09 | 0.751 | 0 |

0.86 | 0.15 | 26.80 | 26.80 | 0.095 | 0 | 43.43 | 43.43 | 0.722 | 0 |

0.90 | 0.05 | 29.88 | 29.88 | 0.082 | 0 | 46.09 | 46.09 | 0.737 | 0 |

0.90 | 0.10 | 27.30 | 27.30 | 0.078 | 0 | 43.83 | 43.83 | 0.704 | 0 |

0.92 | 0.05 | 28.61 | 28.61 | 0.070 | 0 | 45.49 | 45.49 | 0.701 | 0 |

We find that 5 of the 56 considered combinations of posterior and
predictive thresholds result in a design within our acceptable range of
type I error and minimal power. The column labeled
`prop_pos_null`

contains the proportion of positive trials
under the null, which represents the type I error rate, and the column
labeled `prop_pos_alt`

contains the proportion of positive
trials under the alternative, which represents the power. The column
labeled `mean_n1_null`

contains the average sample size under
the null whereas the column labeled `mean_n1_alt`

contains
the average sample size under the alternative.

###
`optimize_design()`

Finally, we can pass the results from a call to
`calibrate_thresholds()`

to the
`optimize_design()`

function to list the optimal design with
respect to type I error and power, termed the “optimal accuracy” design,
and the optimal design with respect to the average sample sizes under
the null and alternative, termed the “optimal efficiency” design, within
the specified range of acceptable type I error and minimum power.

```
optimize_design(two_sample_cal_tbl,
type1_range = c(0.05, 0.1),
minimum_power = 0.7)
#> $`Optimal accuracy design:`
#> # A tibble: 1 × 6
#> pp_threshold ppp_threshold `Type I error` Power `Average N under the null`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.86 0.1 0.1 0.751 28.6
#> `Average N under the alternative`
#> <dbl>
#> 1 45.1
#>
#> $`Optimal efficiency design:`
#> # A tibble: 1 × 6
#> pp_threshold ppp_threshold `Type I error` Power `Average N under the null`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.92 0.05 0.07 0.701 28.6
#> `Average N under the alternative`
#> <dbl>
#> 1 45.5
```

The optimal accuracy design is the one with posterior threshold 0.86 and predictive threshold 0.1. It has a type I error of 0.1, power of 0.751, average sample size under the null of 29 per arm, and average sample size under the alternative of 45 per arm.

The optimal efficiency design is the one with posterior threshold of 0.92 and predictive threshold of 0.05. It has a type I error of 0.07, power of 0.701, average sample size under the null of 29 per arm, and average sample size under the alternative of 46 per arm.

We find that either the optimal accuracy or the optimal efficiency sequential predictive probability design reasonable type I error and power to detect the effect of interest, and results in a much lower average total sample size under both the null and alternative than the 310 actually used in the phase II study of atezolizumab. Additionally, this randomized design would allow direct estimation of the response rate to both the experimental treatment and to the standard of care treatment in the biomarker subgroup of interest, thus avoiding the issues with historical control rates that arose in the original atezolizumab study. While this design would not prevent the problem of the targeted biomarker being prognostic rather than predictive, it uses fewer patients and avoids continuation to phase III when it is unwarranted.

###
`calc_decision_rules()`

Once we have selected the design of interest, we need to obtain the
decision rules at each futility monitoring time for easy implementation
of our trial. We can do so by passing the parameters of our selected
design to `calc_decision_rules()`

. The input parameters
`n`

, `direction`

, `p0`

,
`delta`

, `prior`

, `S`

, and
`N`

are the same as in `calibrate_thresholds()`

.
The input parameter `theta = 0.92`

specifies that we are
interested in a posterior probability threshold of 0.92 and the input
parameter `ppp = 0.1`

specifies that we are interested in a
predictive probability threshold of 0.1, as determined by the optimal
efficiency design above. Note that due to the computational time
involved, the object produced from the below example code
`one_sample_decision_tbl`

is available as a dataset in the
{ppseq} package.

```
set.seed(123)
two_sample_decision_tbl <-
calc_decision_rules(
n = cbind(seq(10, 50, 10), seq(10, 50, 10)),
N = c(50, 50),
theta = 0.92,
ppp = 0.05,
p0 = NULL,
direction = "greater",
delta = 0,
prior = c(0.5, 0.5),
S = 5000
)
```

`two_sample_decision_tbl`

n0 | n1 | r0 | r1 | ppp |
---|---|---|---|---|

10 | 10 | 0 | NA | NA |

10 | 10 | 1 | 0 | 0.0394 |

10 | 10 | 2 | 0 | 0.0096 |

10 | 10 | 3 | 1 | 0.0256 |

10 | 10 | 4 | 2 | 0.0382 |

10 | 10 | 5 | 3 | 0.0384 |

10 | 10 | 6 | 4 | 0.0456 |

10 | 10 | 7 | 5 | 0.0440 |

10 | 10 | 8 | 6 | 0.0330 |

10 | 10 | 9 | 7 | 0.0258 |

10 | 10 | 10 | 9 | 0.0350 |

Above are the first 11 rows of the resulting table, which contains
all possible combinations of number of responses in the control and
experimental arms that would stop the trial at each interim look. In the
results table, `n0`

indicates the number of enrolled patients
in the control arm and `n1`

indicates the number of enrolled
patients in the treatment arm at each look for futility; `r0`

indicates the number of responses in the control arm and `r1`

indicates the number of responses in the treatment arm for which we
would stop the trial at a given interim look if the number of observed
responses is <=r1 for a given fixed value of r0. At the end of the
trial the treatment would be considered promising if the number of
observed responses is >r1 for a given r0. For example, in this case
we see that if we had 4 responses in the control arm out of the first 10
control arm patients, we would stop the trial if we had 2 or fewer
responses in the experimental arm out of the first 10 experimental arm
patients.

###
`plot()`

Passing the results of `calibrate_thresholds()`

to
`plot()`

returns two plots, one of type I error by power, and
one of average sample size under the null by average sample size under
the alternative. The arguments `type1_range`

and
`power`

are used in the same way as for the
`print()`

function. The argument `plotly`

defaults
to FALSE, which results in a pair of side-by-side ggplots being
returned. If set to TRUE, two individual interactive plotly plots will
be returned, as demonstrated below.

Note that when points were tied based on optimization criteria, the point with the highest posterior and predictive threshold is selected for plotting.

The diamond-shaped point represents the optimal design based on each
criteria, which is discussed in more detail below. Using the
`plotly = TRUE`

option to `plot()`

we can hover
over each point to see the x-axis and y-axis values, along with the
distance to the top left corner on each plot, the posterior and
predictive thresholds associated with each point, and the average sample
sizes under the null and alternative for each arm.

Passing the results of `calc_decision_rules()`

to
`plot()`

returns a faceted plot according to the interim
look. One each plot, the x-axis indicates the number of responses in the
control arm and the y-axis indicates the number of responses in the
experimental arm. The color denotes whether the trial should proceed
(green) or stop (red). Hovering over each box will indicate the
combination of sample size and responses and the decision at that
time.

`plot(two_sample_decision_tbl)`