12 Pitfalls in the Design, Conduct and Analysis of Randomised Clinical Trials
Richard Stephens
R. Stephens
Research Scientist, Cancer Division, MRC Clinical Trials Unit, 222 Euston Road, London, NW1 2DA, UK
CONTENTS
12.1 Introduction 495 12.2 Trial Design 496
12.3 Choice of Control Therapy 497 12.3.1 Eligibility 497
12.3.2 Choice of Endpoints 497 12.4 Trial Conduct 499 12.4.1 Monitoring 499 12.4.2 Follow-up 499 12.5 Trial Analysis 499 12.5.1 Patient Population 499
12.5.2 Pre-treatment Patient Characteristics 500 12.6 Conclusions 501
References 502
12.1
Introduction
We are in the era of evidence-based medicine, and the building blocks for this evidence are randomised clinical trials. Therefore the importance of high- quality randomised trials cannot be understated. In theory randomised clinical trials are very simple.
Half of the patients receive standard treatment, half receive the new treatment, and the two groups are compared in terms of effi cacy. What could go wrong?
Well, in practice, many things! The design, conduct and analysis of randomised clinical trials can actually be very complex. This paper aims to highlight some of the common pitfalls, giving examples from recent publications, and to suggest ways of avoiding them.
It is fi rst important to describe what we are try- ing to do in a randomised trial as without this un- derstanding, the implications of the pitfalls discussed cannot be fully appreciated.
Classically a randomised trial compares a new experimental therapy with the current standard therapy, in an attempt to fi nd out whether the new
treatment is better and, if so, to estimate how much better. Usually, in cancer, the primary endpoint of interest is survival, but in addition response, toxic- ity, quality of life and cost-effectiveness may also be important factors in deciding whether the new treat- ment is better.
If we had access to every patient with the disease under scrutiny and could randomise them all, we could obtain a fairly accurate measure of whether the new treatment is better than the standard, and if so by how much. However, of course, we don’t. We only have access to a sample of these patients, and all the results of our randomised trial can do is give an es- timation of the true difference. It stands to reason, therefore, that the larger the number of patients we study, the better the estimation.
The beauty of randomisation is that it ensures that a sample of patients is divided into groups that are as comparable as possible. Given suffi cient patients, the groups will not only be automatically matched on ob- vious characteristics (for example, age and sex), but most importantly, in every other aspect. It is the latter point that makes the act of randomisation so crucial, and the use of historical controls so risky, as we are still unable to predict with any great accuracy which patients will do well, which badly, and what factors infl uence outcome. Randomisation thus ensures that the only difference between the groups will be the treatment they receive. Nevertheless, it is also impor- tant to remember that the sample of patients we are studying may be drawn from anywhere within the full population, and thus groups of patients receiv- ing the same treatment in different trials may have different outcomes.
There are a number of statistical terms that are ba- sically used to describe how close the estimated result from a trial is likely to be to the true result:
The ‘power’ of a trial relates to the chances of iden- tifying a difference if it exists. Trials that are under- powered (i.e. do not include enough patients to reli- ably detect the difference) may therefore result in a false-negative result (also referred to as a type II er- ror). Generally trials are powered at 90% but this still
means that 10 out of every 100 trials so powered will be a false-negative (i.e. although a difference exists between the groups, the trial suggests no difference).
Unfortunately, of course, we never know which ‘nega- tive’ results are false-negatives!
The p value indicates how likely it is that an ob- served difference has been found purely by chance.
Thus a p value of 0.05 indicates that this result would have occurred by chance 5 times in every 100. It is generally considered that a difference with a p value of )0.05 is a true and ‘positive’ result. However, it is vi- tal to remember that this actually means that 5 out of every 100 ‘positive’ results will be false-positives (also referred to as a type I error), found purely by chance.
Again, the trouble is we never know which!
Whilst we need to be aware that a proportion of positive trial results may in fact be false-positives (and a proportion of negatives false-negatives), the problem of type I and type II errors also af- fects analyses within a trial, as the more tests that are performed, the more likely it is that these will be contaminated with false results. To reduce this risk, the number of statistical tests performed in a trial should be limited. A good way of doing this is to consider that within a trial there is only a certain amount of p value spending. So, if one test is per- formed and the result is p)0.05, then the result can be considered signifi cant. If two tests are performed then perhaps they should only be considered signifi - cant if p)0.025 or, as is often used to accommodate interim analyses, the fi rst is only considered signifi - cant if p)0.001 so that the second can be considered signifi cant if p)0.049. Consider, for example, one table relating to the assessment of quality of life in the paper by Sundstrom et al. (2002) where 84 p val- ues were calculated, although the authors recognised the problem and indicated that only p<0.01 would be considered signifi cant.
The hazard ratio (HR) is usually used to indicate the overall survival difference, with, conventionally, a value of <1 indicating that the new treatment is better, and >1 indicating that the new treatment is worse. Thus an HR for a survival difference of 0.85 indicates that the new treatment results in a 15% bet- ter survival, and an HR of 1.02 indicates that the new treatment is actually 2% worse. A ballpark method of converting the HR into real time is that HR is ap- proximately equal to the median survival of patients on the standard treatment divided by the median survival of patients on the new treatment. In addi- tion the HR is approximately equal to the natural log of the proportion of patients surviving at a par- ticular timepoint on the standard treatment, divided
by the natural log of the proportion of patients sur- viving at the same timepoint on the new treatment.
Thus, for example, if the median and 1-year survival of patients on a standard treatment are 9 months and 20% respectively, and the HR from a trial is 0.85, the estimated median and 1-year survival for patients on the new treatment are approximately 10.6 months and 25.5% respectively.
However, probably the most important statistical term is the 95% confi dence interval (CI). This indi- cates the range in which we are 95% sure that the true value lies. Thus, for example, in a survival com- parison, an HR of 0.85 with a 95% CI of 0.65-1.05 indicates that our best estimate of the survival differ- ence is that the new treatment is 15% better, but we are 95% confi dent that it is somewhere between 35%
better and 5% worse. This surprisingly wide range, however, is the sort of range commonly obtained from randomised trials with a sample size of about 250 patients. Thousands of patients are required to obtain confi dence intervals of only about 5% around the HR. Even in a trial of more than 1,000 patients, comparing surgery with or without adjuvant che- motherapy, Scagliotti et al. (2003) reported an HR of 0.96 with a 95% CI of 0.81-1.31, indicating that compared to the median survival of 48 months with surgery alone, adjuvant chemotherapy could have resulted in a detriment of 5.5 months or a benefi t of 11 months.
There are numerous pitfalls that can occur in a randomised trial, although sometimes pitfalls is the wrong word, as trials can, of course, be deliberately designed, or analyses deliberately performed, to weigh the scales in favour of one treatment or an- other. Nevertheless, the aim of this chapter is to alert readers to the major defi ciencies that can occur in trial design and trial reporting which may prevent the trial from being a true and unbiased comparison of the treatments.
12.2 Trial Design
Whilst most randomised trials are designed to test a new treatment against a standard treatment, tri- als may also be designed to assess whether a new treatment is equivalent to a standard treatment (for example, the new treatment may have preferable at- tributes, such as being given orally rather than intra- venously, or be less costly) or to establish which of two standard treatments is better.
What should guide trial design is equipoise, or the uncertainty principle, which perhaps might be judged by the willingness of clinicians to be en- rolled themselves should they develop that condition.
Unfortunately, often the trials that are the easiest to accrue to (for example, chemotherapy A vs chemo- therapy B) are the ones least likely to change practice, whereas the opposite applies to ‘diffi cult’ trials (for example, surgery vs no surgery).
Trials should also always aim to answer only one clear question. Thus a logical trial design in chemo- therapy would be to add or replace one drug in the standard treatment combination. Results from trials that change two drugs (or schedules or doses) often leave the question unanswered as to the relative value of each changed factor. For example, Kelly et al.
(2001) compared paclitaxel and carboplatin given in 4-weekly cycles with vinorelbine and cisplatin given in 3-weekly cycles, and Souquet et al. (2002) com- pared vinorelbine 30 mg/m2 on days 1, 8 and 15 and cisplatin 80 mg/m2 on day 1 with vinorelbine 25 mg/
m2 on days 1 and 8, cisplatin 75 mg/m2 on day 1 and ifosfamide 3 g/m2 on day 1.
It is important that all the decisions regarding de- sign issues are clearly stated and justifi ed in the proto- col, and also that a detailed analysis plan is written.
12.3
Choice of Control Therapy
In a randomised trial the choice of control treat- ment is paramount. Logically it should always be the current best standard treatment for the condi- tion, although often knowing what is acknowledged as ‘best’ is diffi cult. Indeed, there may be situations where the local, national and international ‘best’ are all different because of, for example, differences in facilities, expertise or access to drugs. The choice of control treatment will depend on several factors, in- cluding whether the trial result is aimed at affecting local, national or international practice, how prag- matic the trial is (for example, if the question is ‘does the addition of drug A to chemotherapy improve survival?’, the chemotherapy used may not need to be stated) or how a non-local control treatment will affect accrual. It is not diffi cult to see that the choice of the control treatment can signifi cantly infl uence the way the trial result is interpreted, as unfortu- nately much more attention is paid to trials with a
‘positive’ result. Thus in order to increase the chances of seeing a ‘positive’ outcome, trials can be designed
to compare the new treatment with a poor or inap- propriate control. A common trick is to compare the new treatment alone with the new treatment in combination with a standard treatment. Thus in lung cancer there are examples of trials comparing new drug versus new drug plus cisplatin; for example, Splinter et al. (1996) compared teniposide with or without cisplatin in advanced NSCLC. Cisplatin is a very effective drug and thus the chances are that the combination will appear effective and can be claimed as an effective standard treatment, irrespec- tive of whether the new drug actually has any useful effect or not. Because of the diffi culty, due to the huge numbers of patients required, of showing that a new treatment is equivalent to a standard treatment, a course of action sometimes taken is to show that the new treatment is better than a previous standard to the same degree as the current standard. Thus if treatment B is 5% better than treatment A, the options for new treatment C are either to try and show that C is equivalent to B, or that C is also 5%
better than treatment A. However, it could be argued that the latter is unethical as patients are not being offered the current standard of care. Nevertheless, this is a commonly used strategy. For example, given that in the NSCLC meta-analysis (Non-small Cell Lung Cancer Collaborative Group 1995) the survival benefi t seen with cisplatin-based chemo- therapy in the supportive care setting was highly sig- nifi cant (p<0.0001), should Anderson et al. (2000) and Roszkowski et al. (2000) have compared gem- citabine and docetaxel respectively against support- ive care or cisplatin-based chemotherapy?
12.3.1 Eligibility
The results of trials will infl uence the way future patients are treated. It is therefore important that the eligibility criteria refl ect this population of patients, as it is unlikely that all the eligibility criteria will be remembered and adhered to outwith the trial. Thus, results from trials with strict eligibility criteria are often not reproducible when the treatment in ques- tion is adopted in general practice.
12.3.2
Choice of Endpoints
Usually the choice of endpoint will be straightfor- ward, commonly survival, response, toxicity and
quality of life, but the detail of each will be all-im- portant and must be defi ned.
Survival. Treatments need to be compared on their overall survival as choosing a landmark timepoint, be it median or 1-year survival, may bias the results.
For instance, in a trial of surgery versus a non-surgi- cal intervention, the expectation may well be that the surgery group is likely to experience high early post- operative mortality but better longer-term survival.
Thus comparing survival at, say, 1 month or 5 years might give an inaccurate picture of the true between- treatment difference. Although the expected median survival or proportion of patients surviving at key timepoints is often quoted in protocols, these are sim- ply snapshots of the likely survivals and the likely survival difference, and are also used to calculate a sample size. For example, the shape of the survival curves seen in the trials reported by Fossella et al.
(2000) and Takada et al. (2002) overlap for a consid- erable time before splitting.
All too often sample sizes are based on what is feasible rather than what is realistic. For instance, we know that, in lung cancer, the addition of a new mo- dality, be it radiotherapy or chemotherapy, to surgery (or supportive care) will probably improve survival by only about 5% (Non-small Cell Lung Cancer Collaborative Group 1995). Therefore it is unre- alistic to consider that as a result of tinkering with the drugs, dosages or schedules, we are suddenly go- ing to see advantages of a further 10% or 15%. Yet the vast majority of lung cancer trials are based on seeing differences of about 15%, which will gener- ally require around 400 patients. Some even aim for larger effects. For example, Ranson et al. (2000) powered their trial to look for a 100% improvement (from 20% survival at 1 year with supportive care to 40% with paclitaxel), and Sculier et al. (2001), in a three-arm trial, considered that a 75% increase might be possible with the addition of G-CSF or antibiotics to standard chemotherapy. The sort of target accrual resulting from such over-optimistic expectations is considered feasible, whereas aiming for around 1,500 patients to see a 10% difference, or 4,000 patients to see a 5% difference, which is probably the sort of target most trials should now be aiming at, is simply considered an impossible task. Maybe this explains why progress in lung cancer has been so slow, as we have had to wait for meta-analyses to combine data from a number of trials in order to accumulate the thousands of patients required to confi rm these small differences. A question then arises as to whether it is ethical to run any trial of less than perhaps 1,000,
patients given the high probability of an inconclusive result. An even greater dilemma occurs with equiv- alence trials. Taking the same example that the ad- dition of a modality (chemotherapy) improves sur- vival over surgery alone or supportive care by about 5%, what happens when we want to show that a new chemotherapy treatment is as effective as standard?
If we compare the new chemotherapy to standard chemotherapy with a trial of 400 patients we may fi n- ish up with an HR of around 1.00 but with a 95% CI of about ±15%. So all we could conclude is that the new treatment is somewhere between 15% better and 15%
worse than standard and thus could actually be 10%
worse than no chemotherapy. Nevertheless, some papers, for example Gatzemeier et al. (2000), claim survival is comparable even though a 20% benefi t or detriment cannot be ruled out.
Response. To compare tumour response and/or pro- gression it is of course important that patients in each group undergo the same investigations, undertaken (as far as possible) by the same staff, using the same equipment at baseline and at the same predefi ned timepoints [in relation to the time from randomisa- tion (the one common timepoint for all patients)]
throughout the trial. It is important to choose equivalent follow-up timepoints because if patients in one group are assessed more often, progression will be picked up earlier in that group, and any analysis of progression-free survival will be biased.
Complications also arise when patients have non- protocol or second-line treatment. Great care must be taken to defi ne whether response rates reported are purely those related to the protocol treatment or are as a result of the policy of giving a particular regimen.
Toxicity. The same considerations (consistency of in- vestigations and follow-up) need to be applied to the assessment of toxicity. In addition, in cancer the side- effects of treatment can sometimes be very diffi cult to distinguish from the symptoms of the disease (for example, anorexia and breathlessness). It is perhaps unrealistic, therefore, to ask clinicians to distinguish between these and report just on treatment-related toxicity. Thus it is always preferable to collect infor- mation on all symptoms irrespective of the cause and assume that any differences seen will be due to the difference in treatment.
Quality of Life. Numerous issues surround the de- sign of the assessment of quality of life (QL). Few tri- als actually estimate the number of patients required
for the QL aspects, and consequently many trials include only a small subset of patients. This seldom provides suffi cient data. For example, a calculation of the number of patients required to show a 10%
difference in, say, shortness of breath at 3 months yields a sample size of about 400 patients. A recent review (Stephens et al. 2004) indicates that only fi ve trials in NSCLC have collected QL data on 200- 300 patients at follow-up, and only one more than 300. The solution to many of the QL design issues is to pre-defi ne the primary and secondary QL end- points. This may involve discussing with doctors and patients how the standard and new treatments are likely to impact on QL and when. Such information will certainly guide the choice of QL questionnaire, the timing of administration and the calculation of sample size, and in addition will focus the analyses.
However, very few trials have so far fully embraced this way of working.
12.4
Trial Conduct 12.4.1 Monitoring
To ensure patient safety it is imperative that the ac- cumulating data are reviewed at regular intervals throughout the trial. Whether ‘regular’ means annu- ally, when accrual reaches certain targets or when certain numbers of events have occurred, will depend on the trial. It is also important that the interim data are reviewed completely independently by clinicians and a statistician not involved with any other aspect of the trial. Rules for when the trial should close early must also be agreed and there are a number of options, from fi xed p values to Bayesian statements such as ‘the evidence must convince sceptics’. It is im- portant that among the Data Monitoring and Ethics Committee (DMEC) members there is knowledge of the disease and treatments and previous DMEC experience, as often DMECs will be called upon to make very diffi cult decisions. There are numerous examples where trials have stopped early, but the re- sults have been unconvincing and new trials have had to be set up to clarify the situation. For example, two trials of neo-adjuvant chemotherapy for NSCLC (Rosell et al. 1994; Roth et al. 1994) both stopped early after accruing 60 patients, but subsequently sev- eral large trials have been set up to clarify whether any benefi t exists.
12.4.2 Follow-up
A major consequence of needing to review the in- terim data and make important decisions is that the data must always be as up to date as possible as it is vitally important that DMECs make decisions based on all the available data. However, follow-up may be diffi cult if different modalities are being compared, especially if this requires the patients to be seen at dif- ferent times by different clinicians (for example, when chemotherapy is being compared to radiotherapy, or surgery to best supportive care). Whenever possible, follow-up should revert to a common time schedule and within each participating centre patients should be assessed by the same clinical team.
To ensure an unbiased comparison of survival, the duration of follow-up in the groups must be simi- lar. If follow-up is different this can subtly affect the Kaplan-Meier curves, as surviving patients are as- sumed to follow the same survival patterns as those known to have died. A ‘reverse’ Kaplan-Meier plot, nominating the ‘time last seen for those alive’ as the event and censoring at the date of death, is a good way of comparing follow-up in the groups, and the resulting p value of the log-rank test can be quoted.
Some papers report median follow-up of survivors, although this is rarely split by group, and other pa- pers simply report median follow-up, though it is far from obvious what this latter fi gure actually repre- sents.
12.5
Trial Analysis
A good policy is to account for every patient in every analysis. Thus including categories such as ‘not as- sessed’ or ‘died’ in tables and reporting the numbers of patients (not just the proportions) makes all analy- ses completely transparent to the reader.
12.5.1
Patient Population
The easiest and most logical group to analyse is ev- eryone who has been randomised. This is the strict defi nition of ‘intent to treat’. At the time of randomi- sation all patients should have been considered suit- able for the treatments being studied and thus post- trial refl ect the population who are likely to be offered
the treatment. Papers often list subgroups of patients who are excluded from analyses, such as those shown to be ineligible by post-randomisation investigations or independent review, those who do not receive any or all of their protocol treatment or those not assessed for an endpoint. However, removing patients for any of the above reasons has the potential to bias the analysis sample. For example, although the primary endpoint of the trial was response, Georgoulias et al. (2001) excluded 35 of the 441 patients randomised and all analyses (which were claimed to be ‘inten- tion-to-treat’) were then performed on the remaining 406 patients, and Schiller et al. (2002) excluded 52 patients who were found to be ineligible post- randomisation in their trial of four chemotherapy regimens.
12.5.2
Pre-treatment Patient Characteristics
It is, of course, logical to list the pre-treatment charac- teristics and to highlight balance (or imbalance) be- tween groups. However, it is illogical to apply statisti- cal tests to show balance or imbalance. Statistical tests are used to estimate the likelihood that an observed difference has not occurred by chance. However, dif- ferences in pre-treatment characteristics can only have occurred by chance, and it is thus an inappropri- ate use of a statistical test and a wasteful use of p value spending. If imbalances in pre-treatment character- istics are observed, the analysis of the key endpoints should be adjusted accordingly. Recent examples of this unnecessary testing can be found in papers by Tada et al. (2004) and Langendijk et al. (2001).
Survival. Survival should always include all patients randomised, be calculated from the date of randomi- sation and include all causes of death. It should be measured by constructing Kaplan-Meier curves and comparing them using the log-rank test, and overall survival should be reported using the hazard ratio and 95% confi dence interval. Taking the start date as anything other than randomisation (which is the one common timepoint for all patients) will have the potential to bias the result. For example, the date of diagnosis may not be accurate for all patients, the date of start of treatment may include different delays for different groups, and what do you do with patients who don’t start treatment?
Although the cause of death may be of interest to the trialists, to indicate how the treatment is working, in a sense this may be much less important to the pa-
tient. Thus survival analyses that only report deaths from cancer may be interesting but very misleading.
For example, a treatment that causes many early treat- ment-related deaths may, in a cancer-specifi c survival analysis, appear to be the better treatment.
Sundstrom et al. (2002) reported the disease-spe- cifi c survival rates in their trial of chemotherapy regi- mens, and Shepherd et al. (2002) censored patients who died from causes unrelated to disease or treat- ment in their analysis of progression-free survival.
Subgroup Analysis. Subgroup analyses are only reli- able if they are predefi ned, which will usually mean they are hypothesis driven, and take account of sam- ple size and multiple statistical testing. Unless the above rules are respected, subgroup analyses should always be considered with caution and treated as only hypothesis generating. All too often when clear overall results are not seen, the data are trawled for interesting subgroup results and, when found, hy- potheses built around them. Reporting such fi ndings as defi nitive results is irresponsible.
It is, of course, often interesting to explore whether any overall survival difference observed is consis- tent across all subgroups, and analyses stratifi ed for pre-treatment characteristics are therefore useful;
whilst Sause et al. (2000) did just that, the subgroup analyses did not appear to have been pre-defi ned, accounted for in the sample size or considered only as exploratory or hypothesis generating. Whilst ex- ploratory analyses are acceptable, analysis by post- randomisation factors (such as treatment received, or response) are totally unacceptable, as the groups being compared may be defi ned by the outcome be- ing tested. Thus, for example, comparing the survival of responders versus non-responders is fl awed be- cause the responders have to survive long enough to respond. Therefore analyses such as those presented by Fukuoka et al. (2003), comparing survival by re- sponders, and Socinski et al. (2002), showing sur- vival by number of cycles of chemotherapy received, must be viewed with great caution. Prognostic factor analyses are sometimes run to try and identify the factors most related to survival, but usually there are far too few patients in a single trial to draw any fi rm conclusions. For example, in a trial reported by Pujol et al. (2001)multivariate analyses were performed on 226 patients.
Response. Although the RECIST criteria (Therasse et al. 2000)are now the standard method of assessing response, there are still complications. For example, it is unclear what to do with multiple lesions, disease
present but not measurable, or measurement sched- ules that are not every 4 weeks. It is important to report the response rate as the proportion of patients who achieve complete or partial response out of the total number of patients in the group. Quoting the response rate as just the proportion of patients who have been assessed at a certain timepoint may mask the fact that patients may have had to stop the treat- ment due to toxicity or death.
Many papers purport to show differences between treatments in terms of time to progression with the use of a Kaplan-Meier plot, taking progression as the event and censoring those alive (or dead) with- out progression. This sort of analysis can be very misleading as patients who fail from a competing risk (for example, an early treatment-related death) that precludes the possibility of achieving the event are treated the same as censored patients who still have the potential for progression. Recent examples of this can be found in papers by Sunstrom et al.
(2002), Ranson et al. (2000) and Pujol et al. (2001).
Progression-free survival, which takes into account deaths without progression, should always be the preferred analysis.
Toxicity. Standard defi nitions of toxicity, such as the Common Terminology Criteria for Adverse Events developed by the NCI Cancer Therapy Evaluation Programme (2003), should always be used, but there are a number of ways of reporting toxicity. Perhaps the most logical and widely used method is to report the proportion of patients with grade 3 or 4 for each key symptom within a defi ned time period from ran- domisation. Such an analysis will inevitably include some noise, as patients will have had symptoms pre- treatment and some patients will have toxicity as a result of non-protocol treatment, but understanding and applying the concept of ‘intent to treat’ is im- portant, as the trial should be trying to record the experiences of a group of patients chosen to receive a certain treatment. If some patients don’t actually receive the protocol treatment and have to receive different treatment, perhaps with different side-ef- fects, that is a key message. In virtually all analyses it is much better to report the proportion of patients with a good or bad experience rather than the mean or median score. The mean or median can mask or dilute the fact that a small proportion of patients had good or bad experiences.
Quality of Life. Patient self-assessed quality of life (QL) data are especially diffi cult to report reliably because the data are multidimensional, longitudinal
and inevitably much is missing. There are therefore no agreed methods of presenting QL results, and care must be taken to be conservative in making strong claims. Major problems can arise from starting with inadequate sample sizes, multiple statistical testing, imputing missing data, comparing the treatments at timepoints that favour one group and/or summarising the data inappropriately. Non-standard analyses such as those used by Ranson et al. (2000), estimating sepa- rate slopes for dropouts and completers, or Sandler et al. (2000), calculating the change in score from baseline to last observation, should be avoided.
Many of these problems can be mitigated by pre- defi ning QL hypotheses which have the effect of guid- ing the choice of questionnaire, the choice of admin- istration timepoints, the sample size calculation and the analyses to be performed. However, there are few examples of this actually being carried out in prac- tice, and consequently the results from QL aspects of trials are often disregarded and distrusted by clini- cians and patients.
Daily diary cards can be very useful to highlight transient changes. Plots of the proportions of patients reporting dyspnoea post radiotherapy, for instance, can be very illuminating, but potentially misleading unless it is made clear how many patients are contrib- uting to the curves at each timepoint.
Interpretation. Trials are rarely islands. Results need to be presented and discussed in the context of the to- tality of previous work. However, Clarke et al. (1998) reviewed the discussion sections of reports of trials published in fi ve major journals during one month in 1997 and found that only two (of 26) placed their results in the context of an up-to-date systematic re- view. Repeating this exercise in 2001, they reported no improvement, with only three (of 30) trials being so reported (Clarke et al. 2002). Such fi ndings are disappointing and suggest that there is a general lack of awareness that individual trials are only part of the whole picture. We must never lose sight of the fact that lung cancer is a global problem and with- out global collaboration progress will continue to be painfully slow.
12.6 Conclusions
There are numerous pitfalls in the design, conduct and analysis of randomised trials. Some are subtle, some less so. What in particular should a trialist try
to ensure, and what should cause a reader to cast doubt on the results in a publication? Here are ten questions to keep in mind:
– What is the control treatment? Is it a widely used standard treatment given in an acceptable sched- ule?
– Has the trial been designed to answer a clear, un- confounded question?
– Are there pre-defi ned hypotheses for all key end- points?
– Do the eligibility criteria cover all the patients who are likely to be treated this way outwith the trial?
– Is the sample size based on information that is sensible and feasible?
– Are the details of the interim analyses and stop- ping rules clearly laid out?
– Are all randomised patients included and accounted for in all analyses?
– Are the number of statistical tests limited, and if not, have the signifi cance levels been adjusted accordingly?
– Have the hazard ratio and especially the 95%
confi dence interval of the primary endpoint been given?
– Has the result been put into the context of previ- ous work in the area?
All trials and all trial results are important as they all in some way advance the progress of human knowledge. Our ultimate aim as trialists is to improve the treatment of future patients and it is therefore important that we are as rigorous and honest in our work as we can be.
References
Anderson H, Hopwood P, Stephens RJ, Thatcher N, Cottier B, et al. (2000) Gemcitabine plus best supportive care (BSC) vs BSC in inoperable non-small cell lung cancer – a ran- domized trial with quality of life as the primary outcome.
Br J Cancer 83:447-453
Cancer Therapy Evaluation Programme (2003) Common Ter- minology Criteria for Adverse Events. Version 3.0. DCTD, NCI, NIH, DHHS. March 31 2003. http://ctep.cancer.gov/
Clarke M, Chalmers I (1998) Discussion sections in reports of controlled trials published in general medical journals:
islands in search of continents? JAMA 280:280-282 Clarke M, Alderson P, Chalmers I (2002) Discussion sections
in reports of controlled trial published in general medical journals. JAMA 287:2799-2801
Fossella FV, DeVore R, Kerr RN, Crawford J, Natale RR, et al. (2000) Randomized phase III trial of docetaxel versus
vinorelbine or ifosfamide in patients with advanced non- small cell lung cancer previously treated with platinum- containing chemotherapy regimens. J Clin Oncol 18:2354- 2362
Fukuoka M, Yano S, Giaccone G, Tamura T, Nakagawa K, et al. (2003) Multi-institutional randomized phase II trial of gefi tinib for previously treated patients with advanced non-small cell lung cancer. J Clin Oncol 21:2237-2246 Gatzemeier U, von Pawel J, Gottfried M, ten Velde GPM, Matt-
son K, et al. (2000) Phase III comparative study of high- dose cisplatin versus a combination of paclitaxel and cis- platin in patients with advanced non-small cell lung cancer.
J Clin Oncol 18:3390-3399
Georgoulias V, Papadakis E, Alexopoulos A, Tsiafaki X, Rapti A, et al. (2001) Platinum-based and non-platinum-based chemotherapy in advanced non-small cell lung cancer: a randomised multicentre trial. Lancet 357:1478-1484 Kelly K, Crowley J, Bunn Jr PA, Presant CA, Grevstad PK, et al.
(2001) Randomized phase III trial of paclitaxel plus carbo- platin versus vinorelbine plus cisplatin in the treatment of patients with advanced non-small cell lung cancer: a South- west Oncology Group trial. J Clin Oncol 19:3210-3218 Langendijk H, de Jong J, Tjwa M, Muller M, ten Velde G, et al.
(2001) External irradiation versus external irradiation plus endobronchial brachytherapy in inoperable non-small cell lung cancer: a prospective randomized study. Rad Oncol 58:257-268
Non-small Cell Lung Cancer Collaborative Group (1995) Chemotherapy in non-small cell lung cancer: a meta- analysis using updated data on individual patients from 52 randomised clinical trials. BMJ 311:899-909
Pujol J-L, Daures J-P, Riviere A, Quoix E, Westell V, et al. (2001) Etoposide plus cisplatin with or without the combination of 4’-epidoxorubicin plus cyclophosphamide in treatment of extensive small cell lung cancer: a French Federation of Cancer Institutes multicenter phase III randomized study.
J Natl Cancer Inst 93:300-308
Ranson M, Davidson N, Nicolson M, Falk S, Carmichael J, et al.
(2000) Randomized trial of paclitaxel plus supportive care versus supportive care for patients with advanced non- small cell lung cancer. J Natl Cancer Inst 92:1074-1080 Rosell R, Gomez-Condina J, Camps C, Maestre J, Padilla J, et al.
(1994) A randomized trial comparing preoperative chemo- therapy plus surgery with surgery alone in patients with non-small cell lung cancer. N Engl J Med 330:153-158 Roszkowski K, Pluzanska A, Krzakowski M, Smith AP, Saigi E,
et al. (2000) A multicenter, randomized, phase III study of docetaxel plus best supportive care versus best support- ive care in chemotherapy-naive patients with metastatic or non-resectable localized non-small cell lung cancer (NSCLC). Lung Cancer 27:145-157
Roth JA, Fossella F, Komaki R, Ryan MB, Putnam JB, et al.
(1994) A randomized trial comparing perioperative chem- otherapy and surgery with surgery alone in resectable stage IIIa non-small cell lung cancer. J Natl Cancer Inst 86:673- 680
Sandler AB, Nemunaitis J, Denham C, von Pawel J, Cormier Y, et al. (2000) Phase III trial of gemcitabine plus cisplatin versus cisplatin alone in patients with locally advanced or metastatic non-small cell lung cancer. J Clin Oncol 18:122- 130
Sause W, Kolesar P, Taylor S IV, Johnson D, Livingston R, et al. (2000) Final results of phase III trial in regionally
advanced unresectable non-small cell lung cancer. Chest 117:358-364
Scagliotti GV, Fossati R, Torri V, Crino L, Giaccone G, et al.
(2003) Randomized study of adjuvant chemotherapy for completely resected stage I, II or IIIa non-small cell lung cancer. J Natl Cancer Inst 95:1453-1461
Schiller JH, Harrington D, Belani CP, Langer C, Sandler A, et al. (2002) Comparison of four chemotherapy regimens for advanced non-small cell lung cancer. N Engl J Med 346:92- 98
Sculier JP, Paesmans M, Lecomte J, van Cutsem O, Lafi tte JJ, et al. (2001) A three-arm phase III randomised trial assessing, in patients with extensive disease small cell lung cancer, accelerated chemotherapy with support of haematological growth factor or oral antibiotics. Br J Cancer 85:1444-1451 Shepherd FA, Giaccone G, Seymour L, Debruyne C, Bezjak A, et
al. (2002) Prospective, randomized, double-blind, placebo- controlled trial of marimastat after response to fi rst-line chemotherapy in patients with small cell lung cancer: a trial of the national Cancer Institute of Canada – Clinical Trials group and the European Organization for Research and Treatment of Cancer. J Clin Oncol 20:4434-4439 Socinski MA, Schell MJ, Peterman A, Bakri K, Yates S, et al.
(2002) Phase III trial comparing a defi ned duration of therapy versus continuous therapy followed by second- line therapy in advanced stage IIIb/IV non-small cell lung cancer. J Clin Oncol 20:1335-1343
Souquet PJ, Tan EH, Rodrigues Pereira J, van Klaveren R, Price A, et al. (2002) GLOB-1: a prospective randomised clini- cal phase III trial comparing vinorelbine-cisplatin with
vinorelbine-ifosfamide-cisplatin in metastatic non-small cell lung cancer patients. Ann Oncol 13:1853-1861
Splinter TA, Sahmoud T, Festen J, van Zandwijk N, Sorenson S, et al. (1996) Two schedules of teniposide with or without cisplatin in advanced non-small cell lung cancer: a rand- omized study of the European Organization for Research and Treatment of Cancer Lung Cancer Cooperative Group.
J Clin Oncol 14:127-134
Stephens R (2004) Quality of life issues in non-small cell lung cancer. Expert Rev Pharmacoeconomics Outcomes Res 4:89-100
Sundstrom S, Bremnes RM, Kaasa S, Aasebo U, Hatlevoll R, et al. (2002) Cisplatin and etoposide regimen is superior to cyclophosphamide, epirubicin and vincristine regimen in small cell lung cancer: results from a randomized phase III trial with 5 years’ follow-up. J Clin Oncol 20:4665-4672 Tada H, Tsuchiya R, Ichinose Y, Koike T, Nishizawa N, Nagai
K, Kato H (2004) A randomized trial comparing adjuvant chemotherapy versus surgery alone for completely resected pN2 non-small cell lung cancer (JCOG9304). Lung Cancer 43:167-173
Takada M, Fukuoka M, Kawahara M, Sugiura T, Yokoyama A, et al. (2002) Phase III study of concurrent versus sequential thoracic radiotherapy in combination with cisplatin and etoposide for limited stage small cell lung cancer: results of the Japan Clinical Oncology Group Study 9104. J Clin Oncol 20:3054-3060
Therasse P, Arbuck SG, Eisenhauer EA, Wanders J, Kaplan RS, et al. (2000) New guidelines to evaluate the response to treatment in sold tumours. J Natl Cancer Inst 92:205-216