54
Critically Reviewing the Literature for Improving Clinical Practice
Clifford Y. Ko and Robin McLeod
764
As the science of surgery continues to advance, it is important for the practicing clinical surgeon to remain up to date on the current issues in the field. Many surgeons remain updated by reading the literature—and although this is an excellent way to stay current, it is paramount that the reader understands how to critically read the literature, and evaluate the impor- tance, relevance, and validity of the published works.
The current chapter is written to assist the reader in critically evaluating the literature. It is organized in a building block manner—with fundamental issues being discussed initially, after which more complex issues are addressed. Specifically, we begin by addressing study designs for clinical research with most of the section being devoted to important issues sur- rounding randomized, controlled trials (RCTs). After this, a discussion of how study designs dictate the level and grading of evidence is given. There are several grading systems, of which two are presented. The third section addresses the notion of best evidence and highlights the use of metaanalysis and prac- tice guidelines. The fourth and final section discusses critical evaluation of the literature, and covers statistics, risk adjust- ment, and quality of life (QOL) studies. For the interested reader, further readings are available in the references.
Study Designs: Case Series, Case Control, Cohort, and RCTs
Providing the Evidence
Various hierarchies have been proposed for classifying study design.1,2 In simplest terms, studies can be classified as case series, case control studies, cohort studies, and RCTs. The case series is the weakest and the RCT is the strongest for determining the effectiveness of treatment (Table 54-1).
Case Series
Case reports (arbitrarily defined as 10 or fewer subjects) and case series are the typical surgical studies performed. There is no concurrent control group although there may be a histori-
cal control group. Patients may be followed from the same inception point and followed prospectively—not for the purpose of the study—but in the normal clinical course of the disease. Typically, data from patient charts or clinical databases are reviewed retrospectively. Thus, the outcome of interest is present when the study is initiated. Despite the limitations of this study design, the importance of results from case series should not be minimized. It is because of careful observation that innovations in surgical practice and techniques have been and continue to be made. However, results from case series should be likened to those observa- tions made in the laboratory. Just as those observations should lead to generation of a hypothesis and performance of an experiment to test it, an RCT should be performed to con- firm the observations reported in a case series. Case series are plagued with biases such as selection and referral biases, and because data are not collected specifically for the study, they are often incomplete or even inaccurate. Therefore, incorrect conclusions about the efficacy of a treatment are common and surgeons should not rely solely on evidence from case series.
Case Control Studies
The case control study is the design used most frequently to study risk factors or causation. There are typically two groups of patients: the case group, composed of subjects in whom the outcome of interest is present, and the control group in whom it is not. Controls are selected by the investigator rather than by random allocation so the likelihood of bias being intro- duced is real and thus there is a risk of making an erroneous conclusion. Generally the controls are matched to the cases with respect to important prognostic variables other than the factor that is being studied. Although it is important to match the subjects to avoid an incorrect conclusion about the signif- icance of the factor being studied, it is equally important not to overmatch the controls so that a true difference is not observed. In case control studies, as in case series, data are collected retrospectively. Thus, the outcome is present at the
start of the study. As an example, Selby and colleagues3 performed a case control study to make inferences about the effectiveness of flexible sigmoidoscopy in preventing rectal cancer. The cases were HMO patients who had been receiv- ing regular yearly examinations and developed rectal cancer (the outcome of interest). The controls were individuals from the same cohort of patients who had not developed rectal can- cer. They were matched to the cases with respect to age, sex, and date of entry into the health plan. Selby and colleagues found that cases were less likely to have had a flexible sig- moidoscopy than controls in the preceding 10 years (8.8% of cases versus 24.2% of controls).
Cohort Studies
Cohort studies may be retrospective or prospective. There are two or more groups but subjects are not randomly allocated to the groups. One group receives the treatment or exposure of interest whereas other groups of subjects receive another treatment or no treatment or exposure. The inception point may not be defined by the study and the intervention and fol- low-up may be ad hoc. However, the outcome is not present at the time that the inception cohort is assembled. There is less possibility of bias than a case control study because cases are not selected and the outcome is not present at the initiation of the study. However, the likelihood of bias is still high because subjects are not randomly allocated to groups.
Instead, there is some selection process either by the subject or the clinician that allocates them to groups. For instance, subjects may be allocated to groups by where they live (when the effect of an environmental toxin is being studied), by choice (when a lifestyle factor such as dietary intake is being studied), or by physician (when a nonrandomized study of a treatment intervention is being performed). Retrospective cohort studies differ from prospective cohort studies in that data analysis and possibly data collection are performed ret- rospectively but there is an identifiable time point that can be used to define the inception cohort. Such a date could be the date of birth, date of first attendance at a hospital, etc.
Cohort studies typically are performed by epidemiologists studying risk factors where randomization of patients is unethical. An example of a cohort study would be the use of a database to follow patients who had an anal mucosectomy versus no mucosectomy as part of restorative procto- colectomy, to determine the effect of the mucosectomy on long-term outcome.
Randomized, Controlled Trials
The RCT is accepted as the best trial design for establishing treatment effectiveness. There are several essential compo- nents of the RCT. First, subjects are randomly allocated to two groups: a treatment group (in which the new treatment is being tested) and a control group (in which the standard ther- apy or placebo is administered). Thus, the control group is concurrent and subjects are randomly allocated to the two groups. Second, the interventions and follow-up are standard- ized and performed prospectively. Thus, it is hoped that both groups are similar in all respects except for the interventions being studied. Not only does this guard against differences in factors known to be important, it also ensures that there are no differences as a result of unknown or unidentified factors.
This latter point is especially important. Statistical techniques such as multivariate analysis can be used to adjust for known prognostic variables, but they obviously cannot adjust for unknown prognostic variables. There are multiple examples of studies showing differences between groups that cannot be accounted for by the known prognostic variables.4
Where differences in treatment effect are small, the RCT may minimize the chance of reaching an incorrect conclusion about the effectiveness of treatment. There are, however, some limitations to RCTs. First, RCTs tend to take a long time to complete because of the time required for planning, accruing, and following patients and finally analyzing results.
As a consequence, results may not be available for many years. Second, clinical trials are expensive to perform, although their cost may be recouped if ineffective treatments are abandoned and only effective treatments are imple- mented.5Third, the results may not be generalizable or appli- cable to all patients with the disease because of the strict inclusion and exclusion criteria and inherent differences in patients who volunteer for trials. In addition, not all patients will respond similarly to treatment. Fourth, in situations whereby the disease or outcome is rare or only occurs after a long period of follow-up, RCTs are generally not feasible.
Finally, the ethics of performing RCTs is controversial and some clinicians may be uncomfortable with randomizing their patients when they believe one treatment to be superior even if that is based only on anecdotal evidence.
There are elements common to all RCTs. The first and per- haps the most important issue in designing an RCT is to enun- ciate clearly the research question. Most RCTs are based on observations or experimental evidence from the laboratory.
RCTs should always make biologic sense, have clinical rele- vance, and be feasible to perform. The research question will determine who will be included, what the intervention will be, and what will be measured. Frequently, a sequence of RCTs will be performed to evaluate a particular intervention.
Initially, a rather small trial that is highly controlled using a physiologic or surrogate endpoint may be performed. This trial would provide evidence that the intervention is effective in the optimal situation (efficacy trial). However, it might lack TABLE54-1. Types of study designs
Control Prospective Random allocation
group follow-up of subjects
Case series No No No
Case control study Yes No No
Cohort study Yes Yes No
RCT Yes Yes Yes
clinical relevance especially if the endpoint were a physio- logic measure. However, if it were positive, it would then lead to another trial, with more patients and a more clinically rel- evant outcome measure. If this were positive, a very large trial might be indicated to assess the effectiveness of the interven- tion in normal practice (effectiveness trial). Such an example would be studying the effect of a chemoprevention agent in colon cancer. Initially, the agent might be prescribed to a group of individuals at high risk for polyp formation (e.g., patients with familial polyposis coli) for a short time with the outcome measure being a rectal biopsy looking for prolifera- tive changes. A subsequent trial might look at polyp regres- sion in this same cohort of patients with subsequent trials aimed at the prevention of significant polyps in average-risk individuals who were followed for several years. As one can see, the selection of subjects, the intervention, duration of the trial, and the choice of outcome measure may vary depending on the research question. Ultimately, however, investigators wish to generalize the results to clinical practice so the out- come measures should be clinically relevant. For this reason, QOL measures are often included.
Although there are elements common to all RCTs, there are issues of special concern in surgical trials.6The issue of stan- dardization of the procedure is of major importance in surgical trials. Standardization is difficult because surgeons vary in their experience with and in their ability to perform a surgical technique. There may be individual preferences in performing the procedure, and technical modifications may occur as the procedure evolves. Moreover, differences in perioperative and postoperative care may also impact on the outcome. There are two issues related to standardization of the procedure. First, there is the issue of who should perform the procedure: only experts or surgeons of varying ability. Implicit in this is the definition of an “expert.” Second, there is the issue of stan- dardization of the procedure so it is performed similarly by all surgical participants and it can be duplicated by others follow- ing publication of the trial results. The implications of these two issues are different and strategies to address them differ.
The first issue is analogous to assessing compliance in a medical trial. Thus, if the procedure is performed by experts only in a very controlled manner, this is analogous to an “effi- cacy trial.” The advantage of such a trial is that if the proce- dure is truly superior to the other intervention, then this design has greatest likelihood of detecting a difference. The disadvantage, obviously, is that the results are less generaliz- able. Like most issues in clinical trials, there is no right or wrong answer. If the procedure is usually performed by experts, then it probably is desirable to have only experts involved in the trial. However, if a wide spectrum of surgeons perform the procedure, then it would be appropriate not to limit surgical participation.
Regardless of the number of surgeons involved in the trial and their desire to mimic routine practice, there must be a certain amount of standardization so that readers of the trial results can understand what was done and can duplicate the
procedure in their own practice. There are several strategies to ensure a minimum standard. First, all surgeons should agree on the performance of the critical aspects of the proce- dure. It may not be necessary that there is agreement with all of the technical aspects but there should be consensus on those that are deemed to be important. Furthermore, if there are aspects of the perioperative and postoperative care that impact on outcome (e.g., postoperative adjuvant therapy), they should be standardized. Teaching sessions may be held preoperatively and feedback given to surgeons on their per- formance during the trial. As well, obtaining documentation that the procedure has been performed satisfactorily (e.g., through postoperative angiograms to document vessel patency or pathology specimens to document resection mar- gins and lymph node excision) may contribute to ensuring that the surgery is being performed adequately. Finally, patients are usually stratified according to surgeon or center to ensure balance in case there are differences in surgical technique among centers or surgeons.
Blinding is often a difficult issue in surgical trials. It may not be an issue if two surgical procedures are being compared but is a major issue if a surgical procedure is being compared with a medical therapy. There is often a placebo effect of surgery. The classic example was observed in a series of 18 patients in which 13 patients underwent ligation of the internal mammary artery for coronary artery disease and five patients underwent a sham operation.7All of the patients in the latter group reported subjective improvement in their symp- toms. In the 1990s, it would be difficult ethically to perform a sham operation so it might be impossible to conceal which treatment the patient received. The lack of blinding is espe- cially worrisome if the primary outcome is a change in symp- toms or QOL rather than a “hard” outcome measure such as mortality or morbidity. In these situations, if a hard outcome measure is also measured and it correlates with the patient’s assessment, there is less concern about the possibility of bias.
Assessments may be performed by an independent assessor who is unaware of the treatment group that the patient is in.
Finally, if criteria used to define an outcome are explicitly specified a priori, it may minimize or eliminate bias (e.g., cri- teria to diagnose an intraabdominal abscess). Investigators may also choose in this situation to have a blinded panel review the results of tests to ensure that they meet the criteria.
The issue of timing of trials is difficult. Chalmers8 has argued that the first patient in whom a procedure is performed should be randomized. Most surgeons would argue, however, that a learning curve exists in any procedure and that modifi- cations to the technique are made frequently at its inception.
By including these early patients, one would almost certainly bias the results against the new procedure. The introduction of laparoscopic cholecystectomy and the initially high rate of common bile duct injuries or the laparoscopic versus open inguinal hernia trial are good examples of this.9However, it may be difficult to initiate a trial when the procedure is widely accepted by both the patient and surgical community.
The paucity of RCTs testing surgical therapies supports this latter contention. This dilemma arises because, unlike the release of medical therapies, there is no regulating body in surgery that restricts performance of a procedure or requires proof of its efficacy. Probably, RCTs should be performed early before new procedures become accepted into practice, recognizing that future trials may be necessary as the procedure evolves and surgical experience increases. This is analogous to medical oncologic trials in which new trials are being planned as one is being completed. However, a surgical procedure must first be established adequately to avoid invest- ing a large amount of money and time into a valueless trial.
Finally, patient issues may be of greater concern in surgical trials. In a medical trial, patients may be randomized to either treatment arm with the possibility that, at the conclusion of the trial, they can receive the more efficacious treatment if the dis- ease is not progressive and the treatment is reversible. Surgical procedures, however, are almost always permanent. This may be of particular concern if a medical therapy is being com- pared with a surgical procedure or the two surgical procedures differ in their magnitude or invasiveness. Patients may have a preference for one or the other treatments and therefore refuse to participate in the trial. There also tends to be more emotion involved with surgery and patients may be less willing to leave the decision as to which procedure will be performed to chance. Surgeons themselves may be uncomfortable in dis- cussing the uncertainty of randomization with patients requir- ing surgery.10 Thus, accruing patients for surgical trials may be more difficult than for medical trials. In a survey of subjects who had already participated in a trial of maintenance therapy for Crohn’s disease, Kennedy et al.11 found that 91% would agree to participate in a trial again if it involved comparison of two medical treatments but only 44% would agree to partici- pate if it included a surgical arm. Although accrual may be more difficult, there are notable examples of important surgi- cal trials that have been performed.12–14 Thus, they can be per- formed although it may require a larger pool of eligible patients from which to sample.
Levels of Evidence: Grading the Evidence
Levels of Evidence
There are several grading systems for assessing the level of evidence.1,15–18 The first was developed by the Canadian Task Force on the Periodic Health Examination in the 1970s (Table 54-2) and has been adopted by the United States Task Force.
Although differing in some respects, most systems consider the a priori design of the study and the actual quality of the study. Studies in which there has been blinded random allocation of subjects are given highest weighting because the risk of bias is minimized. Thus, an RCT will provide Level I evidence provided it is well executed with respect to the issues discussed earlier in this chapter.
Although this system is of value because of its simplicity, difficulties may arise when readers wish to pool results from several studies, either informally during their reading or when performing systematic reviews or developing guidelines.
Decisions must be made on whether studies should be included or excluded depending on the quality of the study.19 As well, the systems are not sensitive to the relevance of the findings of studies. For instance, neither the clinical relevance of the outcome measures, the baseline risk of the effect, nor the actual results of the studies (e.g., study results that are not consistent with results from other RCTs) are considered in any system.
In the American Society of Colon and Rectal Surgeons (ASCRS), the Standards Committee in 2003 decided to adopt the grading system shown in Table 54-3.16,18 This system identifies the level of evidence based on the available litera- ture. Moreover, this system also provides a grade for the TABLE 54-2. Canadian task force levels of evidence
Level Type of evidence
I Evidence obtained from at least one properly RCT
II-1 Evidence obtained from well-designed controlled trials without randomization
II-2 Evidence obtained from well-designed cohort or case-control analytic studies, preferably from more than one center or research group
II-3 Evidence obtained from comparisons between times or places with or without the intervention; dramatic results in uncontrolled experiments (such as the results of treatment with penicillin in the 1940s) could also be included in this category
III Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees
TABLE54-3. Levels of evidence and grade of recommendation used by the ASCRS Standards Committee
Level Type of evidence
I Evidence obtained from metaanalysis of multiple, well-designed, controlled studies. Randomized trials with low false-positive and low false-negative errors (high power)
II Evidence obtained from at least one well-designed experimental study. Randomized trials with high false-positive and/or false-negative errors (low power)
III Evidence obtained from well-designed, quasi-experimental studies such as nonrandomized, controlled, single-group, pre-post, cohort, time, or matched case-control series IV Evidence from well-designed, nonexperimental studies such
as comparative and correlational descriptive and case studies V Evidence from case reports and clinical examples
Grade Grade of recommendation
A There is evidence of Type I or consistent findings from multiple studies of Type II, III, or IV
B There is evidence of Type II, III, or IV and findings are generally consistent
C There is evidence of Type II, III, or IV but findings are inconsistent
D There is little or no systematic empirical evidence
recommendation that depends on both the level of evidence and the consistency of the results from the different studies.
Assessing the Best Evidence
What Is the Quality of Evidence Evaluating Surgical Practice?
There is certainly a perception that surgeons are not ade- quately assessing surgical procedures. In an editorial in the Lancet in 1996 entitled “Surgical Research or Comic Opera:
Questions but Few Answers,” Richard Horton criticized sur- geons for their high reliance on case studies and stated that if surgeons wished to retain their academic reputations, they must find imaginative ways to collaborate with epidemio- logists to improve the design of the case series and to plan randomized trials.20Furthermore, he quoted a medical statis- tician, Major Greenwell, who stated, “I should like to shame surgeons out of the comic opera performances which they suppose are statistics of operations.”20This quote dated back to 1923. In a similar condemnation, Spodick21complained of the “repeated reporting of biased data from uncontrolled or poorly controlled trials, giving an illusion of success due to sheer quantity,” and stated that “a thousand zeros look impres- sive on paper, but they still amount to zero.”
So what is the evidence of the evidence? As one would pre- dict, repeated studies have shown that there is a predominance of case studies and a relative paucity of RCTs published in the literature. Solomon and McLeod2 reviewed three surgical journals—British Journal of Surgery, Surgery, and Diseases of the Colon and Rectum—over two time periods—1980 and 1990. They found that only 7% of all published clinical arti- cles were RCTs despite the fact that almost half of the articles addressed issues of treatment effectiveness. Furthermore, the proportion differed neither between 1980 and 1990 nor among the three journals. Another examination of the Diseases of the Colon and Rectum showed that the numbers of RCTs published were 5 (in 1990), 13 (in 1995), and 17 (in 2000).22Similarly, Barnes23noted that only 5% of abstracts accepted at the annual joint meetings of the Society for Vascular Surgery and the International Society for Cardiovascular Surgery dealt with RCTs. Haines24 reported that only 5% of articles in the Journal of Neurosurgery between 1973 and 1977 were controlled clinical trials. More recently, Horton20noted that 7% of articles published in nine surgical journals were reports of RCTs.
What clinical trials are being performed by surgeons?
Solomon et al.25were able to identify 204 RCTs published in the literature in 1990, which were published by surgeons, were from a surgical department, or contained at least one surgical arm. They estimated that their search retrieved approximately half of the surgical RCTs that were published.
Of these trials, the majority (75%) compared two medical therapies whereas trials comparing two surgical therapies
comprised only 18% and trials comparing a medical to a sur- gical therapy comprised only 5%. Thus, trials comparing antibiotic prophylactic regimens and adjuvant chemotherapy regimens were not uncommon, whereas trials comparing two different operative procedures were infrequent. Furthermore, the published trials tended to be small: almost two-thirds were single center trials, and in half there was no significant differ- ence detected, probably because the sample size was small and the trial lacked adequate power. Unfortunately, surgeons were the primary authors in only a small proportion of stud- ies, even those comparing two surgical procedures and in areas almost exclusively surgical in nature (e.g., trauma). The quality of the trials tended to be poor, especially if they con- tained one or two surgical arms or were published in surgical journals. Hall and colleagues26reviewed the published surgi- cal trials in 10 journals between 1988 and 1994. They also found that the trials tended to be of poor quality.
Given the relative paucity of RCTs reported in the litera- ture, Solomon and McLeod27 then wished to determine whether it should be possible to perform RCTs in more instances or whether it is not possible, as has been suggested by some. To address this issue, they identified a sample of 260 questions in the surgical literature relating to the efficacy of general surgical procedures. From this analysis, it was esti- mated that it should be possible to perform an RCT to answer approximately 40% of questions. In contrast, only 4.6% of the articles reviewed reported results of RCTs and more than 50%
of the articles were case reports or case studies. Although methodologic issues unique to surgical trials are frequently cited as the reason for not being able to do an RCT, in fact, they believed that methodologic issues would preclude doing an RCT only 1% of the time.
The most common issues to preclude performing an RCT would be strong patient preferences for one or the other treat- ments or the infrequency of the condition. However, with respect to the former, this was an assessment made by clini- cians and trials, such as those comparing mastectomy and lumpectomy and carotid endarterectomy to medical therapy, illustrating that it is possible to do trials even when the alternative treatments differ significantly in magnitude.
Although one cannot argue that surgeons do seem to rely on case series rather than RCTs to evaluate new surgical tech- niques, it is also important to point out that some noteworthy surgical trials that have had a high impact have been per- formed: mastectomy versus lumpectomy trials, carotid endarterectomy and ECIC bypass trials for stroke prevention, and the laparoscopic versus open colorectal cancer trial.12,13,28,29Furthermore, we must not forget the pioneering work of John Goligher30 who performed a series of trials assessing the surgical management of peptic ulcer disease long before RCTs were in vogue. However, although internists may criticize surgeons for not performing more tri- als, it is also important to realize that perhaps the greatest impetus for medical trials is the requirement by regulating agencies of evidence from clinical trials before release of new
medication and, therefore, the availability of funding from industry to test them.
Beyond the issue of the performance of RCTs, it is impor- tant for the reader to be able to critically evaluate the literature, which means that certain important information must be included in the manuscript. A recent article examined the qual- ity of reporting for RCTs in the Diseases of the Colon and Rectum. The authors found that 77% of 11 basic elements were reported appropriately. The best reported items were eli- gibility criteria, discussion of statistical tests, and accounting for all patients lost to follow-up. The worst reported item involved power calculations. Only 11% appropriately reported power calculations. For the critical reader, the reporting of appropriate methods, limitations, and data is important. To this end, standards have been recommended regarding the publica- tion of RCTs (i.e., CONSORT—Consolidated Standards of Reporting Trials) and includes 22 items (Table 54-4).31
The Best Evidence
Practicing evidence-based medicine might be a daunting task for the clinician who has a busy clinical practice, must look after the administrative and financial aspects of his or her practice, and then try to keep current with the latest informa- tion. It is physically impossible for clinicians to read all pub- lished medical journals, even in one’s own specialty, much less stay abreast of information that is distributed on the Internet and non–peer reviewed sources. Thus, the busy clini- cian must learn ways to access the best information and be able to critically appraise it to determine its worth and rele- vance to his or her practice. There may be two scenarios for which clinicians wish to obtain information: for specific patient problems encountered daily and for general mainte- nance or updating of knowledge. Although clinicians will need to have the skills to retrieve information and critically appraise it, there are several information sources that may be of particular help including systematic reviews and evidence- based practice guidelines.
Systematic Reviews or Metaanalyses
The terms systematic review and metaanalysis have been used interchangeably. However, systematic reviews or overviews are qualitative reviews, whereas statistical methods are used to combine and summarize the results of several studies in metaanalysis.32In both, there is a specific scientific approach to the identification, critical appraisal, and synthesis of all rel- evant studies on a specific topic. They differ from the usual clinical review in that there is an explicit, specific question that is addressed. As well, the methodology is explicit and there is a conscientious effort to retrieve and review all stud- ies on the topic without preconceived prejudice. The value of metaanalysis is that study results are combined so conclusions can be made about therapeutic effectiveness, or if there is not a conclusive answer, to plan new studies.33 They are espe- cially useful when results from several studies disagree with regard to the magnitude or the direction of effect, when indi- vidual studies are too small to detect an effect and label it as statistically not significant, or when a large trial is too costly or time consuming to perform. For the clinician, metaanaly- ses are useful because results of individual trials are com- bined so he or she does not have to retrieve, evaluate, and synthesize the results of all studies on the topic. Thus, it may increase the efficiency of the clinician in keeping abreast of recent advances.
Metaanalysis is a relatively new method for synthesizing information from multiple studies. Thus, the methodology is constantly evolving, and similar to other studies, the quality of individual metaanalysis may be quite variable. There has been a call for standardization of the methodology used in metaanalysis.34,35However, because the rigor of the method- ology of many published metaanalyses may be quite variable, the clinician should have some knowledge of metaanalysis TABLE54-4. CONSORT checklist for reporting RCTs
1. Title and abstract—how participants were allocated to interventions Introduction
2. Background—scientific background and explanation of rationale Methods
3. Participants—eligibility criteria, settings, and locations of data collection 4. Interventions—details of interventions for each group
5. Objectives—specific aims and hypotheses
6. Outcomes—defined primary and secondary outcomes
7. Sample size—how sample size was determined, interim analyses, stopping rules
8. Randomization sequence generation—method used to generate randomization
9. Randomization allocation concealment—method used to implement randomization
10. Randomization implementation—who generated the allocation sequence, who enrolled participants
11. Blinding—whether or not blinding was performed (subjects, researchers, etc.)
12. Statistical methods—methods used to compare groups Results
13. Participant flow—flow of subjects through each stage (strongly recommend diagram) such as numbers of subjects randomly assigned, receiving intended treatment, completing protocol, etc.
14. Recruitment—dates defining the periods of recruitment and follow-up 15. Baseline data—baseline demographic/clinical characteristics of
each group.
16. Numbers analyzed—“denominator” of each group and whether analysis was performed by “intention to treat”
17. Outcomes and estimation—summary of results for each primary and secondary outcome for each group
18. Ancillary analyses—address added analyses and whether they were prespecified or exploratory
19. Adverse events—all important adverse events or side effects for each group Discussion
20. Interpretation—interpretation of results, discussing hypotheses, bias, limitations
21. Generalizability—external validity of the trial findings
22. Overall evidence—general interpretation of the results in the context of current evidence
methodology and be able to critically appraise them.
Published guidelines are available (Table 54-5).36
There are some basic steps that are followed in performing a metaanalysis. First, the metaanalysis should address a spe- cific healthcare question. Second, various strategies should be used to ensure that all relevant studies (RCTs) on the topic are retrieved. These include searching various databases such as MEDLINE and EMBASE. In addition, proceedings of meetings and reference lists should be checked and content experts and clinical researchers are consulted in order to ensure all published and nonpublished trials are identified.
Reliance on MEDLINE searches alone will result in incom- plete retrieval of published studies.27 Third, as in other stud- ies, inclusion criteria determining which studies will be included should be set a priori. Fourth, data from the indi- vidual studies should be extracted by two blinded investiga- tors to ensure that this is done accurately. As well, these investigators should assess the quality of the individual stud- ies. Fifth, the data should be combined using various statisti- cal techniques. Before doing so, statistical tests to determine the “sameness” or “homogeneity” of the individual studies should be performed.
Whereas some have embraced metaanalysis as a system- atic approach to synthesizing published information from individual trials, others have cautioned about the results of metaanalysis and some have been completely skeptical of the technique.37 LeLorier et al.38 compared the results of 19 metaanalyses with the results of 12 large trials published subsequently. If the subsequent trials had not been per- formed, an ineffective treatment would have been adopted in 32% of cases and a useful treatment would have been rejected in 33%. Others have pointed out that metaanalyses on the same clinical question have led to different conclu- sions.39 Some of these are attributable to methodologic problems. Failure to use broad enough search strategies may result in exclusion of all relevant studies. Usually, unpub- lished studies are excluded and these are more likely to be
“negative trials” (so-called publication bias).40 As well, there is evidence that omission of trials not published in English journals may bias the results.41 Finally, there is a strong association between statistically positive conclusions
of metaanalyses and their quality (i.e., the lower the quality of the studies, the more likely that the metaanalysis reached a positive conclusion).42 One of the values of metaanalysis is that the generalizability of the results is increased by com- bining the results of several trials. However, if there is great variation in studies, including patient inclusion criteria, dosage and mode of administration of medication, and length of follow-up (so-called heterogeneity), it may be inappropriate to combine results. If this is done, it may pro- duce invalid results. Other reasons for discrepancies may be the use of different statistical tests and failure to update the metaanalysis. Finally, metaanalysis has generally been restricted to combining the results of RCTs even though there is also a need for combining data from nonrandomized or observational studies.
In response to the problems in disseminating the results of individual RCTs, the Cochrane Collaboration was estab- lished43 to prepare, maintain, and disseminate systematic reviews of RCTs of healthcare interventions. It was named after Archie Cochrane, an eminent statistician in the United Kingdom. The Cochrane Collaboration is a voluntary inter- national organization that encourages the participation of interested individuals. Cochrane groups are organized by areas of interest (e.g., upper gastrointestinal, inflammatory bowel disease, colorectal cancer, hepatobiliary). In addition to preparing reviews, journals are hand searched and a data- base of all published RCTs is maintained. Systematic reviews are constantly being updated. The Cochrane Library is available on CD ROM on a quarterly basis (The Cochrane Library. Update Software Inc. 936 La Rueda, Vista, CA 92084). It includes several databases including the Cochrane Database of Systematic Reviews. This is a valuable source of high-level information for practicing clinicians.
Unfortunately, it is of somewhat more limited use to sur- geons because of the paucity of published surgical RCTs and metaanalyses.
Practice Guidelines
Practice guidelines have been defined by the Institute of Medicine as “systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances.”44 Guidelines are not standards that set rigid rules of care for patients. Rather, guidelines should be flexible so that individual patient char- acteristics, preferences of surgeons and patients, and local cir- cumstances can be accommodated.45
Guideline development has occurred for several reasons.46 First, as discussed earlier, there is growing evidence of sub- stantial unexplained and inappropriate variation in clinical practice patterns, which is probably attributable in part to physician uncertainty. Second, there is evidence that the tra- ditional methods for delivering continuing medical education are ineffective and that clinicians have difficulty in assimilat- ing the rapidly evolving scientific evidence. Third, there is TABLE54-5. Guidelines for using a review
1. Did the overview address a focused clinical question?
2. Were the criteria used to select articles for inclusion appropriate?
3. Is it unlikely that important, relevant studies were missed?
4. Was the validity of the included studies appraised?
5. Were the assessments of the studies reproducible?
6. Were the results similar from study to study?
7. What are the overall results of the review?
8. How precise were the results?
9. Can the results be applied to my patient care?
10. Were all the clinically important outcomes considered?
11. Are the benefits worth the harms and costs?
concern that as healthcare resources become more limited, there will be inadequate funds to deliver high-quality care if current technology and treatments are used inappropriately or ineffectively.
Practice guidelines have been promoted as one strategy to assist clinical decision making to increase the effectiveness and decrease unnecessary costs of delivering healthcare serv- ices.46 Many clinicians are wary of guidelines and believe that they are simply a means to limit resources and inhibit clinical decision making and individual preferences.
Guidelines have also been criticized for being too idealistic and failing to take into account the realities of day to day practice. The argument is that patients differ in their clinical manifestations, associated diseases, and preferences for treatments. Thus, guidelines may be either too restrictive or irrelevant. Also, clinicians may be confused because of con- flicting guidelines. Finally, guideline development may be inhibited because there is a lack of evidence upon which to base guidelines.
Many groups and organizations have begun to develop practice guidelines. Guidelines are developed using different methods.47 Guidelines can be developed based on informal consensus. The criteria upon which decisions are made are often poorly described and there is no systematic approach to reviewing the evidence. More often, these guidelines are based on the opinion of experts. Readers are unable to judge the validity of the guidelines because even if a systematic approach was followed, the process is not documented. In many instances, guidelines are self-serving and used to pro- mote a certain specialty or expertise. The National Institutes of Health and others have produced guidelines based on a for- mal consensus approach. Although this approach tends to be more structured than the informal consensus, it has the same potential flaws in that it is less structured and susceptible to the biases of the experts.
Evidence-based guidelines are the most rigorously devel- oped guidelines.15–46,48 There should be a focused clinical question and a systematic approach to the retrieval; assess- ment of quality and synthesis of evidence should be followed.
Guideline development should also be a dynamic process with constant updating as more evidence is available. In addi- tion to assessment of the literature, there is usually an inter- pretation of the evidence by experts and the evidence may be modulated by current or local circumstances (e.g., cost/avail- ability of technology).
Whereas there has been much attention given to the prepa- ration of guidelines, there has been less emphasis on the dis- semination of and evaluation of the impact of guidelines.
Unfortunately, there is some indication that evidence-based guidelines may not have as much impact either on changing physician behavior or improving outcome.
Because there are many guidelines available, including some with conflicting recommendations, clinicians require some skills to evaluate the guidelines and determine their validity and applicability15,48 (Table 54-6).
Critically Evaluating the Literature (How to . . .)
Critically Appraising the Literature
Critical appraisal skills must be mastered before evidence- based practice can be implemented successfully.49 Critical appraisal skills are those that enable application of certain rules of evidence and laws of logic to clinical, investigative, and published data and information in order to evaluate their validity, reliability, credibility, and utility. Clinicians need critical appraisal skills because of the constant appearance of new knowledge and the short half-life of current knowledge.
Clinicians cannot rely on facts learned from medical school.
Instead, they must have the necessary skills to assess the validity and relevance of new knowledge in order to provide the best care to their patients.
Critical appraisal requires the clinician to have some knowledge of clinical epidemiology, biostatistics, epidemiol- ogy, decision analysis, and economics. Critical appraisal skills also improve with practice, and the clinician is encour- aged simply to begin using the skills they already have and build on them. There are a variety of articles and books writ- ten on the topic. The McMaster Evidence Based Medicine Group has published a series of articles in the Journal of the American Medical Association.15,36,48,50–64 Sackett and col- leagues49 have consolidated much of this information into a book entitled “Evidence Based Medicine.” Interested readers are encouraged to seek further information from these and other sources.
To make decisions about a patient, clinicians generally need to know something about the cause of the disease, risk factors for it, the natural history or prognosis of the disease, how to quantify aspects of the disease (measurement issues), diagnostic tests and the diagnosis of the disease, and the effectiveness of treatment. In addition, clinicians now need to have some knowledge of economic analysis, health services research, practice guidelines, systematic reviews, and deci- sion analysis to fully appreciate the literature and make use of all sources of information.
TABLE 54-6. Guidelines for assessing practice guidelines 1. Were all the important options and outcomes clearly specified?
2. Was an explicit and sensible process used to identify, select, and combine evidence?
3. Was an explicit and sensible process used to consider the relative value of different outcomes?
4. Is the guideline likely to account for important recent developments?
5. Has the guideline been subject to peer review and testing?
6. Are practical, clinically important, recommendations made?
7. How strong are the recommendations?
8. What is the impact of uncertainty associated with the evidence and values used in guidelines?
9. Is the primary objective of the guideline consistent with your objective?
10. Are the recommendations applicable to your patients?
Many clinicians believe that critical appraisal only requires knowledge of statistics. As stated previously, an array of skills is required. Furthermore, in making decisions about the inter- nal validity of the study (i.e., How good is the study and how confident am I that the results or conclusions are correct?), it is critical that the clinician can assess the study design and how well the study was actually performed. The statistical analysis, although important, is only one component of study design.
Generally, clinicians read articles so they can generalize the results of the study and apply them to their own patients.
There are two potential sources of error, which may lead to incorrect conclusions about the validity of the study results:
systematic error (bias) or random error. Bias is defined as
“any effect at any stage of investigation or inference tending to produce results that depart systematically from the true val- ues.”65 For example, the term “biased sample” is often used to mean that the sample of patients is not typical or representa- tive of patients with that condition. There are a number of biases that might be present, not just those related to patient selection. It may be difficult for the reader to discern the pres- ence of bias and its magnitude. For instance, suppose two dif- ferent treatments are compared in two groups of patients from two different hospitals. Although the authors could provide basic demographic information on the patient groups, one could not be certain that there were not differences in the patients, the severity of the disease, ancillary care, etc., at the two different hospitals and these differences, rather than the treatment, led to an improved outcome. The risk of an error as a result of bias decreases as the rigor of the trial design increases (see discussion of risk adjustment in the pro- ceeding section). Because of the random allocation of patients as well as its other attributes, the RCT is considered the best design for minimizing bias. In observational studies, includ- ing outcomes research (where patients have not been random- ized), various statistical tests (e.g., multivariate analysis) are frequently used to adjust for differences in prognostic factors between the two groups of patients. However, it is important to realize that it is possible to adjust for only known or meas- urable factors. In addition to these, there may be other unknown and possibly important prognostic factors that can- not be adjusted. Again, only if patients are randomly allocated can one be certain that the two groups are similar with respect to all known and unknown prognostic variables.
The other type of error is random error. Random error occurs because of chance, when the result obtained in the sample of patients studied differs from the result that would be obtained if the entire population were studied.65 Statistical testing can be performed to determine the likelihood of a ran- dom error. The type of statistical test used will vary depend- ing on the type of data. Some of the more common tests are shown in Table 54-7. There are two types of random error:
Type I and Type II errors. The risk of stating there is a differ- ence between two treatments when really there is not one is known as a Type I error. In the theory of testing hypotheses, rejecting a null hypothesis when it is actually true is called
a Type I error. By convention, if the risk of the result occurring because of chance is less than 5% (a P value less than .05), then the difference in the results of treatment is considered statistically significant. There really is a difference in the effectiveness of the two treatments.
One of the issues regarding Type I errors is that of multiple comparisons. Specifically, the more comparisons being per- formed with a given set of data, the higher the likelihood of a Type I error (finding a difference, when one truly does not exist). Under these circumstances, a correction for multiple comparisons (e.g., a Bonferroni correction) should be per- formed by the authors.
Although a result may be statistically significant, the clini- cian must determine whether it is clinically relevant or impor- tant.49 Typically, treatment effects can be written as absolute risk reduction (ARR) or relative risk reduction (RRR). The ARR is simply the difference in rates between the control group and the experimental group whereas the RRR is a pro- portional risk reduction and is calculated by dividing the ARR by the control risk. The advantage of the ARR is that the base- line event rate is considered. For instance, the RRR would be the same in two different studies in which the rates between the control and experimental groups were 50% and 25% and 0.5% and 0.25%, respectively. In other words, whereas the ARR would be 25% in the first study and 0.25% in the second study, the RRR for both studies would be 50%. Although the RRR is the same in both studies, the treatment benefit in the second scenario may be trivial.
Recently, Cook and Sackett66 have coined the term “num- ber needed to treat” (NNT) which may make more intuitive sense to clinicians rather than thinking in terms of ARR and RRR. It is calculated by dividing the ARR into 1. Thus, in the example mentioned above, four patients would have to be treated to prevent one bad outcome (the NNT is four) in the first study whereas 400 would have to be treated to prevent one bad outcome (the NNT is 400) in the second. It is up to the judgment of the clinician to decide whether the treatment benefit is clinically significant. The statistician can only determine whether a treatment benefit is statistically signifi- cant. Whether the effect is clinically significant will depend on the NNT, the frequency and severity of side effects (some- times stated as the number needed to harm—NNH), as well as the cost of treatment and its feasibility and acceptability.
TABLE54-7. Types of statistical tests
Statistical tests (with no Statistical tests (with Data type adjustment for risk factors) adjustment for risk factors) Binary Fisher exact test or Logistic regression
(dichotomous) chi square Ordered discrete Mann-Whitney U test
Continuous Student’s t test Analysis of covariance (normal
distribution)
Time to event Log-rank Wilcoxon test Log-rank (Cox
(censored data) proportional hazard)