Scolaris Content Display Scolaris Content Display

Les résultats liés à la santé évalués au travers d'études observationnelles par rapport à ceux évalués dans des essais randomisés

This is not the most recent version

Collapse all Expand all

Résumé scientifique

Contexte

Les chercheurs et organisations ont souvent recours à des preuves issues d'essais contrôlés randomisés (ECR) afin de déterminer l'efficacité d'un traitement ou d'une intervention dans des conditions idéales. Des plans d'études observationnelles sont souvent utilisés pour mesurer l'efficacité d'une intervention dans des conditions « réelles ». De nombreux plans d'étude et plans d'étude modifiés tels que les essais randomisés et les études observationnelles, sont utilisés pour réaliser des recherches comparatives de l'efficacité d'interventions afin d'essayer d'obtenir une estimation non biaisée indiquant si un traitement est plus efficace ou plus sûr qu'un autre pour une population.

Une analyse systématique des détails de conception des études, des risque de biais, des interprétations des paramètres, et de l'ampleur des effets pour tous les types d'études randomisées et observationnelles (non‐expérimentales) sont nécessaires afin d'identifier des différences spécifiques dans leur conception et des biais potentiels. Cette revue résume les résultats de revues méthodologiques comparant les résultats d'études observationnelles à ceux obtenus à partir d'essais randomisés portant sur la même question, ainsi que des revues méthodologiques comparant les résultats de différents types d'études observationnelles.

Objectifs

Évaluer l'impact de la conception des études (incluant les plans d'études sous la forme d'un essai contrôlé randomisé ou d'une étude observationnelle) sur les mesures de l'effet estimé.

Explorer les variables méthodologiques pouvant potentiellement expliquer les différences identifiées.

Identifier les lacunes dans les recherches existantes comparant différents plans d'étude.

Stratégie de recherche documentaire

Nous avons consulté sept bases de données électroniques, de janvier 1990 jusqu'à décembre 2013.

En plus des termes MeSH et des mots clés pertinents, nous avons utilisé la version équilibrée (en termes de sensitivité et de spécificité) d'une stratégie de recherche validée pour identifier des revues dans PubMed, avec un terme supplémentaire (« revue » dans les titres des articles), pour mieux cibler les revues narratives. Aucune restriction de langue n'a été appliquée.

Critères de sélection

Nous avons examiné les revues systématiques ayant été conçues comme des revues méthodologiques ayant pour but de comparer les estimations de l'ampleur de l'effet quantitatif mesurant l'efficacité ou l'efficience des interventions évaluées dans des essais à celles évaluées dans des études observationnelles. Les comparaisons incluaient des ECR par rapport à des études observationnelles (y compris des cohortes rétrospectives et des cohortes prospectives, des études cas‐témoins et des études transversales). Les revues n'étaient pas éligibles si celles‐ci comparaient des essais randomisés à d'autres études ayant utilisé une certaine forme d'assignation concurrente.

Recueil et analyse des données

En général, les critères de jugement incluaient les risques relatifs ou les ratios des taux (RR), les rapports de cotes (RC), les hazard ratios (HR). Lorsque nous avons utilisé les résultats d'études observationnelles en tant que groupe de référence, nous avons examiné les estimations publiées afin de déterminer s'il y avait une plus grande taille relative ou un plus petit effet dans les ratios des rapports de cotes (ROR).

Dans chaque revue identifiée, si une estimation comparant les résultats des études observationnelles à ceux obtenus à partir d'ECR n'était pas fournie, nous avons regroupé les estimations issues des études observationnelles et des ECR. Ensuite, nous avons estimé le rapport des ratios (risque relatif ou rapports de cotes) pour chaque revue identifiée en utilisant des études observationnelles comme catégorie de référence. Dans toutes les revues, nous avons synthétisé ces ratios pour obtenir un ratio combiné des rapports de cotes (ROR) comparant les résultats des ECR à ceux des études observationnelles.

Résultats principaux

Notre recherche initiale a permis d'identifier 4406 références bibliographiques uniques. Quinze revues répondaient à nos critères d'inclusion ; 14 d'entre elles ont été incluses dans l'analyse quantitative.

Les revues incluses ont analysé les données issues de 1583 méta‐analyses recouvrant 228 affections médicales différentes. Le nombre moyen d'études incluses par article était de 178 (plage allant de 19 à 530).

Onze (73 %) des revues présentaient un faible risque de biais pour les critères explicites de sélection des études, neuf (60 %) étaient à faible risque de biais au niveau de l'accord entre les chercheurs quant à la sélection des études, cinq (33 %) incluaient un échantillon complet des études considérées, sept études (47 %) ont évalué le risque de biais des études qu'elles ont incluses.

Sept (47 %) des revues ont pris en compte les différences méthodologiques entre les études.

Huit (53 %) des revues ont pris en compte l'hétérogénéité entre les études, neuf (60 %) ont analysé des critères de jugement similaires, et quatre (27 %) ont été considérées comme présentant un faible risque de biais de notification.

Notre analyse quantitative principale portant sur 14 revues a montré que le ROR combiné des effets issus d'ECR par rapport à ceux issus d'études observationnelles était de 1,08 (intervalle de confiance à 95 % (IC) 0,96 à 1,22). Parmi les 14 revues incluses dans cette analyse, 11 (79 %) n'ont trouvé aucune différence significative entre les études observationnelles et les ECR. Une revue a suggéré que les études observationnelles avaient obtenu des effets plus larges, et deux revues suggéraient que les études observationnelles avaient obtenu des effets plus faibles.

Similairement à l'effet observé au travers des revues incluses, les effets des revues comparant des ECR à des études de cohorte avaient un ROR combiné de 1,04 (IC à 95 % 0,89 à 1,21), avec une hétérogénéité substantielle (I 2= 68 %). Trois revues ont comparé les effets des ECR et des études cas‐témoin (ROR combiné : 1,11 (IC à 95 % de 0,91 à 1,35)).

Aucune différence significative dans les estimations ponctuelles indépendamment de l'hétérogénéité, de l'inclusion d'interventions pharmacologiques, ou des ajustements du score de propension pour les sous‐groupes n'a été constatée. Aucune revue n'avait comparé les ECR à des études observationnelles ayant utilisé les deux méthodes les plus courantes d'inférence causale, c'est‐à‐dire les variables instrumentales et les modèles marginaux structuraux.

Conclusions des auteurs

Nos résultats portant sur l'ensemble des revues (ROR combiné de 1,08) sont très similaires aux résultats rapportés par les revues réalisées d'une manière semblable. Par conséquent, nous avons obtenu des conclusions similaires ; en moyenne, il existe très peu de preuves indiquant des différences significatives au niveau de l'estimation de l'effet entre les études observationnelles et les ECR, indépendamment de la conception observationnelle spécifique des études, de l'hétérogénéité, ou de l'inclusion d'études portant sur des interventions pharmacologiques. Des facteurs autres que le plan d'étude en lui‐même, doivent être pris en compte lors de l'exploration des motifs d'un désaccord entre les résultats d'ECR et d'études observationnelles. Nos résultats soulignent le fait qu'il est important pour les auteurs de revues de prendre en compte non seulement la conception des études, mais aussi le niveau d'hétérogénéité dans des méta‐analyses d'ECR ou d'études observationnelles. Une meilleure compréhension de la manière dont ces facteurs ont une influence sur les effets des études pourrait permettre d'obtenir des estimations représentatives de l'efficacité réelle.

Résumé simplifié

Comparer les estimations d'effets issues d'essais contrôlés randomisés et d'études observationnelles

Les chercheurs et organisations font souvent référence à des preuves issues d'essais contrôlés randomisés (ECR) afin de déterminer l'efficacité d'un traitement ou d'une intervention dans des conditions idéales, tandis que des études de type observationnel sont utilisées pour mesurer l'efficacité d'une intervention dans un contexte non‐expérimental et « plus réel ». Parfois, les résultats d'ECR et d'études observationnelles portant sur les mêmes question peuvent obtenir des résultats différents. Cette revue cherche à établir si ces différences dans les résultats sont liées au type d'étude en lui‐même, ou à d'autres caractéristiques des études.

Cette revue résume les résultats de revues méthodologiques comparant les résultats d'études observationnelles à des essais randomisés portant sur les mêmes questions, ainsi que des revues méthodologiques comparant les résultats de différents types d'études observationnelles.

Les principaux objectifs de la revue sont d'évaluer l'impact de la conception des études ‐‐ sur le choix de l'inclusion d'ECR par rapport à des études observationnelles (par ex. des études de cohorte et des études cas‐témoin) sur les mesures de l'effet estimé, et étudier les variables méthodologiques qui pourraient expliquer toute différence.

Nous avons effectué des recherches dans plusieurs bases de données électroniques et dans les références bibliographiques des articles pertinents afin d'identifier des revues systématiques conçues comme des revues méthodologiques ayant pour objectif de comparer les estimations de l'ampleur de l'effet quantitatif mesurant l'efficience ou l'efficacité des interventions dans des essais cliniques par rapport à des études observationnelles ou à différents plans d'études observationnelles. Nous avons évalué les risques de biais des revues incluses.

Nos résultats apportent peu de preuves indiquant des différences significatives au niveau de l'estimation de l'effet entre les études observationnelles et les ECR, indépendamment de la conception spécifique des études observationnelles, de l'hétérogénéité, de l'inclusion d'études pharmacologiques, ou des ajustements des scores de propension. Des facteurs autres que le plan d'étude per se, doivent être pris en compte lorsque les motifs d'un désaccord entre les résultats d'ECR et d'études observationnelles sont analysés.

Authors' conclusions

Implication for methodological research

In order to understand why RCTs and observational studies addressing the same question sometimes have conflicting results, methodological researchers must look for explanations other than the study design per se. Confounding is the greatest bias in an observational study compared to an RCT and methods for accounting for confounding in meta‐analyses of observational studies should be developed (Reeves 2013). The Patient‐Centered Outcomes Research Institute is finalizing methodological standards and calling for more research on measuring confounding in observational studies(PCORI 2012). PCORI has also called for empirical data to support the constitution of propensity scores and the validity of instrumental variables, two methods used to control for confounding in observational studies.

Background

Researchers and organizations often use evidence from randomized controlled trials (RCTs) to determine the efficacy of a treatment or intervention under ideal conditions. Studies of observational design are used to measure the effectiveness of an intervention in non‐experimental, 'real world' scenarios at the population level. The Institute of Medicine defines comparative effectiveness research (CER) as: “the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policy makers to make informed decisions that will improve health care at both the individual and population levels" (Institute of Medicine 2009). Comparative effectiveness research has also been called "comparative clinical effectiveness research" and "patient centered outcomes research" (Kamerow 2011). Regardless of what this type of research is called, it should give an unbiased estimate of whether one treatment is more effective or safer than another for a particular population. Debate about the validity of observational studies versus randomized trials for estimating effectiveness of interventions has continued for decades.

Numerous study designs and modifications of existing designs, both randomized and observational, are used for comparative effectiveness research. These include, but are not limited to, head‐to‐head randomized trials, cluster‐randomized trials, adaptive designs, practice/pragmatic/explanatory trials, PBE‐CPI “practice based evidence for clinical practice improvement,” natural experiments, observational or cross‐sectional studies of registries and databases including electronic medical records, meta‐analysis, network meta‐analysis, modeling and simulation. Modifications can often include newer observational study analysis approaches employing so‐called causal inference techniques, which can include instrumental variables, marginal structural models, propensity scores, among others. Non‐randomized experimental designs (e.g., non‐randomized trials), also play a role in comparative effectiveness research, but this review focuses on comparing randomized trials with non‐experimental observational designs. As noted in the Cochrane Handbook for Systematic Reviews of Interventions, potential biases for all non‐randomized studies are likely to be greater than for randomized trials (Higgins 2011). A systematic analysis of study design features, risk of bias, and effect size for all the types of studies used for comparative effectiveness research is needed to identify specific differences in design types and potential biases.

This review summarizes the results of methodological reviews that compare the outcomes of observational studies with randomized trials addressing the same question, as well as methodological reviews that compare the outcomes of different types of observational studies. A number of reviews comparing the effect sizes and/or biases in RCTs and observational studies (or non‐randomized controlled trials) have been conducted (Benson 2000; Britton 1998; Concato 2000; Deeks 2003; Ioannidis 2001; Kunz 1998; Kunz 2002; MacLehose 2000; Odgaard‐Jensen 2011; Oliver 2010; Sacks 1982; Wilson 2001).These reviews examined whether certain types of study designs report smaller or larger treatment effects, or change the direction of effects. Some reviews found that a lack of randomization or inadequate randomization is associated with selection bias, larger treatment effects, smaller treatment effects, or reversed direction of treatment effects (Deeks 2003; Ioannidis 2001; Kunz 1998; Odgaard‐Jensen 2011), while others found little to no difference in treatment effect sizes between study designs (Benson 2000; Britton 1998; Concato 2000; MacLehose 2000; Oliver 2010). However, there has been no systematic review of comparisons of all study designs currently being used for comparative effectiveness research. Reviews that compared RCTs with observational studies most often limited the comparison to cohort studies, or the types of observational designs included were not specified. In addition, most of the reviews were published between 1982 and 2003 and the methodology for observational studies has evolved since that time. One Cochrane review, first published in 2002 (Kunz 2002), has been archived and superseded by later versions. The most recent version of that review, published in 2011, compared random allocation versus non‐random allocation or adequate versus inadequate/unclear concealment of allocation in randomized trials (Odgaard‐Jensen 2011). This review included comparisons of randomized trials ("randomized controlled trials" or "RCTs"); non‐randomized trials with concurrent controls, and non‐equivalent control group designs. The review excluded comparisons of studies using historical controls (patients treated earlier than those who received the intervention being evaluated, frequently called "historically controlled trials" or "HCTs"); classical observational studies, including cohort studies, cross‐sectional studies, case‐control studies and ’outcomes studies’ (evaluations using large administrative or clinical databases). Another recent review assessing the relationship between randomized study designs and estimates of effect has focused only on policy interventions (Oliver 2010).

Why it is important to do this review

Despite the need for rigorous comparative effectiveness research, there has been no systematic comparison of effect measure estimates among all the types of randomized and non‐experimental observational study designs that are being used to assess effectiveness of interventions. The findings of this review will inform the design of future comparative effectiveness research and help prioritize the types of context‐specific study designs that should be used to minimize bias.

Objectives

To assess the impact of study design ‐ to include RCTs versus observational study designs on the effect measures estimated.

To explore methodological variables that might explain any differences identified. Effect size estimates may be related to the underlying risk of bias (i.e., methodological variables) of the studies, and not the design per se.  A flawed RCT may have larger effect estimates than a rigorous cohort study, for example. If the methodological reviews we included assessed the risk of bias of the study designs they included, we attempted to see if the differences in risk of bias explain any differences in effect size estimates. 

To identify gaps in the existing research comparing study designs. 

Methods

Criteria for considering studies for this review

Types of studies

We examined systematic reviews that were designed as methodological reviews to compare quantitative effect size estimates measuring efficacy or effectiveness of interventions tested in trials with those tested in observational studies. For the purposes of this review, a methodological review is defined as a review that is designed to compare outcomes of studies that vary by a particular methodological factor (in this case, study design) and not to compare the clinical effect of an intervention to no intervention or a comparator. Comparisons included RCTs and observational studies (including retrospective cohorts, prospective cohorts, case‐controls, and cross‐sectional designs) that compared effect measures from different study designs or analyses. For this review, the only non‐experimental studies we analyzed were observational in design. Therefore, we use the term "observational" in presenting the findings of our review. However, it should be noted that the terminology used in the literature to describe study designs is not consistent and can lead to confusion.

We included methodological reviews comparing studies described in the review as head to head randomized trials, cluster randomized trials, adaptive designs, practice / pragmatic / explanatory trials, PBE‐CPI “practice based evidence for clinical practice improvement,” natural experiments, prospective and retrospective cohort studies, case‐control studies, observational or cross‐sectional studies of registries and databases including electronic medical records, or observational studies employing so‐called causal inference techniques (e.g. briefly, analytical techniques that attempt to estimate a true causal relationship from observational data), which could include instrumental variables, marginal structural models, or propensity scores.  Specifically, we included comparisons of estimates from RCTs with any of the above types of observational studies.

Our focus is on reviews of effectiveness or harms of health‐related interventions. We included two types of reviews: a) systematic reviews of primary studies in which the review's main objective was pre‐defined to include a comparison of study designs and not to answer one specific clinical research question; and b) methodological reviews of reviews that included existing reviews or meta‐analyses that compared RCTs with observational designs. We excluded comparisons of study designs where the included studies were measuring the effects of putative harmful substances that are not health‐related interventions, such as environmental chemicals, or diagnostic tests, as well as studies measuring risk factors or exposures to potential hazards. We excluded studies that compared randomized trials to non‐randomized trials. For example, we excluded studies that compared studies with random allocation to those with non‐random allocation or trials with adequate versus inadequate/unclear concealment of allocation. We also excluded studies that compared the results of meta‐analyses with the results of single trials or single observational studies. Lastly, we excluded meta‐analyses of the effects of an intervention that included both randomized trials and observational studies with an incidental comparison of the results.

Types of data

It was our intention to select reviews that quantitatively compared the efficacy or effectiveness of alternative interventions to prevent or treat a clinical condition or to improve the delivery of care. Specifically, our study sample included reviews that have effect estimates from RCTs or cluster‐randomized trials and observational studies, which included, but were not limited to, cohort studies, case‐control studies, cross‐sectional studies.

Types of methods

We identified reviews comparing effect measures between trials and observational studies or different types of observational studies to include the following.

  • RCTs/cluster‐randomized trials versus prospective/retrospective cohorts

  • RCTs/cluster‐randomized trials versus case‐control studies

  • RCTs/cluster‐randomized trials versus cross‐sectional studies

  • RCTs/cluster‐randomized trials versus other observational design

  • RCTs/cluster‐randomized trials versus observational studies employing so‐called causal inference analytical methods

Types of outcome measures

The direction and magnitude of effect estimates (e.g. odds ratios, relative risks, risk difference) varied across meta‐analyses included in this review. Where possible, we used odds ratios as the outcome measure in order to conduct a pooled odds ratio analysis.

Search methods for identification of studies

Electronic searches

To identify relevant methodological reviews we searched the following electronic databases, in the period from 01 January 1990 to 06 December 2013.

  • Cochrane Methodology Register

  • Cochrane Database of Systematic Reviews

  • MEDLINE (via PubMed)

  • EMBASE (via EMBASE.com)

  • Literatura Latinoamericana y del Caribe en Ciencias de la Salud (LILACS)

  • PsycINFO

  • Web of Science/Web of Social Science

Along with MeSH terms and a wide range of relevant keywords, we used the sensitivity‐specificity balanced version of a validated strategy to identify reviews in PubMed (Montori 2004), augmented with one term ("review" in article titles) so that it better targeted reviews. We anticipated that this strategy would retrieve all relevant reviews. See Appendix 1 for our PubMed search strategy, which was modified as appropriate for use in the other databases.

The search strategy was iterative, in that references of included reviews were searched for additional references. We used the "similar articles" and "citing articles" features of several of the databases to identify additional relevant articles. All languages were included.

Prior to executing the electronic searches, the search strategy was peer reviewed by a second information specialist, according to the Peer Review of Electronic Search Strategies (PRESS) guidance (Sampson 2009).

Data collection and analysis

The methodology for data collection and analysis was based on the guidance of Cochrane Handbook of Systematic Reviews of Interventions (Higgins 2011).

Selection of studies

After removing duplicate references, one review author (THH) screened the results, excluding those that were clearly irrelevant (e.g. animal studies, editorials, case studies).

Two review authors (AA and LB) then independently selected potentially relevant reviews by scanning the titles, abstracts, and descriptor terms of the remaining references and applying the inclusion criteria. Irrelevant reports were discarded, and the full article (or abstract if from a conference proceeding) was obtained for all potentially relevant or uncertain reports. The two review authors independently applied the inclusion criteria. Reviews were reviewed for relevance based on study design, types of methods employed, and a comparison of effects based on different methodologies or designs. THH adjudicated any disagreements that could not be resolved by discussion.

Data extraction and management

After an initial search and article screening, two review authors independently double‐coded and entered information from each selected study onto standardized data extraction forms. Extracted information included the following. 

  • Study details: citation, start and end dates, location, eligibility criteria, (inclusion and exclusion), study designs compared, interventions compared.

  • Comparison of methods details: effect estimates from each study design within each publication.

  • Outcome details: primary outcomes identified in each study.

Assessment of risk of bias in included studies

We included systematic reviews of studies therefore, The Cochrane Collaboration tool for assessing the risk of bias for individual studies does not apply. We used the following criteria to appraise the risk of bias of included reviews, which are similar to those used in the methodology review by Odgaard‐Jensen and colleagues (Odgaard‐Jensen 2011).

  • Were explicit criteria used to select the studies?

  • Did two or more investigators agree regarding the selection of studies?

  • Was there a consecutive or complete sample of studies?

  • Was the risk of bias of the included studies assessed?

  • Did the review control for methodological differences of included studies (for example, with a sensitivity analysis)?

  • Did the review control for heterogeneity in the participants and interventions in the included studies?

  • Were similar outcome measures used in the included studies?

  • Is there an absence of risk of selective reporting?

  • Is there an absence of evidence of bias from other sources?

Each criterion was rated as yes, no or unclear.

We summarized the overall risk of bias of each study as: low risk of bias, unclear risk of bias or high risk of bias.

Measures of the effect of the methods

In general, outcome measures included relative risks or rate ratios (RR), odds ratios (OR), hazard ratios (HR).

Dealing with missing data

This review is a secondary data analysis and did not incur the missing data issues seen in most systematic reviews. However, for a select, small number of reviews we needed more information from the publishing authors regarding methods or other details, therefore, we contacted the corresponding authors.

Assessment of heterogeneity

We synthesized data from multiple reviews to compare effects from RCTs with observational studies. We had a wide variety of outcomes and interventions synthesized, increasing the amount of heterogeneity between reviews. We assessed heterogeneity using the χ2 statistic with a significance level of 0.10, and the I2 statistic. Together with the magnitude and direction of the effect, we interpreted an I2 estimate between 30% and 60% as indicating moderate heterogeneity, 50% to 90% substantial heterogeneity, and 75% to 100% as a high level of heterogeneity. Furthermore, if an included study was, in fact, a review article that already assessed heterogeneity, we reported the authors' original assessment of heterogeneity.

Assessment of reporting biases

We attempted to minimize the potential for publication bias by our comprehensive search strategy that included evaluating published and unpublished literature. In cases where we were missing specific information or data, we contacted authors and requested additional data.

Data synthesis

We examined the relationship between study design type and the affiliated estimates. Using results from observational studies as the reference group, we examined the published estimates to see whether there was a relative smaller or larger effect. We explored whether the RCT comparators showed about the same effects, larger treatment effects, or smaller treatment effects compared to the observational study reference group. Furthermore, in the text we qualitatively described the reported results from each included review. Within each identified review, if an estimate comparing results from RCTs with observational studies was not provided, we pooled the estimates for observational studies and RCTs. Then, using methods described by Altman (Altman 2003), we estimated the ratio of ratios (hazard ratio or risk ratio or odds ratio) for each included review using observational studies as the reference group. Across all reviews, we synthesized these ratios to get a pooled ratio of odds ratios (ROR) comparing results from RCTs to results from observational studies. Our results varied considerably by comparison groups, outcomes, interventions, and study design, which contributed greatly to heterogeneity. To avoid overlap of data between included studies, we did not include data previously included in another included review.

Subgroup analysis and investigation of heterogeneity

Reducing bias in comparative effectiveness research is particularly important for studies comparing pharmacological interventions with their implications for clinical care and health care purchasing. Since a number of the studies comparing study designs used for comparative effectiveness research focused on pharmacological comparisons, we decided, a priori, to conduct a subgroup analysis of these pharmacological studies. Specifically, we hypothesized that studies of pharmacological comparisons in a randomized design may have smaller effect estimates than studies of pharmacological comparisons in an observational study.

Additionally, we performed a subgroup analysis by heterogeneity of the included methodological reviews to compare the differences between RCTs and observational studies from the subgroup of methodological reviews with high heterogeneity (as measured in their respective meta‐analysis) to those with moderate‐low heterogeneity. As such, we stratified the reviews by the heterogeneity within each methodology review.

Results

Description of studies

See Characteristics of included studies; Characteristics of excluded studies.

Results of the search

Our initial search yielded 4406 unique references. An additional five references were identified from checking the reference lists of included publications. We selected 59 full‐text articles for further review, of which 44 were excluded because they did not meet our inclusion criteria. Fifteen reviews met our inclusion criteria for this review; 14 of these reviews were included in the quantitative analysis. See Figure 1 for study selection chart.


Flow chart depicting screening process

Flow chart depicting screening process

Included studies

See Characteristics of included studies. Fifteen reviews, published between 01 January 1990 and 06 December 2013, met the inclusion criteria for this review. Fourteen papers compared RCTs with observational designs; two reviews focused exclusively on pharmacological interventions (Beynon 2008; Naudet 2011), while four focused on pharmacological and other interventions, but provided data on drugs that could be analyzed separately (Benson 2000; Concato 2000; Golder 2011; Ioannidis 2001).

The included reviews analyzed data from 1583 meta‐analyses that covered 228 different medical conditions. The mean number of included studies per paper was 178 (range 19 to 530).

Of the 15 reviews, 14 were included in the quantitative analysis and had data, or we were able to obtain quantitative data from the authors, that allowed us to calculate RORs. One study (Papanikolauo 2006) was included in a previously published review (Golder 2011), therefore we have described it, but did not include it in the meta‐analysis.

Benson 2000 et al searched the Abridged Index Medicus and Cochrane databases for observational studies published between 1985 and 1998 that compared two or more treatments. To identify RCTs and observational studies comparing the same treatment, the researchers searched MEDLINE and Cochrane databases. One hundred and thirty‐six publications were identified that covered 19 different treatments. Benson 2000 et al found little evidence that treatment effect estimates obtained from observational studies were consistently larger than estimates from RCTs.

Beynon 2008 et al attempted to identify all observational and randomized studies with all‐cause mortality as the outcome for a sample of topics selected at random from the medical literature. One hundred and fourteen RCTs and 19 observational studies on 19 topics were included. The ratio of RRs for RCTs compared to observational studies was 0.88 (0.8 to 0.97), suggesting that observational studies had larger treatment effects by 12% on average.

Bhandari 2004 et al conducted a MEDLINE search for both observational and randomized studies comparing internal fixation and arthroplasty in patients with femoral neck fractures in publications between 1969 and 2002. The authors found 27 studies that met the criteria. Bhandari 2004 et al found that observational studies underestimated the relative benefit of arthroplasty by 19.5%.

Concato 2000 et al searched MEDLINE for meta‐analyses of RCTs and observational studies of the same intervention published in five major journals between 1991 and 1995. From 99 reports on five clinical topics, observational studies, on average, were similar to RCTs. The authors concluded that well‐designed observational studies generally do not have larger effects of treatment when compared to results of RCTs.

Edwards 2012 et al performed a systematic review and meta‐analysis comparing effect estimates evaluating the effects of surgical procedures for breast cancer in both RCTs and observational studies. A search of MEDLINE, EMBASE, and Cochrane Databases (2003 to 2008) yielded 12 RCTs covering 10 disparate outcomes. In two of 10 outcomes the pooled estimates from RCTs and observational studies differed, though none significantly. The authors conclude that RCTs comparing breast surgery procedures may yield different estimates in 20% to 40% of cases compared with estimates from observational studies.

Furlan 2008 et al searched for comparative studies of low‐back pain interventions published in MEDLINE, EMBASE, or The Cochrane Library through May 2005 and included interventions with the highest numbers of non‐randomised studies. Seventeen observational studies and eight RCTs were identified and, in general, results from observational studies either agreed with results from RCTs or underestimated the effects when compared to RCTs.

Golder 2011 et al performed a meta‐analysis of meta‐analyses comparing estimates of harm derived from meta‐analysis of RCTs with meta‐analyses of observational studies. Fifty‐eight meta‐analyses were identified. Pooled relative measures of adverse effect (odds ratio (OR) or risk ratio (RR)) suggested no difference in effect between study type (OR = 1.03; 95% confidence interval (CI) 0.93‐1.15). The authors conclude that there is no evidence on average in effect estimate of adverse effect of interventions from meta‐analyses of RCTs when compared to observational studies.

Ioannidis 2001 et al performed an analysis of meta‐analyses comparing effect estimates evaluating medical interventions from meta‐analysis of RCTs to meta‐analyses of observational studies. A search of MEDLINE (1966to 2000) and The Cochrane Library (2000, Issue 3) and major journals yielded 45 diverse topics from 240 RCTs and 168 observational studies. Observational studies tended to show larger treatment effects (P = 0.009). The authors conclude that despite good correlation between RCTs and observational studies, differences in effect sizes are present.

Kuss 2011 et al performed a systematic review and meta‐analysis comparing effect estimates from RCTs with observational studies employing propensity scores The included studies examined the effects of off‐pump versus on‐pump surgery in similar populations. A MEDLINE search yielded 29 RCTs and 10 propensity score analyses covering 10 different outcomes. For all outcomes, no differences were noted between RCTs and propensity score analyses.

The authors conclude that RCTs and propensity score analyses will likely yield similar results and propensity score analyses may have only a small remaining bias compared to RCTs.

Lonjon 2013 et al performed a systematic review and meta‐analysis comparing effect estimates from RCTs with observational studies employing propensity scores studying the effects of surgery addressing the same clinical question. A MEDLINE search yielded 94 RCTs and 70 propensity score analyses covering 31 clinical questions. For all‐cause mortality the authors noted no differences between RCTs and propensity score analyses (ROR = 1.07; 95% CI 0.87 to 1.33).

The authors conclude that RCTs and propensity score analyses will likely yield similar results in surgery studies.

Müeller 2010 et al searched PubMed for RCTs and observational studies comparing laparoscopic versus open cholecystectomy. A total of 162 studies were identified for inclusion (136 observational and 26 RCTs). Among the 15 outcomes of interest, three yielded significant discrepancies in effect sizes between study designs. As such, the authors conclude that the results from observational studies and RCTs differ significantly in at least 20% of outcomes variables.

Naudet 2011 et al identified published and unpublished studies from 1989 to 2009 that examined fluoxetine and venlafaxine as first line treatment for major depressive disorder. The authors identified 12 observational studies and 109 RCTs and produced meta‐regression estimates for outcomes of interest. The standardized treatment response in RCTs was greater by a magnitude of 4.59 compared to observational studies and the authors conclude that the response to antidepressants is greater in RCTs than in observational studies.

Oliver 2010 et al identified systematic reviews that compared results of policy interventions, stratifying estimates by observational study and RCT study design published between 1999 and 2004. A total of 16 systematic reviews were identified, with a median of 11.5 RCTs and 14.5 observational studies in each systematic review. Observational studies published in systematic reviews were pooled separately from RCTs published in the same systematic reviews. Results that were stratified by study design were heterogeneous with no clear differences in magnitude of effects; the authors found no evidence for clear systematic differences in terms of results between RCTs and observational studies.

Shikata 2006 et al identified all meta‐analyses of RCTs of digestive surgery published between 1966 and 2004. Fifty‐two outcomes for 18 disparate topics were identified from 276 articles (96 RCTs and 180 observational studies). Pooled odds ratios and relative risks were extracted for each outcome, using the same indicator that had been used in the meta‐analysis of interest and approximately 25% of all outcomes of interest yielded different results between observational studies and RCTs.

Papanikolauo 2006 et al compared evidence from RCTs with observational studies that explored the effects of interventions on the risk of harm. Harms of interest were identified from RCTs with more than 4000 patients. Observational studies of more than 4000 patients were also included for comparison. Fifteen harms of interest were identified and relative risks were extracted for 13 topics. Data from 25 observational studies were compared with results from RCTs. Relative risks for each outcome/harm were calculated for both study types. The estimated increase in RR differed by more than two‐fold between observational studies and RCTs for 54% of the topics studied. The authors conclude that observational studies usually under‐estimate the absolute risk of harms. These data were included in Golder 2011 and consequently were not re‐analyzed in the current quantitative analysis.

Excluded studies

See Characteristics of excluded studies. Following full‐text screening, 44 studies were excluded from this review. The main reasons for exclusion included the following: the studies were meta‐analyses that did an incidental comparison of RCTs and observational studies, but were not designed for such a comparison (n = 14); the studies were methodological or statistical papers that did not conduct a full systematic review of the literature (n = 28); or the studies included quasi‐ or pseudo‐randomized studies, or provided no numerical data that would allow a quantitative comparison of effect estimates (n = 7).

Risk of bias in included studies

Eleven reviews had low risk of bias for explicit criteria for study selection (Benson 2000; Beynon 2008; Bhandari 2004; Edwards 2012; Furlan 2008; Ioannidis 2001; Kuss 2011; Müeller 2010; Naudet 2011; Oliver 2010; Papanikolauo 2006); nine (60%) had low risk of bias for investigators' agreement for study selection (Bhandari 2004; Concato 2000; Edwards 2012; Golder 2011; Kuss 2011; Naudet 2011; Oliver 2010; Papanikolauo 2006; Shikata 2006); five (33%) included a complete sample of studies (Bhandari 2004; Müeller 2010; Naudet 2011; Oliver 2010; Shikata 2006); seven (47%) assessed the risk of bias of their included studies (Bhandari 2004; Furlan 2008; Golder 2011; Lonjon 2013; Müeller 2010; Naudet 2011; Oliver 2010); seven (47%) controlled for methodological differences between studies (Furlan 2008; Ioannidis 2001; Kuss 2011; Lonjon 2013; Müeller 2010; Naudet 2011; Oliver 2010); eight (53%) controlled for heterogeneity among studies (Beynon 2008; Edwards 2012; Furlan 2008; Ioannidis 2001; Lonjon 2013; Müeller 2010; Naudet 2011; Oliver 2010); nine (60%) analyzed similar outcome measures (Benson 2000; Beynon 2008; Bhandari 2004; Edwards 2012; Ioannidis 2001; Lonjon 2013; Müeller 2010; Oliver 2010; Shikata 2006); and only four (27%) were judged to be at low risk of reporting bias (Bhandari 2004; Furlan 2008; Ioannidis 2001; Naudet 2011).

We rated reviews that were coded as adequate for explicit criteria for study selection, complete sample of studies, and controlling for methodological differences and heterogeneity as having a low risk of bias and all others as having a high risk of bias. Two reviews, Müeller 2010 and Naudet 2011, met all four of these criteria and, thus, had an overall low risk of bias.

See Figure 2; Figure 3.


'Risk of bias' graph: review authors' judgements about each risk of bias item presented as percentages across all included studies.

'Risk of bias' graph: review authors' judgements about each risk of bias item presented as percentages across all included studies.


'Risk of bias' summary: review authors' judgements about each risk of bias item for each included study.

'Risk of bias' summary: review authors' judgements about each risk of bias item for each included study.

Effect of methods

Our primary quantitative analysis (Analysis 1.1), including 14 reviews, showed that the pooled ratio of odds ratios (ROR) comparing effects from RCTs with effects from observational studies was 1.08 (95% confidence interval (CI) 0.96 to 1.22) (see Figure 4). There was substantial heterogeneity for this estimate (I2 = 73%). Of the 14 reviews included in this analysis, 11 (71%) found no significant difference between observational studies and RCTs. However, one review suggested observational studies have larger effects of interest (Bhandari 2004), while two other reviews suggested observational studies have smaller effects of interest (Müeller 2010; Naudet 2011).


Forest plot of comparison: 1 RCT vs Observational, outcome: 1.2 Pooled Ratio of Odds Ratios‐‐Study Design.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.2 Pooled Ratio of Odds Ratios‐‐Study Design.

When possible or known, we isolated our results to reviews that specifically compared cohort studies and RCTs. Nine reviews either provided adequate data or performed these analyses in their publication (Benson 2000; Bhandari 2004; Concato 2000; Edwards 2012; Golder 2011; Ioannidis 2001; Kuss 2011; Lonjon 2013; Naudet 2011) Similar to the effect across all included reviews, the effects from RCTs compared with cohort studies was pooled ROR = 1.04 (95% CI 0.89 to 1.21), with substantial heterogeneity (I2 = 68%) (Analysis 1.1.2 ). In lieu of a sensitivity analysis removing case‐control studies, we performed a subgroup analysis of reviews that compared the effects of case‐controls versus RCTs (Concato 2000; Golder 2011; Ioannidis 2001). The pooled ROR comparing RCTs with case‐control studies was 1.11 (95% CI 0.91 to 1.35), with minor heterogeneity (I2 = 24%). There was no significant difference between observational study design subgroups (P value = 0.61).

We also performed a subgroup analysis of all reviews stratified by levels of heterogeneity of the pooled RORs from the respective reviews (Analysis 1.2). No significant difference in point estimates across heterogeneity subgroups were noted (see Figure 5). Specifically, comparing RCTs with observational studies in the low heterogeneity subgroup yielded a pooled ROR of 1.00 (95% CI 0.72 to 1.39). The pooled ROR comparing RCTs with observational studies in the moderate heterogeneity group was also not significantly different (OR = 1.11; 95% CI 0.95 to 1.30). Similarly, the pooled ROR comparing RCTs with observational studies in the significant heterogeneity group was 1.08 (95% CI 0.87 to 1.34).


Forest plot of comparison: 1 RCT vs Observational, outcome: 1.3 Pooled Ratio of Odds Ratios‐‐Heterogeneity Subgroups.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.3 Pooled Ratio of Odds Ratios‐‐Heterogeneity Subgroups.

Additionally, we performed a subgroup analysis of all included reviews stratified by whether they compared pharmacological studies or not (Analysis 1.3). Though the pooled ROR for comparisons of pharmacological studies was higher than the pooled ROR for reviews of non‐pharmacological studies, this difference was not significant (see Figure 6) (P value = 0.34). Namely, the pooled ROR comparing RCTs with observational studies in the pharmacological studies subgroup of six reviews was 1.17 (95% CI 0.95 to 1.43), with substantial heterogeneity (I2 = 81%). The pooled ROR comparing RCTs with observational studies in the non‐pharmacological studies subgroup of 11 reviews was 1.03 (95% CI 0.87 to 1.21), with substantial heterogeneity (I2 = 74%).


Forest plot of comparison: 1 RCT vs Observational, outcome: 1.4 Pooled Ratio of Odds Ratios‐‐Pharmacological Studies Subgroups.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.4 Pooled Ratio of Odds Ratios‐‐Pharmacological Studies Subgroups.

Lastly, we performed an analysis of all included reviews that compared RCTs and observational studies that employed propensity score adjustments (Analysis 1.4). The pooled ROR comparing estimates from RCTs with the estimates from observational studies using propensity scores was not significant. Namely, the pooled ROR comparing RCTs with observational studies with propensity scores (two reviews) was 0.98 (95% CI 0.85 to 1.12), with no heterogeneity (I2 = 0%). There was no difference between the pooled ROR of RCTs versus observational studies with propensity score adjustment and the pooled ROR of RCTs versus observational studies without propensity score adjustment (P value = 0.22).

Discussion

Summary of main results

Our results showed that, on average, there is little difference between the results obtained from RCTs and observational studies. In addition, despite several subgroup analyses, no significant differences between effects of study designs were noted. However, due to high statistical heterogeneity, there may be important differences between subgroups of reviews that we were unable to identify, Our primary quantitative analysis showed that the pooled ROR comparing effects from RCTs with effects from observational studies was 1.08 (95% CI 0.96 to 1.22). The effects from RCTs compared with cohort studies only was pooled ROR = 1.04 (95% CI 0.89 to 1.21), while the pooled ROR comparing RCTs with only case‐control studies was1.11 (95% CI 0.91 to 1.35).

Though not significant, the point estimates suggest that observational studies may have smaller effects than those obtained in RCTs, regardless of observational study design. Furthermore, it is possible that the difference between effects obtained from RCTs and observational studies has been somewhat attenuated in more recent years due to researchers' improved understanding of how to handle adjustments in observational studies. In the present study, it was not always very clear which observational studies included adjusted estimates and which did not in the included reviews. Bhandari et al reported that no observational study adjusted for all nine confounders the authors felt were important (Bhandari 2004). In fact, they adjusted for as few as two and as many as six. Mueller et al reported that of the 136 non‐RCTs included in their review, 19 population‐based studies and 22 other studies adjusted their results for baseline imbalances (Müeller 2010). Two reviews included only observational studies with propensity score adjustments (Kuss 2011; Lonjon 2013). Other included reviews note the importance of adjustment in the estimates from observational studies, but do not specifically list the studies with and without adjusted estimates. Our results suggest that although observational designs may be more biased than RCTs, this does not consistently result in larger or smaller intervention effects.

We also found that the effect estimate differences between observational studies and RCTs were potentially influenced by the heterogeneity within meta‐analyses. Though subgroup analyses comparing heterogeneity groups were not statistically significant, meta‐analyses comparing RCTs and observational studies may be particularly influenced by heterogeneity and researchers should consider this when designing such comparisons. However, with so few reviews, spurious effects between heterogeneity subgroups cannot be ruled out.

The risks of bias in the included reviews were generally high. In particular, two‐thirds of all included reviews either did not include a complete sample or there was not enough information provided to make a determination, and more than half of the reviews did not assess the risk of bias of their included studies. Furthermore, nearly three‐quarters of the included reviews were judged to be at high or unclear risk of reporting bias.

We note that our results may be influenced by the different comparison arms in all the studies included in the reviews. Often the specific types of comparison arms in the meta‐analyses were not identified in the review. However, among included reviews with reported details about comparison arms in the RCTs in the meta‐analyses (n = 519 meta‐analyses), 84% (n = 454) compared one intervention (e.g., drug or surgery) with another intervention (drug or surgery), 11% (n = 55) used a placebo or sham, 3% (n = 13) used an unspecified control arm, and 2% (n = 15) compared one intervention with no intervention or treatment.

Lastly, though not statistically significant, there appears to be a difference in effect comparing RCTs and observational studies when considering studies with pharmacological‐only interventions or studies without pharmacological interventions. More specifically, the difference in point estimates between pharmacological RCTs and observational pharmacological studies is greater than the difference in point estimates from non‐pharmacological studies. Perhaps this is a reflection of the difficulties in removing all potential confounding in observational pharmacological studies; or, perhaps this is an artifact of industry or selective reporting bias in pharmacological RCTs. The most recent study quantifying pharmaceutical industry support for drug trials found that the pharmaceutical industry funded 58% of drug trials in 2007 and this was the largest source of funding for these trials (Dorsey 2010). This is not surprising as RCTs must be submitted to regulatory agencies to obtain regulatory approval of drugs, whereas observational studies of drugs are conducted after drug approval. Funding and selective reporting bias have been well documented in industry‐sponsored RCTs (Lundh 2012) and less is known about the extent of these biases in observational studies.

Potential biases in the review process

We reduced the likelihood for bias in our review process by having no language limits for our search and having two review authors independently screen abstracts and articles for selection. Nevertheless, we acknowledge the potential for introduction of unknown bias in our methods as we collected a myriad of data from 14 reviews (1583 meta‐analyses covering 228 unique outcomes).

Agreements and disagreements with other studies or reviews

Our results across all reviews (pooled ROR 1.08; 95% CI 0.96 to 1.22) are very similar to results reported by Concato 2000 and Golder 2011. As such, we have reached similar conclusions‐‐there is little evidence for significant effect estimate differences between observational studies and RCTs, regardless of specific observational study design, heterogeneity, or inclusion of drug studies.

Golder 2011 (and consequently, Papanikolauo 2006) and Edwards 2012) were the only reviews that focused on harm outcomes. Golder's findings do not support the notion that observational studies are more likely to detect harm than randomized controlled trials, as no differences in RCTs and observational studies were detected. However, this finding may be related to the short‐term nature of the adverse events studied where one would expect shorter‐term trials to be as likely to detect harm as longer‐term observational studies.

Flow chart depicting screening process
Figures and Tables -
Figure 1

Flow chart depicting screening process

'Risk of bias' graph: review authors' judgements about each risk of bias item presented as percentages across all included studies.
Figures and Tables -
Figure 2

'Risk of bias' graph: review authors' judgements about each risk of bias item presented as percentages across all included studies.

'Risk of bias' summary: review authors' judgements about each risk of bias item for each included study.
Figures and Tables -
Figure 3

'Risk of bias' summary: review authors' judgements about each risk of bias item for each included study.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.2 Pooled Ratio of Odds Ratios‐‐Study Design.
Figures and Tables -
Figure 4

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.2 Pooled Ratio of Odds Ratios‐‐Study Design.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.3 Pooled Ratio of Odds Ratios‐‐Heterogeneity Subgroups.
Figures and Tables -
Figure 5

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.3 Pooled Ratio of Odds Ratios‐‐Heterogeneity Subgroups.

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.4 Pooled Ratio of Odds Ratios‐‐Pharmacological Studies Subgroups.
Figures and Tables -
Figure 6

Forest plot of comparison: 1 RCT vs Observational, outcome: 1.4 Pooled Ratio of Odds Ratios‐‐Pharmacological Studies Subgroups.

Comparison 1 RCT vs Observational, Outcome 1 Summary Ratios of Ratios: RCTs vs Observational Studies.
Figures and Tables -
Analysis 1.1

Comparison 1 RCT vs Observational, Outcome 1 Summary Ratios of Ratios: RCTs vs Observational Studies.

Comparison 1 RCT vs Observational, Outcome 2 Summary Ratios of Ratios: RCTs vs Observational Studies (Heterogeneity Subgroups).
Figures and Tables -
Analysis 1.2

Comparison 1 RCT vs Observational, Outcome 2 Summary Ratios of Ratios: RCTs vs Observational Studies (Heterogeneity Subgroups).

Comparison 1 RCT vs Observational, Outcome 3 Summary Ratios of Ratios: RCTs vs Observational Studies (Pharmacological Studies vs non‐Pharmacological Studies).
Figures and Tables -
Analysis 1.3

Comparison 1 RCT vs Observational, Outcome 3 Summary Ratios of Ratios: RCTs vs Observational Studies (Pharmacological Studies vs non‐Pharmacological Studies).

Comparison 1 RCT vs Observational, Outcome 4 Summary Ratios of Ratios: RCTs vs Observational Studies (Propensity Scores).
Figures and Tables -
Analysis 1.4

Comparison 1 RCT vs Observational, Outcome 4 Summary Ratios of Ratios: RCTs vs Observational Studies (Propensity Scores).

Comparison 1. RCT vs Observational

Outcome or subgroup title

No. of studies

No. of participants

Statistical method

Effect size

1 Summary Ratios of Ratios: RCTs vs Observational Studies Show forest plot

14

Odds Ratio (Random, 95% CI)

Subtotals only

1.1 RCT vs All Observational

14

Odds Ratio (Random, 95% CI)

1.08 [0.96, 1.22]

1.2 RCT vs Cohort

9

Odds Ratio (Random, 95% CI)

1.04 [0.89, 1.21]

1.3 RCT vs Case Control

3

Odds Ratio (Random, 95% CI)

1.11 [0.91, 1.35]

2 Summary Ratios of Ratios: RCTs vs Observational Studies (Heterogeneity Subgroups) Show forest plot

14

Odds Ratio (Random, 95% CI)

Subtotals only

2.1 Low Heterogeniety (I2: 0% to 30%)

4

Odds Ratio (Random, 95% CI)

1.00 [0.72, 1.39]

2.2 Moderate Heterogeneity (I2:31% to 60%)

8

Odds Ratio (Random, 95% CI)

1.11 [0.95, 1.30]

2.3 Significant Heterogeneity (I2: 61% to 100%)

2

Odds Ratio (Random, 95% CI)

1.08 [0.87, 1.34]

3 Summary Ratios of Ratios: RCTs vs Observational Studies (Pharmacological Studies vs non‐Pharmacological Studies) Show forest plot

13

Odds Ratio (Random, 95% CI)

Subtotals only

3.1 Pharmacological Studies

6

Odds Ratio (Random, 95% CI)

1.17 [0.95, 1.43]

3.2 Non‐Pharmacological Studies

11

Odds Ratio (Random, 95% CI)

1.03 [0.87, 1.21]

4 Summary Ratios of Ratios: RCTs vs Observational Studies (Propensity Scores) Show forest plot

14

Odds Ratio (Random, 95% CI)

Subtotals only

4.1 RCTs vs Observational Studies (propensity score adjustment)

2

Odds Ratio (Random, 95% CI)

0.98 [0.85, 1.12]

4.2 RCTs vs Observational Studies (no propensity score adjustment)

12

Odds Ratio (Random, 95% CI)

1.10 [0.96, 1.27]

Figures and Tables -
Comparison 1. RCT vs Observational