➢ Significance reporting in clinical trials can be misleading because of factors such as overpowering, choosing clinically insignificant outcomes, and inappropriate subgroup analysis.
➢ Clinical significance generally refers to a change that is important to a patient, although clinical significance can also be reported through the viewpoints of health-care providers or health-care policy-makers.
➢ Clinical significance can be measured in a number of ways, including minimally important difference, responder analysis, patient acceptable symptom state, and substantial clinical benefit.
➢ Orthopaedic cost-effectiveness and value analyses should include a measure of clinical significance to ensure relevance to patients.
The primary goal of clinical orthopaedic research is determining how to evaluate the safety and efficacy of orthopaedic interventions, with a focus on outcomes that are important to patients. On March 23, 2010, the Patient Protection and Affordable Care Act (PPACA) introduced the Patient Centered Outcomes Research Institute, emphasizing the importance of patient-centered care1. Patient-centered outcomes research helps to answer the following questions:
(1) “Given my personal characteristics, conditions and preferences, what should I expect will happen to me?”
(2) “What are my options and what are the potential benefits and harms of those options?”
(3) “What can I do to improve the outcomes that are most important to me?”
(4) “How can clinicians and the care delivery systems they work in help me make the best decisions about my health and healthcare?”2
Patient-centered care has created a trend toward complementing clinician-based measures with patient-reported outcome measures3. Both disease-specific patient-reported outcome measures and generic health-related quality-of-life measures enhance the measurement of value of orthopaedic care.
Orthopaedic clinical outcomes research is aimed at determining the best treatment for individual patients. In order to ensure that the effect of treatment is not based on chance alone, measures of significance were developed and have been the hallmark of research reporting4. Statistical significance is used to demonstrate the likelihood that differences in results are not by chance and has conventionally been assigned as 95% or p < 0.05. This means that if a p value or confidence interval (CI) for outcomes between two groups is <0.05, then outcomes A and B are truly different nineteen times out of twenty when an experiment is repeated twenty times. It is also important to assess the clinical or practical relevance of the effect.
The term clinical significance was first described in behavioral medicine addressing psychotherapy5 and generally refers to a change that is important to a patient. Clinical significance most appropriately describes “patient-important” outcomes rather than clinically measured outcomes or effect sizes (e.g., length of stay) that historically have been of interest only to surgeons and/or administrators6. The present review discusses clinical significance, the assessment of various measures of clinical significance, the relationship between outcomes and clinical significance, and the application of clinical significance in economic evaluation of orthopaedic care delivery.
Potential Pitfalls of Improperly Applied Significance
Statistical overpowering increases the statistical power of a clinical trial to increase the likelihood of finding a statistically significant (but not necessarily clinically significant) difference between two treatments7. Although overpowered clinical studies are less common than underpowered clinical studies, any difference between two treatments will become significant if the sample size of a clinical trial is increased sufficiently. For example, outcome differences of 10% and 8% will reach significance if there are ≥3209 subjects in each study arm. Outcome differences of 10% and 6% will be significant if there are ≥720 subjects in each arm. An orthopaedically relevant example of overpowering was demonstrated in a study assessing intra-articular hyaluronic acid injections8. The two study arms in that trial, active treatment and placebo, consisted of 124 and 129 patients, respectively. The study demonstrated a notably small, but significant, difference of 0.14 on the visual analog scale (VAS) pain score (p = 0.047). Hence, a clinical trial can be designed to find a statistically significant difference even if the difference is not clinically significant.
Another way to obtain statistical significance without clinical significance is by selecting a clinically unimportant outcome measure to power a clinical trial. Occasionally, a surrogate or substitute measure of outcome is used; such measures could include laboratory, physiological, physical, or radiographic findings. The appropriateness of a surrogate depends on how strongly (or not) it is linked to patient-important outcomes9. Brown performed a pooled analysis of randomized controlled studies comparing different methods of prophylaxis against venous thromboembolism in which asymptomatic, venographically positive deep-vein thromboses were used as the primary outcome measure10. Although the individual clinical trials demonstrated significant differences in terms of this primary outcome measure, the pooled analysis of clinically significant outcomes (symptomatic deep-vein thromboses, pulmonary emboli, fatal pulmonary emboli, and operative site bleeding complications) demonstrated that low-molecular-weight heparin, warfarin, and pentasaccharides were associated with similar rates of venous thromboembolic events compared with aspirin and that aspirin was associated with clinically significantly lower rates of operative site bleeding than the more aggressive anticoagulants.
Randomized clinical trials are powered for the primary outcome. Because of the expense of such trials, it is uncommon to have sufficient power to perform subgroup analyses. However, in the face of a negative primary outcome, it is common to perform post hoc subgroup analyses to salvage the clinical trial by finding a statistically significant positive outcome11. The level of significance is the likelihood of a single positive result occurring due to chance; such a finding is also known as a false-positive result or a type-I error. The same level of significance (α = 0.05) cannot be used for multiple subgroup comparisons without increasing the risk of false-positive results12. In addition to the statistical considerations of subgroup analyses, any significant differences must be assessed with regard to their clinical significance.
Differences Between Clinical and Statistical Significance
Practically, the differentiation between clinical and statistical significance is often found in published orthopaedic clinical research. The literature includes many examples that emphasize the importance of clinical significance when reporting outcomes, whether they are patient-centered or clinically measured. de Verteuil et al., in a study comparing minimal-incision and standard approaches for total hip arthroplasty, used estimated blood loss and operation duration as primary outcomes13. The reported differences in estimated blood loss and operative time were 57.7 mL and 3.70 minutes, respectively, both of which were statistically significant (p < 0.01). Similarly, Vavken et al., in a study evaluating anterior total hip arthroplasty, reported an estimated blood loss difference of 52 mL (p < 0.001)14. While these primary outcomes were statistically significant, the clinical relevance was minimal, if any, because very small differences in blood loss or operation duration are unlikely to be important to patients. The most important outcomes to patients are much more difficult to measure and require outcome instruments that are based on patient symptoms or perceptions. We may intuitively understand what it means for knee motion to improve by 5° as opposed to 50°, but what does it mean for a pain score to improve by 2 points or a functional outcome to improve by 4 points? This difference is termed the change score, and clinical significance helps to understand the change score15.
Chan et al., in a study evaluating single-injection and continuous peripheral nerve blocks as well as patient-controlled analgesia following total knee arthroplasty, reported a significant reduction in the VAS pain score when continuous nerve block was compared with a single injection (p < 0.05)16. However, the difference in the VAS score was 0.57. Understanding the VAS score requires knowledge of the specific instrument, and 0.57 has questionable clinical importance.
Minimally Important Difference
One method of addressing clinical importance is the minimally important difference, defined as “the smallest difference in score in the outcome of interest that informed patients or informed proxies perceive as important, either beneficial or harmful, and which would lead the patient or clinician to consider a change in the management.”17 The term represents an evolution from the earlier terms minimum clinically important improvement and minimum clinically important difference. This evolution is conceptually appealing because if a single minimally important difference can be determined, a study needs only to define whether or not a patient cohort met or exceeded the minimally important difference. The word clinically was dropped from the term because it may represent a determination of outcome by someone other than the patient.
Two concerns are often expressed about the minimally important difference. First, does a change score mean the same thing throughout an entire scale? Second, can individual change scores be combined into population change scores18,19? Both of these questions are really asking whether the scales are mathematically linear. If the scales are linear, a change of 1 to 4 on a 10-point scale is the same as a change from 6 to 9, and, like range of motion, the numbers can be averaged. In general, it is not correct to assume linearity in most functional outcome scores, whether formulated on a continuous scale (e.g., a VAS) or on intervals (e.g., scales). Practically, many scales behave sufficiently linearly within similar patients that they can be considered linear in practice. Nonlinearity also occurs because of regression to the mean. Because patients come to physicians when their symptoms are most bothersome, and because symptoms usually wax and wane, doing nothing or even something ineffective will often be associated with an improvement of symptoms. Because symptoms or perceptions tend to go back to a baseline (regression to the mean), changes in the middle of a scale are usually less impressive than those at the extremes, regardless of treatment, and patients who are worse before intervention are more likely to show more change than those who are closer to the middle. This means that for most patient-reported outcome measures, there is probably not a single minimally important difference across all diseases and patient cohorts. Researchers who incorporate minimally important differences into their methods or evidence users who incorporate minimally important differences into their judgments should be careful to select a minimally important difference that was derived from a similar cohort of patients who underwent a similar intervention. Studies comparing functional outcome change scores between patients starting at similar points are likely to have more comparable changes than those starting at different levels.
In patient cohorts with negligible mean changes in outcomes, it is likely that the distributions will show individual patients for whom the improvement or decline is clinically relevant. The distribution occurs because of both individual random measurement variation and real differences between individual patients. Highly reliable patient-reported outcome measures have small measurement errors that are similar to most other typical clinical measurements20. Inferences at the group level may be informative with comparative effectiveness studies of treatments or decisions regarding public health policy. Interpretation of outcomes at the individual patient level is most important for patient-specific clinical treatment decisions. Because of the likely difference between individual and overall group responses, the meaningful differences for a patient may be different from those for the group as a whole21.
Several approaches for determining minimally important difference in orthopaedic clinical research have been described21. The anchor-based approach compares clinically relevant measures, such as clinical measurements or patient-reported outcome measures, with use of either cross-sectional or longitudinal methods22. Cross-sectional methodology involves comparing groups that are different in terms of some disease-related criterion at a single point in time. With the patient-question format, a change in status is determined through the use of questions such as “Are you a lot worse, worse, a little worse, the same, a little better, or much better?” Patients who are “a little worse” or “a little better” determine the value of the minimally important difference. The shortfall of cross-sectional studies is that the groups are likely to have confounding differences between them in other unmeasured variables. The longitudinal approach measures global ratings of change that represent measurable clinical or patient-reported parameters both before and after treatment23. This approach is more commonly seen in surgical populations because it lends itself to a “before and after” approach to measuring outcome.
A general criticism of anchor-based methodology is that global ratings of change rely on patient recall with its potential bias, which can be problematic, especially over long periods of time24. Another concern regarding global ratings is the reliability and validity of their use in research25. Global ratings associated with health-related quality-of-life instruments may only account for some of the variance in scores and may not include the measurement precision of the tool26,27. The potential for a nonlinear relationship between the anchor and health-related quality of life could also cloud interpretation of the results28.
Distribution-based methodologies for determining clinical significance use the statistical properties of the sample populations to describe relevance5,21. Distribution-based measures utilizing significance evaluate change in relation to the probability that this change occurred by random variation. These measures are based on standard errors of a sampling distribution and are therefore influenced by the sample size. The characteristic feature of these measures is that, all else being equal, the index increases as a function of the sample size. The paired t-statistic and growth curve analysis are examples based on significance29,30. A second type of distribution-based measure evaluates change in relation to sample variation. These approaches include baseline variation of the sample (effect size), variation of change scores (standardized response mean), and variation of change scores in a stable group (responsiveness statistic)19,31,32. In contrast to measures based on significance, these indices are independent of sample size. Variation is expressed as an average variation (per subject) about a mean value. The third type of distribution-based measure is that based on the measurement precision of the instrument30. These measures, which include the standard error of the mean (SEM) and the responsiveness statistic, evaluate change in relation to variation of the measurement instrument as opposed to variation of the sample.
A benefit of distribution-based measures of minimally important difference is that these methods can evaluate change beyond some level of random variation. Distribution-based methods are also relatively constant across sample populations. Because relatively few benchmarks for establishing clinical significance have been described, appropriate categorization of results cannot be rigorously defined. However, without the ability to compare these distribution-based results with some clinical measure of importance, these methodologies do not provide an output that is alone clinically meaningful.
Several studies have addressed the correspondence between anchor-based and distribution-based methodologies for measuring clinically meaningful differences and have demonstrated that an effect size of approximately 0.50 (from a range of 0.05 to 0.75) corresponds closely with both methods of global rating33,34. Cohen recommended using 0.2 as a small effect size, 0.5 as a moderate effect size, and 0.80 as a large effect size35. Kolotkin et al. demonstrated that different conclusions could occur dependent on methodology at the extremes of the ranges of effect36. Clinically significant change measurement was comparable between anchor-based and distribution-based methods for patients with moderate levels of impairment but markedly different for those with mild and severe impairment. Generally, changes of one-half standard deviation correspond closely with minimally important differences obtained by means of either anchor or distribution methods. Yost et al. recommended using both methods to triangulate on the minimally important difference for specific populations, and, when necessary, using a Delphi approach to hone the answer37. The Delphi method utilizes multiple-round expert consensus, with the results of previous rounds informing the experts and with the experts being allowed to revise their previous votes37.
The minimally important difference has been reported for a few clinical outcome scales in orthopaedics. Parker et al. reported a minimally important difference of 14.9 for the Oswestry Disability Index (ODI) with use of a minimum detectable change approach for degenerative lumbar spondylolisthesis fusion surgery38. Carreon et al. reported a minimally important difference of 11.8 for the ODI with use of a minimum detectable change method for primary and revision lumbar fusion procedures39. Dick and Brown reported a minimally important difference of 10.2 for the ODI with use of a distribution method40,41. Clement et al. reported a minimally important difference of 5.0 for the Oxford Knee Score (OKS) with use of an anchor method42. Brown reported a minimally important difference of 4.5 for the OKS with use of a distribution method43. van Kampen et al. reported minimally important differences of 12.4 and 13.4 for the Disabilities of the Arm, Shoulder and Hand (DASH) and QuickDASH when assessing outcomes of shoulder interventions44-46.
When evaluating minimally important difference, it is also important to note potential confounding factors. First, is it relevant to utilize the patient’s baseline status or level of impairment to evaluate clinically significant change19,47? In other words, should patients with greater impairment or worse baseline status require more change to obtain a clinically significant result? Perhaps a portion of the answer to these questions can be explained by regression to the mean. Failure to acknowledge regression to the mean can lead to erroneous conclusions regarding clinical significance if not accounted for in the baseline analysis. Another issue associated with the minimally important difference is the application of health-related quality-of-life measure changes to all conditions, patients, or circumstances. Differences in tools (disease-specific versus generic measurement tools), patient populations (obese versus non-obese), and condition (total knee arthroplasty versus no total knee arthroplasty) may impact the results of the measured changes, but the magnitude of that impact is unknown for every situation.
Patient Acceptable Symptom State
In some cases, patients simply wish to return to normal status, whereas in other cases, particularly those involving painful or risky procedures, they wish to be substantially better postoperatively. Tubach et al. described the patient acceptable symptom state (PASS) as the value beyond which patients consider their symptoms acceptable48. The acceptable symptom or health state may be less than, equal to, or even greater than the minimally important change from a patient’s current state. The PASS is strictly a patient-reported measure. While measures of minimally important difference measure a patient getting better (or worse), PASS measures whether a patient is well. Unlike some of the other measures of clinical significance, the PASS needs to be addressed with duration of time spent in that health state. For example, would this symptom state continue to be acceptable for a week, a month, a year, etc.? It is also unknown whether the PASS is treatment-specific. Patient expectations of treatment are unknown but could affect the PASS if different for each treatment. Paulsen et al. reported on the PASS scores for patients undergoing hip arthroplasty who were evaluated with the Hip Dysfunction and Osteoarthritis Outcome Score (HOOS)49. Like minimally important difference anchor-based methods, the PASS can be used as an alternative method for anchoring outcome measurement tools.
Another method of measuring clinical significance is a responder analysis. The Food and Drug Administration (FDA) has issued the following guidance: “There may be situations where it is more reasonable to characterize the meaningfulness of an individual’s response to treatment than a group’s response, and there may be interest in characterizing an individual patient as a responder to treatment, based upon pre-specified criteria backed by empirically derived evidence supporting the responder definition as a measure benefit.”50 This represents the individual patient-reported outcome measure score change over a predetermined time period that should be interpreted as a treatment benefit. Examples include categorizing a patient as a responder on the basis of a prespecified change from baseline, a change in score of a certain size or greater (e.g., a 0.5-point change on a 10-point VAS), or a percentage change from baseline. Arden et al. used responder analysis to evaluate the efficacy of intra-articular hyaluronic acid in patients with knee osteoarthritis51.
A disadvantage of responder analysis is reduced power relative to an analysis on the original scale52,53. This means that the sample size required to detect small differences would be significantly greater54. Another consideration when using responder analysis is the dichotomous nature of a response (i.e., “yes” or “no”). If the a priori definition of a response in weight loss is 10.2 lb (4.63 kg), does that imply that all patients with weight loss of 10.1 lb (4.58 kg) are not responders? As with previous examples, responder analysis also has other sources of error, including regression to the mean, measurement imprecision, natural history of the disease, and confounding treatment factors. It is also impossible to compare two effective treatments with use of responder analysis. If both treatments result in improvement but one treatment results in substantially more improvement than the other, this difference cannot be identified well with responder analysis. As such, current responder analysis techniques will not lend themselves to comparative effectiveness research.
Substantial Clinical Benefit
For treatments such as major surgery that are associated with substantial risks and morbidity, it has been suggested that the definition of clinical success be greater than achieving a minimally important difference. A substantial clinical benefit is the minimum acceptable outcome of improved function, pain relief, and/or improved quality of life. Since no single clinically measurable outcome—for example, radiographic or laboratory data—can clearly reflect the outcome of an intervention, patient-reported measures are preferred. Substantial clinical benefit has been calculated with anchor-based methods, with patient responses such as “much better” or “mostly satisfied” being used to define improved outcomes. Carragee and Cheng administered standard questionnaires to patients and defined clinical success following lumbar spinal fusion as a pain rating of ≤3 of 10, an improvement of ≥20 points on the ODI, discontinuation of opioid medications, and return to some occupational activity55. Glassman et al. defined substantial clinical benefit following lumbar spinal fusion56, and Solberg et al. defined substantial clinical benefit following lumbar discectomy57. Michener et al. defined substantial clinical benefit following physical therapy for the treatment of shoulder impingement syndrome as a 40% improvement in the DASH score58. Limitations of substantial clinical benefit include the lack of understanding of the effect of baseline patient severity on outcome and the use of a single follow-up time period measurement.
Clinical Significance and Value
Value in health care has been defined as health outcomes achieved per dollar spent59. Employers, payers, and governmental agencies are interested in the value of care delivered in orthopaedic surgery60. This interest is in large part due to variation in measures within orthopaedics, the high volume of orthopaedic procedures performed per annum, and the lack of consensus regarding which outcomes should be measured and what constitutes a meaningful improvement in musculoskeletal outcomes61.
Cost-effectiveness analysis is a useful tool for assessing the value of an intervention by identifying the procedures that provide the greatest improvement in outcome at the lowest cost. Cost-utility analysis is a form of cost-effectiveness analysis that seeks to evaluate the economics of a health-care intervention by translating the benefits of those interventions into a quality-adjusted life-year (QALY) gained62,63. Patient-derived measures of health allow for comparison of the value of an intervention both within a specialty and across heterogeneous fields within medicine. As such, cost-utility analysis is the preferred methodology for cost-effectiveness analysis in medicine. The most commonly used patient-reported outcome measure for QALY calculations is the EuroQol-5D (EQ-5D)64. Multiple international joint replacement registries have adopted the EQ-5D (Swedish Hip Arthroplasty Register; Norwegian Arthroplasty Register; National Joint Registry for England, Wales and Northern Ireland)65-67. Cost-benefit analysis converts health outcomes into a monetary value, and both the costs and outcomes of a procedure can be compared in monetary terms. Health-care consumers (patients or payers) are asked how much they are willing to pay for the intervention or to achieve a certain health state.
To date, the evaluation of cost-effectiveness and its variants has been performed with use of continuous variable evaluation and significance measures. Incorporation of clinical significance in value measurement provides the opportunity to standardize the reporting of outcome measures and to define a so-called floor for outcomes that are incorporated into the value equation61. Parker and McGirt defined the minimum cost-effectiveness difference as “the smallest improvement in an outcome instrument that is associated with a cost-effective response to surgery.”68 The minimum cost-effectiveness difference was determined from receiver operating characteristic curve analysis with a cost-effectiveness anchor of <$50,000 per QALY69.
We agree that outcomes that have achieved a measure of clinical significance should be incorporated into value assessments. Investigators who have used patient-reported outcome measures in cost-effectiveness analysis may have had difficulty in identifying differences in outcomes, instead relying primarily on cost differences as the primary factor for variation70. The inherent problem with cost-benefit analysis is that most current methodologies, which lack measures of clinical significance, are confusing to patients. Patients, physicians, and payers may not have the ability to translate changes in utility scales to relevant improvement in health state.
In summary, the concept of clinical significance is ultimately of much greater importance to patients and society than the concept of statistical significance is. While the minimally important difference is critical, the details of identifying a minimally important difference remain, continue to evolve, and depend on the patient-reported outcome measure instrument, the severity of disease being measured, and the distribution of that severity. Assuming that the study has demonstrated statistical significance, the best methods for reporting clinical improvement for many disorders are a combination of the minimally important difference and responder analysis. Further development of these tools will be critical for determining appropriate value for orthopaedic care.
Source of Funding: No external funds were received in support of this report.
Investigation performed at Intermountain Healthcare, St. George, Utah
Disclosure: None of the authors received payments or services, either directly or indirectly (i.e., via his or her institution), from a third party in support of any aspect of this work. One or more of the authors, or his or her institution, has had a financial relationship, in the thirty-six months prior to submission of this work, with an entity in the biomedical arena that could be perceived to influence or have the potential to influence what is written in this work. No author has had any other relationships, or has engaged in any other activities, that could be perceived to influence or have the potential to influence what is written in this work. The complete Disclosures of Potential Conflicts of Interest submitted by authors are always provided with the online version of the article.
- Copyright © 2015 by The Journal of Bone and Joint Surgery, Incorporated