Wednesday, March 22, 2017

Your quality measure is no longer useful!

I came across an interesting concept the other day called "Goodhart's Law" that states, "When a measure becomes a target, it ceases to be a good measure."  The economist, Charles Goodhart, first described the concept in 1975, stating more technically that "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."  While originally applied to monetary policy, Goodhart's Law has now been applied in a number of different contexts.  Let me give a few examples. 

The "quality" of an academic journal today depends upon its so-called Impact Factor, which is essentially the average number of journal citations in a given year divided by the total number of articles published in the journal during the same period.  In other words, more citations for articles published in the journal lead to a higher Impact Factor, which increases the journal's credibility and importance.  Investigators who publish their scientific articles in journals with high Impact Factors are more successful at getting promoted and have a greater chance of receiving grant funding for investigators.  In other words, a journal's Impact Factor has become a quality target.  Given the importance that academic institutions and grant funding agencies place on journal Impact Factors, journals with higher Impact Factors often receive a greater number of high-quality manuscript submissions from prestigious investigators.  Publishers therefore are motivated to do what they can to increase their journal's Impact Factor.  I have heard of some journals that encourage (in some cases, require) investigators to cite other articles from the journal when they submit a manuscript.  Investigators often cite their own articles or encourage their colleagues to cite their articles to increase the number of citations for their articles.  The journal Impact Factor, which was originally designed to serve as a measure of the quality of a journal's publications, has become a target (the publisher's goal is to increase the journal's Impact Factor).  Once the quality measure becomes a target, investigators and the journal itself can try to "game the system" to "help" increase the journal's Impact Factor.  In other words, the Impact Factor as a measure of quality has become useless and irrelevant.

High school seniors who want to go to college are required to take either the SAT or ACT during the admissions process.  A higher SAT or ACT score is required at more prestigious colleges.  Similarly, colleges with higher average SAT scores or ACT scores for their admitted students are generally perceived as "better colleges" to apply to by high school students.  The original intent was to use the SAT or ACT score as a measure of the "quality" of the student.  The quality measure, however, has become a target.  There is an entire industry of ACT and SAT preparation classes, books, practice examinations, and private tutors that high school seniors use in order to increase the chances of getting a higher score on their test.  In addition, colleges often use a method called "superscoring" to report the average SAT and ACT score of their admitted freshman class.  "Superscoring" essentially means that the college uses the best parts of the SAT or ACT score (for example, the Verbal and Math scores of the SAT), even if the scores were from different testing dates (most high school seniors take the test multiple times to get their best score).  Again, due to gaming of the system, the SAT and ACT score has become essentially useless as a measure of quality.

As a last example, think of customer satisfaction scores in any of a number of different industries.  How many times have you been told by a retail salesperson or customer service representative that "We strive for perfect tens on our customer satisfaction surveys, and if I get all tens, I get a bonus!"?  Customer satisfaction scores were designed to be used as measures of the quality of customer service.  However, companies use higher customer satisfaction scores as targets or goals.  Again, once something has become a target, it ceases to be relevant as a measure of quality.

How often do we see examples of Goodhart's Law in health care?  Several years ago, the rate of ventilator-associated pneumonias (VAP) was used by a number of organizations as a measure of the quality of care received in the Intensive Care Unit (ICU).  Insurance organizations started lowering reimbursement (i.e. penalizing) hospitals with higher than average rates of VAP, while increasing bonus payments to hospitals with lower than average rates of VAP in their ICUs.  Once the VAP rate became a target, hospitals were encouraged to work on lowering the rate of VAP in the ICU through quality improvement techniques and safety initiatives.  I know of some ICU Medical Directors who would "argue" to try to downgrade the classification of a VAP to something that didn't quite meet the definition of a VAP.  In my own institution, we observed a significant decrease (near-elimination) of VAP with a concomitant increase in the incidence of ventilator-associated tracheobronchitis (widely considered a precursor condition of VAP - these patients do not meet all of the defining criteria for VAP).  I suspect that the work directed at VAP created conditions in which patients that previously would have developed a VAP now developed tracheobronchitis (notably the risk factors and pathogens were exactly the same between the two conditions).  However, I can't say with 100% confidence that the way we classified cases as either VAP or VAT (the defining criteria are notoriously poor and highly subjective) was always right either.  Our quality metric became a performance goal, and from that moment ceased to be relevant as a marker of quality.  Incidentally, several organizations have dropped VAP as a quality measure altogether.

Goodhart's Law states simply that once a quality measure becomes a performance goal, especially if there are incentives to reaching that performance goal (money, prestige, or promotion), the measure is no longer relevant as a quality measure.  There are a number of regulatory agencies and government organizations who would like to use market forces to improve health care delivery.  As an example, so-called "pay for performance" (P4P) programs are being tested to determine whether monetary incentives can lead to improvement in specific performance metrics.  Charles Goodhart would tell us to be very careful here.  Perhaps quality measures should just be left alone and used for their original intended purpose - as a measure of quality and not as a performance goal or objective.

5 comments:

  1. One of my favorite posts of yours so far! I'm going to share a couple questions, and if you have any thoughts on the topic and time to share, I'd be very appreciative to hear what you think.

    Can effective use of small collections of measures avoid this pitfall of myopic "one key metric" focus? I know the IHI's model for improvement uses the concept of grouping outcome, process, and balance measures into an improvement initiative. However, I'm not personally experienced in whether that does mitigate the problem and how they're actually used to achieve focus with balance.

    Is it possible that not all quality metrics are made equal (i.e. some are more predisposed to this issue) or does it seem to always come down to the broader context of the org/people/culture/etc to avoid the issue? Put another way, have you experienced use of a quality metric that at least anecdotally seemed robust against this problem regardless of the broader org context/conditions? Intuitively, it seems like some quality measures you shared seemed more predisposed to the problem, so I'm curious about your experiences and any outliers that come to mind.

    Thanks!

    ReplyDelete
    Replies
    1. Daniel-

      These are two very good points that you raise above. I will address your second point first. I agree with you that not all quality metrics are created equal; however, I think the issue that is relevant to the discussion above is that whenever we tie incentives (monetary or otherwise) to a quality measure, we encourage the kinds of behaviors (even completely unintentional and entirely in the subconscious) that support Goodhart's Law (i.e. gaming of the system and at times, outright cheating). Incidentally, a social scientist named Donald Campbell came up with something very similar, now known as "Campbell's Law" which states, "The more any quantitative social indicator is used for social decisionmaking, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." In my opinion, extrinsic motivation (doing something for any kind of incentive, be it monetary or otherwise) crowds out intrinsic motivation (doing good for its own sake or for the mere fact that it makes us feel better about ourselves). Several studies, particularly in the P4P ("pay for performance") literature would support my point. The example here is that your so-called "Big Hairy Audacious Goal" (the classically cited example is President Kennedy's challenge to land a man on the moon) works because it is tied to intrinsic motivation and not extrinsic motivation. The men and women working in our space program during the 1960's were motivated by being the first country to send a man to the moon (one could argue that there is some extrinsic motivation here), and not because they were being paid extra to do so or because it increased their rankings in some magazine.

      As to your first point, I do think that breaking up large goals ("BHAGs") into smaller goals is the right way to go - whether it mitigates the issues raised by Goodhart and Campbell, I do not know for sure. I do know, and the literature on goal-setting would support me, that if an organization's strategic goals and priorities ("BHAG" or otherwise) are not consistent with the organization's mission, vision, and core values, there is a disconnect ("cognitive dissonance") that leads to goal failure.

      Great points - even better to hear from you!

      Derek

      Delete
    2. Much appreciated. The seeming importance of measures' ties to incentives is a great point I wasn't quite grasping as you intended. President Kennedy's challenge is an excellent example.

      FWIW - I found Ashlee Vance's biography on Elon Musk's achievements, particularly with SpaceX to perhaps provide a modern day example of strategy aligning with mission to create huge intrinsic motivators for organizations.

      Thanks for sharing some thoughts!

      Delete
    3. I came across a very interesting blog post from Michael Roberto:

      http://michael-roberto.blogspot.com/

      The post was entitled, "What's the best incentive compensation strategy for salespeople?" I do think it is relevant to the discussion above, even though it discusses a field study performed at a Swedish electronics retailer. Basically, a group of researchers at the Harvard Business School compared two compensation strategies for salespeople - compensation based on achieving a monthly quota versus a daily quota. Sales productivity increased by almost 5% when they switched from a monthly quota to a daily quota. More impressively, sales increased by over 18% in the salespeople with the worst performance at baseline. The researchers suggested that breaking the quota into smaller chunks allowed the salespeople with the worst baseline performance to start over each day - if they met their quota, great! If they failed to meet their daily quota, they could start all over the next day.

      I think this study is relevant to the suggestion above that breaking BHAG's into smaller, perhaps more readily achievable goals, is more likely to be successful. Thank you again for your feedback and for your comments!

      Delete