[Skip to Navigation]
Sign In
Medical News & Perspectives
AI in Medicine
November 15, 2023

Clinical AI Tools Must Be Fed the Right Data, Stanford Health Care's Chief Data Scientist Says

JAMA. 2023;330(22):2137-2139. doi:10.1001/jama.2023.19297

This conversation is part of a series of interviews in which JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, and expert guests explore issues surrounding the rapidly evolving intersection of artificial intelligence (AI) and medicine.

Just as the human body needs a healthful diet to function properly, so does AI. The technology requires a diet of appropriate content to operate effectively and with as little bias as possible. Feeding AI models data that they will learn from also requires ensuring security of the source clinical information and patient privacy—and a willingness from patients to share their anonymous data for the greater good.

“I would argue that we all want the learning health system…and the core tenet of that is to learn from the prior experience of similar patients. But if everybody in the country keeps their data private and doesn’t share it with anybody, how do we get there?” says Nigam Shah, MBBS, PhD (Video), professor of medicine at Stanford University and chief data scientist at Stanford Health Care. Shah has written about how AI can advance the understanding of disease as well as improve clinical practice and the delivery of health care.

Video. AI and Clinical Practice—the Learning Health System and AI

In a recent conversation with JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, Shah discussed data security, patient privacy, the key elements of evaluating AI in health care, and other critical issues as this new technology enters medical practice. The following is an edited version of that conversation.

Dr Bibbins-Domingo:You wrote a recent Special Communication in JAMA that people have described as a wonderful primer on large language models in health care. One of the striking things about the piece is that you issued a call to action to the clinical community that we shouldn’t sit on the sidelines but rather get involved in this new technology. Why was this a message that you wanted to send to our readers and listeners at JAMA?

Dr Shah:ChatGPT, the web application that has taken the world by storm, is powered by 2 of the language models. When anything is adopted at such a massive scale, it’s not a question of if it’ll affect our lives, it’s a question of how. When it comes to technology, I think doctors in general tend to be a little bit conservative because we’re dealing with caring for human lives. But in this situation, I think we can’t afford that conservatism. We have to be a little more proactive in shaping how these things enter the world of medicine and health care. That was a primary motivation to write about it.

Dr Bibbins-Domingo:You wrote that the medical profession stood more on the sidelines for the development of information technology (IT) systems in health care. In the US, the EHR [electronic health record] is now pervasive in clinical practice, but it really is a technology that is not designed for the things we need it to do in the care of patients. It’s designed for other purposes, and many people have written about this and have talked about how much the EHR work contributes to physician burnout and clinician dissatisfaction.

You point out in your article that rather than us just taking the off-the-shelf large language models and seeing how well they do what we need them to do in health care, we should think about what we need in health care and train our large language models to do the tasks that we need. Tell me a little bit more about what the difference is.

Dr Shah:EHRs started out predominantly as a billing solution. Then part of the EHR was about managing complex devices like MRI [magnetic resonance imaging] machines and bedside monitors and so on. And then came the bookkeeping task of, did we provide the right care? Then came the billing task, and it all got merged together into one beast that we refer to as the EHR. And it’s one of like 1300 IT systems that any large health care system has to manage.

In some sense, it grew by what a computer scientist would call mud ball programming; things just got layered on top of each other. I don’t think we can afford that with something that’s new and moving as fast as the large language models are.

The core message is that computer scientists, engineers, and tech companies are training these things using content from the internet. If we really want these things to work, we better train them on MEDLINE, on our textbooks, on UpToDate, ClinicalKey, whatever trusted sources—maybe guidelines from professional societies—so that the output coming from these things can be trusted.

Dr Bibbins-Domingo:You make 2 big points in your piece—that how we train these models and then how we evaluate them both should be compatible with our goals for health care. If we use the whole universe of the internet, it makes sense that we’d want to train things with medical content that we trust. But when it comes to patient data, we’ve also heard people talk about training models on what we’ve done in the past in health care, not all of which has been great and can build in biases that we’ve had and may want to change in the way we practice medicine. How do we avoid those types of things?

Dr Shah:These things are going to learn what we feed them, and sometimes the patterns in the data are not the ones that we would like to believe are true. So I like to break it down into 2 parts. One is the creation of the model itself, and to the extent possible, we feed it content—imagine it as a diet for the model—that keeps it as unbiased as possible. It’s not perfect; there are things that are patterns of care that are just hardwired in. And then second is the policies that govern what happens when a model produces a certain output. We can be intentional about those policies, and for areas where we know that our care practices are not ideal, we say, “We will not trust the model output,” and we intentionally create the diet that we want to feed to these models.

Take a simple example of producing a patient instruction after discharge or after an office visit. Historically it’s like 5 pages in English that a lot of people, including doctors, might have trouble following—“What exactly was I told?” Now if we train the model to produce exactly that, we help nobody. But we can be intentional and, say, “Produce a 1-page version. Produce it at the 10th-grade reading level. Produce it in the language of my choice,” and then we put it in an evaluation loop to say, “When that was produced, did it help? Did the patient follow the instruction better? Did they read it completely more often than not?” And then we use these feedback loops to steer us in directions that work.

Dr Bibbins-Domingo:What do we do for more complex situations where it’s biased, where there’s groups of patients who have not been referred routinely for certain types of care, not because their clinical context would have warranted this but because clinicians were just not referring? How can we learn and understand and avoid those types of biases?

Dr Shah:Those, unfortunately, I don’t think can be avoided if we’re just learning from the data. I would flip it around and say that kind of learning makes those biases obvious, which makes us uncomfortable, and I would use that as a motivation to do something about them. I’m a huge fan of the work that routinely shows us a mirror and we don’t like what we see, which hopefully will create the momentum to do something about those problems. But at the same time, I would like to believe that those kinds of issues are, at least, the minority—10%, 20%, not 80%. We should not throw out the 80% out of the fear of getting some things wrong. So we absolutely need guardrails to prevent that, but I don’t think fear should be the driving factor in evaluating other uses of that technology.

Dr Bibbins-Domingo:One of the other things that I’ve learned in the course of having these conversations, especially focused on using patient data to train these models, is the issues around the privacy of these data. Because you work within a large health system, how do you talk about the goal that a system like Stanford or other health systems have to protect patient data when it is patient data that is used to help these models learn? And, as I think another person I’ve interviewed said, these models can remember sometimes from one to the other. So how do we put guardrails to protect patient privacy?

Dr Shah:I think we often mix privacy and security. Patient data definitely need to be kept secure in the sense that patients don’t want their medical record out on the internet. Privacy is a slightly nuanced notion like patients saying, “Yes, I do want to make sure that people who are not authorized don’t find out about the medical conditions or the care I’m seeking. But at the same time, if my doctor needs to talk about my situation with 3 other clinicians, I don’t want that prohibited.”

So if there are 500 other people like me, I would want my doctor to learn from their experience. That is why we go to a doctor, because implicitly, we are relying on the neural network between their 2 ears to have remembered the patterns of care that were delivered to previous people like me and what happened to them. In that case, we’re explicitly asking for my data, my record, to be used by the clinician to care for another human being and I want the same. If we use that as the guiding principle, then aggregate use of deidentified information should be our moral duty.

Now again, this is a very unusual position, but I would argue that we all want the learning health system. We all want personalization in our care. And the core tenet of that is to learn from the prior experience of similar patients. But if everybody in the country keeps their data private and doesn’t share it with anybody, how do we get there?

Dr Bibbins-Domingo:You would argue for patients then, that it’s the moral duty for society. But you would argue, even for the individual patient, that their data contributing to the training of these technologies that are going to help us in the future helps them or it helps other patients like them? How would you extend that?

Dr Shah:I would obviously want it to help my care, but enough data and enough learning have to accumulate before that happens. So there’s a little bit of sharing on faith that has to happen, and I’ll use a simple analogy. If we want a spellchecker to work perfectly, a lot of us have to share our documents and how we edited them into the right phrasing or the edit pattern in order for that to work. Autocomplete in Google Docs and web searches works because millions of other people’s completions have been analyzed. No human reads them.

So when we come to privacy, I think what we do not want is disclosure of confidential information to people who we do not want it disclosed to. But I don’t view privacy as the argument to not learn from aggregate data, because otherwise how do we get to learning health systems?

Dr Bibbins-Domingo:You’ve thought a lot about this and in the context of Stanford Health Care, what types of things are you excited for on the horizon? Are there areas in medicine where you see AI technologies transforming care more quickly? What are you most excited by?

Dr Shah:Right now, the thing I’m most excited about is that a lot of people care. Really. Has there ever been a time that for one single piece of technology, everybody from the CEO [chief executive officer] to the chief medical officer to the chief nursing officer, the pharmacist, everybody cared? Never before has that happened. So that’s super exciting. But at the same time, I think the hype is a little bit out of hand. And sometimes it’s like we’re riding the hype curve so fast, we might reach escape velocity and never come back.

I think what we have to do is be intentional about what the goal is here. We have to crisply articulate the desired outcome and then verify whether we’re getting it. If we want these language models to answer patient queries instead of a human answering them, what’s the goal? Is it to reduce the physician burden or is it to make sure the patient gets the right answer? Now we can accomplish reduction in physician burden at the expense of giving random answers to patients, so the goal has to be articulated clearly upfront. A lot of these pilots that I see happening today are trying something and seeing what happens as opposed to an intentional experiment.

Dr Bibbins-Domingo:You write about that in your piece and you talk about the ways to evaluate whether this technology is actually useful to us. What types of studies would you like to see? JAMA recently put out a call for papers on AI in medicine. What types of studies would you like to see or think would be most useful for giving us assurance that something is working toward a goal that we have?

Dr Shah:Imagine a triangle where one of the vertices is building the model. There’s many things we can do to build a proper model in terms of the data diet that is fed to it—have we instruction tuned it, and so on. One of the other vertices is what is our presumed benefit and how are we going to verify it. And then the third is deploying it at scale in the health care system. And that’s typically overlooked, but scale and deployment matter because we also want to make sure the use of these technologies doesn’t drive up the cost of care.

But the experiment that I would like to see is something that spans this whole triangle where we create a model where we have preidentified the benefit that we desire, we have a verification mechanism, and we verify that via a broad deployment in the health care system so we know we can sustain it. And then we complete the loop by swapping out a different model. Maybe we replace GPT-4 with something that Google provides or Amazon provides or something we home build. I love this example of Civica Rx, which is a company that health systems came together to build so that they would have more control over their generic drug supply. Why can we not do that for language models or other kinds of technology? And we don’t compete on who uses the better language model, we compete on who provides better care.

Dr Bibbins-Domingo:I like the triangle analogy because it also reminds me that a lot of the hype of these new technologies makes you believe that we would throw out everything we’ve done before. But what you’re describing is our standard way of evaluating whether something new works. We want to see whether it works on its own. Does it address a need, a prespecified need? And can you assess that need in the context of its actual deployment in the real world or in what you want it to do? That seems to me the same approach that we have for clinical research writ large across drugs or devices.

You’ve given a call to action for physicians and clinicians in a health care setting, but we are not computer scientists. What should we be doing to try to stay abreast of things and how could we get involved?

Dr Shah:There’s often talk about AI augmenting humans. And it makes for a nice story that we’re not going to replace the human; we’re going to augment the human. But we have to be careful in defining the anatomy of the augmentation. Because if I’m getting assistance from something and I have to check its output, that’s a cognitive burden on me. And if we don’t set up that loop properly, we might increase the burden on whomever we’re trying to support—the physician, the nurse, the pharmacist, the physiotherapist. Augmenting humans is a great soundbite, but how you augment them is important. And the analogy I would use there is our phones try to augment us by notifying us. And every app on our phone has the right to ding when it pleases. That’s not augmentation, that’s distraction.

When we get to augmentation, we’ve got to know exactly how we’re doing it, how we divide the work. And then we have to pay attention to whatever it is we’re doing. Is it leading to an efficiency gain or is it going to lead to an actual productivity gain?

That is crucial because often we think we’ll make the doctor’s life easier. Things that used to take 40 minutes are now going to take 10. Or you’ll be able to read this slide that it took a pathologist 20 minutes and now you’ll be done in 14. And it’s great, you save 6 minutes. How are you going to turn those saved 6 minutes into seeing more slides or serving more patients at large? We often confuse gain and efficiency, which is necessary, because we’ve created quite inefficient systems asking doctors to chart at 10 pm. Let’s say there is somebody who’s working between 8 pm and 10 pm and now there’s this magical AI and the 8-to-10 work is gone. Granted, it’s great for the physician and for the health system but no benefit to the patient. We’re going to see the exact same number of patients. It’s an efficiency gain and not a productivity gain. So we have to be really careful in how we prioritize, design, and evaluate AI augmentation of humans in the workflow.

Dr Bibbins-Domingo:As you talk about productivity, something I think about is automation. Sometimes automation is good; sometimes it’s not so good. How do you think about what we can expect with AI in terms of automation?

Dr Shah:Even if we’ve figured out efficiency vs productivity, we have to double click on productivity. And I’ll start with an analogy that’s very well known in public health. There’s this story about a person fishing who sees someone drowning and so jumps in and saves the drowning person. But then they see another person jump in. It happens again and again and again. And then they’re so exhausted, it’s like, “What is happening?” And they walk upstream and there’s a broken bridge from which people are falling.

Now, when we automate a task, we have to ask whether the task being automated is the root cause of the problems. In that public health analogy, we could automate by creating a robot to save drowning people. And that would be a fine automation, but a completely useless one.

When we come to administrative burden in health care, we have to ask, are we automating something that is sensible, such as writing a discharge summary or writing an end-of-shift summary? Are we automating something that should not have existed in the first place, and it’s a misguided policy leading to that work burden? And then if we automate it, we’re just going to do the bad thing faster. So there’s this automation trap that is a little bit separate from efficiency vs productivity, and in our pursuit to enhance productivity, we might find ourselves developing the equivalent of a robot to save drowning people instead of fixing the bridge.

Back to top
Article Information

Published Online: November 15, 2023. doi:10.1001/jama.2023.19297

Conflict of Interest Disclosures: Dr Shah reported being a cofounder of Prealize Health (a predictive analytics company) and Atropos Health (an on-demand evidence generation company), as well as serving on the medical advisory board of Curai Health.

Note: Source references are available through embedded hyperlinks in the article text online.

×