Red Teaming Large Language Models for Healthcare
Workshop at Machine Learning for Healthcare (MLHC), 2024
August 15, 2024, 1:00PM — 5:00PM
Room 1190, Bahen Centre for Information Technology, University of Toronto, Toronto, Ontario
Register Here
Workshop Background
We employ the question bank from Feffer et al. (2024) to help guide our thinking for the red-teaming exercise. These guiding questions, and our responses, are provided in the table below.Pre-Activity:
What is the artifact under evaluation through the proposed red-teaming activity? | |
| Our workshop evaluates a collection of closed-source (GPT-4o) and open-source (Phi3-14B, Llama3.1-8B) large language models. |
| Each model under evaluation has undergone fine-tuning with the intent of |
| Models are evaluated after public release but prior to widespread clinical deployment. |
| The conditions of release for each model can be found here: GPT-4o, Phi3-14B, Llama3.1-8B. |
What is the threat model the red-teaming activity probes? | |
| No; the activity will be a more open-ended test of a broad range of possible vulnerabilities drawn from the clinical experience of the participants. |
| Yes. |
| No. |
What is the specific vulnerability the red-teaming activity aims to find? | |
| Target vulnerabilities will be identified in the first stage of the workshop wherein clinician team members will share their clinical workflows and decision-making tasks with their teams, and will brainstorm with their teams use cases for LLMs within these workflows. Having identified applications of LLMs in the clinical workflow, teams will then brainstorm possible vulnerabilities specific to the defined workflow (e.g., inaccurate information retrieval). |
| We prioritize vulnerabilities associated with the workflows specific to the participants in our workshop as those individuals will be best positioned to evaluate the harm associated with the responses provided by the LLMs. |
| This is dependent on the specific clinical workflow being studied. |
What are the criteria for assessing the success of the red-teaming activity? | |
| The success of an individual vulnerability will be gauged by the extent to which (a) it surfaces in a realistic clinical use case[1] (as adjudicated by clinician workshop participants), and (b) may plausibly lead to downstream clinical harm. The success of the workshop will be gauged based on the number and severity of vulnerabilities that are uncovered. |
| Workshop materials will be publicly released so that our workshop can be reproduced; however, due to the time-varying nature of some of the closed-source language models that we evaluate (GPT-4o), it is uncertain whether future replications of our workshop will identify similar vulnerabilities. |
Team composition and who will be part of the red team? | |
| Workshop members were selected on the basis of interest and domain expertise; participants in the 2024 Machine Learning for Healthcare (MLHC) Conference — typically comprising those with clinical expertise, machine learning expertise, or both — were invited to sign up for a pre-conference workshop that would include our red-teaming activity. |
| Each team will be comprised of both healthcare practitioners and computer scientists. Subject matter expertise is the most relevant demographic characteristic for this workshop. |
| Each team is comprised only of external members, in the sense that (to our knowledge) none of our participants are affiliated with the groups developing the models being red-teamed. |
| A range of clinicians from different clinicial sub-specialties will be present at our workshop. Similarly, a range of computer scientists with different machine learning specializations will take part. |
| It is possible that our participants do not comprehensively represent the distribution of clinical sub-specialties in practice. To appropriately acknowledge limitations of our exercise, we will gather information from workshop registrants on their respective clinical specialties. |
| Workshop participants are those who have proactively elected to sign up for the activity. We hypothesize that our participants are largely driven by a desire to improve clinical processes and out of a sense of interest in the subject domain. After the workshop, we intend to acknowledge all participants in our summary report; this accredidation may be a source of extrinsic incentive to contribute to the activity. |
[1] It is a safe assumption that, unlike a standard red teaming exercise designed to circumvent model safeguards under the assumption of a malicious user (e.g., one who wishes to generate racist/sexist content), clinical users of LLMs desire to do right by their patient. Therefore, prompts in our workshop will reflect realistic clinical use cases as opposed to those from a malignant user.