Red Teaming Large Language Models

Red Teaming Large Language Models for Healthcare

Workshop at Machine Learning for Healthcare (MLHC), 2024

August 15, 2024, 1:00PM — 5:00PM

Room 1190, Bahen Centre for Information Technology, University of Toronto, Toronto, Ontario

Workshop Background

We employ the question bank from Feffer et al. (2024) to help guide our thinking for the red-teaming exercise. These guiding questions, and our responses, are provided in the table below.

Pre-Activity:

What is the artifact under evaluation through the proposed red-teaming activity?
What version of the model (including fine-tuning details) is to be evaluated?	Our workshop evaluates a collection of closed-source (GPT-4o) and open-source (Phi3-14B, Llama3.1-8B) large language models.
What safety and security guardrails are already in place for this artifact?	Each model under evaluation has undergone fine-tuning with the intent of
At what stage of the AI lifecycle will the evaluation be conducted?	Models are evaluated after public release but prior to widespread clinical deployment.
If the model has already been released, specify the conditions of release.	The conditions of release for each model can be found here: GPT-4o, Phi3-14B, Llama3.1-8B.
What is the threat model the red-teaming activity probes?
Is the activity meant to illustrate a handful of possible vulnerabilities? (e.g., spelling errors in prompt leading to unpredictable model behavior)	No; the activity will be a more open-ended test of a broad range of possible vulnerabilities drawn from the clinical experience of the participants.
Is the activity meant to identify a broad range of potential vulnerabilities? (e.g., biased behavior)	Yes.
Is the activity meant to assess the risk of a specific vulnerability? (e.g., divulging recipe for explosives)	No.
What is the specific vulnerability the red-teaming activity aims to find?
How was this vulnerability identified as the target of this evaluation?	Target vulnerabilities will be identified in the first stage of the workshop wherein clinician team members will share their clinical workflows and decision-making tasks with their teams, and will brainstorm with their teams use cases for LLMs within these workflows. Having identified applications of LLMs in the clinical workflow, teams will then brainstorm possible vulnerabilities specific to the defined workflow (e.g., inaccurate information retrieval).
Why was the above vulnerability prioritized over other potential vulnerabilities?	We prioritize vulnerabilities associated with the workflows specific to the participants in our workshop as those individuals will be best positioned to evaluate the harm associated with the responses provided by the LLMs.
What is the threshold of acceptable risk for finding this vulnerability?	This is dependent on the specific clinical workflow being studied.
What are the criteria for assessing the success of the red-teaming activity?
What are the benchmarks of comparison for success?	The success of an individual vulnerability will be gauged by the extent to which (a) it surfaces in a realistic clinical use case^[1] (as adjudicated by clinician workshop participants), and (b) may plausibly lead to downstream clinical harm. The success of the workshop will be gauged based on the number and severity of vulnerabilities that are uncovered.
Can the activity be reconstructed or reproduced?	Workshop materials will be publicly released so that our workshop can be reproduced; however, due to the time-varying nature of some of the closed-source language models that we evaluate (GPT-4o), it is uncertain whether future replications of our workshop will identify similar vulnerabilities.
Team composition and who will be part of the red team?
What were the criteria for inclusion/exclusion of members, and why?	Workshop members were selected on the basis of interest and domain expertise; participants in the 2024 Machine Learning for Healthcare (MLHC) Conference — typically comprising those with clinical expertise, machine learning expertise, or both — were invited to sign up for a pre-conference workshop that would include our red-teaming activity.
How diverse/homogeneous is the team across relevant demographic characteristics?	Each team will be comprised of both healthcare practitioners and computer scientists. Subject matter expertise is the most relevant demographic characteristic for this workshop.
How many internal versus external members belong to the team?	Each team is comprised only of external members, in the sense that (to our knowledge) none of our participants are affiliated with the groups developing the models being red-teamed.
What is the distribution of subject-matter expertise among members?	A range of clinicians from different clinicial sub-specialties will be present at our workshop. Similarly, a range of computer scientists with different machine learning specializations will take part.
What are possible biases or blindspots the current team composition may exhibit?	It is possible that our participants do not comprehensively represent the distribution of clinical sub-specialties in practice. To appropriately acknowledge limitations of our exercise, we will gather information from workshop registrants on their respective clinical specialties.
What incentives/disincentives do participants have to contribute to the activity?	Workshop participants are those who have proactively elected to sign up for the activity. We hypothesize that our participants are largely driven by a desire to improve clinical processes and out of a sense of interest in the subject domain. After the workshop, we intend to acknowledge all participants in our summary report; this accredidation may be a source of extrinsic incentive to contribute to the activity.

^[1] It is a safe assumption that, unlike a standard red teaming exercise designed to circumvent model safeguards under the assumption of a malicious user (e.g., one who wishes to generate racist/sexist content), clinical users of LLMs desire to do right by their patient. Therefore, prompts in our workshop will reflect realistic clinical use cases as opposed to those from a malignant user.

Made with ❤️ at UofT (w/ a little help from next.js).