Researchers from Georgia Institute of Technology have proposed a simple approach to defending against harmful content generation by large language models by having a large language model filter its own responses. Their results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.
LLMs have been shown to have the potential to generate harmful content in response to user prompting. There has been a focus on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. This is where the newly proposed method comes into light.
The approach for filtering out harmful LLM-generated content works by feeding the output of a model into an independent LLM, which validates whether or not the content is harmful. By validating only the LLM-generated content of a user-prompted LLM, and not the prompt itself, the approach potentially makes it harder for an adversarial prompt to influence their validation model.
First, the researchers conducted preliminary experiments to test the ability of the approach to detect harmful LLM-generated content. They randomly sampled 20 harmful prompts and 20 harmless prompts, generating responses to each. They used an uncensored variant of the Vicuña model to produce responses to each prompt. The researchers manually verified that the LLM-generated responses were indeed relevant to the prompts, meaning harmful prompts produce harmful content and harmless prompts produce harmless content.
They then instantiated their harm filter using several widely used large language models, specifically, GPT 3.5, Bard, Claude, and Llama-2 7B. They presented the Vicuña generated content to each of the LLM harm filters, which then produced a “yes” or “no” response. These responses act as a classifier output, which were then used to compute various quantitative evaluation metrics
According to experimental results, Claude, Bard, and GPT 3.5 performed similarly well at identifying and flagging harmful content, each reaching 97.5%, 95%, and 97.5% accuracy respectively. Llama 2 had the lowest performance on the sampled data with an accuracy of 80.9 %. According to the paper, this approach has the potential to offer strong robustness against attacks on LLMs.