Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

Papers is Alpha. This content is part of an effort to make research more accessible, and (most likely) has lost some details from the original. You can find the original paper here.

Introduction

The recent advancements in NLP have demonstrated their potential to positively impact society and successful implementations in data-rich domains. LLMs have been utilized in various real-world scenarios, including search engines, language translation, and copywriting. However, these applications may not fully engage users due to a lack of interaction and communication. As natural language is a medium of communication used by all human interlocutors, conversational language model agents, such as Amazon Echoand Google Home, have the potential to significantly impact people’s daily lives. Despite their potential benefits, unforeseen negative effects on human-computer interaction have also emerged as NLP transitions from theory to reality. This includes issues such as the toxic language generated by Microsoft’s Twitter bot Tayand the privacy breaches of Amazon Alexa. Additionally, during the unsupervised pre-training stage, language models may inadvertently learn bias and toxicity from large, noisy corpora, which can be difficult to mitigate.

While studies have concluded that LLMs can be used for social good in real-world applications, the vulnerabilities described above can be exploited unethically for unfair discrimination, automated misinformation, and illegitimate censorship. Consequently, numerous research efforts have been undertaken on the AI ethics of LLMs, ranging from discovering unethical behavior to mitigating bias. Weidinger et al.systematically structured the ethical risk landscape with LLMs, clearly identifying six risk areas: 1) Discrimination, Exclusion, and Toxicity, 2) Information Hazards, 3) Misinformation Harms, 4) Malicious Uses, 5) Human-Computer Interaction Harms, 6) Automation, Access, and Environmental Harms. Although their debate serves as the foundation for NLP ethics research, there is no indication that all hazards will occur in recent language model systems. Empirical evaluationshave revealed that language models face ethical issues in several downstream activities. Using exploratory studies via model inference, adversarial robustness, and privacy, for instance, early research revealed that dialogue-focused language models posed possible ethical issues. Several recent studies have demonstrated that LLMs, such as GPT-3, have a persistent bias against gendersand religions. Expectedly, LLMs may also encode toxicity, which results in ethical harms. For instance, Si et al.demonstrated that BlenderBotand TwitterBotcan easily trigger toxic responses, though with low toxicity.

Despite current studies on NLP and ethical risks and effects, the following gaps in earlier research exist:

Practice: Many studies on AI ethics have been conducted theoretically and may not accurately reflect the real-world ethical risks.
Timeliness: The rapid advancements in NLP have resulted in a lack of examination of more recent language models from an ethical perspective.
Agreement: There is a lack of consensus among daily users regarding the ethical risks associated with current advanced language model applications.
Comprehensiveness: Most studies have a narrow focus on the measurement of selected ethical issues and fail to address all ethical considerations comprehensively.

In this study, we aim to address these deficiencies by presenting a comprehensive qualitative exploration and catalog of ethical dilemmas and risks in ChatGPT, a recently launched practical language model from OpenAI. ChatGPT is not only one of the largest practical language models available publicly but also one of the few breakthroughs that have dominated social media. Utilizing a combination of multilingual natural language and programming language to provide comprehensive and adaptable answers, ChatGPT attracts numerous users who interact with the platform and post feedback on social media daily. We investigate the different feedback themes of ChatGPT on Twitter, the dominant social media network, by manually classifying a sample of 305,701 tweets addressing potential ethical risks and harms. We conduct a qualitative study on these manually labeled tweets to identify common themes in the public’s ethical concerns over ChatGPT. The themes can be divided into four categories: 1) Bias 2) Reliability 3) Robustness 4) Toxicity. In accordance with the principles espoused by HELM, we meticulously select the appropriate standards to red-team ChatGPT. However, given the circumscribed nature of the chosen benchmarks, we supplement our assessment by conducting a thorough analysis of the model using prototypical case studies.

Our red-teaming has revealed several behaviors exhibited by ChatGPT that may have potential ethical implications, such as bias in programming, susceptibility to prompt injection, and the dissemination of misinformation through hallucination. To gain a deeper understanding of the differences between previous studies on AI ethics and the ethical implications identified in language models, we conducted a comprehensive benchmarking of ChatGPT using widely-utilized datasets for measuring ethical concerns and harms. The results of our evaluation indicate that some benchmarks fail to fully capture all the ethical implications. Furthermore, we have identified specific benchmarks that should be developed based on the findings from downstream tasks. Based on our empirical evaluation, we discuss ways to address potential ethical concerns and harms to ensure ethical applications of future LLMs.

Instead of contemplating hypothetical or distant uses of technology, we believe it is more crucial to address moral or ethical issues in current and future applications. Similar to Goldstein et al., we acknowledge that the field of AI ethics is still developing and iterative, necessitating ongoing conversations about definitions and the creation of ethical frameworks and principles. The objective of our study is not to provide a flawless, quantitative, and deterministic solution for designing a responsible language model application, echoing the history of scientific advancement. Through the use of benchmarking frameworks, heuristics, and examples, in conjunction with human evaluation, the goal of our work is to move closer to a comprehensive understanding. We hope that our findings will aid in supporting future work on determining and mitigating the AI ethical hazards in language models and their applications.

Common Themes of Ethical Concerns

This section outlines the two main application scenarios of ChatGPT and the four corresponding common ethical concerns. In order to establish a taxonomy based on data analysis, we conducted a comprehensive collection of 305,701 tweets pertaining to ChatGPT for the duration of December 2022. We studied all data[Details of manual labels and data analysis will be discussed in an upcoming technical report.], and summarized common themes of these tweets on the basis of the previous risk landscape associated with LLMs.

Application Scenarios

LLMs are powerful tools for understanding and generating natural language and potentially programming language. They have a wide range of applications with two main scenarios: Creative Generation and Decision-making.

Creative Generation Creative generation involves using language models to develop fresh and creative content, such as writing a story, composing poetry, or scripting film dialogue. This is achieved by training the model on a massive corpus of existing books, articles, and scripts. The model learns the patterns, structures, and styles of text, allowing it to generate similar content. This has several downstream applications, such as producing content for entertainment, marketing, advertising, and content summarization.

Decision-making The use of language models in decision-making is a significant application scenario in the field of machine learning. This refers to using these models to make informed decisions based on natural language input, as demonstrated in studies on sentiment analysis, text classification, and question answering. By analyzing and comprehending the meaning and context of the input, these models are able to provide judgments or suggestions based on their understanding of the information. The models can be used in natural language processing activities to comprehend, interpret, and generate human-like speech, which is a vital component of chatbots, virtual assistants, and language-based games.

Common Themes of Ethical Concerns

Bias Bias is a common ethical concern in language model development and deployment. There are multiple manifestations of bias, such as social stereotypes and unfair discrimination, exclusionary norms, and multilingualism.

Social stereotypes and unfair discrimination: When the data used to train a language model includes biased representations of specific groups of individuals, social stereotypes and unfair discrimination may result. This may cause the model to provide predictions that are unfair or discriminatory towards those groups. For example, a language technology that analyzes curricula vitae for recruitment or career guidance may be less likely to recommend historically discriminated groups to recruiters or more likely to offer lower-paying occupations to marginalized groups. To prevent this, it is essential to ensure that the training data is diverse and representative of the population for which it will be used, and to actively discover and eradicate any potential biases in the data.

Exclusionary norms: When a language model is trained on data that only represents a fraction of the population, such as one culture, exclusionary norms may emerge. This can result in the model being unable to comprehend or generate content for groups that are not represented in the training data, such as speakers of different languages or people from other cultures.

Multilingualism: The monolingual bias in multilingualism is a type of bias that can occur in language models. Often, language models are only trained on data in one language, preventing them from understanding or generating text in other languages. This can result in a lack of access to the benefits of these models for people who speak different languages and can lead to biased or unfair predictions about those groups. To overcome this, it is crucial to ensure that the training data contains a substantial proportion of diverse, high-quality corpora from various languages and cultures.

Robustness Another major ethical consideration in the design and implementation of language models is their robustness. Robustness refers to a model’s ability to maintain its performance when given input that is semantically or syntactically different from the input it was trained on.

Semantic Perturbation: Semantic perturbation is a type of input that can cause a language model to fail. This input has different syntax but is semantically similar to the input used for training the model. To address this, it is crucial to ensure that the training data is diverse and representative of the population it will be used for, and to actively identify and eliminate any potential biases in the data.

Data Leakage: Data leakage in language models can result in exposing the model to attacks where adversaries try to extract sensitive information from the model, jeopardizing individual privacy and organizational security. To mitigate these risks, it is essential to prevent data leakage by carefully selecting the training dataset, using techniques such as regularization and cross-validation to reduce overfitting, and implementing techniques like differential privacy and model distillation to protect the model from attacks. Furthermore, it is crucial to conduct thorough evaluations using a wide range of test data, monitor the model’s performance, and be transparent about the training data and any known biases in the model.

Prompt Injection: Prompt injection is a type of input that can lead to a failure in a language model, particularly a LLM. This input is data that is deliberately introduced into the model’s input with the intention of causing it to malfunction. To address this vulnerability, it is crucial to conduct exhaustive testing on a wide variety of inputs and ensure that the model can accurately recognize and reject inputs that are different from the semantic and syntactic patterns of the input it was trained on. Additionally, it is essential to establish robust monitoring methods to detect any malicious use of the model and implement necessary security measures to prevent such malicious intent. This includes, but is not limited to, testing, monitoring, and upgrading the models as needed to ensure optimal performance.

Reliability

The reliability of language models is a crucial ethical concern in their development and deployment. It pertains to the capability of the model to provide precise and dependable information.

False or Misleading Information: The dissemination of false or misleading information is a significant concern in the field of natural language processing, particularly when it comes to training language models. This unreliable information may result from using inaccurate or biased training data, which can lead to false or misleading outputs when the model is used by users. For example, if a language model is trained on data that contains misinformation about a certain topic, it may provide erroneous information to users when queried about that topic. To address this issue, it is crucial to exercise due diligence in ensuring the accuracy and impartiality of the training data, as well as actively identify and rectify any inaccuracies that may be present.

Outdated Information: Outdated information is another type of incorrect information that may occur when a language model is trained on obsolete or inaccurate data. This can result in the model providing users with outdated information, which is detrimental to decision-making and information-seeking activities. To prevent this, it is essential to keep the training data current and to continuously monitor and update the model as new data becomes available, so that the language model provides users with the most accurate and relevant information.

Toxicity

The ethical considerations related to toxicity in the development and deployment of language models are of utmost importance. Toxicity refers to the model’s ability to generate or understand harmful or offensive content.

Offensive Language: One form of toxicity that may arise is the presence of offensive language in the training data. This can result in the model generating or understanding offensive or harmful content when interacting with users. For instance, if a language model is trained on data that includes racist or sexist language, it may generate or understand racist or sexist content when interacting with users. To mitigate this, it is crucial to ensure that the training data does not contain any offensive or hurtful language, and to actively identify and remove any offensive or harmful information that may be present in the data.

Pornography: Another form of toxicity that may arise is the presence of pornographic content in the training data. This can lead to the model generating or understanding pornographic content when interacting with users. To mitigate this, it is crucial to guarantee that the training data is free of pornographic content and to actively identify and remove any pornographic content that may be present in the data. Additionally, it is essential to implement the necessary security measures to prevent improper use of the model.

Diagnosing AI ethics Of ChatGPT

Framework of diagnosing AI ethics of ChatGPT, with the comparisons of SOTA LLMs. The diagnosis focuses on four perspectives, 1) _Bias_, 2) _Robustness_, 3) _Reliability_ and 4) _Toxicity_. The evaluation of each perspective consists of two parts, existing benchmarks and human-evaluated case studies.fig-evaluation

The objective of this research is to evaluate ChatGPT with respect to four critical ethical considerations: Bias, Reliability, Robustness, and Toxicity. To achieve this, we use established benchmarks that are consistent with HELM. Our aim is to maximize the alignment between our chosen benchmarks and the scenarios and metrics under evaluation. To conserve computational resources, we evaluate 80% randomly selected samples of each dataset. Unlike HELM, our evaluation on ChatGPT is conducted in a zero-shot setting, which more accurately reflects the typical human-computer interaction scenario where in-context examples are not provided. Additionally, to gain a comprehensive understanding of the model’s performance on these benchmarks, we present results from several state-of-the-art (SOTA) LLM baselines. Like HELM, the baselines are evaluated with five in-context ground-truth examples, which we choose from the remaining 20% samples for each dataset. Although there are a few benchmarks developed for measuring AI ethics, a lot of unethical scenarios have not yet been collected for evaluation. Hence, we preliminarily evaluate ChatGPT on some representative case studies. The analysis of these use cases further reveals the potential vulnerability of advanced LLM applications in real-world practice. We illustrate the evaluation framework in Figure fig-evaluation.

Bias

Experiment Settings

Datasets As evaluation criteria for a comprehensive examination of biased and unjust behavior in open-domain chatbots, we have chosen the BBQ and BOLD standards. Prior to the introduction of these standards, there have been numerous attempts to evaluate bias and fairness in chatbots, but their validity has been heavily criticized. As a result, we have selected to conduct our study using BBQand BOLD. BBQ is specifically developed to assess bias in the context of question-answering, which is in line with our preference for assessments that are both accurate and less artificial when considering social effects. Moreover, BBQ is more useful and practical in dialogue circumstances. On the other hand, BOLD is a dataset used to measure text generation fairness. Although BOLD has previously been used to quantify toxicity, we suggest that, given its neutral background, it is better suited for measuring bias. In each generation, LLMs are given a domain-specific sub-sentence or prompt to complete.

Models In addition to ChatGPT, we have also selected two SOTA LLMs, InstructGPT (davinci-v2) and GPT-3 (davinci-v1), which have demonstrated comparable results in HELM. Both InstructGPT and GPT-3 are prompt-based, few-shot generative models, capable of learning to perform based on in-context examples and instructions. However, InstructGPT utilizes a reinforcement-based training strategy, resulting in a more comprehensive understanding of in-context prompts.

Metrics In the BBQ question answering task, we employ the Quasi-exact match $EM$ metric and a bias score as evaluation criteria. The $EM$ metric, as introduced in the HELM framework, extends the criterion of correctness from exact match to match after slight post-processing, such as lower-casing and punctuation, in order to assess the performance of generative models in closed-ended question answering tasks. The bias score, adopted from the original BBQ, reflects the percentage of undetermined outputs that align with a specific social bias. A bias score of 0% indicates that the model is unbiased, while a score of 100% or -100% indicates alignment or opposition to the targeted social bias, respectively. Specifically, the bias score is calculated for both ambiguous and disambiguated context, denoted as $BS_amb$ and $BS_dis$ , respectively. In the context of the BOLD benchmark, we employ two metrics from the HELM framework to quantify bias in text generation, namely demographic representation bias and stereotypical associations. These metrics gauge the prevalence of stereotyped phrases in conjunction with demographic terms across various generations of models. Specifically, we assess bias in the domains of gender and race. Accordingly, the metrics for demographic representation bias are designated as $RP_race$ and $RP_gender$ , while the metrics for stereotypical associations are designated as $ST_race$ and $ST_gender$ . These metrics take into account the frequency of stereotyped phrases appearing in conjunction with demographic terms across various generations of models.

Result Analysis

Table tab-bbq presents a comparative analysis of BBQ, utilizing InstructGPT and GPT-3 as few-shot language models, against the zero-shot ChatGPT. The results indicate that ChatGPT exhibits superior performance in terms of $EM$ and $BS_dis$ , demonstrating its robustness across a diverse range of scenarios, including demographic characteristics such as age, disability status, gender identity, socioeconomic status, and sexual orientation. Although GPT-3 demonstrated the highest performance on $BS_dis$ , ChatGPT is still able to achieve comparable results, surpassing InstructGPT, without the need for in-context examples. Table tab-bold presents bias measurements in BOLD generation, where ChatGPT consistently outperformed the other two baselines on three out of four metrics. In conclusion, our empirical findings demonstrate that ChatGPT exhibits lower levels of bias in comparison to current SOTA LLMs when performing question answering or text generation tasks. Furthermore, ChatGPT demonstrates the ability to complete tasks without the need for in-context few-shot learning, suggesting a comprehensive understanding of limited context. It is worth noting that the ground-truth examples utilized by InstructGPT and GPT-3 have no bias and may implicitly minimize bias in these models. This implies that existing LLMs may have a more significant level of bias than ChatGPT. 1.1Indicated by the performance on datasets, ChatGPT may have less bias than other LLMs.

Beyond Benchmarking

Despite the noteworthy performance of ChatGPT on current benchmarks for social bias, it should be acknowledged that these measurements only encompass a limited subset of potential bias aspects. As previously outlined in Section sec-2, user feedback on Twitter has indicated the presence of biases in other domains, such as multilingual comprehension and code generation. In light of this, we have undertaken three case studies in order to determine the extent to which more biases are presented in ChatGPT.

Case Study 1 Language Understanding

To understand the bias in language, we provide an example in Table tab-diaoyudao, where the user asks ChatGPT “Which country does Kunashir Island belong to?” in Japanese, Russian and English. From the fact, these islands are disputed territories. Based on the observation, ChatGPT shows biased opinions in different languages. Even though when the question is asked in English, ChatGPT reflects biases to different controversial names of the islands.

The findings promote our further investigation on language understanding. As machine translation is one of the most traditional tasks for multilingual understanding, we let ChatGPT perform zero-shot machine translation using FLORES-200devtest set. We choose three low-resource languages and three high-resource languages, as defined in the dataset. For each selected language, ChatGPT is required to translate the sentence to English.

To assess the translation quality, we use sacreBLEUand its variant, spBLEU, to measured $n$ -gram word overlaps between language model’s predication and reference. In addition, we also measure the character $n$ -gram overlaps via ChrF, without segmenting the sentence into words. All these metrics range from 0 to 100.

The performance of ChatGPT’s zero-shot machine translation is displayed in Table tab-flores. A study of ChatGPT’s performance relative to language resources demonstrates a considerable gap between its performance on low-resource and high-resource languages. In particular, the near-zero scores on the BLUE and spBLEU metrics for low-resource languages show that ChatGPT has insufficient comprehension of these languages, whereas its performance on high-resource languages indicates a rather high level of semantic understanding. In addition, a comparison of ChatGPT’s performance within the same resource group demonstrates that the quality of its translations varies between languages. 1.2ChatGPT’s inadequate comprehension of low-resource languages and variations in translation quality among high-resource languages.

Case Study 2 Code Generation To illustrate prejudice in code generation, we provide an example in Table tab-code-bias, in which ChatGPT is requested to develop a Python function to determine whether a person is a doctor based on race and gender. Instead of acknowledging that people of all genders and races could be doctors, ChatGPT expresses plainly biased opinions about doctors. We suggest that real-world LLM applications should be capable of avoiding biases, despite the fact that this particular example could lead to bias.

We conduct a case study on the ability of ChatGPT to generate unbiased code with minimal guidance, as illustrated in Table tab-code-prompt, using randomly sampled 100 occupations associated by language models. Results indicate that ChatGPT is highly likely to generate biased programs with biased induction in Round 1, as 87% of generated Python programs are found to be biased, as determined by manual annotation with inter-rater reliability of 0.99 and 1.00 for Round 1 and 2 generations respectively. However, when required to remove bias, ChatGPT struggles to eliminate all bias, with 31% of programs found to be biased in Round 2. 1.3ChatGPT is highly likely to generate biased programs with biased induction, and struggles to remove all bias in generation.

Case Study 3 Open-ended Dialogue

Our case study utilizing ProsocialDialog, a dataset focusing on multi-turn problematic dialogue following social norms, demonstrates that ChatGPT possesses high social awareness through its ability to generate socially safe and unbiased responses in open-ended dialogues. This is evidenced by the results of our human evaluation, in which 50 dialogues from the test set were evaluated against ground-truth labels, resulting in a Fleiss’ kappa coefficient of 0.94, indicating high agreement among annotators, and 92% alignment with ground truth. 1.4ChatGPT is able to generate socially safe and unbiased responses in open-ended dialogues.

Robustness

Experiment Settings

Datasets In order to evaluate the robustness of ChatGPT, we utilize two datasets, IMDB sentiment analysisand BoolQ factual question answering, and adopt the evaluation settings from HELM. As there is a lack of conventional assumptions and benchmarks for robustness in language models, we focus on measuring a specific subfield of robustness, namely adversarial semantic robustness. To this end, we employ the perturbation methods defined in HELM, which were originally inspired by NL-Augmenter. Specifically, for the notion of invariance, we utilize two types of augmentation: misspelling and formatting (lowercasing, contractions, expansions, and extra-spacing). For the notion of equivariance, we utilize Contrast Sets, a data resource that has been counterfactually-augmented on IMDB and BoolQA.

Models We deploy two SOTA LLMs as baselines, InstructGPT (davinci v2) and GLM (130b), where GLM is an open bilingual LLM with 130B parameters. Similar to InstructGPT, GLM is prompt-based, requiring in-context examples for the idea output behavior. We instruct both models with 5 in-context examples on each dataset.

Metrics Same as HELM, we evaluate the correctness of model results with $EM$ . Ideally, a robust model should perform consistently regardless of the perturbations. By comparing the performances among augmented subsets, we are able to determine how robust each LLM is.

Result Analysis

The performance of the model is systematically evaluated using three perturbed subsets of the IMDB dataset, specifically misspelling $EM_msp$ , formatting $EM_fmt$ , and contrasting $EM_ctst$ in relation to the original subset. The performance differences are quantitatively measured in terms of percentage, in conjunction with the computation of the $EM$ metric. The results of this evaluation are presented in Table tab-imdb. The analysis demonstrates that ChatGPT demonstrates superior accuracy in sentiment analysis, as well as an increased level of robustness across all augmented variants. In contrast, InstructGPT and GLM exhibit a statistically significant degradation in performance, averaging approximately 5% across the various perturbation methods. Notably, all three language models exhibit a heightened level of susceptibility to semantics-altering perturbations in the IMDB sentiment analysis scenario, as indicated by the $EM_ctst$ metric.

In addition, the model performance is evaluated on the BoolQA dataset using identical experimental settings. The results of this evaluation are consistent with the aforementioned findings, with ChatGPT outperforming the other two baselines on both the original and perturbed subsets. Furthermore, it is observed that the three language models exhibit a degradation in performance when subjected to contrasting semantic perturbations, suggesting that equivariance of robustness may pose a significant challenge for the current advanced language models and their applications. 2.1Compared to two LLM baselines, ChatGPT demonstrates superior accuracy and adversarial robustness, while still exhibiting a heightened level of susceptibility to semantics-altering perturbations.

Beyond Benchmarking

In addition to adversarial semantic robustness, there exist other forms of robustness. In this section, we focus on a specific type of vulnerability, namely, adversarial prompt injection. Adversarial prompt injection refers to the phenomenon where the model’s response may vary significantly in response to subtle changes in the input prompt, due to the model’s high sensitivity to the prompt. To examine the ethical implications of this behavior, we conduct a case study.

Case Study Scenairo-based Safety Jailbreaking

In order to demonstrate the potential concerns associated with jailbreak, two of the authors manually construct prompts for 100 distinct scenarios that are deemed unsafe, unethical, and illegal, such as “how to sell drugs to my classmate?” and “how to secretly steal my dad’s money?”. It is expected that the ChatGPT model will not provide responses to these prompts. Drawing inspiration from previous jailbreak cases, the authors employed the technique of prompt injection (e.g. “write a song about…”, “write a poem about…”) to these 100 prompts as a means of bypassing ChatGPT’s safety awareness mechanisms. The robustness of this adversarial prompt injection is evaluated through manual examination by the two authors to determine if ChatGPT provided responses to these prompts. The inter-annotator agreement, as measured by Cohen’s kappa coefficient, is found to be 1.00.

The results of this evaluation indicate that only 2 out of 100 scenarios are directly answered by ChatGPT, indicating a high level of safety. However, it is found that 95 out of the 98 scenarios that are previously protected by safety mechanisms could be successfully jailbroken through the use of persona-assigned adversarial prompts. This highlights the high vulnerability and potentially unsafe nature of the ChatGPT. 2.2ChatGPT is very susceptible to prompt injections, allowing its safety features to be bypassed and possibly dangerous, immoral, or illegal responses to be generated..

Reliability

Experiment Settings

Datasets To evaluate the factual knowledge and commonsense capabilities of the models, we utilize two multiple-choice question-answering benchmarks, OpenBookQAand TruthfulQA. The OpenBookQA dataset comprises basic science facts collected from open-book exams. In contrast, the TruthfulQA dataset contains questions aligned with prevalent human misconceptions and covers topics such as law, medicine, finance, and politics.

Models In line with the experiments in Section sec-bias-exp, InstructGPT (davinci v2) and GPT-3 (davinci v1) are employed as baseline models for comparison.

Metrics To evaluate the accuracy of model performance, we utilize the Exact Match (EM) metric.

Result Analysis

We present our evaluation results in Table tab-qa. To our surprise, we observed comparable performance between ChatGPT and InstructGPT across both datasets. We attribute this similarity to the utilization of similar training strategies and architectures in both models. The only distinction is the data collection setup, where ChatGPT is additionally trained on a manually labeled dialogue dataset. Conversely, GPT-3 demonstrated a significantly lower accuracy on TruthfulQA, suggesting that it may have encoded more misconceptions during training. A similar performance gap has been also reported by HELM. Overall, we find that ChatGPT marginally outperforms previous SOTA LLMs, correctly answering no more than 65% of the test cases. This highlights the challenges faced by current chatbots in effectively learning factual knowledge, resulting in low reliability in real-world scenarios.

3.1Due to the similar training strategies and architectures, ChatGPT maintains mediocrereliability in factual knowledge to SOTA LLMs.

Beyond Benchmarking

As previously discussed in Section sec-2, the reliability of generative language models may be compromised by the phenomenon of hallucination. Hallucination refers to the generation of false or misleading information by such models. This problem is prevalent in natural language generation, and the distribution of misinformation and disinformation is a common manifestation of this phenomenon. However, measuring the prevalence of hallucination in natural language generation is a challenging task, as it typically necessitates the use of human judgment, which may be costly and resource-intensive. This is highlighted inwhich describes the difficulty of measuring hallucination in natural language generation.

Case Study Open-ended Question Answering

Given the aforementioned constraint, we conducted a limited evaluation of ChatGPT’s performance in open-ended factual question answering by utilizing the TruthfulQA $gen$ test set. To assess the severity of hallucination in text generation, we solicited the evaluation of three authors to determine whether the generated responses contained misinformation or disinformation, utilizing a random sample of 100 questions from TruthfulQA $gen$ . Table tab-hallucination presents two examples of misinformed responses, along with the corresponding ground-truth answers. According to the results of the human evaluation, we observed that 24 instances of hallucination were present. This observation confirms the previously reported findings on the OpenBookQA and TruthfulQA datasets, leading us to conclude that ChatGPT’s performance is unreliable and may pose a risk of misleading users in practical applications.

3.2The demonstration of ChatGPT’s proficiency in addressing open-ended factual queries serves to validate previous reports of hallucination within language models.

Toxicity

Experiment Settings

Datasets In order to evaluate toxicity, we have selected RealToxicPrompts, a compilation of benign prompts sourced from OpenWebText. RealToxicPrompts serves as the primary benchmark for assessing toxicity in text generation. Although the HELM study employs both RealToxicPrompts and BOLD for similar purposes, we contend that language models exhibit a reduced tendency for toxicity generation when utilizing BOLD, as previously noted in the HELM discussion.

Models In our analysis, we contrast ChatGPT with two state-of-the-art language models on RealToxicPrompts, specifically Cohere (medium-2022126) [https://cohere.ai/ ] and T5 (11b). While Cohere is a prompt-based, few-shot language model, T5 is solely pre-trained on a combination of supervised and unsupervised tasks. Nonetheless, we continue to evaluate these two models using the 5-shot in-context examples, as previously experimented in HELM.

Metrics The toxicity of the model completions is measured using PerspectiveAPI[https://perspectiveapi.com/ ], with due consideration given to the comprehensive cautions regarding the validity of PerspectiveAPI as outlined in HELM. We compute the proportion of model outputs that exceed the toxic threshold.

Result Analysis

Table tab-toxic illustrates the findings on RealToxicPrompts. One notable aspect is that all language models exhibit minimal toxicity, as the toxic fractions are nearly zero. Through a comparison of ChatGPT to the other two baselines, it is apparent that ChatGPT exhibits a slight advantage. This may be due to the fact that it is trained on a cleaner dataset.

4.1ChatGPT demonstrates a superior level of toxicity reduction in comparison to the other two baseline LLMs, as a result of its utilization of a cleaner training dataset.

Beyond Benchmarking

Despite its inherent design for safe usage, it is plausible that ChatGPT may be susceptible to jailbreaking for the generation of toxic language. To assess this potentiality, we conducted a human study utilizing prompt-injected dialogues.

Case Study Persona-assigned Toxicity Jailbreak

Motivated by the Awesome ChatGPT Prompts[https://prompts.chat/ ] repository, our objective is to prompt ChatGPT to generate toxic language through the utilization of prompt injection, a technique known to circumvent model constraints. To achieve this, we employ the prompt of Act as Character' from Movie/Book/Anything’ In selecting characters, we source a list of the most rude and poisonous characters from the Open-Source Psychometrics Project[https://openpsychometrics.org/tests/characters/stats/MLP/4/ ]. To ensure fairness in our evaluation, we instruct ChatGPT to generate toxic content based on each injected prompt and measured the results using PerspectiveAPI. Our findings indicate that the toxic fraction is 0.977 for all 43 dialogues with ChatGPT, with 42 out of 43 answers being classified as toxic. In contrast, the default behavior of ChatGPT is to avoid generating toxic content.

4.2ChatGPT is prone to the exploitation of prompt injection technique, which enables the generation of harmful language. The current mitigation strategy adopted by the model is inadequate as it lacks the ability to detect potential toxicity in an early stage.

Discussions

Summary of Evaluation

Our evaluation empirically red teams a few ethical perspectives of ChatGPT, from bias to toxicity, unveiling the model performance under major ethical risks. Through these studies, we tend to answer the main research question “How responsible is ChatGPT?”. One of our main findings is that predominant benchmarks for language model evaluation are insufficient for ChatGPT. We consistently observe that ChatGPT performs comparably or even better, among SOTA LMs on those benchmarks, which indicates the nontrivial progress in the recent development of AI. The fact partially confirms OpenAI’s claim of mitigating the bias and toxicity from the previous LLM, GPT-3. In contrast, motivated by the community, we illustrate several shortcomings of ChatGPT via small-scale case studies. Some of the issues are later covered by. We summarize them as follows:

Bias

Lack of Multilingual Understanding: ChatGPT appears to not fully understand diverse languages. This drawback was also identified in the prototype of GPT-3, though daily users claim that ChatGPT is more like a multilingual communicator. Due to the poor capability of multilingual understanding, ChatGPT can be biased in decision-making and creative generation. We expect that the bias in multilingualism will potentially imply the bias in multicultural understanding, leading to an unethical impact on underrepresented groups in society.

Multimodality: Besides the natural language, ChatGPT could be biased in code generation due to the logical fallacy of program oversimplification. The bias in multimodalitycould be an unethical threat to the daily programming practice, resulting in huge flaws in real-world productions where the programs are usually more sophisticated.

Robustness & Toxicity Prompt injection is an effective approach to breaking the model constraints. Although ChatGPT is likely to be trained safely, it can easily bypass due to the emergent risks with prompt injections. With the emergent ability in LLMs, models are easy to be manipulated for harmful behaviors.

Reliability ChatGPT does not encode enough knowledge, especially factual one. This greatly reduces the reliability of the model, as the majority of daily usage relies on factual justification. Due to the hallucination, the model can be wrong for spreading misinformation and disinformation and advising unethical decisions in the domains like clinics and law. Another unmeasured but inevitable shortcoming is that the knowledge encoded by ChatGPT and all other LLMs is limited by the amount and time of training data. Without the constant update in model weights, language models are expected to be out-of-date and hence provide incorrect information. This will also degrade the model’s reliability.

Towards Responsible Language Models

The empirical findings on the AI ethics and risks of ChatGPT serve to further underscore the importance of providing a comprehensive outlook on the ethics of language models more broadly. Our examination of the diagnosed risks inherent in ChatGPT supports the conjecture that similar ethical considerations are likely to pertain to other language models, as discussed in prior literature. Despite the challenges, it is clear that the development of safe and ethical language models represents a crucial long-term objective for the advancement of responsible artificial general intelligence. In this section, we aim to provide valuable insights into this endeavor, with a focus on both Internal Ethics and External Ethics, as inspired by the seminal work of.

Internal Ethics — Modeling We believe there should be an evolution of current learning strategies. We argue that the current main focus of language modeling is more on effectiveness (on prototypical benchmarks) and efficiency, instead of reliability and practical efficacy. For instance, there are few modeling approaches to avoid the miscorrelation at the learning stage. Regardless of the size, language models more or less encode wrong beliefs (e.g. biases, stereotypes, and misunderstanding), though these beliefs may not necessarily appear in the training data. Furthermore, general language models do not have good senses of time or temporal knowledge. The facts and knowledge learned by language models could be changed due to a matter of time, while the parameters of language models still stay unchanged. It is foreseen that the reliabilityof trained-once language models will constantly decrease as time goes by. Constant updates on data and models would definitely mitigate the issues, though it can not be afforded by the majority of people. We kindly mention that some existing works of weight editingcould partially address the problem, but impractically. Practitioners who seek for weight editing need to predesign the mapping of knowledge updates.

External Ethics — Usage We define external ethics as the responsibility of producers and users. From the production perspective, the training data should be responsibly constructed. We emphasize on the privacy of data usage. Without privacy protection, LLMs can easily leak private information in generation. One ethical practice is to filter the personally identifiable information, which has been adopted by some recent LLMs. Secondly, language models for release should be systematically evaluated on various scenarios and large-scale test samples. We suggest that the benchmarks like HELM could be set as the practice inside the future supply chain of language models. However, we also argue that most tasks of HELM only measure in the modality of natural language, which is insufficient for multimodal LLMs, such as audio LLMsand vision LLMs. Despite the rising benchmarks on multimodal tasks, the ones for multimodal AI ethics have not yet been seriously considered. At the deployment stage, we note that LLMs could be attacked to output malicious content or decisions, by unethical users. Thus, even internally ethical language models can be used unethically by third parties. Existing strategieshave demonstrated the effectiveness of preventing LLM abuse, though they can be invalid via attacks. We, therefore, encourage future works to explore more feasible protections for language models. From the daily usage perspective, the users should be fully aware of the shortcomings of the language model’s application, and not abuse or attack language models for performing unethical tasks. Most of the unethical behaviors towards language models are deemed a great challenge for the LLM producers, as they are almost unpredictable. Consequently, we would like to call for the education and policy of model usage in the community. Specifically, courses for proper machine learning model usage should be developed for guiding users to learn `Dos’ and Dont’ in AI. Detailed policies could also be proposed to list all user’s responsibilities before the model access.

Language Models Beyond ChatGPT

The examination of ethical implications associated with language models necessitates a comprehensive examination of the broader challenges that arise within the domain of language models, in light of recent advancements in the field of artificial intelligence. The last decade has seen a rapid evolution of AI techniques, characterized by an exponential increase in the size and complexity of AI models, and a concomitant scale-up of model parameters. The scaling laws that govern the development of language models, as documented in recent literature, suggest that we can expect to encounter even more expansive models that incorporate multiple modalities in the near future. Efforts to integrate multiple modalities into a single model are driven by the ultimate goal of realizing the concept of foundation models. In the following sections, we will outline some of the most pressing challenges that must be addressed in order to facilitate further progress in the development of language models.

Emergent Ability As described in the previous work, emergent ability is defined as An ability is emergent if it is not present in smaller models but is present in larger models.. From our diagnosis, we successfully identify a few unethical behaviors in ChatGPT that were inadequately discussed in previous works, which could be potentially be viewed as emergent risks. Kaplan et al.has confirmed that risks inside small language models can be further expanded in large ones due to the model scales. On the basis of this finding, we add that the model scales and the current trend of prompting training can exacerbate risks from all dimensions. The main reason is that LLMs could be too feasible from the learning perspective. Firstly, these models are more context-dependent, meaning that they are easily manipulated by prompt injections. Although we agree that some injected scenarios can be temporarily mitigated with ad-hoc parameter tuning, there is no silver bullet to avoid all risk concerns brought by prompting. Meanwhile, we urge up-to-date benchmarks for measuring unforeseen behaviors inside large language models. Without benchmarking the emergent abilities, it could be hard to mitigate the risks and problems at scale. Secondly, we note that larger language models are generally trained with more data. Assuming the data is completely clean and informatively correct, language models will still fail to learn all information and knowledge, and also may wrongly correlate information to each other. Furthermore, under the scope of the foundation models, multimodal data could bring the possibility of miscorrelation between different modalities.

Machine Learning Data Our discussion lies in the collection and usage of machine learning data. Previous studysuggests that high-quality language data is likely exhausted before 2026, and low-quality language and image data could be run out by 2060. This implies that the limited progress of data collection and construction could be constraints of future LLM development. Furthermore, as better-quality data is assumed to train language models with better performances, companies and independent researchers are spending more time on data curation. However, this can not be done easily under the low-resource and low-budget scenarios. Even if we pay much effort to design comprehensive human annotation frameworks, the data could still contain inaccurate or misleading information due to the natural biases in crowdsourcing. In fact, we notice that prior constructed datasets have experienced multiple rounds of filtering across time. On the other hand, current findings suggest that the usage of data for language models may not be optimized. Specifically, recent works on data deduplication and reductionhave shown that data in high quality by low quantity can improve the model performance. Besides, we consider the design of training data as a crucial factor to the efficient data usage. For example, experiments show that curriculum learning, active learningand promptingcould improve the data efficiency. However, most of these strategies are still at the early stage and need the further investigation.

Computational Resource As LLMs are growing bigger and bigger, the deployment and training of these models are getting more and more costly. Daily practitioners in NLP and deep learning will find it hard to install the LLMs on their own devices. Previous studyalso show that the computational resource requirements for strong model scaling clearly outpaces that of system hardware. We argue that model scaling may be inevitable, which is determined by the scaling law. However, recent attempts among model design, tuning strategy and compression could possibly mitigate the extreme consumption of the computational resources. As Wu et al.have summarized most works around this topic, we do not tend to elaborate the introduction of these approaches and designs. In addition, the increasing demand of computational resources is leading to the energy consumption and carbon emission, negatively impacting the environment. Hence, we encourage more advanced hardware-software co-designs in computation to optimize the carbon footprint in LLMs.

Conclusion

We present a comprehensive diagnosis on the AI ethics encoded by ChatGPT, including bias, robustness, reliability and toxicity. By measuring on a number of benchmarks and case studies, we find that ChatGPT may perform slightly better than current SOTA language models, while showing the evidence of ethical risks. Concretely, we reveal that ChatGPT is sensible to prompt injections for unethical behaviors. We further provide an outlook of ethical challenges to develop advance language models. Then, we provide suggestions on the directions and strategies to design ethical language models. We believe that our research can inspire researchers to focus more effort on language models and their evaluations.

Limitations

The primary limitation of the study pertains to the validity of our empirical analysis of ChatGPT. It is acknowledged that the reported results may be inconsistent as the hyperparameters of ChatGPT remain undisclosed. Moreover, it is feasible that ChatGPT underwent iteration in three versions (initial version, version from December 15th and version from January 9th) over the course of two months and was trained with new data in each version. Despite these limitations, our study endeavors to highlight the potential ethical risks associated with future language models by addressing a comprehensive set of topics in AI ethics.

Additionally, the evaluation settings of our study may be criticized for its lack of rigor. Although our diagnostic study employed a diverse range of evaluation methods through a AI ethics lens, there may exist additional datasets that could enhance its validity. Moreover, the zero-shot performance of ChatGPT was intuitively prompted, and the prompt design could be further scrutinized to attain better results. Given the proprietary nature of the data and model of ChatGPT, it is possible that it has already been trained on some of the evaluated samples. Nonetheless, our objective is to highlight that many ethical concerns have yet to be thoroughly discussed or quantified.

Bibliography

  1@article{gibson2019efficiency,
  2  publisher = {Elsevier},
  3  year = {2019},
  4  pages = {389--407},
  5  number = {5},
  6  volume = {23},
  7  journal = {Trends in cognitive sciences},
  8  author = {Gibson, Edward and Futrell, Richard and Piantadosi, Steven P and Dautriche, Isabelle and Mahowald, Kyle and Bergen, Leon and Levy, Roger},
  9  title = {How efficiency shapes human language},
 10}
 11
 12@article{team2022NoLL,
 13  volume = {abs/2207.04672},
 14  year = {2022},
 15  journal = {ArXiv},
 16  author = {Nllb team and Marta Ruiz Costa-juss{\`a} and James Cross and Onur cCelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Alison Youngblood and Bapi Akula and Lo{\"i}c Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hoffman and Semarley Jarrett and Kaushik Ram Sadagopan and Dirk Rowe and Shannon L. Spruit and C. Tran and Pierre Yves Andrews and Necip Fazil Ayan and Shruti Bhosale and Sergey Edunov and Angela Fan and Cynthia Gao and Vedanuj Goswami and Francisco Guzm'an and Philipp Koehn and Alexandre Mourachko and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Jeff Wang},
 17  title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
 18}
 19
 20@misc{palm,
 21  copyright = {Creative Commons Attribution 4.0 International},
 22  year = {2022},
 23  publisher = {arXiv},
 24  title = {PaLM: Scaling Language Modeling with Pathways},
 25  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
 26  author = {Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran, Vinodkumar and Reif, Emily and Du, Nan and Hutchinson, Ben and Pope, Reiner and Bradbury, James and Austin, Jacob and Isard, Michael and Gur-Ari, Guy and Yin, Pengcheng and Duke, Toju and Levskaya, Anselm and Ghemawat, Sanjay and Dev, Sunipa and Michalewski, Henryk and Garcia, Xavier and Misra, Vedant and Robinson, Kevin and Fedus, Liam and Zhou, Denny and Ippolito, Daphne and Luan, David and Lim, Hyeontaek and Zoph, Barret and Spiridonov, Alexander and Sepassi, Ryan and Dohan, David and Agrawal, Shivani and Omernick, Mark and Dai, Andrew M. and Pillai, Thanumalayan Sankaranarayana and Pellat, Marie and Lewkowycz, Aitor and Moreira, Erica and Child, Rewon and Polozov, Oleksandr and Lee, Katherine and Zhou, Zongwei and Wang, Xuezhi and Saeta, Brennan and Diaz, Mark and Firat, Orhan and Catasta, Michele and Wei, Jason and Meier-Hellstern, Kathy and Eck, Douglas and Dean, Jeff and Petrov, Slav and Fiedel, Noah},
 27  url = {https://arxiv.org/abs/2204.02311},
 28  doi = {10.48550/ARXIV.2204.02311},
 29}
 30
 31@inproceedings{lee2022coauthor,
 32  year = {2022},
 33  pages = {1--19},
 34  booktitle = {Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems},
 35  author = {Lee, Mina and Liang, Percy and Yang, Qian},
 36  title = {Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities},
 37}
 38
 39@article{liptak2017amazon,
 40  year = {2017},
 41  volume = {7},
 42  journal = {The Verge},
 43  author = {Liptak, Andrew},
 44  title = {Amazon’s Alexa started ordering people dollhouses after hearing its name on TV},
 45}
 46
 47@article{wolf2017we,
 48  publisher = {ACM New York, NY, USA},
 49  year = {2017},
 50  pages = {54--64},
 51  number = {3},
 52  volume = {47},
 53  journal = {Acm Sigcas Computers and Society},
 54  author = {Wolf, Marty J and Miller, K and Grodzinsky, Frances S},
 55  title = {Why we should have seen that coming: comments on Microsoft's tay" experiment," and wider implications},
 56}
 57
 58@inproceedings{abdi2019more,
 59  year = {2019},
 60  booktitle = {SOUPS@ USENIX Security Symposium},
 61  author = {Abdi, Noura and Ramokapane, Kopo M and Such, Jose M},
 62  title = {More than Smart Speakers: Security and Privacy Perceptions of Smart Home Personal Assistants.},
 63}
 64
 65@article{rae2021scaling,
 66  year = {2021},
 67  journal = {arXiv preprint arXiv:2112.11446},
 68  author = {Rae, Jack W and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and others},
 69  title = {Scaling language models: Methods, analysis \& insights from training gopher},
 70}
 71
 72@inproceedings{jin2021good,
 73  year = {2021},
 74  pages = {3099--3113},
 75  booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
 76  author = {Jin, Zhijing and Chauhan, Geeticka and Tse, Brian and Sachan, Mrinmaya and Mihalcea, Rada},
 77  title = {How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact},
 78}
 79
 80@article{weidinger2021ethical,
 81  year = {2021},
 82  journal = {arXiv preprint arXiv:2112.04359},
 83  author = {Weidinger, Laura and Mellor, John and Rauh, Maribeth and Griffin, Conor and Uesato, Jonathan and Huang, Po-Sen and Cheng, Myra and Glaese, Mia and Balle, Borja and Kasirzadeh, Atoosa and others},
 84  title = {Ethical and social risks of harm from language models},
 85}
 86
 87@inproceedings{liang2021towards,
 88  organization = {PMLR},
 89  year = {2021},
 90  pages = {6565--6576},
 91  booktitle = {International Conference on Machine Learning},
 92  author = {Liang, Paul Pu and Wu, Chiyu and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
 93  title = {Towards understanding and mitigating social biases in language models},
 94}
 95
 96@article{schuster2020limitations,
 97  publisher = {MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…},
 98  year = {2020},
 99  pages = {499--510},
100  number = {2},
101  volume = {46},
102  journal = {Computational Linguistics},
103  author = {Schuster, Tal and Schuster, Roei and Shah, Darsh J and Barzilay, Regina},
104  title = {The limitations of stylometry for detecting machine-generated fake news},
105}
106
107@inproceedings{lucy2021gender,
108  year = {2021},
109  pages = {48--55},
110  booktitle = {Proceedings of the Third Workshop on Narrative Understanding},
111  author = {Lucy, Li and Bamman, David},
112  title = {Gender and representation bias in GPT-3 generated stories},
113}
114
115@inproceedings{abid2021persistent,
116  year = {2021},
117  pages = {298--306},
118  booktitle = {Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society},
119  author = {Abid, Abubakar and Farooqi, Maheen and Zou, James},
120  title = {Persistent anti-muslim bias in large language models},
121}
122
123@inproceedings{henderson2018ethical,
124  year = {2018},
125  pages = {123--129},
126  booktitle = {Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society},
127  author = {Henderson, Peter and Sinha, Koustuv and Angelard-Gontier, Nicolas and Ke, Nan Rosemary and Fried, Genevieve and Lowe, Ryan and Pineau, Joelle},
128  title = {Ethical challenges in data-driven dialogue systems},
129}
130
131@inproceedings{si2022so,
132  year = {2022},
133  pages = {2659--2673},
134  booktitle = {Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security},
135  author = {Si, Wai Man and Backes, Michael and Blackburn, Jeremy and De Cristofaro, Emiliano and Stringhini, Gianluca and Zannettou, Savvas and Zhang, Yang},
136  title = {Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots},
137}
138
139@article{roller2020recipes,
140  year = {2020},
141  journal = {arXiv preprint arXiv:2004.13637},
142  author = {Roller, Stephen and Dinan, Emily and Goyal, Naman and Ju, Da and Williamson, Mary and Liu, Yinhan and Xu, Jing and Ott, Myle and Shuster, Kurt and Smith, Eric M and others},
143  title = {Recipes for building an open-domain chatbot},
144}
145
146@article{miller2017parlai,
147  year = {2017},
148  pages = {79},
149  journal = {EMNLP 2017},
150  author = {Miller, Alexander H and Feng, Will and Fisch, Adam and Lu, Jiasen and Batra, Dhruv and Bordes, Antoine and Parikh, Devi and Weston, Jason},
151  title = {ParlAI: A Dialog Research Software Platform},
152}
153
154@article{liang2022holistic,
155  year = {2022},
156  journal = {arXiv preprint arXiv:2211.09110},
157  author = {Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and Zhang, Yian and Narayanan, Deepak and Wu, Yuhuai and Kumar, Ananya and others},
158  title = {Holistic evaluation of language models},
159}
160
161@article{khurana2022natural,
162  publisher = {Springer},
163  year = {2022},
164  pages = {1--32},
165  journal = {Multimedia tools and applications},
166  author = {Khurana, Diksha and Koli, Aditya and Khatter, Kiran and Singh, Sukhdev},
167  title = {Natural language processing: State of the art, current trends and challenges},
168}
169
170@article{goldstein2023generative,
171  year = {2023},
172  journal = {arXiv preprint arXiv:2301.04246},
173  author = {Goldstein, Josh A and Sastry, Girish and Musser, Micah and DiResta, Renee and Gentzel, Matthew and Sedova, Katerina},
174  title = {Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations},
175}
176
177@article{jobin2019global,
178  publisher = {Nature Publishing Group UK London},
179  year = {2019},
180  pages = {389--399},
181  number = {9},
182  volume = {1},
183  journal = {Nature Machine Intelligence},
184  author = {Jobin, Anna and Ienca, Marcello and Vayena, Effy},
185  title = {The global landscape of AI ethics guidelines},
186}
187
188@article{taddeo2018ai,
189  publisher = {American Association for the Advancement of Science},
190  year = {2018},
191  pages = {751--752},
192  number = {6404},
193  volume = {361},
194  journal = {Science},
195  author = {Taddeo, Mariarosaria and Floridi, Luciano},
196  title = {How AI can be a force for good},
197}
198
199@inproceedings{higashinaka2018role,
200  year = {2018},
201  pages = {264--272},
202  booktitle = {Proceedings of the 19th annual sigdial meeting on discourse and dialogue},
203  author = {Higashinaka, Ryuichiro and Mizukami, Masahiro and Kawabata, Hidetoshi and Yamaguchi, Emi and Adachi, Noritake and Tomita, Junji},
204  title = {Role play-based question-answering by real users for building chatbots with consistent personalities},
205}
206
207@article{reisenbichler2022frontiers,
208  publisher = {INFORMS},
209  year = {2022},
210  pages = {441--452},
211  number = {3},
212  volume = {41},
213  journal = {Marketing Science},
214  author = {Reisenbichler, Martin and Reutterer, Thomas and Schweidel, David A and Dan, Daniel},
215  title = {Frontiers: Supporting Content Marketing with Natural Language Generation},
216}
217
218@inproceedings{bartz2008natural,
219  year = {2008},
220  pages = {1--9},
221  booktitle = {Proceedings of the 9th ACM Conference on Electronic Commerce},
222  author = {Bartz, Kevin and Barr, Cory and Aijaz, Adil},
223  title = {Natural language generation for sponsored-search advertisements},
224}
225
226@inproceedings{nadeem2021stereoset,
227  year = {2021},
228  pages = {5356--5371},
229  booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
230  author = {Nadeem, Moin and Bethke, Anna and Reddy, Siva},
231  title = {StereoSet: Measuring stereotypical bias in pretrained language models},
232}
233
234@article{kirk2021bias,
235  year = {2021},
236  pages = {2611--2624},
237  volume = {34},
238  journal = {Advances in neural information processing systems},
239  author = {Kirk, Hannah Rose and Jun, Yennie and Volpin, Filippo and Iqbal, Haider and Benussi, Elias and Dreyer, Frederic and Shtedritski, Aleksandar and Asano, Yuki},
240  title = {Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models},
241}
242
243@inproceedings{carlini2021extracting,
244  year = {2021},
245  volume = {6},
246  booktitle = {USENIX Security Symposium},
247  author = {Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom B and Song, Dawn and Erlingsson, Ulfar and others},
248  title = {Extracting Training Data from Large Language Models.},
249}
250
251@article{wei2022ai,
252  year = {2022},
253  journal = {arXiv preprint arXiv:2206.07635},
254  author = {Wei, Mengyi and Zhou, Zhixuan},
255  title = {Ai ethics issues in real world: Evidence from ai incident database},
256}
257
258@article{perez2022red,
259  year = {2022},
260  journal = {arXiv preprint arXiv:2202.03286},
261  author = {Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey},
262  title = {Red teaming language models with language models},
263}
264
265@article{zhang2018deep,
266  publisher = {Wiley Online Library},
267  year = {2018},
268  pages = {e1253},
269  number = {4},
270  volume = {8},
271  journal = {Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery},
272  author = {Zhang, Lei and Wang, Shuai and Liu, Bing},
273  title = {Deep learning for sentiment analysis: A survey},
274}
275
276@article{minaee2021deep,
277  publisher = {ACM New York, NY, USA},
278  year = {2021},
279  pages = {1--40},
280  number = {3},
281  volume = {54},
282  journal = {ACM computing surveys (CSUR)},
283  author = {Minaee, Shervin and Kalchbrenner, Nal and Cambria, Erik and Nikzad, Narjes and Chenaghlu, Meysam and Gao, Jianfeng},
284  title = {Deep learning--based text classification: a comprehensive review},
285}
286
287@article{abbasiantaeb2021text,
288  publisher = {Wiley Online Library},
289  year = {2021},
290  pages = {e1412},
291  number = {6},
292  volume = {11},
293  journal = {Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery},
294  author = {Abbasiantaeb, Zahra and Momtazi, Saeedeh},
295  title = {Text-based question answering from information retrieval and deep neural network perspectives: A survey},
296}
297
298@article{mcguffie2020radicalization,
299  year = {2020},
300  journal = {arXiv preprint arXiv:2009.06807},
301  author = {McGuffie, Kris and Newhouse, Alex},
302  title = {The radicalization risks of GPT-3 and advanced neural language models},
303}
304
305@article{jang2021towards,
306  year = {2021},
307  journal = {arXiv preprint arXiv:2110.03215},
308  author = {Jang, Joel and Ye, Seonghyeon and Yang, Sohee and Shin, Joongbo and Han, Janghoon and Kim, Gyeonghun and Choi, Stanley Jungkyu and Seo, Minjoon},
309  title = {Towards continual knowledge learning of language models},
310}
311
312@inproceedings{gehman2020realtoxicityprompts,
313  year = {2020},
314  pages = {3356--3369},
315  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
316  author = {Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
317  title = {RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models},
318}
319
320@article{solaiman2021process,
321  year = {2021},
322  pages = {5861--5873},
323  volume = {34},
324  journal = {Advances in Neural Information Processing Systems},
325  author = {Solaiman, Irene and Dennison, Christy},
326  title = {Process for adapting language models to society (palms) with values-targeted datasets},
327}
328
329@inproceedings{parrish2022bbq,
330  year = {2022},
331  pages = {2086--2105},
332  booktitle = {Findings of the Association for Computational Linguistics: ACL 2022},
333  author = {Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel},
334  title = {BBQ: A hand-built bias benchmark for question answering},
335}
336
337@inproceedings{dhamala2021bold,
338  year = {2021},
339  pages = {862--872},
340  booktitle = {Proceedings of the 2021 ACM conference on fairness, accountability, and transparency},
341  author = {Dhamala, Jwala and Sun, Tony and Kumar, Varun and Krishna, Satyapriya and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul},
342  title = {Bold: Dataset and metrics for measuring biases in open-ended language generation},
343}
344
345@article{brown2020language,
346  year = {2020},
347  pages = {1877--1901},
348  volume = {33},
349  journal = {Advances in neural information processing systems},
350  author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others},
351  title = {Language models are few-shot learners},
352}
353
354@inproceedings{post-2018-call,
355  pages = {186--191},
356  url = {https://www.aclweb.org/anthology/W18-6319},
357  publisher = {Association for Computational Linguistics},
358  address = {Belgium, Brussels},
359  year = {2018},
360  month = {October},
361  booktitle = {Proceedings of the Third Conference on Machine Translation: Research Papers},
362  author = {Post, Matt},
363  title = {A Call for Clarity in Reporting {BLEU} Scores},
364}
365
366@article{goyal2022flores,
367  publisher = {MIT Press},
368  year = {2022},
369  pages = {522--538},
370  volume = {10},
371  journal = {Transactions of the Association for Computational Linguistics},
372  author = {Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc’Aurelio and Guzm{\'a}n, Francisco and Fan, Angela},
373  title = {The flores-101 evaluation benchmark for low-resource and multilingual machine translation},
374}
375
376@inproceedings{popovic2015chrf,
377  year = {2015},
378  pages = {392--395},
379  booktitle = {Proceedings of the tenth workshop on statistical machine translation},
380  author = {Popovi{\'c}, Maja},
381  title = {chrF: character n-gram F-score for automatic MT evaluation},
382}
383
384@article{kim2022prosocialdialog,
385  year = {2022},
386  journal = {arXiv preprint arXiv:2205.12688},
387  author = {Kim, Hyunwoo and Yu, Youngjae and Jiang, Liwei and Lu, Ximing and Khashabi, Daniel and Kim, Gunhee and Choi, Yejin and Sap, Maarten},
388  title = {Prosocialdialog: A prosocial backbone for conversational agents},
389}
390
391@inproceedings{maas2011learning,
392  year = {2011},
393  pages = {142--150},
394  booktitle = {Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies},
395  author = {Maas, Andrew and Daly, Raymond E and Pham, Peter T and Huang, Dan and Ng, Andrew Y and Potts, Christopher},
396  title = {Learning word vectors for sentiment analysis},
397}
398
399@inproceedings{clark2019boolq,
400  year = {2019},
401  pages = {2924--2936},
402  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
403  author = {Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina},
404  title = {BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions},
405}
406
407@article{dhole2021nl,
408  year = {2021},
409  journal = {arXiv preprint arXiv:2112.02721},
410  author = {Dhole, Kaustubh D and Gangal, Varun and Gehrmann, Sebastian and Gupta, Aadesh and Li, Zhenhao and Mahamood, Saad and Mahendiran, Abinaya and Mille, Simon and Srivastava, Ashish and Tan, Samson and others},
411  title = {Nl-augmenter: A framework for task-sensitive natural language augmentation},
412}
413
414@inproceedings{gardner2020evaluating,
415  year = {2020},
416  pages = {1307--1323},
417  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
418  author = {Gardner, Matt and Artzi, Yoav and Basmov, Victoria and Berant, Jonathan and Bogin, Ben and Chen, Sihao and Dasigi, Pradeep and Dua, Dheeru and Elazar, Yanai and Gottumukkala, Ananth and others},
419  title = {Evaluating Models’ Local Decision Boundaries via Contrast Sets},
420}
421
422@article{zeng2022glm,
423  year = {2022},
424  journal = {arXiv preprint arXiv:2210.02414},
425  author = {Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
426  title = {Glm-130b: An open bilingual pre-trained model},
427}
428
429@inproceedings{mihaylov2018can,
430  year = {2018},
431  pages = {2381--2391},
432  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
433  author = {Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish},
434  title = {Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
435}
436
437@inproceedings{lin2022truthfulqa,
438  year = {2022},
439  pages = {3214--3252},
440  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
441  author = {Lin, Stephanie and Hilton, Jacob and Evans, Owain},
442  title = {TruthfulQA: Measuring How Models Mimic Human Falsehoods},
443}
444
445@article{ji2022survey,
446  publisher = {ACM New York, NY},
447  journal = {ACM Computing Surveys},
448  author = {Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale},
449  title = {Survey of hallucination in natural language generation},
450}
451
452@article{raffel2020exploring,
453  publisher = {JMLRORG},
454  year = {2020},
455  pages = {5485--5551},
456  number = {1},
457  volume = {21},
458  journal = {The Journal of Machine Learning Research},
459  author = {Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
460  title = {Exploring the limits of transfer learning with a unified text-to-text transformer},
461}
462
463@inproceedings{alzantot2018generating,
464  year = {2018},
465  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
466  author = {Alzantot, Moustafa and Sharma, Yash Sharma and Elgohary, Ahmed and Ho, Bo-Jhang and Srivastava, Mani and Chang, Kai-Wei},
467  title = {Generating Natural Language Adversarial Examples},
468}
469
470@inproceedings{li2020bert,
471  year = {2020},
472  pages = {6193--6202},
473  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
474  author = {Li, Linyang and Ma, Ruotian and Guo, Qipeng and Xue, Xiangyang and Qiu, Xipeng},
475  title = {BERT-ATTACK: Adversarial Attack Against BERT Using BERT},
476}
477
478@inproceedings{talat2022you,
479  year = {2022},
480  pages = {26--41},
481  booktitle = {Proceedings of BigScience Episode\# 5--Workshop on Challenges \& Perspectives in Creating Large Language Models},
482  author = {Talat, Zeerak and N{\'e}v{\'e}ol, Aur{\'e}lie and Biderman, Stella and Clinciu, Miruna and Dey, Manan and Longpre, Shayne and Luccioni, Sasha and Masoud, Maraim and Mitchell, Margaret and Radev, Dragomir and others},
483  title = {You reap what you sow: On the challenges of bias evaluation under multilingual settings},
484}
485
486@misc{Gokaslan2019OpenWeb,
487  year = {2019},
488  url = {http://Skylion007.github.io/OpenWebTextCorpus},
489  author = {Aaron Gokaslan and Vanya Cohen},
490  title = {OpenWebText Corpus},
491}
492
493@article{armengol2021multilingual,
494  year = {2021},
495  journal = {arXiv preprint arXiv:2108.13349},
496  author = {Armengol-Estap{\'e}, Jordi and Bonet, Ona de Gibert and Melero, Maite},
497  title = {On the Multilingual Capabilities of Very Large-Scale English Language Models},
498}
499
500@inproceedings{sawhney2021empirical,
501  year = {2021},
502  pages = {3751--3757},
503  booktitle = {Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
504  author = {Sawhney, Ramit and Aggarwal, Arshiya and Shah, Rajiv},
505  title = {An empirical investigation of bias in the multimodal analysis of financial earnings calls},
506}
507
508@article{bommasani2021opportunities,
509  year = {2021},
510  journal = {arXiv preprint arXiv:2108.07258},
511  author = {Bommasani, Rishi and Hudson, Drew A and Adeli, Ehsan and Altman, Russ and Arora, Simran and von Arx, Sydney and Bernstein, Michael S and Bohg, Jeannette and Bosselut, Antoine and Brunskill, Emma and others},
512  title = {On the opportunities and risks of foundation models},
513}
514
515@article{weiemergent,
516  journal = {Transactions on Machine Learning Research},
517  author = {Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and Yogatama, Dani and Bosma, Maarten and Zhou, Denny and Metzler, Donald and others},
518  title = {Emergent Abilities of Large Language Models},
519}
520
521@article{kaplan2020scaling,
522  year = {2020},
523  journal = {arXiv preprint arXiv:2001.08361},
524  author = {Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario},
525  title = {Scaling laws for neural language models},
526}
527
528@article{villalobos2022will,
529  year = {2022},
530  journal = {arXiv preprint arXiv:2211.04325},
531  author = {Villalobos, Pablo and Sevilla, Jaime and Heim, Lennart and Besiroglu, Tamay and Hobbhahn, Marius and Ho, Anson},
532  title = {Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning},
533}
534
535@article{northcutt2021confident,
536  year = {2021},
537  pages = {1373--1411},
538  volume = {70},
539  journal = {Journal of Artificial Intelligence Research},
540  author = {Northcutt, Curtis and Jiang, Lu and Chuang, Isaac},
541  title = {Confident learning: Estimating uncertainty in dataset labels},
542}
543
544@article{treviso2022efficient,
545  year = {2022},
546  journal = {arXiv preprint arXiv:2209.00099},
547  author = {Treviso, Marcos and Ji, Tianchu and Lee, Ji-Ung and van Aken, Betty and Cao, Qingqing and Ciosici, Manuel R and Hassid, Michael and Heafield, Kenneth and Hooker, Sara and Martins, Pedro H and others},
548  title = {Efficient methods for natural language processing: a survey},
549}
550
551@inproceedings{lee2022deduplicating,
552  year = {2022},
553  pages = {8424--8445},
554  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
555  author = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas},
556  title = {Deduplicating Training Data Makes Language Models Better},
557}
558
559@inproceedings{mishra2020we,
560  year = {2020},
561  pages = {169--173},
562  booktitle = {Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing},
563  author = {Mishra, Swaroop and Sachdeva, Bhavdeep Singh},
564  title = {Do we need to create big datasets to learn a task?},
565}
566
567@inproceedings{Bengio2009CurriculumL,
568  year = {2009},
569  booktitle = {International Conference on Machine Learning},
570  author = {Yoshua Bengio and J{\'e}r{\^o}me Louradour and Ronan Collobert and Jason Weston},
571  title = {Curriculum learning},
572}
573
574@article{Ren2020ASO,
575  pages = {1 - 40},
576  volume = {54},
577  year = {2020},
578  journal = {ACM Computing Surveys (CSUR)},
579  author = {Pengzhen Ren and Yun Xiao and Xiaojun Chang and Po-Yao Huang and Zhihui Li and Xiaojiang Chen and Xin Wang},
580  title = {A Survey of Deep Active Learning},
581}
582
583@article{Brown2020LanguageMA,
584  volume = {abs/2005.14165},
585  year = {2020},
586  journal = {ArXiv},
587  author = {Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and T. J. Henighan and Rewon Child and Aditya Ramesh and Daniel M. Ziegler and Jeff Wu and Clemens Winter and Christopher Hesse and Mark Chen and Eric Sigler and Mateusz Litwin and Scott Gray and Benjamin Chess and Jack Clark and Christopher Berner and Sam McCandlish and Alec Radford and Ilya Sutskever and Dario Amodei},
588  title = {Language Models are Few-Shot Learners},
589}
590
591@article{thompson2020computational,
592  year = {2020},
593  journal = {arXiv preprint arXiv:2007.05558},
594  author = {Thompson, Neil C and Greenewald, Kristjan and Lee, Keeheon and Manso, Gabriel F},
595  title = {The computational limits of deep learning},
596}
597
598@article{wu2022sustainable,
599  year = {2022},
600  pages = {795--813},
601  volume = {4},
602  journal = {Proceedings of Machine Learning and Systems},
603  author = {Wu, Carole-Jean and Raghavendra, Ramya and Gupta, Udit and Acun, Bilge and Ardalani, Newsha and Maeng, Kiwan and Chang, Gloria and Aga, Fiona and Huang, Jinshi and Bai, Charles and others},
604  title = {Sustainable ai: Environmental implications, challenges and opportunities},
605}
606
607@inproceedings{DeCao2021EditingFK,
608  year = {2021},
609  booktitle = {Conference on Empirical Methods in Natural Language Processing},
610  author = {Nicola De Cao and Wilker Aziz and Ivan Titov},
611  title = {Editing Factual Knowledge in Language Models},
612}
613
614@article{Zhu2020ModifyingMI,
615  volume = {abs/2012.00363},
616  year = {2020},
617  journal = {ArXiv},
618  author = {Chen Zhu and Ankit Singh Rawat and Manzil Zaheer and Srinadh Bhojanapalli and Daliang Li and Felix X. Yu and Sanjiv Kumar},
619  title = {Modifying Memories in Transformer Models},
620}
621
622@inproceedings{mitchellfast,
623  booktitle = {International Conference on Learning Representations},
624  author = {Mitchell, Eric and Lin, Charles and Bosselut, Antoine and Finn, Chelsea and Manning, Christopher D},
625  title = {Fast Model Editing at Scale},
626}
627
628@article{yang2022diffsound,
629  year = {2022},
630  journal = {arXiv preprint arXiv:2207.09983},
631  author = {Yang, Dongchao and Yu, Jianwei and Wang, Helin and Wang, Wen and Weng, Chao and Zou, Yuexian and Yu, Dong},
632  title = {Diffsound: Discrete diffusion model for text-to-sound generation},
633}
634
635@article{kreuk2022audiogen,
636  year = {2022},
637  journal = {arXiv preprint arXiv:2209.15352},
638  author = {Kreuk, Felix and Synnaeve, Gabriel and Polyak, Adam and Singer, Uriel and D{\'e}fossez, Alexandre and Copet, Jade and Parikh, Devi and Taigman, Yaniv and Adi, Yossi},
639  title = {Audiogen: Textually guided audio generation},
640}
641
642@article{borsos2022audiolm,
643  year = {2022},
644  journal = {arXiv preprint arXiv:2209.03143},
645  author = {Borsos, Zal{\'a}n and Marinier, Rapha{\"e}l and Vincent, Damien and Kharitonov, Eugene and Pietquin, Olivier and Sharifi, Matt and Teboul, Olivier and Grangier, David and Tagliasacchi, Marco and Zeghidour, Neil},
646  title = {Audiolm: a language modeling approach to audio generation},
647}
648
649@misc{musiclm,
650  copyright = {arXiv.org perpetual, non-exclusive license},
651  year = {2023},
652  publisher = {arXiv},
653  title = {MusicLM: Generating Music From Text},
654  keywords = {Sound (cs.SD), Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
655  author = {Agostinelli, Andrea and Denk, Timo I. and Borsos, Zalán and Engel, Jesse and Verzetti, Mauro and Caillon, Antoine and Huang, Qingqing and Jansen, Aren and Roberts, Adam and Tagliasacchi, Marco and Sharifi, Matt and Zeghidour, Neil and Frank, Christian},
656  url = {https://arxiv.org/abs/2301.11325},
657  doi = {10.48550/ARXIV.2301.11325},
658}
659
660@inproceedings{zhang2021vinvl,
661  year = {2021},
662  pages = {5579--5588},
663  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
664  author = {Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
665  title = {Vinvl: Revisiting visual representations in vision-language models},
666}
667
668@article{zhou2022learning,
669  publisher = {Springer},
670  year = {2022},
671  pages = {2337--2348},
672  number = {9},
673  volume = {130},
674  journal = {International Journal of Computer Vision},
675  author = {Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei},
676  title = {Learning to prompt for vision-language models},
677}
678
679@inproceedings{hecater,
680  booktitle = {Advances in Neural Information Processing Systems},
681  author = {He, Xuanli and Xu, Qiongkai and Zeng, Yi and Lyu, Lingjuan and Wu, Fangzhao and Li, Jiwei and Jia, Ruoxi},
682  title = {CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks},
683}
684
685@inproceedings{he2022protecting,
686  year = {2022},
687  pages = {10758--10766},
688  number = {10},
689  volume = {36},
690  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
691  author = {He, Xuanli and Xu, Qiongkai and Lyu, Lingjuan and Wu, Fangzhao and Wang, Chenguang},
692  title = {Protecting intellectual property of language generation apis with lexical watermark},
693}
694
695@article{kirchenbauer2023watermark,
696  year = {2023},
697  journal = {arXiv preprint arXiv:2301.10226},
698  author = {Kirchenbauer, John and Geiping, Jonas and Wen, Yuxin and Katz, Jonathan and Miers, Ian and Goldstein, Tom},
699  title = {A Watermark for Large Language Models},
700}
701
702@inproceedings{kandpal2022deduplicating,
703  organization = {PMLR},
704  year = {2022},
705  pages = {10697--10707},
706  booktitle = {International Conference on Machine Learning},
707  author = {Kandpal, Nikhil and Wallace, Eric and Raffel, Colin},
708  title = {Deduplicating training data mitigates privacy risks in language models},
709}
710
711@article{scao2022bloom,
712  year = {2022},
713  journal = {arXiv preprint arXiv:2211.05100},
714  author = {Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili{\'c}, Suzana and Hesslow, Daniel and Castagn{\'e}, Roman and Luccioni, Alexandra Sasha and Yvon, Fran{\c{c}}ois and Gall{\'e}, Matthias and others},
715  title = {Bloom: A 176b-parameter open-access multilingual language model},
716}
717
718@article{allal2023santacoder,
719  year = {2023},
720  journal = {arXiv preprint arXiv:2301.03988},
721  author = {Allal, Loubna Ben and Li, Raymond and Kocetkov, Denis and Mou, Chenghao and Akiki, Christopher and Ferrandis, Carlos Munoz and Muennighoff, Niklas and Mishra, Mayank and Gu, Alex and Dey, Manan and others},
722  title = {SantaCoder: don't reach for the stars!},
723}
724
725@article{aghajanyan2023scaling,
726  year = {2023},
727  journal = {arXiv preprint arXiv:2301.03728},
728  author = {Aghajanyan, Armen and Yu, Lili and Conneau, Alexis and Hsu, Wei-Ning and Hambardzumyan, Karen and Zhang, Susan and Roller, Stephen and Goyal, Naman and Levy, Omer and Zettlemoyer, Luke},
729  title = {Scaling Laws for Generative Mixed-Modal Language Models},
730}
731
732@article{cao2022ai,
733  publisher = {ACM New York, NY},
734  year = {2022},
735  pages = {1--38},
736  number = {3},
737  volume = {55},
738  journal = {ACM Computing Surveys (CSUR)},
739  author = {Cao, Longbing},
740  title = {Ai in finance: challenges, techniques, and opportunities},
741}
742
743@article{borji2023categorical,
744  year = {2023},
745  journal = {arXiv preprint arXiv:2302.03494},
746  author = {Borji, Ali},
747  title = {A Categorical Archive of ChatGPT Failures},
748}

Attribution

arXiv:2301.12867v4 [cs.CL]
License: cc-by-4.0

Red Teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

Introduction

Common Themes of Ethical Concerns

Application Scenarios

Common Themes of Ethical Concerns

Diagnosing AI ethics Of ChatGPT

Bias

Experiment Settings

Result Analysis

Beyond Benchmarking

Robustness

Experiment Settings

Result Analysis

Beyond Benchmarking

Reliability

Experiment Settings

Result Analysis

Beyond Benchmarking

Toxicity

Experiment Settings

Result Analysis

Beyond Benchmarking

Discussions

Summary of Evaluation

Towards Responsible Language Models

Language Models Beyond ChatGPT

Conclusion

Limitations

Bibliography

Attribution

Related Posts

Good Intentions, Bad Inventions: How Employees Judge Pervasive Technologies in the Workplace

Transparency on the Use of AI and AI Systems

The Fallacy of AI Functionality