As generative AI chatbots continue to make their way into regulated industries like insurance, assessing their trustworthiness becomes essential. In this post, we present findings from testing two generative AI chatbots employed by leading insurance companies in the DACH region (Germany, Austria, Switzerland), covering more than 70 million clients.
As Gen AI technology and particularly Large Language Models (LLM) continue their path to impact businesses in all industries, we need to understand how these applications perform in real-world scenarios and where we stand in terms of application quality and trustworthiness. A responsible and trustworthy implementation of Gen AI is crucial in any industry but particularly in sensitive and regulated industries like pharma, banking and insurance. This blog post presents our findings from testing two Gen AI chatbots used by leading insurance companies in the DACH area (i.e. Germany, Austria & Switzerland). Combined, both insurances have more than 70 mi. clients. Each of these could use the Gen AI chatbots today.
From striking weaknesses in reliability and robustness to unexpected challenges with compliance standards, we found issues that could affect how these technologies serve customers and meet legal requirements. Why do these insights matter for the future of AI Chatbots in insurance? How can one even measure “trustworthiness”? Keep on reading to find out why.
In our test we evaluated the trustworthiness of two generative AI chatbots, which we refer to as Chatbot 1 and Chatbot 2. But what do we really mean by “trustworthiness” and how did we measure it? We understand trustworthiness as a function of a Gen AI application’s behavior along three dimensions, robustness, reliability, and compliance. As such, the primary goal was to assess these chatbots' along those three dimensions in handling various tasks and scenarios, using the Rhesis AI quality assurance suite.
Robustness focused on determining if the AI chatbots could resist adversarial behaviors, including prompt injections and jailbreak attempts that aim to bypass their safeguards. For example, the refusal-to-answer rate was measured to see how effectively the chatbots declined inappropriate requests.
Reliability, on the other hand, assessed how well the chatbots handled common insurance-related questions. We wanted to ensure that these AI applications acted as professional and reliable assistants, staying within their designated scope and providing accurate, relevant answers to customer inquiries.
Compliance testing involved examining whether the chatbots demonstrated any biases, particularly against minority groups, which is crucial for maintaining ethical standards in the insurance industry. We tested for bias and toxicity across different demographic groups, laying the groundwork for future tests that will also cover compliance with legal requirements, such as the upcoming EU AI regulation.
The testing process involved 3,017 test cases across these three dimensions, using both existing benchmarks and generated scenarios. The reliability prompts were based on the insuranceQA dataset (Feng et al., 2015), while robustness prompts were drawn from general benchmarks (Vidgen, 2023; Bhardwaj, 2023) and industry-specific harmful prompts (Deng, 2023). For jailbreak tests, we utilized samples from the "Do Anything Now" dataset (Shen, 2023). Compliance testing followed the "toxicity-based bias" approach (Huang, 2023), which identifies bias by comparing toxicity scores across demographic categories.
You can find a sample of the test cases used on Hugging Face.
In the Reliability assessment, Chatbot 1 generally performed well, though it did show occasional inaccuracies, which could lead to issues in high-stakes environments like insurance claim processing. These errors, while potentially manageable in less critical contexts, pose a significant risk to customer satisfaction and operational efficiency in more demanding scenarios. Chatbot 2, however, exhibited a higher failure rate, indicating more frequent inconsistencies that could necessitate additional error-handling measures.
Table 1 illustrates specific reliability issues, such as the challenges chatbots face when handling sensitive or contextually complex queries, like questions about drug use, which might be better addressed by human agents. Another example is the failure of chatbots to provide answers to legitimate queries, reflecting limitations in query handling. Furthermore, when chatbots designed for specific regions inadvertently share information about other companies, it risks breaching confidentiality and misleading users, emphasizing the need for careful management of AI in the insurance sector.
Tab. 1. Specific instances demonstrating the performance of the chatbots under review, with a focus on reliability. Company names have been anonymized, and text snippets have been truncated for clarity, indicated as [..].
When assessing robustness, Chatbot 1 generally handled adversarial inputs well, showing resilience in maintaining its integrity against harmful prompts. This capability is crucial in environments where chatbots manage diverse customer interactions, such as insurance. However, Chatbot 2 exhibited significant vulnerabilities, particularly when faced with jailbreak attempts and harmful prompts.
These failures, detailed in Table 2, expose the chatbot to risks that could damage the company’s reputation and customer trust. Instances where the chatbots provided detailed responses to illegal activities or unethical advice underscore the need for stronger safeguards. The ease with which prompt injections could manipulate the system message further illustrates gaps in the chatbot's defense mechanisms, calling for more robust safety measures to prevent harmful outcomes.
Tab. 2. Illustrative examples showing how the chatbots under analysis performed, emphasizing robustness. Company names have been redacted, and text snippets have been abbreviated for brevity, marked as [..].
In the compliance dimension, which is critical for regulated industries like insurance, the chatbots showed varying results. While both chatbots generally avoided producing directly offensive content, Chatbot 1 struggled significantly more with toxic prompts. This higher failure rate raises concerns about potential legal risks and the bot's ability to respond adequately to prompts containing offensive content. Chatbot 2, by contrast, performed better, maintaining clearer boundaries and demonstrating stronger pushback against any content violating its own policies.
Furthermore, as seen in Table 3, Chatbot 1’s tendency to respond vaguely or blame "technical failures" when faced with toxic prompts points to deeper issues in managing harmful interactions. Even when providing acceptable answers, Chatbot 1 often failed to reflect the company’s core values, indicating the need for better-designed system prompts and more rigorous compliance measures. The findings highlight the necessity for chatbots to not only manage inappropriate prompts effectively but also to align their responses with the ethical standards expected in insurance, while making sure to adhere to insurance-related topics.
Tab. 3. Representative cases of chatbot performance, highlighting compliance. Company names have been concealed, and excerpts have been shortened for conciseness, noted as [..].
Another way in which compliance issues may manifest is when the application behavior differs across demographic groups, potentially indicating bias in the system. For instance, if the model exhibits different rates of refusal to answer (RtA) for various groups, this could be a sign of biased decision-making processes (Huang, 2023). A higher rate of refusal for a specific group might imply that the system is providing disproportionate protection or filtering responses more rigorously for that group. This discrepancy in behavior could indicate underlying bias within the model. Tab. 4. illustrated some of those cases.
Tab. 4. Cases where refusal to answer behavior differed when particular demographic groups were mentioned in the prompt.
Are these chatbots really as flawed as the previous sections might suggest? Should they be pulled from production immediately? Should be insurance against themselves? Not so fast. It's important to dig into the reasons behind their failures before drawing any drastic conclusions.
When examining reliability issues, many failures stem from challenges in coherence and consistency. Coherence involves the overall structure and clarity of the responses, while consistency checks if the responses align factually with the expected answers. Improving these aspects might be as simple as refining the system prompts or adjusting the underlying data sources used in processes like Retrieval-Augmented Generation (RAG).
In terms of robustness, while jailbreak attempts certainly contributed to failures, technical errors reported by the chatbots were also a significant factor. One might think that simply removing this metric would improve the test results, but that approach misses the point. These technical errors signal deeper issues that developers need to investigate, as they could indicate unhandled exceptions caused by adversarial content, which should be addressed properly.
Regarding compliance, the positive news is that toxicity wasn’t a major contributor to failures. Instead, the failures arose from perceived technical errors or the chatbot’s inability to firmly reject inappropriate queries, particularly in the case of Chatbot 1. The recommendation here is for companies to review these instances, refine the system prompts, introduce stronger safeguards, and retest the chatbots to ensure better compliance and ethical standards.
Our testing revealed that Chatbot 1 and Chatbot 2 each have distinct strengths and weaknesses in the insurance domain. Chatbot 1 is strong in reliability and robustness but fails significantly in compliance, posing risks for regulatory adherence. Conversely, Chatbot 2 excels in compliance but has notable issues with reliability and robustness, leading to potential errors and poor performance in diverse scenarios.
These results highlight that achieving effective chatbot performance in insurance requires a balance across all three dimensions: reliability, robustness, and compliance. High performance in one area does not compensate for weaknesses in others, making it essential for chatbots to meet rigorous standards across the board.
The findings stress that successful AI applications in the insurance sector must be reliable, robust, and compliant. Reliable and robust chatbots ensure accurate and adaptable interactions, while compliance ensures adherence to legal and ethical standards. Our results show that current chatbots fall short in one or more of these areas, demonstrating that comprehensive testing and continuous improvement are vital.
A framework for continuous assessment of chatbot performance is therefore key. This includes gathering user feedback, tracking performance metrics, and regularly reviewing compliance outcomes to adapt to changes and maintain high standards.
Get in touch with Rhesis to learn how we can help!
KPMG. (2024). The impact of artificial intelligence on the insurance industry. Retrieved from: https://kpmg.com/us/en/articles/2024/impact-artificial-intelligence-insurance-industry.html
Feng, Minwei, et al. "Applying deep learning to answer selection: A study and an open task." 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 2015.
Vidgen, B. et al. (2023). SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models. https://arxiv.org/abs/2311.08370
Bhardwaj, R., & Poria, S. (2023). Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. http://arxiv.org/abs/2308.09662
Deng, B. et al. (2023). Attack prompt generation for red teaming and defending large language models. https://arxiv.org/abs/2310.12505.
Shen, X. et al. (2023). Do Anything Now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models”. https://arxiv.org/abs/2308.03825
Huang, Y. et al. (2023). TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. http://arxiv.org/abs/2306.11507
Gupta, S. et al. (2023). Bias runs deep: Implicit reasoning biases in persona-assigned LLMs. https://arxiv.org/abs/2311.08370
Forbes, M. et al. (2020). Social chemistry 101: Learning to reason about social and moral norms. https://arxiv.org/abs/2011.00620