In my previous post, I introduced the quality model depicted in Figure 1. The flower model depicts the AI-related quality properties for AI-enabled systems, on top of the quality properties for software systems as specified by ISO25000.

With the advent of foundation models and generative AI, especially the recent explosion in Large Language Models (LLMs), we see our students and companies around us build a whole new type of AI-enabled systems: LLM-based systems or LLM systems in short. The most well-known example of an LLM system is the chatbot ChatGPT. Inspired by ChatGPT and its possibilities, many developers want to build there own chatbots, trained on their own set of documents, e.g. as an intelligent search engine. For the specific text generation task they have to 1) select the most appropriate LLM, sometimes fine-tune it, 2) engineer the document retrieval step (Retrieval Augmented Generation, RAG), 3) engineer the prompt, 4) engineer a user interface that hides the complexity of prompts and answers to end users.

Especially prompt engineering is a new activity introduced in LLM systems. It is intrinsically hard as the possibilities are endless, prompts are hard to test or compare, the result might vary with different LLM models or model versions, prompts are difficult to debug, you need domain expertise (and language skills!) to engineer fitting prompts for the task at hand, and so on. For LLMs however, prompt engineering is the main way the models can be adapted to support specific tasks.

So, where in my previous post I concluded that AI-enabled systems are data + model + code, for LLM systems we must conclude that they are data + model + prompt + code. Where it must also be noted that with LLM systems the model is usually provided by an external party and thus hard or inpossible for the developer to control, other than by engineering prompts. The external party might however frequently update its LLM and this might necessitate a system update for the LLM system as well.

In this post, I analyze the quality characteristics of LLM systems and discuss the challenges for engineering LLM systems. I also present the solutions I have found untill now to address the quality characteristics and the challenges. In future work we will engineer LLM systems together with our students and workfield partners, thereby adding to the body of knowledge on LLM engineering.

Figure 1: quality model for trustworthy AI

Quality Requirements for Trustworthy LLM Systems

Let us imagine an LLM system that can do retrieval augmented generation on all on-boarding documentation of company X. The LLM system has a Q&A interface in which (new) employees of company X can ask questions. The LLM system retrieves the answer from the on-boarding documentation and presents it to the employee. The idea is that this Q&A system provides a better user experience than a plain search engine on the same documents.

Now we will analyze the LLM system based on the quality properties depicted in Figure 1.

Model Correctness

As said, an external party provides the model. However, we need to define how we are going to evaluate that model for correctness. With text generation this is far more difficult because multiple different answers might be correct, if they contain the information the user was looking for or serve the purpose the user was generating text for. As noted above, we are in fact not just checking the model correctness, but the combination of prompt and model. And in the case of Retrieval Augmented Generation, we are even also checking the correctness of the query that retrieves documents (see RAG triad). For each task, one should define task-specific evaluation metrics for the correctness.

In our example LLM system we could imagine defining a set of typical questions and answers, for which a developer can write automated tests (e.g., “What are the opening hours of the help desk?”), or which a developer can easily verify by hand. We could also think to include questions for which the answers are not in the documents, to see if we get a proper “I do not know” from the system. Then we would require the LLM system to answer 100% of the sample questions correctly.

Model Robustness

In the case of an in-house LLM system one might not expect many adversarial attacks on the system. However, we could imagine that employees without the proper access rights might try to trick the LLM system to disclose sensitive information by asking “attack questions”. Again, one should analyze upfront the possible attacks or unwanted disclosures for the set of documents and design the system in such a way to maximize prevention.

In our example LLM system we could imagine never to disclose e.g. actual usernames and passwords (“What is the administrator password for the AWS server?”). Thus we need to program a question and answer checker to remove that kind of content, or engineer the prompt to never include this in the answer.

Reproducibility

Reproducibility and traceability of results is extremely hard for LLM systems. Even with the exact same prompt and the exact same model version, output of the LLM might differ. The LLM itself is in essence a black-box which hinders traceability from input to output. In the case of RAG we also would like to see which documents or document chunks have been retrieved to generate answers from.

In our example LLM system we would certainly want to provide a link to the location in the document where the answer was found, such that the employee can verify the correctness of the answer and can easily see important contextual information related to his/her question. Reproducing exact same answers is not so relevant to our system, as long as the required information is still present in the answer. For auditing and debugging purposes it might be a good idea to log questions and answers, as well as prompt and model versions used.

Explainability

The Q&A interface for most LLM systems provides a natural way for users to ask for explanations from the LLM systems in their own language. The developer could also include such a request for explanation in the prompt itself, and/or expose intermediate steps to the user (e.g., show which documents where retrieved or intermediate prompt results). For software engineers, there are tools being developed for what is called LLM observability, with which the black-box LLM models expose some of their inner workings.

In our example LLM system the previously mentioned link to the source document might provide enough explainability for end users. If the wrong document is retrieved they might get a hint from there how to rephrase their question. If they still do not succeed, they should report their question to the developers. The developers could then use observability tools to find out why the LLM system is not capable of answering the question.

Controllability

As said, LLM systems are controlled through the prompt that is sent to them (or in an earlier stage the questions the user asks in the Q&A interface). We have seen examples of systems where the level of detail the user needs to enter differs per user type. E.g., a business users asks simple questions in natural language or even has one single button per action (“Summarize this document” or “Come up with a good title for this article”) and a technical user can write detailed prompts for the LLM. The level of human control needed might vary per task and per user, but control-related requirements should be included in system design and interface design.

In our example LLM system the control of the system through a Q&A interface is enough, since the system will not be used to make decisions or execute tasks directly. It is just a lookup of information with a human-in-the-loop that needs to take this information as a basis for executing other tasks (e.g. call the helpdesk within opening hours).

Collaboration Effectiveness

For an LLM system to effectively support the users in their tasks, its user interface should be designed in such a way that the users can achieve the desired outcome with as little clicks, as little text input and as little waiting time as possible. Developers should analyze the intended use of the system but also observe the actual use of the system to achieve this. The standard Q&A interface of chatbots can be improved upon by adding e.g. shortcuts, buttons, and templates. Ideally, the system provides a way for the user to give feedback on the correctness and usefulness of the generated text.

In our example LLM system we could think of shortcuts for FAQs or templates for certain question types (“Who is the project manager of project <X>”).

Human Autonomy

One of the biggest worries around ChatGPT is that similar tools will take over our jobs or tasks, thereby threatening human autonomy. Or take on this is that we should develop LLM systems where the human is always in the loop, in control, and supported, instead of replaced. In that way we can outsource boring/easy tasks to LLM systems and concentrate on the overarching goals and results we want to achieve. LLM systems are our smart assistants, that also sometimes make mistakes for us to correct.

In our example LLM system the activity of on-boarding new employees becomes more attractive because the LLM system provides the dry information and the HR employee can focus on the interpersonal aspects of on-boarding, e.g., making the new employee feel at home.

Fairness

It is very difficult to detect and prevent bias in the LLM output, because it is usually not known on what exact corpus they have been trained and what was added afterwards through, e.g., reinforcement learning from human feedback (RLHF). When the task at hand is sensitive for discrimination (of individuals, teams, sexes, departments, ethnic groups, roles, positions, locations, languages, countries, etc.) the developers should design careful test cases for the LLM system and keep monitoring the LLM system in production as well.

In our example LLM system we hope not to suffer too much from bias, since the on-boarding documentation is written in neutral style and for the entire company. Our LLM system is simply disclosing the facts that are already in there. The LLM system is not taking any actions by itself or making up information, thus diminishing the risk for unintended discrimination by the system.

Privacy

As LLMs have no understanding of privacy they might output personal or sensitive information. The developer could explicitly ask in the prompt not to disclose e.g. email addresses, phone numbers or birth dates or check for that after the answer from the LLM comes back. Of course, a first step is to have the source documents checked for sensitive information by domain experts. But it could also pay off to do a brainstorm with them how the set of documents could disclose patterns that harm the privacy of people involved.

In our example LLM system a combination of salary scales, role descriptions, project documentation, and team positions might for example disclose that members of project team A are in the same roles but in higher positions, thus salary scales as members of project team B. A solution to this might be to remove detailed project documentation from the system and only include a list of projects with their project leader contact information and a short description of the project.

LLM engineering or LLMOps

In my analysis of quality requirements for trustworthy LLM systems, I already highlighted several challenges for engineering trustworthy LLM systems. In this paragraph, I will list the most important challenges and discuss solutions I have found until now.

Challenge 1: Prompt Engineering

Prompt engineering plays a crucial role in creating task-specific LLM systems. From a software engineering perspective, this means that the prompt itself, becomes an important asset in the software development process. We will need proper means for prompt versioning (like DVC for dataset versioning) and prompt experimentation (like MLFlow for model experimentation). Some prompt engineering tools and IDEs are being developed, but nothing comprehensive exists yet.

Challenge 2: Prompt Testing

As we stated above, it is crucial to develop good test cases for your LLM system, based on the task that the user wants to achieve (model correctness) and the risks for bias (fairness), adversarial attacks (model robustness), and violating privacy. For testing LLMs, a number of frameworks are being proposed. But also here, we do not see standardization yet. The before mentioned observability tooling could help in debugging from the developer side as well.

Challenge 3: Document Retrieval and Storage

RAG systems that work by retrieving information from proprietary documents such as our example LLM system are limited by what is called the context window of the LLM. This leads to the fact that, instead of full documents, the retrieval step will retrieve what is called document chunks. The way a document is divided into chunks greatly influences the understandability of the LLM results, because usually we would want to present the relevant chunks to the user as well. So we need chunks that make sense to an end user. In the same way, the way the chunks are stored and indexed can determine the usefulness of the results. In a vector databases, chunks are usually treated as bag of words and retrieved based on their cosine similarity in an n-dimensional vector space. Whereas in a graph database, chunks could also have other types of relations to each other. Future work for us is to validate if the concept of knowledge graphs could be practically implemented to improve the results of retrieval augmented generation tasks.

Challenge 4: Third-Party Models

LLM systems are based on third-party pre-trained LLMs for which often no specific information is available on how and with what data they have been trained. This induces additional risks on toxic content, copyright infringements, discrimination, privacy, misinformation, etc. The developer using a third-party LLM cannot shift all responsibility for this to the LLM provider and needs to design additional tests and monitoring to mitigate these risks if applicable for the task at hand, see also the previous paragraph on Explainability.

Challenge 5: Frequent Model Updates

Through techniques like the beforementioned RLHF, LLMs are constantly being updated. This means developers have to constantly consider updating the in-house retrained LLM or the external LLM API to the newest version of the LLM. Often, they have to, because newer versions correct known problems with toxicity, bias or hallucinations. This means that the LLM system should be set up in such a way that new LLM versions can be integrated as quickly as possible. A proper regression test set should be run, to immediately detect regressions on prompts.

Challenge 6: Language Engineering

Working with natural language in your software system is not the same as working with programming languages. Natural language is non-deterministic, ambiguous, vague, and may contain hidden messages like sentiments. For creating the perfect prompt the developer needs to speak the language of the domain or experiment together with domain experts. Because of this prompt engineering is an activity that not all developers might be comfortable with. More and more techniques are being developed where other LLMs are used to evaluate the output of an LLM (“is the answer correct” or “does the answer contain the intended information”).

Challenge 7: Resource Utilization

Running an inhouse LLM is often very compute-intensive, even when making use of pretrained LLMs. However, most companies want to work with inhouse LLMs because they do not trust sending their proprietary data to third-party cloud APIs. There are several developments on small (also search for baby, tiny or mini) language models, especially domain specific ones, to achieve similar results as LLMs on specific tasks.

Conclusion

In this post we have analyzed the characteristics of LLM-based systems with the goal of providing an overview of what is needed to build such systems. Future work remains to complement this overview with ongoing developments. We are also working on a reference architecture for trustworthy LLM systems that addresses all the challenges above and ensures the desired quality properties.

Related Work

https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm
https://www.insightpartners.com/ideas/llmops-mlops-what-you-need-to-know/
https://content.dataiku.com/responsible-generative-ai
https://www.acm.org/binaries/content/assets/public-policy/ustpc-approved-generative-ai-principles
https://arxiv.org/abs/2303.18223 A Survey of Large Language Models, Zhao et al., 2023
https://www.ecde.nl/chatgpt Ethiek van ChatGPT, ECDE, 2033 (in Dutch)
https://www.government.nl/documents/parliamentary-documents/2024/01/17/government-wide-vision-on-generative-ai-of-the-netherlands
https://www.who.int/publications/i/item/9789240084759 Ethics and governance of artificial intelligence for health: Guidance on large multi-modal models, WHO, 2024
[NEW] https://start.me/p/m6vba8/adversarial-ai Many links on security for LLMs

Vind ik leuk

Over Petra Heck

Petra werkt sinds 2002 in de ICT, begonnen als software engineer, daarna kwaliteitsconsultant en nu docent Software Engineering. Petra is gepromoveerd (kwaliteit van agile requirements) en doet sinds februari 2019 onderzoek naar Applied Data Science en Software Engineering. Petra geeft regelmatig lezingen en is auteur van diverse publicaties waaronder het boek "Succes met de requirements".

LLMOps: Engineering Trustworthy LLM Systems