How does the context change our way to evaluate AI answers?

Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, Netherlands.
Connection table
Summary and 1 Introduction
2 Methodology and 2.1 Experimental Data and Tasks
2.2 Automatic production of various dialogue contexts
2.3 Crowdsource Experiments
2.4 Experimental Conditions
2.5 Participants
3 Results and Analysis and 3.1 Data Statistics
3.2 RQ1: Effect of the Changing amount of dialogue context
3.3 RQ2: The effect of automatically created dialogue context
4 Discussions and Results
5 related work
6 CONCLUSION, LIMITATIONS AND ENDLESSED MOVEMENTS
7 Thanks and References
A. EK
2.3 Crowdsource Experiments
(Kazai, 2011; Kazai et al., 2013; Roitero et al., 2020), we design human intelligence (Hit) templates to collect relevance and utility labels. We distribute Hitler under variable conditions to understand how contextual information affects the decisions of additional explanations. There are two stages of our study: We change the amount of contextual knowledge in stage 1; In Stage 2, we change the type of contextual knowledge. At every stage and condition, it was paid to additional explanants in the same amount, because this study was paid because it was not focused on understanding how the incentive affects the quality of mass -welded labels. (Kazai et al., 2013), in both stages, we avoid disclosing the angle of research on additional explanants; This helps prevent potential prejudices during the completion of the hit.
Phase 1. The focal point in Stage 1 is to understand how the amount of dialogue context affects the quality and consistency of the relevance level and usability labels. We change to address the length of the dialogue context (RQ1). Thus, we design our experiment with three variations: C0, C3 and C7 (see section 2.4). The Hit consists of a general task statement, instructions, examples and the main task department. For each variation, we collect labels for two main dimensions (relevance and usefulness) and add an open -ended question to ask for feedback of additional explanations. Each size is evaluated with 3 explanants in a separate stroke with the same system response, each evaluated by each. This provides a consistent evaluation process for both relevance and usefulness.
Phase 2. In phase 2, the focus passes to the type of contextual knowledge to answer (RQ2). We watch a machine approach in the loop for crowded resources. We limit our experiments to the experimental variation C0 (defined below), where there is no dialogue context for additional explanants. In addition to the evaluated return, we aim to improve the quality of mass -welded labels for C0 by adding additional contextual information. Our hypothesis is that without the previous context, Annotators may face difficulties in providing accurate and consistent labels. We expect an increase in the accuracy of the evaluations by bringing additional contexts such as the need for information or a summary of dialogue. In this way, we aim to approach an performance level similar to the fact that Annotators can reach the entire dialogue context while minimizing the necessary additional explanation effort. As described in detail in Chapter 2.2, we develop 40 dialogues from phase 1 with the user’s need for information or dialogue summary. Therefore, we have three experimental order in phase 2: C0-llm, C0-Heu and C0-Sum. Table 3 in Chapter A.1 summarizes the installations.
Hit design closely reflects phase 1. The main task remains unchanged, except for the use of the user’s information needs or dialogue summary. Additional explanants answer two questions about relevance and use of separate hits. Although it absolutely necessitates relying on the additional information provided, additional explanations are encouraged to use when they perceive that the current response does not have enough information for a conscious decision.
2.4 Experimental Conditions
We focus on two basic features: the amount and type of dialogue context. For both features, we discover three different settings that result in 6 variations for both relevance and use; Each of which was applied to the same 40 dialogues:
• Context amount. We are investigating the three cutting strategy: a previous dialogue context (C0), some dialogue context (C3) and additional explanters could not access the previous dialogue context (C7), and no previous dialogue context (C0) has been designed.
• Context type. Using the contexts created in Chapter 2.2, we try the variations of three context types: Intuitive information requirement (C0-HEU), a need for information (C0-llm) created by LLM (C0-LLM) and Dialogue Summary (C0-SUM).
Table 3, Ekin A.1. Summarizes the experimental conditions in the Department.
2.5 Participants
In order to provide competent language understanding, we joined the master workers on Amazon Mechanical Turks (MTURK) (Amazon Mechanical Türk, 2023). Annotators were filtered according to platform qualifications and required a minimum of 97% accuracy in 5000 hit. In order to reduce any learning error, each additional explanation was limited to completing 10 hit per party and participating in a maximum of 3 experimental conditions. A total of 78 unique additional explanations participated in 1 and 2 stages, and each worker was paid $ 0.4 per hit and an average of $ 14 per hour. The average age ranges were 35-44 years. Gender distribution was 46% female and 54% male. The majority received a four -year bachelor’s degree (48%) and then two -year and graduate degrees (15%and 14%respectively).
Annexed a.2. As described in the section, we carry out quality control on mass -source labels to ensure reliability.