How to develop mass -based labels for dialogue systems

Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, Netherlands.
Connection table
Summary and 1 Introduction
2 Methodology and 2.1 Experimental Data and Tasks
2.2 Automatic production of various dialogue contexts
2.3 Crowdsource Experiments
2.4 Experimental Conditions
2.5 Participants
3 Results and Analysis and 3.1 Data Statistics
3.2 RQ1: Effect of the Changing amount of dialogue context
3.3 RQ2: The effect of automatically created dialogue context
4 Discussions and Results
5 related work
6 CONCLUSION, LIMITATIONS AND ENDLESSED MOVEMENTS
7 Thanks and References
A. EK
An additional
In this section, we offer additional materials used to support our main article. These materials include: Experimental conditions prepared in Section A.1, quality control measures to provide high -quality mass welded labels, and the claims used to create additional contexts in section A.2 and create additional contexts in section A.3. In Chapter A.4, we add additional description instructions and screen breakdowns of our additional description tasks. Chapter A.5 shows the sample complementary context produced by GPT-4.
A.1 Experimental Conditions
We list the experimental conditions used for our mass resource experiments in Table 3.
A.2 Data Quality Control
The need for user information created and summary. In order to address the potential hallucination of LLMs (Chang et al., 2023), we applied a quality control process for the created user information needs and summaries, and provided consistency and real accuracy. We automatically cross the films mentioned in both the introduction dialogues and the summaries. A summary should contain at least two -thirds of the films specified in the input dialogue to be considered valid. If this criterion is not met, the summary is discarded and a new criterion is created by following the specified rapid requirements. In total, we summarized 15 dialogue and reproduced. In order to further ensure consistency, we randomly exemplified 30% of the summaries and information needs produced. The authors have reviewed them to verify their consistency and compliance with the information presented in the input dialogue. This increased the quality and reliability of the created content.
Label welded labels. In order to ensure the high quality of the collected data, we included the remarkable questions into the stroke. Additional explanants had to specify the number of expressions in the dialogues they evaluated and identify the last film mentioned in the system response. 10% of Hitler was rejected and returned to collect new labels. In total, we collected 1440 data samples from mass resource usage task and covered six variations for relevance and usefulness. We have voted a majority to create the final relevance level and usefulness dialogue label.
A.3 requests
In Table 4, we show us user information and the latest requests used to create a summary of dialogue with GPT-4.
A.4 Additional Description Instructions and Screen Casting
Table 5 elaborates additional explanation instructions for relevance and utility assessments. In Figure 5 and 6, we show the additional description interface used for phase 1 and phase 2, respectively.
A.5 sample complementary context
In Table 6, we show the sample user information need and summary produced by GPT-4.