Abstract
Addressee detection is a fundamental task for seamless dialogue management and turn taking in human-agent interaction. Though addressee detection is implicit in dyadic interaction, it becomes a challenging task when more than two participants are involved. This article proposes multiple addressee detection models based on smart feature selection and focus encoding schemes. The models are trained using different machine learning and deep learning algorithms. This research work improves existing baseline accuracies for addressee prediction on two datasets. In addition, the article explores the impact of different focus encoding schemes in several addressee detection cases. Finally, an implementation strategy for addressee detection model in real-time is discussed.
Similar content being viewed by others
Notes
The annotation is available at: https://doi.org/10.6084/m9.figshare.13297775.
References
Akker H, Akker R (2009) Are you being addressed?-real-time addressee detection to support remote participants in hybrid meetings. In: SIGDIAL, pp 21–28
Akker R, Traum D (2009) A comparison of addressee detection methods for multiparty conversations. In: SEMDIAL’09, pp 99–106
Baba N, Huang HH, Nakano YI (2011) Identifying utterances addressed to an agent in multiparty human–agent conversations. In: International workshop on IVA’11, pp 255–261
Bakx I, Van Turnhout K, Terken J (2003) Facial orientation during multi-party interaction with information kiosks. In: INTERACT 2003 Zurich, Switzerland, pp 163–170
Carletta J (2007) Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Lang Resour Eval 41(2):181–190
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: SIGKDD. ACM, pp 785–794
Dietterich TG et al (2002) Ensemble learning. Handb Brain Theory Neural Netw 2:110–125
Elith J, Leathwick JR, Hastie T (2008) A working guide to boosted regression trees. J Anim. Ecol. 77(4):802–813
Galley M, McKeown K, Hirschberg J, Shriberg E (2004) Identifying agreement and disagreement in conversational speech: use of Bayesian networks to model pragmatic dependencies. In: ACL’04, p 669
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Goffman E (1981) Forms of talk, University of Pennsylvania publications in conduct and communication. University of Pennsylvania Press, Philadelphia
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360
Hawkins DM (2004) The problem of overfitting. J Chem Inform Comput Sci 44(1):1–12
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. Intell Syst Appl 13(4):18–28
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, New York
Jovanovic N (2007) To whom it may concern-addressee identification in face-to-face meetings
Jovanovic N, Akker R, Nijholt A (2006) A corpus for studying addressing behaviour in multi-party dialogues. LREC’06 40(1):5–23
Jovanovic N, op den Akker R (2004) Towards automatic addressee identification in multi-party dialogues. In: SIGdial@HLT-NAACL’04
Kiranyaz S, Ince T, Abdeljaber O, Avci O, Gabbouj M (2019) 1-D convolutional neural networks for signal processing applications. In: ICASSP’19, pp 8360–8364
Koutsombogera M, Vogel C (2018) Modeling collaborative multimodal behavior in group dialogues: the multisimo corpus. In: LREC-2018
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS’12, pp 1097–1105
Kruse R, Borgelt C, Klawonn F, Moewes C, Steinbrecher M, Held P (2013) Multi-layer perceptrons. In: Computational Intelligence, pp 47–81
Le MT, Shimizu N, Miyazaki T, Shinoda K (2018) Deep learning based multi-modal addressee recognition in visual scenes with utterances. In: IJCAI, pp 1546–1553
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R News 2(3):18–22
Malik U, Barange M, Ghannad N, Saunier J, Pauchet A (2019) A generic machine learning based approach for addressee detection in multiparty interaction. In: IVA ’19, pp 119–126
McCowan I, Carletta J, Kraaij W, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V et al (2005) The ami meeting corpus. In: MB’05, vol 88, p 100
Melamud O, Goldberger J, Dagan I (2016) context2vec: learning generic context embedding with bidirectional lstm. In: 20th SIGNLL conference on computational natural language learning, pp 51–61
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. JMLR 12:2825–2830
Recasens A, Khosla A, Vondrick C, Torralba A (2015) Where are they looking? In: Adv. in neural information processing systems, pp 199–207
Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Sacks H, Schegloff E, Jefferson G (1974) A simplest systematics for the organization of turn-taking in conversation. Language 50:696–735
Searle JR, Searle JR (1969) Speech acts: an essay in the philosophy of language, vol. 626, Cambridge university press
Serban O, Pauchet A (2014) Agentslang: a new distributed interactive system. current approaches and performance. In: ICAART14, pp 596–603
Smit SK, Eiben AE (2009) Comparing parameter tuning methods for evolutionary algorithms. In: CEC’09, pp 399–406
Traum DR, Robinson S, Stephan J (2004) Evaluation of multi-party virtual reality dialogue interaction. In: LREC’04, pp 1699–1702
Traum DR, Robinson S, Stephan J (2006) Evaluation of multi-party reality dialogue interaction. Tech. rep., University of Southern California Marina Del Rey CA Inst For Creative Technologies
Vertegaal R (1998) Look who’s talking to whom. Mediating joint attention in multiparty. Doctoral Thesis, Twente University, the Netherlands
Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: GRC’05, vol 2. ACM, pp 718–721
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the DAISI project, cofunded by the European Union with the European Regional Development Fund (ERDF), by the French Agence Nationale de la Recherche and by the Regional Council of Normandie.
Appendix: Classifiers and parameters for experimentation
Appendix: Classifiers and parameters for experimentation
Classifier | AMI parameters | MULTISIMO parameters |
---|---|---|
XGB | Learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= ’multi:softmax’, nthread=4, scale_pos_weight=1 | learning_rate =0.1, n_estimators=130, max_depth=3, min_child_weight=1, gamma=0, subsample=0.6, colsample_bytree=0.5, objective= ’multi:softmax’, nthread=4, scale_pos_weight=1 |
ET | ’Bootstrap’: true, ’criterion’: ’gini’, ’max_features’: ’sqrt’, ’n_estimators’: 1000 | ’bootstrap’: True, ’criterion’: ’entropy’, ’max_features’: ’sqrt’, ’n_estimators’: 200 |
ADB | Base_estimtor = “DecisionTree”, ’max_features’: 30, ’n_estimators’:800 | Base_estimtor = “DecisionTree”, ’max_features’: 30, ’n_estimators’:800 |
MLP | ’Activation’: ’tanh’, ’alpha’: 0.05, ’hidden_layer_sizes’: (100,), ’learning_rate’: ’adaptive’, ’solver’: ’adam’ | activation = ’tanh’, alpha = 0.0001, hidden_layer_sizes = (50, 100, 50), learning_rate=’constant’, solver = ’sgd’, max_iter = 100 |
RF | ’Bootstrap’: False, ’criterion’: ’gini’, ’max_features’: ’auto’, ’n_estimators’: 200 | ’bootstrap’: True, ’criterion’: ’gini’, ’max_features’: ’sqrt’, ’n_estimators’: 100 |
LR | Penalty=’l2’, C =100 | penalty=‘l2’, C =0.1 |
SVM | ’C’: 100, ’gamma’: 0.01 | ’C’: 10, ’gamma’: 0.01 |
NB | No Parameters | No Parameters |
KNN | ’n_neighbors’: 8 | ’n_neighbors’: 9 |
LSTM | Hidden layer neurons = (100, 50), drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, callbacks = early Stopping, patience = 20 | hidden layer neurons = (50, 25), drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20 |
Bi-LSTM | Hidden layer neurons = (100, 50), drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, callbacks = early Stopping, patience = 20 | hidden layer neurons = (50, 25), drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20 |
1D-CNN | Hidden layer neurons = (100, 50), kernel_size(3,3) drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, calbacks = early Stopping, patience = 20 | hidden layer neurons = (50, 25), kernel_size(3,3) drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20 |
Rights and permissions
About this article
Cite this article
Malik, U., Barange, M., Saunier, J. et al. A novel focus encoding scheme for addressee detection in multiparty interaction using machine learning algorithms. J Multimodal User Interfaces 15, 175–188 (2021). https://doi.org/10.1007/s12193-020-00361-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-020-00361-9