Feature selection for classifying multi-labeled past events

Sumikawa, Yasunobu; Ikejiri, Ryohei

doi:10.1007/s00799-020-00293-5

Feature selection for classifying multi-labeled past events

Published: 08 September 2020

Volume 22, pages 63–83, (2021)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Yasunobu Sumikawa¹ &
Ryohei Ikejiri²

315 Accesses
3 Citations
Explore all metrics

Abstract

The study and analysis of past events can provide numerous benefits. While event categorization has been previously studied, it usually assigned only one event category to an event. In this study, we focus on multi-label classification for past events, which is a more general and challenging problem than those approached in previous studies. We categorize events into thirteen different types using a range of diverse features and classifiers trained on a dataset that has at least 50 labeled news articles for each category. We have confirmed that using all the features to train classifiers has statistical significance and improves all micro- and macro-average \(F_1\), multi-label accuracy, average precision@5, area under the receiver operating characteristic curve and example-based loss functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifying Short Descriptions of Past Events

Categorizing feature selection methods for multi-label classification

Article 23 September 2016

A survey of multi-label classification based on supervised and semi-supervised learning

Article 28 October 2022

Notes

https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic.
Usually, only very popular or important events have own names.
https://en.wikipedia.org/wiki/Portal:Current_events.
We use Japanese news articles to evaluate classifications in this paper as described in Sect. 5. Even though we did not use the listed example events in the evaluation, we show them to aid understanding what kinds of events can be assigned to from the 13 categories.
Some articles are stored in CD-Mainichi Newspapers 2012 data, Nichigai Associates, Inc., 2012 (Japanese). The others are collected by Web crawling.
https://doi.org/10.5281/zenodo.3258150. This opened dataset excludes all texts of the articles to respect copyright law. However, it is possible to obtain the texts because the opened dataset includes event IDs defined in Mainichi Newspapers 2012 data or URLs used to Web crawling. Thus, after buying Mainichi Newspapers 2012 data or recrawling the URLs with Wayback Machine (the accessed day is 18 June, 2019), their corresponding texts can be retrieved.
https://www3.nhk.or.jp/news/html/20181122/k10011720261000.html accessed on 22 Nov. 2018.
https://www3.nhk.or.jp/news/html/20181117/k10011714161000.html accessed on 17 Nov. 2018.
In Japanese, this term can be represented as a word.
https://radimrehurek.com/gensim/models/ldamodel.html,
https://radimrehurek.com/gensim/models/lsimodel.html,
https://radimrehurek.com/gensim/models/doc2vec.html and
https://radimrehurek.com/gensim/models/word2vec.html.

References

Au Yeung, C.M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM ’11, pp. 1231–1240. ACM, New York (2011)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Boix-Mansilla, V.: Historical understanding: beyond the past and into the present. In: Stearns, P.N., Seixas, P., Wineburg, S. (eds.) Knowing, Teaching, and Leaning History: National and International Perspectives, pp. 390–418. New York University Press, New York (2000)
Google Scholar
Chapman, A., Facey, J.: Placing history: territory story identity-and historical consciousness. Teach. Hist. 116, 36–41 (2004)
Google Scholar
Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q.: Document transformation for multi-label feature selection in text categorization. In: ICDM ’07, pp. 451–456. IEEE Computer Society, Washington, DC (2007)
Chew, M.M., Bhowmick, S.S., Jatowt, A.: Ranking without learning: towards historical relevance-based ranking of social images. In: SIGIR ’18, pp. 1133–1136. ACM, New York (2018)
Clavert, F., Majerus, B., Beaupré, N.: #ww1. twitter, the centenary of the first world war and the historian. Twitter for Research (2015)
Cong, G., Lee, W., Wu, H., Liu, B.: Semi-supervised text classification using partitioned em. In: Lee, Y., Li, J., Whang, K.Y., Lee, D. (eds.) Database Systems for Advanced Applications. Lecture Notes in Computer Science, vol. 2973, pp. 482–493. Springer, Berlin (2004)
Chapter Google Scholar
Cook, J., Das Sarma, A., Fabrikant, A., Tomkins, A.: Your two weeks of fame and your grandmother’s. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 919–928. ACM, New York (2012)
Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L.: Trading MIPS and memory for knowledge engineering. Commun. ACM 35(8), 48–64 (1992)
Article Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Doquire, G., Verleysen, M.: Mutual information-based feature selection for multilabel classification. Neurocomputing 122, 148–155 (2013). (Advances in cognitive and ubiquitous computing)
Article Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NIPS’01, pp. 681–687. MIT Press, Cambridge (2001)
Ferron, M., Massa, P.: Collective memory building in Wikipedia: the case of north African uprisings. In: WikiSym ’11, pp. 114–123. Mountain View, California (2011)
Garcia-Gavilanes, R., Mollgaard, A., Tsvetkova, M., Yasseri, T.: The memory remains: understanding collective memory in the digital age. Sci. Adv. 3(4), e1602368 (2017)
Article Google Scholar
Ghani, R.: Combining labeled and unlabeled data for multiclass text categorization. In: ICML ’02, pp. 187–194. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Gopal, S., Yang, Y.: Multilabel classification with meta-level features. In: SIGIR ’10, pp. 315–322. ACM, New York (2010)
Halbwachs, M.: La Memoire Collective. Les Presses universitaires de France (in French) (1950)
Harris, R., Rea, A.: Making history meaningful: helping pupils to see why history matters. Teach. Hist. 125, 28–36 (2006)
Google Scholar
Hoerl, C., McCormack, T.: Time and Memory: Issues in Philosophy and Psychology. Oxford University Press, Oxford (2001)
Google Scholar
Huet, T., Biega, J., Suchanek, F.M.: Mining history with le monde. In: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. In: AKBC ’13, pp. 49–54. ACM, New York (2013)
Ikejiri, R.: Designing and evaluating the card game which fosters the ability to apply the historical causal relation to the modern problems. Jpn. Soc. Educ. Technol. 34(4), 375–386 (2011). (in Japanese)
Google Scholar
Ikejiri, R., Fujimoto, T., Tsubakimoto, M., Yamauchi, Y.: Designing and evaluating a card game to support high school students in applying their knowledge of world history to solve modern political issues. In: ICoME ’12. Beijing Normal University (2012)
Ikejiri, R., Sumikawa, Y.: Developing a mining system to transfer historical causations to solving modern social issues. In: WHA ’16 (2016)
Ikejiri, R., Sumikawa, Y.: Developing world history lessons to foster authentic social participation by searching for historical causation in relation to current issues dominating the news. J. Educ. Res. Soc. Stud. 84, 37–48 (2016). (in Japanese)
Google Scholar
Jacoby, R.: Social Amnesia: A Critique of Contemporary Psychology. Transaction Publishers, Piscataway (1997)
Google Scholar
Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: JCDL ’14, pp. 229–238. IEEE Press, Piscataway (2014)
Jatowt, A., Kawai, D., Tanaka, K.: Digital history meets Wikipedia: analyzing historical persons in Wikipedia. In: JCDL ’16, Newark, New Jersey, USA, pp. 17–26 (2016)
Jatowt, A., Kawai, D., Tanaka, K.: Predicting importance of historical persons using Wikipedia. In: CIKM ’16, pp. 1909–1912. ACM, New York (2016)
Jatowt, A., Kawai, D., Tanaka, K.: Timestamping entities using contextual information. In: SIGIR ’17, pp. 1205–1208. ACM, New York (2017)
Jatowt, A., Kawai, H., Kanazawa, K., Tanaka, K., Kunieda, K., Yamada, K.: Multi-lingual analysis of future-related information on the web. In: Culture and Computing’13, pp. 27–32 (2013)
Kanhabua, N., Nguyen, T.N., Niederée, C.: What triggers human remembering of events?: a large-scale analysis of catalysts for collective memory in Wikipedia. In: JCDL ’14, London, United Kingdom, pp. 341–350 (2014)
Kosmerlj, A., Belyaeva, E., Leban, G., Grobelnik, M., Fortuna, B.: Towards a complete event type taxonomy. In: WWW ’15 Companion, pp. 899–902. ACM, New York (2015)
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to japanese morphological analysis. In: EMNLP ’04, pp. 230–237
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML’14, Bejing, China, pp. 1188–1196 (2014)
Lee, J., Kim, D.W.: Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 34(3), 349–357 (2013)
Article Google Scholar
Lee, J., Kim, D.W.: Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 48(9), 2761–2771 (2015)
Article Google Scholar
Lee, P.: Historical literacy: theory and research. Int. J. Hist. Learn. Teach. Res. 5(1), 25–40 (2005)
Google Scholar
Lee, U., Liu, Z., Cho, J.: Automatic identification of user goals in web search. In: WWW ’05, pp. 391–400. ACM, New York (2005)
Lieberman, E., Michel, J.B., Jackson, J., Tang, T., Nowak, M.A.: Quantifying the evolutionary dynamics of language. Nature 449, 713–716 (2007)
Article Google Scholar
McCallum, A.K.: Multi-label text classification with a mixture model trained by EM. In: AAAI 99 Workshop on Text Learning (1999)
Mikolov, T., Kai, C., Suchanek Greg, C., Dean, J.: Linguistic regularities in continuous space word representations. In: ICLR’13 Workshop (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS’13, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013)
Mikolov, T., Yih, W.t., Zweig, G.: Efficient estimation of word representations in vector space. In: NAACL’13 (2013)
Ministry of Education Culture, Sports, Science and Technology: Japan Course of Study for Senior High Schools (2009)
Miyazaki, T., Sumikawa, Y.: Label propagation using amendable clamping. In: IUI’18 Workshop on WII (2018)
Nie, L., Wang, M., Zha, Z., Li, G., Chua, T.S.: Multimedia answering: enriching text QA with media information. In: SIGIR ’11, pp. 695–704. ACM, New York (2011)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Article Google Scholar
Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. Wiley, New York (1989)
Google Scholar
Odijk, D., de Rooij, O., Peetz, M.H., Pieters, T., de Rijke, M., Snelders, S.: Semantic document selection. In: TPDL’12, pp. 215–221. Springer, Berlin (2012)
Ogata, I., Kato, T., Kabayama, K., Kawakita, M., Kishimoto, M., Kuroda, H., Sato, T., Minamizuka, S., Yamamoto, H.: Encyclopedia of Historiography. Koubundou, Minamiuonuma (1994)
Google Scholar
Pargel, M., Atkinson, Q.D., Meade, A.: Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449, 717–720 (2007)
Article Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW ’08, pp. 91–100. ACM, New York (2008)
Radinsky, K., Davidovich, S., Markovitch, S.: Learning causality for news events prediction. In: WWW ’12, pp. 909–918. ACM, New York (2012)
Radinsky, K., Horvitz, E.: Mining the web to predict future events. In: WSDM ’13, pp. 255–264. ACM, New York (2013)
Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: HIIR ’16, pp. 183–192. ACM, New York (2016)
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: SIGIR ’10, pp. 841–842. ACM, New York (2010)
Staley, D.J.: A history of the future. Hist. Theory 41, 72–89 (2002)
Article Google Scholar
Sumikawa, Y., Jatowt, A.: Classifying short descriptions of past events. In: Advances in Information Retrieval, ECIR ’18, pp. 729–736. Springer, Berlin (2018)
Sumikawa, Y., Jatowt, A., Düring, M.: Digital history meets microblogging: analyzing collective memories in twitter. In: JCDL ’18, pp. 213–222. ACM, New York (2018)
Sun, X., Wang, H., Yu, Y.: Towards effective short text deep classification. In: SIGIR ’11, pp. 1143–1144. ACM, New York (2011)
Takahashi, Y., Ohshima, H., Yamamoto, M., Iwasaki, H., Oyama, S., Tanaka, K.: Evaluating significance of historical entities based on tempo-spatial impacts analysis using Wikipedia link structure. In: HT ’11, pp. 83–92. ACM, New York (2011)
Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multi-label classification of music by emotion. EURASIP J. Audio Speech Music Process. 2011(1), 4 (2011)
Article Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data, pp. 667–685. Springer, Boston (2010)
Google Scholar
Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: ICWSM’10, Washington, DC, USA (2010)
van Drie, J., van Boxtel, C.: Historical reasoning: towards a framework for analyzing students’ reasoning about the past. Educ. Psychol. Rev. 20(2), 87–110 (2008)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)
Book Google Scholar
Wang, B., Tu, Z., Tsotsos, J.K.: Dynamic label propagation for semi-supervised multi-class multi-label classification. In: 2013 IEEE International Conference on Computer Vision, pp. 425–432 (2013)
Wang, F., Zhang, C.: Label propagation through linear neighborhoods. In: ICML’06, pp. 985–992. ACM, New York (2006)
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: SIGIR ’94, New York, NY, USA, pp. 13–22 (1994)
Zelikovitz, S., Marquez, F.: Transductive learning for short-text classification problems using latent semantic indexing. Int. J. Pattern Recognit. Artif. Intell. 19(2), 146–163 (2005)
Article Google Scholar
Zhang, M.L., Pea, J.M., Robles, V.: Feature selection for multi-label naive Bayes classification. Inf. Sci. 179(19), 3218–3229 (2009)
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit. 40(7), 2038–2048 (2007)
Article Google Scholar
Zhang, Y., Jatowt, A., Bhowmick, S., Tanaka, K.: Omnia Mutantur, Nihil Interit: connecting past with present by finding corresponding terms across time. In: ACL/IJCNLP, pp. 645–655. ACL (2015)
Zhang, Y., Jatowt, A., Tanaka, K.: Temporal analog retrieval using transformation over dual hierarchical structures. In: CIKM ’17, pp. 717–726. ACM, New York (2017)
Zhu, X.: Semi-supervised learning with graphs. Ph.D. thesis, Pittsburgh, PA, USA (2005). AAI3179046

Download references

Acknowledgements

This work was partially supported in part by MEXT Grant-in-Aids (#17K12792, #19K20631 and #26750076). We express our gratitude to all the reviewers for their thoughtful comments.

Author information

Authors and Affiliations

Tokyo Metropolitan University, Tokyo, Japan
Yasunobu Sumikawa
The University of Tokyo, Tokyo, Japan
Ryohei Ikejiri

Authors

Yasunobu Sumikawa
View author publications
You can also search for this author in PubMed Google Scholar
Ryohei Ikejiri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasunobu Sumikawa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sumikawa, Y., Ikejiri, R. Feature selection for classifying multi-labeled past events. Int J Digit Libr 22, 63–83 (2021). https://doi.org/10.1007/s00799-020-00293-5

Download citation

Received: 23 September 2018
Revised: 24 January 2020
Accepted: 13 July 2020
Published: 08 September 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00799-020-00293-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection for classifying multi-labeled past events

Abstract

Access this article

Similar content being viewed by others

Classifying Short Descriptions of Past Events

Categorizing feature selection methods for multi-label classification

A survey of multi-label classification based on supervised and semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature selection for classifying multi-labeled past events

Abstract

Access this article

Similar content being viewed by others

Classifying Short Descriptions of Past Events

Categorizing feature selection methods for multi-label classification

A survey of multi-label classification based on supervised and semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation