• arXiv.cs.IR Pub Date : 2020-01-18
Hengyi Cai; Hongshen Chen; Cheng Zhang; Yonghao Song; Xiaofang Zhao; Dawei Yin

Neural conversation systems generate responses based on the sequence-to-sequence (SEQ2SEQ) paradigm. Typically, the model is equipped with a single set of learned parameters to generate responses for given input contexts. When confronting diverse conversations, its adaptability is rather limited and the model is hence prone to generate generic responses. In this work, we propose an {\bf Ada}ptive {\bf N}eural {\bf D}ialogue generation model, \textsc{AdaND}, which manages various conversations with conversation-specific parameterization. For each conversation, the model generates parameters of the encoder-decoder by referring to the input context. In particular, we propose two adaptive parameterization mechanisms: a context-aware and a topic-aware parameterization mechanism. The context-aware parameterization directly generates the parameters by capturing local semantics of the given context. The topic-aware parameterization enables parameter sharing among conversations with similar topics by first inferring the latent topics of the given context and then generating the parameters with respect to the distributional topics. Extensive experiments conducted on a large-scale real-world conversational dataset show that our model achieves superior performance in terms of both quantitative metrics and human evaluations.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-18
Anubha Pandey; Ashish Mishra; Vinay Kumar Verma; Anurag Mittal; Hema A. Murthy

Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-18
Sean MacAvaney; Arman Cohan; Nazli Goharian; Ross Filice

Medical errors are a major public health concern and a leading cause of death worldwide. Many healthcare centers and hospitals use reporting systems where medical practitioners write a preliminary medical report and the report is later reviewed, revised, and finalized by a more experienced physician. The revisions range from stylistic to corrections of critical errors or misinterpretations of the case. Due to the large quantity of reports written daily, it is often difficult to manually and thoroughly review all the finalized reports to find such errors and learn from them. To address this challenge, we propose a novel ranking approach, consisting of textual and ontological overlaps between the preliminary and final versions of reports. The approach learns to rank the reports based on the degree of discrepancy between the versions. This allows medical practitioners to easily identify and learn from the reports in which their interpretation most substantially differed from that of the attending physician (who finalized the report). This is a crucial step towards uncovering potential errors and helping medical practitioners to learn from such errors, thus improving patient-care in the long run. We evaluate our model on a dataset of radiology reports and show that our approach outperforms both previously-proposed approaches and more recent language models by 4.5% to 15.4%.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-19
Amit Kumar Jaiswal; Haiming Liu; Ingo Frommholz

User implicit feedback plays an important role in recommender systems. However, finding implicit features is a tedious task. This paper aims to identify users' preferences through implicit behavioural signals for image recommendation based on the Information Scent Model of Information Foraging Theory. In the first part, we hypothesise that the users' perception is improved with visual cues in the images as behavioural signals that provide users' information scent during information seeking. We designed a content-based image recommendation system to explore which image attributes (i.e., visual cues or bookmarks) help users find their desired image. We found that users prefer recommendations predicated by visual cues and therefore consider the visual cues as good information scent for their information seeking. In the second part, we investigated if visual cues in the images together with the images itself can be better perceived by the users than each of them on its own. We evaluated the information scent artifacts in image recommendation on the Pinterest image collection and the WikiArt dataset. We find our proposed image recommendation system supports the implicit signals through Information Foraging explanation of the information scent model.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-19
David Morris; Eric Müller-Budack; Ralph Ewerth

In the past few years, convolutional neural networks (CNNs) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. Besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. However, this kind of images has received little attention in computer vision. CNNs and similar techniques use large volumes of training data. Currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. In this paper, we address this issue and present SlideImages, a dataset for the task of classifying educational illustrations. SlideImages contains training data collected from various sources, e.g., Wikimedia Commons and the AI2D dataset, and test data collected from educational slides. We have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. Furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-19
Jaya Prakash Champati; Ramana R. Avula; Tobias J. Oechtering; James Gross

There is a growing interest in analysing the freshness of data in networked systems. Age of Information (AoI) has emerged as a popular metric to quantify this freshness at a given destination. There has been a significant research effort in optimizing this metric in communication and networking systems under different settings. In contrast to previous works, we are interested in a fundamental question, what is the minimum achievable AoI in any single-server-single-source queuing system for a given service-time distribution? To address this question, we study a problem of optimizing AoI under service preemptions. Our main result is on the characterization of the minimum achievable average peak AoI (PAoI). We obtain this result by showing that a fixed-threshold policy is optimal in the set of all randomized-threshold causal policies. We use the characterization to provide necessary and sufficient conditions for the service-time distributions under which preemptions are beneficial.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-19
Krisztian Balog; Lucie Flekova; Matthias Hagen; Rosie Jones; Martin Potthast; Filip Radlinski; Mark Sanderson; Svitlana Vakulenko; Hamed Zamani

This paper discusses the potential for creating academic resources (tools, data, and evaluation approaches) to support research in conversational search, by focusing on realistic information needs and conversational interactions. Specifically, we propose to develop and operate a prototype conversational search system for scholarly activities. This Scholarly Conversational Assistant would serve as a useful tool, a means to create datasets, and a platform for running evaluation challenges by groups across the community. This article results from discussions of a working group at Dagstuhl Seminar 19461 on Conversational Search.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-20
Sagar Uprety; Prayag Tiwari; Shahram Dehdashti; Lauren Fell; Dawei Song; Peter Bruza; Massimo Melucci

A large number of studies in cognitive science have revealed that probabilistic outcomes of certain human decisions do not agree with the axioms of classical probability theory. The field of Quantum Cognition provides an alternative probabilistic model to explain such paradoxical findings. It posits that cognitive systems have an underlying quantum-like structure, especially in decision-making under uncertainty. In this paper, we hypothesise that relevance judgement, being a multidimensional, cognitive concept, can be used to probe the quantum-like structure for modelling users' cognitive states in information seeking. Extending from an experiment protocol inspired by the Stern-Gerlach experiment in Quantum Physics, we design a crowd-sourced user study to show violation of the Kolmogorovian probability axioms as a proof of the quantum-like structure, and provide a comparison between a quantum probabilistic model and a Bayesian model for predictions of relevance.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-20
Carlos-Emiliano González-Gallardo; Romain Deveaud; Eric SanJuan; Juan-Manuel Torres

The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment's length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-20
Diana Sousa; Francisco M. Couto

Successful biomedical relation extraction can provide evidence to researchers and clinicians about possible unknown associations between biomedical entities, advancing the current knowledge we have about those entities and their inherent mechanisms. Most biomedical relation extraction systems do not resort to external sources of knowledge, such as domain-specific ontologies. However, using deep learning methods, along with biomedical ontologies, has been recently shown to effectively advance the biomedical relation extraction field. To perform relation extraction, our deep learning system, BiOnt, employs four types of biomedical ontologies, namely, the Gene Ontology, the Human Phenotype Ontology, the Human Disease Ontology, and the Chemical Entities of Biological Interest, regarding gene-products, phenotypes, diseases, and chemical compounds, respectively. We tested our system with three data sets that represent three different types of relations of biomedical entities. BiOnt achieved, in F-score, an improvement of 4.93 percentage points for drug-drug interactions (DDI corpus), 4.99 percentage points for phenotype-gene relations (PGR corpus), and 2.21 percentage points for chemical-induced disease relations (BC5CDR corpus), relatively to the state-of-the-art. The code supporting this system is available at https://github.com/lasigeBioTM/BiONT.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-20
Suhas Thejaswi; Aristides Gionis

In this paper we study a family of pattern-detection problems in vertex-colored temporal graphs. In particular, given a vertex-colored temporal graph and a multi-set of colors as a query, we search for temporal paths in the graph that contain the colors specified in the query. These types of problems have several interesting applications, for example, recommending tours for tourists, or searching for abnormal behavior in a network of financial transactions. For the family of pattern-detection problems we define, we establish complexity results and design an algebraic-algorithmic framework based on constrained multilinear sieving. We demonstrate that our solution can scale to massive graphs with up to hundred million edges, despite the problems being NP-hard. Our implementation, which is publicly available, exhibits practical edge-linear scalability and highly optimized. For example, in a real-world graph dataset with more than six million edges and a multi-set query with ten colors, we can extract an optimal solution in less than eight minutes on a haswell desktop with four cores.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-21
Takafumi J. Suzuki

Document networks are found in various collections of real-world data, such as citation networks, hyperlinked web pages, and online social networks. A large number of generative models have been proposed because they offer intuitive and useful pictures for analyzing document networks. Prominent examples are relational topic models, where documents are linked according to their topic similarities. However, existing generative models do not make full use of network structures because they are largely dependent on topic modeling of documents. In particular, centrality of graph nodes is missing in generative processes of previous models. In this paper, we propose a novel generative model for document networks by introducing random walkers on networks to integrate the node centrality into link generation processes. The developed method is evaluated in semi-supervised classification tasks with real-world citation networks. We show that the proposed model outperforms existing probabilistic approaches especially in detecting communities in connected networks.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-21
Marcia Barros; André Moitinho; Francisco M. Couto

Recommending Chemical Compounds of interest to a particular researcher is a poorly explored field. The few existent datasets with information about the preferences of the researchers use implicit feedback. The lack of Recommender Systems in this particular field presents a challenge for the development of new recommendations models. In this work, we propose a Hybrid recommender model for recommending Chemical Compounds. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares (ALS) and Bayesian Personalized Ranking(BPR)) and semantic similarity between the Chemical Compounds in the ChEBI ontology (ONTO). We evaluated the model in an implicit dataset of Chemical Compounds, CheRM. The Hybrid model was able to improve the results of state-of-the-art collaborative-filtering algorithms, especially for Mean Reciprocal Rank, with an increase of 6.7% when comparing the collaborative-filtering ALS and the Hybrid ALS_ONTO.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-01-22
Robin Haunschild; Werner Marx

Bibliometric information retrieval in databases can employ different strategies. Com-monly, queries are performed by searching in title, abstract and/or author keywords (author vocabulary). More advanced queries employ database keywords to search in a controlled vo-cabulary. Queries based on search terms can be augmented with their citing papers if a re-search field cannot be curtailed by the search query alone. Here, we present another strategy to discover the most important papers of a research field. A marker paper is used to reveal the most important works for the relevant community. All papers co-cited with the marker paper are analyzed using reference publication year spectroscopy (RPYS). For demonstration of the marker paper approach, density functional theory (DFT) is used as a research field. Compari-sons between a prior RPYS on a publication set compiled using a keyword-based search in a controlled vocabulary and three different co-citation RPYS (RPYS-CO) analyses show very similar results. Similarities and differences are discussed.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-05-23
Leo Torres; Kevin S Chan; Tina Eliassi-Rad

Graph embedding seeks to build a low-dimensional representation of a graph G. This low-dimensional representation is then used for various downstream tasks. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of G. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. Here, we dispose of this distance-minimization assumption. Instead, we use the Laplacian matrix to find an embedding with geometric properties instead of spectral ones, by leveraging the so-called simplex geometry of G. We introduce a new approach, Geometric Laplacian Eigenmap Embedding (or GLEE for short), and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-11-06
Ahmed M. Elmisery; Mirela Sertovic

The recent proliferation of smart home environments offers new and transformative circumstances for various domains with a commitment to enhancing the quality of life and experience. Most of these environments combine different gadgets offered by multiple stakeholders in a dynamic and decentralized manner, which in turn presents new challenges from the perspective of digital investigation. In addition, a plentiful amount of data records got generated because of the day to day interactions between these gadgets and homeowners, which poses difficulty in managing and analyzing such data. The analysts should endorse new digital investigation approaches to tackle the current limitations in traditional approaches when used in these environments. The digital evidence in such environments can be found inside the records of logfiles that store the historical events occurred inside the smart home. Threat hunting can leverage the collective nature of these gadgets to gain deeper insights into the best way for responding to new threats, which in turn can be valuable in reducing the impact of breaches. Nevertheless, this approach depends mainly on the readiness of smart homeowners to share their own personal usage logs that have been extracted from their smart home environments. However, they might disincline to employ such service due to the sensitive nature of the information logged by their personal gateways. In this paper, we presented an approach to enable smart homeowners to share their usage logs in a privacy preserving manner. A distributed threat hunting approach has been developed to permit the composition of diverse threat classes without revealing the logged records to other involved parties. Furthermore, a scenario was proposed to depict a proactive threat Intelligence sharing for the detection of potential threats in smart home environments with some experimental results.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-11-10
Xueying Bai; Jian Guan; Hongning Wang

Reinforcement learning is well suited for optimizing policies of recommender systems. Current solutions mostly focus on model-free approaches, which require frequent interactions with the real environment, and thus are expensive in model learning. Offline evaluation methods, such as importance sampling, can alleviate such limitations, but usually request a large amount of logged data and do not work well when the action space is large. In this work, we propose a model-based reinforcement learning solution which models user-agent interaction for offline policy learning via a generative adversarial network. To reduce bias in the learned model and policy, we use a discriminator to evaluate the quality of generated data and scale the generated rewards. Our theoretical analysis and empirical evaluations demonstrate the effectiveness of our solution in learning policies from the offline and generated data.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-11-21
Ming Zhu; Busra Celikkaya; Parminder Bhatia; Chandan K. Reddy

Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor's notes that has been annotated with ICD concepts. Extensive experimental evaluation shows our model achieves significant performance improvements over several state-of-the-art techniques.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-12-28
Tieyun Qian; Yile Liang; Qing Li

Matrix completion is a classic problem underlying recommender systems. It is traditionally tackled with matrix factorization. Recently, deep learning based methods, especially graph neural networks, have made impressive progress on this problem. Despite their effectiveness, existing methods focus on modeling the user-item interaction graph. The inherent drawback of such methods is that their performance is bound to the density of the interactions, which is however usually of high sparsity. More importantly, for a cold start user/item that does not have any interactions, such methods are unable to learn the preference embedding of the user/item since there is no link to this user/item in the graph. In this work, we develop a novel framework Attribute Graph Neural Networks (AGNN) by exploiting the attribute graph rather than the commonly used interaction graph. This leads to the capability of learning embeddings for cold start users/items. Our AGNN can produce the preference embedding for a cold user/item by learning on the distribution of attributes with an extended variational auto-encoder structure. Moreover, we propose a new graph neural network variant, i.e., gated-GNN, to effectively aggregate various attributes of different modalities in a neighborhood. Empirical results on two real-world datasets demonstrate that our model yields significant improvements for cold start recommendations and outperforms or matches state-of-the-arts performance in the warm start scenario.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2019-11-08
Tan Yan; Heyan Huang; Xian-Ling Mao

We introduce a new scientific named entity recognizer called SEPT, which stands for Span Extractor with Pre-trained Transformers. In recent papers, span extractors have been demonstrated to be a powerful model compared with sequence labeling models. However, we discover that with the development of pre-trained language models, the performance of span extractors appears to become similar to sequence labeling models. To keep the advantages of span representation, we modified the model by under-sampling to balance the positive and negative samples and reduce the search space. Furthermore, we simplify the origin network architecture to combine the span extractor with BERT. Experiments demonstrate that even simplified architecture achieves the same performance and SEPT achieves a new state of the art result in scientific named entity recognition even without relation information involved.

更新日期：2020-01-22
• arXiv.cs.IR Pub Date : 2020-01-15
Anant Khandelwal; Niraj Kumar

Wide usage of social media platforms has increased the risk of aggression, which results in mental stress and affects the lives of people negatively like psychological agony, fighting behavior, and disrespect to others. Majority of such conversations contains code-mixed languages[28]. Additionally, the way used to express thought or communication style also changes from one social media plat-form to another platform (e.g., communication styles are different in twitter and Facebook). These all have increased the complexity of the problem. To solve these problems, we have introduced a unified and robust multi-modal deep learning architecture which works for English code-mixed dataset and uni-lingual English dataset both.The devised system, uses psycho-linguistic features and very ba-sic linguistic features. Our multi-modal deep learning architecture contains, Deep Pyramid CNN, Pooled BiLSTM, and Disconnected RNN(with Glove and FastText embedding, both). Finally, the system takes the decision based on model averaging. We evaluated our system on English Code-Mixed TRAC 2018 dataset and uni-lingual English dataset obtained from Kaggle. Experimental results show that our proposed system outperforms all the previous approaches on English code-mixed dataset and uni-lingual English dataset.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2020-01-16
Antoine Gourru; Adrien Guille; Julien Velcin; Julien Jacques

We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents (e.g. citation network) into a pretrained word embedding space. In addition to the textual content, we leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph). We first build a simple word vector average for each document, and we use the similarities to alter this average representation. The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering. We demonstrate that our approach outperforms or matches existing document network embedding methods on node classification and link prediction tasks. Furthermore, we show that it helps identifying relevant keywords to describe document classes.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2020-01-16
Matteo Allaix; Lukas Holzbaur; Tefjol Pllaha; Camilla Hollanti

In the classical private information retrieval (PIR) setup, a user wants to retrieve a file from a database or a distributed storage system (DSS) without revealing the file identity to the servers holding the data. In the quantum PIR (QPIR) setting, a user privately retrieves a classical file by downloading quantum systems from the servers. The QPIR problem has been treated by Song \emph{et al.} in the case of replicated servers, both without collusion and with all but one servers colluding. In this paper, the QPIR setting is extended to account for MDS-coded servers. The proposed protocol works for any [n,k]-MDS code and t-collusion with t = n - k. Similarly to the previous cases, the rates achieved are better than those known or conjectured in the classical counterparts.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2020-01-16
Tong Zeng; Longfeng Wu; Sarah Bratt; Daniel E. Acuna

A citation is a well-established mechanism for connecting scientific artifacts. Citation networks are used by citation analysis for a variety of reasons, prominently to give credit to scientists' work. However, because of current citation practices, scientists tend to cite only publications, leaving out other types of artifacts such as datasets. Datasets then do not get appropriate credit even though they are increasingly reused and experimented with. We develop a network flow measure, called DataRank, aimed at solving this gap. DataRank assigns a relative value to each node in the network based on how citations flow through the graph, differentiating publication and dataset flow rates. We evaluate the quality of DataRank by estimating its accuracy at predicting the usage of real datasets: web visits to GenBank and downloads of Figshare datasets. We show that DataRank is better at predicting this usage compared to alternatives while offering additional interpretable outcomes. We discuss improvements to citation behavior and algorithms to properly track and assign credit to datasets.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2020-01-16
Islam Samy; Mohamed A. Attia; Ravi Tandon; Loukas Lazos

In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This solution, while private, can be too costly, particularly, when perfect (information-theoretic) privacy constraints are imposed. For instance, for a single database holding $K$ messages, privately retrieving one message is possible if and only if the user downloads the entire database of $K$ messages. Retrieving content privately, however, may not be necessary to perfectly hide the latent attributes. Motivated by the above, we formulate and study the problem of latent-variable private information retrieval (LV-PIR), which aims at allowing the user efficiently retrieve one out of $K$ messages (indexed by $\theta$) without revealing any information about the latent variable (modeled by $S$). We focus on the practically relevant setting of a single database and show that one can significantly reduce the download cost of LV-PIR (compared to the classical PIR) based on the correlation between $\theta$ and $S$. We present a general scheme for LV-PIR as a function of the statistical relationship between $\theta$ and $S$, and also provide new results on the capacity/download cost of LV-PIR. Several open problems and new directions are also discussed.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2019-07-24
Muhammad Umer Anwaar; Dmytro Rybalko; Martin Kleinsteuber

Improved search quality enhances users' satisfaction, which directly impacts sales growth of an E-Commerce (E-Com) platform. Traditional Learning to Rank (LTR) algorithms require relevance judgments on products. In E-Com, getting such judgments poses an immense challenge. In the literature, it is proposed to employ user feedback (add-to-basket (AtB) clicks, orders etc) to generate relevance judgments. It is done in two steps: first, query-product pair data are aggregated from the logs and then order rate etc are calculated for each pair in the logs. In this paper, we advocate counterfactual risk minimization (CRM) approach which circumvents the need of relevance judgements, data aggregation and is better suited for learning from logged data, i.e. contextual bandit feedback. Due to unavailability of public E-Com LTR dataset, we provide \textit{Commercial dataset} from our platform. It contains more than 10 million AtB click logs and 1 million order logs from a catalogue of about 3.5 million products associated with 3060 queries. To the best of our knowledge, this is the first work which examines effectiveness of CRM approach in learning ranking model from real-world logged data. Our empirical evaluation shows that CRM approach learns effectively from logged data and beats strong baseline ranker ($\lambda$-MART) by a huge margin. Our method outperforms full-information loss (e.g. cross-entropy) on various deep neural network models. These findings show that by adopting CRM approach, E-Com platforms can improve product search quality without artificially mending the data to fit in the traditional LTR algorithms.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2019-12-06
Shuqi Xu; Qianming Zhang; Linyuan Lv; Manuel Sebastian Mariani

Over the past decade, many startups have sprung up, which create a huge demand for financial support from venture investors. However, due to the information asymmetry between investors and companies, the financing process is usually challenging and time-consuming, especially for the startups that have not yet obtained any investment. Because of this, effective data-driven techniques to automatically match startups with potentially relevant investors would be highly desirable. Here, we analyze 34,469 valid investment events collected from www.itjuzi.com and consider the cold-start problem of recommending investors for new startups. We address this problem by constructing different tripartite network representations of the data where nodes represent investors, companies, and companies' domains. First, we find that investors have strong domain preferences when investing, which motivates us to introduce virtual links between investors and investment domains in the tripartite network construction. Our analysis of the recommendation performance of diffusion-based algorithms applied to various network representations indicates that prospective investors for new startups are effectively revealed by integrating network diffusion processes with investors' domain preference.

更新日期：2020-01-17
• arXiv.cs.IR Pub Date : 2020-01-14
Rahul Radhakrishnan Iyer; Rohan Kohli; Shrimai Prabhumoye

With the rapid growth of e-Commerce, online product search has emerged as a popular and effective paradigm for customers to find desired products and engage in online shopping. However, there is still a big gap between the products that customers really desire to purchase and relevance of products that are suggested in response to a query from the customer. In this paper, we propose a robust way of predicting relevance scores given a search query and a product, using techniques involving machine learning, natural language processing and information retrieval. We compare conventional information retrieval models such as BM25 and Indri with deep learning models such as word2vec, sentence2vec and paragraph2vec. We share some of our insights and findings from our experiments.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-15
Nilavra Bhattacharya; Somnath Rakshit; Jacek Gwizdka; Paul Kogut

We propose an image-classification method to predict the perceived-relevance of text documents from eye-movements. An eye-tracking study was conducted where participants read short news articles, and rated them as relevant or irrelevant for answering a trigger question. We encode participants' eye-movement scanpaths as images, and then train a convolutional neural network classifier using these scanpath images. The trained classifier is used to predict participants' perceived-relevance of news articles from the corresponding scanpath images. This method is content-independent, as the classifier does not require knowledge of the screen-content, or the user's information-task. Even with little data, the image classifier can predict perceived-relevance with up to 80% accuracy. When compared to similar eye-tracking studies from the literature, this scanpath image classification method outperforms previously reported metrics by appreciable margins. We also attempt to interpret how the image classifier differentiates between scanpaths on relevant and irrelevant documents.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-15
Alexander Schindler; Thomas Lidy; Sebastian Böck

Deep Learning has become state of the art in visual computing and continuously emerges into the Music Information Retrieval (MIR) and audio retrieval domain. In order to bring attention to this topic we propose an introductory tutorial on deep learning for MIR. Besides a general introduction to neural networks, the proposed tutorial covers a wide range of MIR relevant deep learning approaches. \textbf{Convolutional Neural Networks} are currently a de-facto standard for deep learning based audio retrieval. \textbf{Recurrent Neural Networks} have proven to be effective in onset detection tasks such as beat or audio-event detection. \textbf{Siamese Networks} have been shown effective in learning audio representations and distance functions specific for music similarity retrieval. We will incorporate both academic and industrial points of view into the tutorial. Accompanying the tutorial, we will create a Github repository for the content presented at the tutorial as well as references to state of the art work and literature for further reading. This repository will remain public after the conference.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-12
Xien Liu; Xinxin You; Xiao Zhang; Ji Wu; Ping Lv

Compared to sequential learning models, graph-based neural networks exhibit some excellent properties, such as ability capturing global information. In this paper, we investigate graph-based neural networks for text classification problem. A new framework TensorGCN (tensor graph convolutional networks), is presented for this task. A text graph tensor is firstly constructed to describe semantic, syntactic, and sequential contextual information. Then, two kinds of propagation learning perform on the text graph tensor. The first is intra-graph propagation used for aggregating information from neighborhood nodes in a single graph. The second is inter-graph propagation used for harmonizing heterogeneous information between graphs. Extensive experiments are conducted on benchmark datasets, and the results illustrate the effectiveness of our proposed framework. Our proposed TensorGCN presents an effective way to harmonize and integrate heterogeneous information from different kinds of graphs.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-15
Markus Zlabinger; Sebastian Hofstätter; Navid Rekabsaz; Allan Hanbury

The effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. While existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. In this paper, we introduce the Disease-Symptom Relation collection (DSR-collection), created by five fully trained physicians as expert annotators. We provide graded symptom judgments for diseases by differentiating between "symptoms" and "primary symptoms". Further, we provide several strong baselines, based on the methods used in previous studies. The first method is based on word embeddings, and the second on co-occurrences of keywords in medical articles. For the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. The evaluation on the DSR-collection shows the effectiveness of the proposed adaption in terms of nDCG, precision, and recall.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-15
Nick Ruest; Jimmy Lin; Ian Milligan; Samantha Fritz

The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building - all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from about two hundred users, totaling over 280 terabytes of web archives.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-15
Shuqi Xu; Manuel Sebastian Mariani; Linyuan Lü; Matúš Medo

Despite the increasing use of citation-based metrics for research evaluation purposes, we do not know yet which metrics best deliver on their promise to gauge the significance of a scientific paper or a patent. We assess 17 network-based metrics by their ability to identify milestone papers and patents in three large citation datasets. We find that traditional information-retrieval evaluation metrics are strongly affected by the interplay between the age distribution of the milestone items and age biases of the evaluated metrics. Outcomes of these metrics are therefore not representative of the metrics' ranking ability. We argue in favor of a modified evaluation procedure that explicitly penalizes biased metrics and allows us to reveal metrics' performance patterns that are consistent across the datasets. PageRank and LeaderRank turn out to be the best-performing ranking metrics when their age bias is suppressed by a simple transformation of the scores that they produce, whereas other popular metrics, including citation count, HITS and Collective Influence, produce significantly worse ranking results.

更新日期：2020-01-16
• arXiv.cs.IR Pub Date : 2020-01-13
Luca Papariello; Alexandros Bampoulidis; Mihai Lupu

We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use of a mixture model of von Mises-Fisher (VMF) distributions instead of Gaussian distributions would be beneficial because of the focus on cosine distances of both VMF and the vector space model traditionally used in information retrieval. Previous experiments had validated this hypothesis. Our replication was not able to validate it, despite a large parameter scan space.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-14
Lu Wang; Jie Yang

Due to the superiority in similarity computation and database storage for large-scale multiple modalities data, cross-modal hashing methods have attracted extensive attention in similarity retrieval across the heterogeneous modalities. However, there are still some limitations to be further taken into account: (1) most current CMH methods transform real-valued data points into discrete compact binary codes under the binary constraints, limiting the capability of representation for original data on account of abundant loss of information and producing suboptimal hash codes; (2) the discrete binary constraint learning model is hard to solve, where the retrieval performance may greatly reduce by relaxing the binary constraints for large quantization error; (3) handling the learning problem of CMH in a symmetric framework, leading to difficult and complex optimization objective. To address above challenges, in this paper, a novel Asymmetric Correlation Quantization Hashing (ACQH) method is proposed. Specifically, ACQH learns the projection matrixs of heterogeneous modalities data points for transforming query into a low-dimensional real-valued vector in latent semantic space and constructs the stacked compositional quantization embedding in a coarse-to-fine manner for indicating database points by a series of learnt real-valued codeword in the codebook with the help of pointwise label information regression simultaneously. Besides, the unified hash codes across modalities can be directly obtained by the discrete iterative optimization framework devised in the paper. Comprehensive experiments on diverse three benchmark datasets have shown the effectiveness and rationality of ACQH.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-14
Natalia Krizhanovsky; Andrew Krizhanovsky

The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river) dictionary entry as an example. The method of constructing the inflection table in the dictionary of the VepKar corpus according to the data of the dynamic template of the English Wiktionary is presented.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-07
Shahpar YakhchiMacquarie University- Sydney-Australia; Amin BeheshtiMacquarie University- Sydney-Australia; Seyed Mohssen GhafariMacquarie University- Sydney-Australia; Mehmet OrgunMacquarie University- Sydney-Australia

Existing Recommender Systems mainly focus on exploiting users' feedback, e.g., ratings, and reviews on common items to detect similar users. Thus, they might fail when there are no common items of interest among users. We call this problem the Data Sparsity With no Feedback on Common Items (DSW-n-FCI). Personality-based recommender systems have shown a great success to identify similar users based on their personality types. However, there are only a few personality-based recommender systems in the literature which either discover personality explicitly through filling a questionnaire that is a tedious task, or neglect the impact of users' personal interests and level of knowledge, as a key factor to increase recommendations' acceptance. Differently, we identifying users' personality type implicitly with no burden on users and incorporate it along with users' personal interests and their level of knowledge. Experimental results on a real-world dataset demonstrate the effectiveness of our model, especially in DSW-n-FCI situations.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-10
Kaushik Chakrabarti; Zhimin Chen; Siamak Shakeri; Guihong Cao; Surajit Chaudhuri

The web contains a vast corpus of HTML tables. They can be used to provide direct answers to many web queries. We focus on answering two classes of queries with those tables: those seeking lists of entities (e.g., cities in california') and those seeking superlative entities (e.g., largest city in california'). The main challenge is to achieve high precision with significant coverage. Existing approaches train machine learning models to select the answer from the candidates; they rely on textual match features between the query and the content of the table along with features capturing table quality/importance. These features alone are inadequate for achieving the above goals. Our main insight is that we can improve precision by (i) first extracting intent (structured information) from the query for the above query classes and (ii) then performing structure-aware matching (instead of just textual matching) between the extracted intent and the candidates to select the answer. We model (i) as a sequence tagging task. We leverage state-of-the-art deep neural network models with word embeddings. The model requires large scale training data which is expensive to obtain via manual labeling; we therefore develop a novel method to automatically generate the training data. For (ii), we develop novel features to compute structure-aware match and train a machine learning model. Our experiments on real-life web search queries show that (i) our intent extractor for list and superlative intent queries has significantly higher precision and coverage compared with baseline approaches and (ii) our table answer selector significantly outperforms the state-of-the-art baseline approach. This technology has been used in production by Microsoft's Bing search engine since 2016.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2019-12-28
Shoujin Wang; Liang Hu; Yan Wang; Longbing Cao; Quan Z. Sheng; Mehmet Orgun

The emerging topic of sequential recommender systems has attracted increasing attention in recent years.Different from the conventional recommender systems including collaborative filtering and content-based filtering, SRSs try to understand and model the sequential user behaviors, the interactions between users and items, and the evolution of users preferences and item popularity over time. SRSs involve the above aspects for more precise characterization of user contexts, intent and goals, and item consumption trend, leading to more accurate, customized and dynamic recommendations.In this paper, we provide a systematic review on SRSs.We first present the characteristics of SRSs, and then summarize and categorize the key challenges in this research area, followed by the corresponding research progress consisting of the most recent and representative developments on this topic.Finally, we discuss the important research directions in this vibrant area.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2019-12-29
Gabriel de Souza Pereira Moreira

Recommender Systems (RS) have became a popular research topic and, since 2016, Deep Learning methods and techniques have been increasingly explored in this area. News RS are aimed to personalize users experiences and help them discover relevant articles from a large and dynamic search space. The main contribution of this research was named CHAMELEON, a Deep Learning meta-architecture designed to tackle the specific challenges of news recommendation. It consists of a modular reference architecture which can be instantiated using different neural building blocks. As information about users' past interactions is scarce in the news domain, the user context can be leveraged to deal with the user cold-start problem. Articles' content is also important to tackle the item cold-start problem. Additionally, the temporal decay of items (articles) relevance is very accelerated in the news domain. Furthermore, external breaking events may temporally attract global readership attention, a phenomenon generally known as concept drift in machine learning. All those characteristics are explicitly modeled on this research by a contextual hybrid session-based recommendation approach using Recurrent Neural Networks. The task addressed by this research is session-based news recommendation, i.e., next-click prediction using only information available in the current user session. A method is proposed for a realistic temporal offline evaluation of such task, replaying the stream of user clicks and fresh articles being continuously published in a news portal. Experiments performed with two large datasets have shown the effectiveness of the CHAMELEON for news recommendation on many quality factors such as accuracy, item coverage, novelty, and reduced item cold-start problem, when compared to other traditional and state-of-the-art session-based recommendation algorithms.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-01
Sami Khenissi; Olfa Nasraoui

What we discover and see online, and consequently our opinions and decisions, are becoming increasingly affected by automated machine learned predictions. Similarly, the predictive accuracy of learning machines heavily depends on the feedback data that we provide them. This mutual influence can lead to closed-loop interactions that may cause unknown biases which can be exacerbated after several iterations of machine learning predictions and user feedback. Machine-caused biases risk leading to undesirable social effects ranging from polarization to unfairness and filter bubbles. In this paper, we study the bias inherent in widely used recommendation strategies such as matrix factorization. Then we model the exposure that is borne from the interaction between the user and the recommender system and propose new debiasing strategies for these systems. Finally, we try to mitigate the recommendation system bias by engineering solutions for several state of the art recommender system models. Our results show that recommender systems are biased and depend on the prior exposure of the user. We also show that the studied bias iteratively decreases diversity in the output recommendations. Our debiasing method demonstrates the need for alternative recommendation strategies that take into account the exposure process in order to reduce bias. Our research findings show the importance of understanding the nature of and dealing with bias in machine learning models such as recommender systems that interact directly with humans, and are thus causing an increasing influence on human discovery and decision making

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-14
Chaitanya K. Ryali; John J. Hopfield; Leopold Grinberg; Dmitry Krotov

The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-14
Jonathan Brophy; Daniel Lowd

Social networking websites face a constant barrage of spam, unwanted messages that distract, annoy, and even defraud honest users. These messages tend to be very short, making them difficult to identify in isolation. Furthermore, spammers disguise their messages to look legitimate, tricking users into clicking on links and tricking spam filters into tolerating their malicious behavior. Thus, some spam filters examine relational structure in the domain, such as connections among users and messages, to better identify deceptive content. However, even when it is used, relational structure is often exploited in an incomplete or ad hoc manner. In this paper, we present Extended Group-based Graphical models for Spam (EGGS), a general-purpose method for classifying spam in online social networks. Rather than labeling each message independently, we group related messages together when they have the same author, the same content, or other domain-specific connections. To reason about related messages, we combine two popular methods: stacked graphical learning (SGL) and probabilistic graphical models (PGM). Both methods capture the idea that messages are more likely to be spammy when related messages are also spammy, but they do so in different ways; SGL uses sequential classifier predictions and PGMs use probabilistic inference. We apply our method to four different social network domains. EGGS is more accurate than an independent model in most experimental settings, especially when the correct label is uncertain. For the PGM implementation, we compare Markov logic networks to probabilistic soft logic and find that both work well with neither one dominating, and the combination of SGL and PGMs usually performs better than either on its own.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2019-12-16
Ke Sun; Tieyun Qian

The context information such as product category plays a critical role in sequential recommendation. Recent years have witnessed a growing interest in context-aware sequential recommender systems. Existing studies often treat the contexts as auxiliary feature vectors without considering the sequential dependency in contexts. However, such a dependency provides valuable clues to predict the user's future behavior. For example, a user might buy electronic accessories after he/she buy an electronic product. In this paper, we propose a novel seq2seq translation architecture to highlight the importance of sequential dependency in contexts for sequential recommendation. Specifically, we first construct a collateral context sequence in addition to the main interaction sequence. We then generalize recent advancements in translation model from sequences of words in two languages to sequences of items and contexts in recommender systems. Taking the category information as an item's context, we develop a basic coupled and an extended tripled seq2seq translation models to encode the category-item and item-category-item relations between the item and context sequences. We conduct extensive experiments on three real world datasets. The results demonstrate the superior performance of the proposed model compared with the state-of-the-art baselines.

更新日期：2020-01-15
• arXiv.cs.IR Pub Date : 2020-01-12
Mohammad-Amin Charusaie; Stefano Rini; Arash Amini

There are several ways to measure the compressibility of a random measure; they include the general rate-distortion curve, as well as more specific notions such as Renyi information dimension (RID), and dimensional-rate bias (DRB). The RID parameter indicates the concentration of the measure around lower-dimensional subsets of the space while the DRB parameter specifies the compressibility of the distribution over these lower-dimensional subsets. While the evaluation of such compressibility parameters is well-studied for continuous and discrete measures (e.g., the DRB is closely related to the entropy and differential entropy in discrete and continuous cases, respectively), the case of discrete-continuous measures is quite subtle. In this paper, we focus on a class of multi-dimensional random measures that have singularities on affine lower-dimensional subsets. These cases are of interest when working with linear transformation of component-wise independent discrete-continuous random variables. Here, we evaluate the RID and DRB for such probability measures. We further provide an upper-bound for the RID of multi-dimensional random measures that are obtained by Lipschitz functions of component-wise independent discrete-continuous random variables ($\mathbf{x}$). The upper-bound is shown to be achievable when the Lipschitz function is $A \mathbf{x}$, where $A$ satisfies $\spark({A}) = \rank ({A})+1$ (e.g., Vandermonde matrices). The above results in the case of discrete-domain moving-average processes with non-Gaussian excitation noise allow us to evaluate the block-average RID and to find a relationship between this parameter and other existing compressibility measures.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Béatrice MazoyerMICS; Nicolas HervéINA; Céline HudelotMICS; Julia CageECON

In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Yohan Chalier; Simon Razniewski; Gerhard Weikum

Commonsense knowledge (CSK) supports a variety of AI applications, from visual understanding to chatbots. Prior works on acquiring CSK, such as ConceptNet, have compiled statements that associate concepts, like everyday objects or activities, with properties that hold for most or some instances of the concept. Each concept is treated in isolation from other concepts, and the only quantitative measure (or ranking) of properties is a confidence score that the statement is valid. This paper aims to overcome these limitations by introducing a multi-faceted model of CSK statements and methods for joint reasoning over sets of inter-related statements. Our model captures four different dimensions of CSK statements: plausibility, typicality, remarkability and salience, with scoring and ranking along each dimension. For example, hyenas drinking water is typical but not salient, whereas hyenas eating carcasses is salient. For reasoning and ranking, we develop a method with soft constraints, to couple the inference over concepts that are related in in a taxonomic hierarchy. The reasoning is cast into an integer linear programming (ILP), and we leverage the theory of reduction costs of a relaxed LP to compute informative rankings. This methodology is applied to several large CSK collections. Our evaluation shows that we can consolidate these inputs into much cleaner and more expressive knowledge. Results are available at https://dice.mpi-inf.mpg.de.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Tianjun HouLGI; Bernard YannouLGI; Yann LeroyIRCCyN; Emilie PoirsonIRCCyN

This research set out to identify and structure from online reviews the words and expressions related to customers' likes and dislikes to guide product development. Previous methods were mainly focused on product features. However, reviewers express their preference not only on product features. In this paper, based on an extensive literature review in design science, the authors propose a summarization model containing multiples aspects of user preference, such as product affordances, emotions, usage conditions. Meanwhile, the linguistic patterns describing these aspects of preference are discovered and drafted as annotation guidelines. A case study demonstrates that with the proposed model and the annotation guidelines, human annotators can structure the online reviews with high inter-agreement. As high inter-agreement human annotation results are essential for automatizing the online review summarization process with the natural language processing, this study provides materials for the future study of automatization.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Fajie Yuan; Xiangnan He; Alexandros Karatzoglou; Liguang Zhang

Inductive transfer learning has greatly impacted the computer vision and NLP domains, but existing approaches in recommender systems remain largely unexplored. Meanwhile, although there has been a large body of research making direct recommendations based on user behavior sequences, few of them attempt to represent and transfer these behaviors for downstream tasks. In this paper, we look in particular at the task of effectively learning a single user representation that can be applied to a diversity of tasks, from cross-domain recommendations to user profile predictions. Fine-tuning a large pre-trained network and adapting it to downstream tasks is an effective way to solve such an issue. However, fine-tuning is parameter inefficient considering that an entire model needs to be re-trained for every new task. To overcome this issue, we develop a parameter-efficient transfer learning architecture, termed as PeterRec, which can be configured on-the-fly to various downstream tasks. Specifically, PeterRec allows the pre-trained parameters unaltered during fine-tuning by injecting a series of re-learned neural networks, which are small but as expressive as learning the entire network. We perform extensive experimental ablation to show the effectiveness of learned user representation in five downstream tasks. Moreover, we show that PeterRec performs efficient transfer learning in multiple domains, where it achieves comparable or sometimes better performance relative to fine-tuning the entire model parameters.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Olegs Verhodubs

Ontology merging is important, but not always effective. The main reason, why ontology merging is not effective, is that ontology merging is performed without considering goals. Goals define the way, in which ontologies to be merged more effectively. The paper illustrates ontology merging by means of rules, which are generated from these ontologies. This is necessary for further use in expert systems.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-17
Joy Bose

Boilerplate refers to unwanted and repeated parts of a webpage (such as ads or table of contents) that distracts the user from reading the core content of the webpage, such as a news article. Accurate detection and removal of boilerplate content from a webpage can enable the users to have a clutter free view of the webpage or news article. This can be useful in features like reader mode in web browsers. Current implementations of reader mode in web browsers such as Firefox, Chrome and Edge perform reasonably well for textual content in webpages. However, they are mostly heuristic based and not flexible when the webpage content is dynamic. Also they often do not perform well for removing boilerplate content in the form of images and multimedia in webpages. For detection of boilerplate images, one needs to have knowledge of the actual layout of the images in the webpage, which is only possible when the webpage is rendered. In this paper we discuss some of the issues in relevant image extraction. We also present the design of a testing framework to measure accuracy and a classifier to extract relevant images by leveraging a headless browser solution that gives the rendering information for images.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-23
Pegah Sagheb Haghighi; Olurotimi Seton; Olfa Nasraoui

Autoencoders are a common building block of Deep Learning architectures, where they are mainly used for representation learning. They have also been successfully used in Collaborative Filtering (CF) recommender systems to predict missing ratings. Unfortunately, like all black box machine learning models, they are unable to explain their outputs. Hence, while predictions from an Autoencoder-based recommender system might be accurate, it might not be clear to the user why a recommendation was generated. In this work, we design an explainable recommendation system using an Autoencoder model whose predictions can be explained using the neighborhood based explanation style. Our preliminary work can be considered to be the first step towards an explainable deep learning architecture based on Autoencoders.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-20
Mukul Kumar; Youna Hu; Will Headden; Rahul Goutam; Heran Lin; Bing Yin

Understanding search queries is critical for shopping search engines to deliver a satisfying customer experience. Popular shopping search engines receive billions of unique queries yearly, each of which can depict any of hundreds of user preferences or intents. In order to get the right results to customers it must be known queries like "inexpensive prom dresses" are intended to not only surface results of a certain product type but also products with a low price. Referred to as query intents, examples also include preferences for author, brand, age group, or simply a need for customer service. Recent works such as BERT have demonstrated the success of a large transformer encoder architecture with language model pre-training on a variety of NLP tasks. We adapt such an architecture to learn intents for search queries and describe methods to account for the noisiness and sparseness of search query data. We also describe cost effective ways of hosting transformer encoder models in context with low latency requirements. With the right domain-specific training we can build a shareable deep learning model whose internal representation can be reused for a variety of query understanding tasks including query intent identification. Model sharing allows for fewer large models needed to be served at inference time and provides a platform to quickly build and roll out new search query classifiers.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-18
Xin Dong; Jingchao Ni; Wei Cheng; Zhengzhang Chen; Bo Zong; Dongjin Song; Yanchi Liu; Haifeng Chen; Gerard de Melo

Recently, recommender systems have been able to emit substantially improved recommendations by leveraging user-provided reviews. Existing methods typically merge all reviews of a given user or item into a long document, and then process user and item documents in the same manner. In practice, however, these two sets of reviews are notably different: users' reviews reflect a variety of items that they have bought and are hence very heterogeneous in their topics, while an item's reviews pertain only to that single item and are thus topically homogeneous. In this work, we develop a novel neural network model that properly accounts for this important difference by means of asymmetric attentive modules. The user module learns to attend to only those signals that are relevant with respect to the target item, whereas the item module learns to extract the most salient contents with regard to properties of the item. Our multi-hierarchical paradigm accounts for the fact that neither are all reviews equally useful, nor are all sentences within each review equally pertinent. Extensive experimental results on a variety of real datasets demonstrate the effectiveness of our method.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-14
Christine Bauer; Eva Zangerle

In this paper, we focus on recommendation settings with multiple stakeholders with possibly varying goals and interests, and argue that a single evaluation method or measure is not able to evaluate all relevant aspects in such a complex setting. We reason that employing a multi-method evaluation, where multiple evaluation methods or measures are combined and integrated, allows for getting a richer picture and prevents blind spots in the evaluation outcome.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-12-11
Anupriya Gogna; Angshul Majumdar

Design of recommender systems aimed at achieving high prediction accuracy is a widely researched area. However, several studies have suggested the need for diversified recommendations, with acceptable level of accuracy, to avoid monotony and improve customers experience. However, increasing diversity comes with an associated reduction in recommendation accuracy; thereby necessitating an optimum tradeoff between the two. In this work, we attempt to achieve accuracy vs diversity balance, by exploiting available ratings and item metadata, through a single, joint optimization model built over the matrix completion framework. Most existing works, unlike our formulation, propose a 2 stage model, a heuristic item ranking scheme on top of an existing collaborative filtering technique. Experimental evaluation on a movie recommender system indicates that our model achieves higher diversity for a given drop in accuracy as compared to existing state of the art techniques.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Liang Xu; Qianqian Dong; Cong Yu; Yin Tian; Weitang Liu; Lu Li; Xuanwei Zhang

In this paper, we introduce the NER dataset from CLUE organization (CLUENER2020), a well-defined fine-grained dataset for name entity recognition in Chinese. CLUENER2020 contains 10 categories. Apart from common labels like person, organization and location, it contains more diverse categories. It is more challenging than current other Chinese NER datasets and could better reflect real-world applications. For comparison, we implement several state-of-the-art baselines as sequence labelling tasks and report human performance, as well as its analysis. To facilitate future work on fine-grained NER for Chinese, we release our dataset, baselines and leader-board.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2020-01-13
Hiba Arnaout; Simon Razniewski; Gerhard Weikum

Knowledge bases (KBs), pragmatic collections of knowledge about notable entities, are an important asset in applications such as search, question answering and dialogue. Rooted in a long tradition in knowledge representation, all popular KBs only store positive information, while they abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. Negative statements would be important to overcome current limitations of question answering, yet due to their potential abundance, any effort towards compiling them needs a tight coupling with ranking. We introduce two approaches towards compiling negative statements. (i) In peer-based statistical inferences, we compare entities with highly related entities in order to derive potential negative statements, which we then rank using supervised and unsupervised features. (ii) In query-log-based text extraction, we use a pattern-based approach for harvesting search engine query logs. Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.1M statements for 100K popular Wikidata entities.

更新日期：2020-01-14
• arXiv.cs.IR Pub Date : 2019-07-05
Xavier BostLIA; Vincent LabatutLIA

A character network is a graph extracted from a narrative, in which vertices represent characters and edges correspond to interactions between them. A number of narrative-related problems can be addressed automatically through the analysis of character networks, such as summarization, classification, or role detection. Character networks are particularly relevant when considering works of fictions (e.g. novels, plays, movies, TV series), as their exploitation allows developing information retrieval and recommendation systems. However, works of fiction possess specific properties making these tasks harder. This survey aims at presenting and organizing the scientific literature related to the extraction of character networks from works of fiction, as well as their analysis. We first describe the extraction process in a generic way, and explain how its constituting steps are implemented in practice, depending on the medium of the narrative, the goal of the network analysis, and other factors. We then review the descriptive tools used to characterize character networks, with a focus on the way they are interpreted in this context. We illustrate the relevance of character networks by also providing a review of applications derived from their analysis. Finally, we identify the limitations of the existing approaches, and the most promising perspectives.

更新日期：2020-01-14
Contents have been reproduced by permission of the publishers.

down
wechat
bug