1 Introduction

With the advancement of computing technology, GPU cluster as a typical heterogeneous cluster becomes one of the most significant computing infrastructures and is widely applied into scientific computing, information services and big data processing. The advance in hardware contributes a lot of improvements in various fields, including computer vision [6], social media and e-commerce platforms. The evolvement of social media and e-commerce platforms contributes to the popularity of online product reviews exceptionally, thus massive online product reviews are generated. To process and analyze these abundant and textual reviews, Natural Language Processing (NLP) has garnered significant attention in recent years. Aspect-based sentiment analysis, as an important research topic in NLP field, offers fine-grained tasks for mining aspect information of product reviews. Accurately mining this information, however, involves three important subtasks: aspect term extraction (ATE), aspect category classification (ACC), and aspect-level sentiment classification (ASC).

Question–answering (Q&A)-style review, which is a novel form of product review, consists of questions and answers where potential consumers generate questions and sellers or people who purchased the products provide answers. Figure 1 shows an example of Q&A-style reviews with annotation information. Different from conventional reviews which many useless messages might be included in (e.g. “The color is beautiful, the price is low and performance is great. A movie star is endorsing it and you’ll regret if you miss it.”), Q&A-style reviews are pairs of conversation, they are more targeted, and the topic of answer texts are confined to the topic of question text. Hence Q&A-style reviews effectively reduce the number of fake reviews and make product information more credible. Thus, aspect-based sentiment analysis is particularly necessary and meaningful for mining valuable information contained in Q&A reviews.

Recently, several studies have focused on ATE and ACC tasks. However, most studies have focused on traditional reviews rather than Q&A-style reviews and regard ATE and ACC as independent tasks and deal with them separately, even though the tasks are highly interrelated. Intuitively, extracted aspect term information assists aspect category prediction, and aspect category information is advantageous to distinguish aspect terms from other words unrelated to aspect information. In addition, to the best of our knowledge, there is no one to explore the ATE task and ACC task based on Q&A-style reviews on GPU clusters. One of the barriers of these two tasks on Q&A-style reviews is the corpus about Q&A-style reviews–especially the Chinese corpus–is scarce. The good news is that we recently resolved this difficulty in previous work by designing a set of elaborate annotation rules and building a high-quality annotated corpus [31, 32]. Beyond that, there are some other important problems needed to be addressed in ATE and ACC tasks.

Fig. 1
figure 1

An example of a question–answering (Q&A)-style review

On the one hand, most studies do not make full use of training resources. Up to now, E-commerce platforms have accumulated massive online product reviews, including Q&A-style reviews. This renders us to get more training sets for our models which can significantly improve the performance of our model. Apart from more training data, we can also utilize more powerful computing power and faster computing speed via GPU clusters. This can greatly improve the quality of the model for it not only enables more data to process during training stage, but also reduces the iteration time in experiment, allows researchers try on their new ideas and configurations. Faster training also enables networks to be deployed in applications whose model needs to be updated frequently. In this paper, we employ data parallelism which will train several mini-batches simultaneously on parallel GPUs. We allocate training data to multiple processors to compute gradient updates, then aggregate these divided updates to get the final result. To the best of our knowledge, we are the first to explore the aspect-based sentiment analysis tasks on GPU clusters.

On the other hand, there are some challenges caused by Q&A-style reviews. First, because of colloquial and informal nature of online reviews, existing word segmentation toolkits generate errors when dealing with text of Q&A-style reviews, which degrades the subsequent model’s performance. As a solution, we adopt character-level rather than word-level embedding to represent Q&A text. Second, the ACC task for Q&A-style reviews is more difficult than for conventional reviews, because of the occasional irrelevant aspect terms. With this in mind, it only makes sense to focus on the aspect term mentioned in both the question and answer context. Third, because of the correlation between ATE task and ACC task, extracted aspect term information assists aspect category prediction, and aspect category information is advantageous to distinguish aspect terms from other words unrelated to aspect information. To overcome this problem, we propose a novel multi-task neural learning framework that jointly addresses the two tasks.

In this paper, by analyzing all problems of Q&A reviews mentioned above, our research offers the following contributions:

  • To make full use of training resources and improve the training speed of models, we deploy our models and all baselines in GPU clusters, and use data parallel strategies for model training. In this paper, we will compare the performance and training time of our proposed model with other baselines in GPU clusters.

  • To contend with colloquial and informal nature of online reviews, we avoid word segmentation errors by adopting character-level rather than word-level embedding. In this way, our proposed model improves the performance of ACC task and implements fine-grained extraction for ATE task.

  • To address the occasional irrelevant aspect terms mentioned in the Q&A context, we introduce attention mechanism that captures the most relevant aspect information mentioned by both the question and answer contexts, which improves the performance of ACC task.

  • To further improve the performance of ACC and ATE, we leverage the correlation between ATE and ACC to jointly address the two tasks.

2 Related work

Fig. 2
figure 2

The architecture of our proposed multi-task model with attention mechanism

2.1 Aspect category classification

The ACC task can be treated as a supervised classification task [20]. Given the predefined categories list, our task is to identify a specified aspect term’s category. Traditional approaches mainly focused on manually designing a set of features such as a bag-of-words or lexicon to train a classifier. Brychcin et al. [2] leveraged a set of binary Maximum Entropy (ME) classifiers for ACC. Kiritchenko et al. [13] used a set of binary Support Vector Machines (SVMs) with different types of n-grams and information from a specially designed lexicon. However, these approaches highly depend on the features’ quality, and feature engineering is labor-intensive.

With the development of deep learning techniques, researchers have designed effective neural networks to address ACC task. Toh et al. [28] extracted features from words in every sentence and adopted the sigmoidal feedforward network to train a binary classifier. Xue et al. [34] proposed a multi-task neural network to jointly address ACC and ATE. (Our research differs from the work they built on conventional reviews, our proposed model is based on Q&A-style reviews and we leverage conditional random fields to further improve ATE’s performance.) More recently, Wu et al. [33] proposed a 4-dimension textual representation model on Q&A style reviews for ACC task.

2.2 Aspect term extraction

The ATE task extracts aspect and opinion terms explicitly contained in the sentence [25]. Early work focused on researching rule-based methods. Hu and Liu et al. [10] leveraged frequent nouns or noun phrases to extract aspect terms, and tried to identify opinion terms by exploiting the relationships and occurrence between aspect terms and opinion terms. However, the rule-based approaches highly relied on hard-coded rules and external language resources. Later, ATE was treated as sequence tagging problem by using supervised featured-based methods such as Hidden Markov Models (HMMs) [17] or Conditional Random Fields (CRF) [27]. However, feature-based approaches greatly rely on features’ quality–and again, feature engineering is both time-consuming and labor-intensive.

With the rapid development of neural networks, researchers proposed a neural language model for general high-level representations of words used to extract aspect terms [9]. Liu et al. [21] used pre-trained word embeddings as input of Recurrent Neural Network (RNN) for ATE. Yin et al. [35] proposed a hybrid method that first learns a distributed representation of words and dependency paths by RNN and then feeds the learned results along with some hand-crafted features into a CRF [16] for extracting aspect terms. Wang et al. [29] proposed a joint model consisting of Recursive Neural Network (ReNN) and CRF layer for ATE task. To reduce the influence of parsing errors, they further designed the RNN with coupled multilayer attention, to exploit the relationship of aspect terms and opinion terms for co-extraction [30]. Recently, Li et al. [19] designed a framework which can exploit opinion summary and aspect detection history for tackling ATE. Ma et al. [22] conducted a gated unit network with attention mechanism to make Seq2Seq learning suit to ATE task. Li et al. [18] alleviated the data scarcity problem in ATE task by proposing a masked sequence-to-sequence data augmentation method.

2.3 Multi-GPU parallel computing

Parallel Computing is a kind of method to solve computing problems by using multiple computing resources simultaneously, which is a useful means to enhance computational efficiency and processing ability of computer system.The history of the utilization of parallel computing can be traced back to 1970s when the first parallel computer ILLIAC IV was invented. By the difference of principle, parallel computing can be divided into Data Parallelism(DP) [3] and Model Parallelism(MP) [4]. In DP, each GPU uses the same model to train on a different subset of training data and compute gradients, which need to be aggregated across the GPUs [15]. The strategy is widely used since its simplicity and highly effectiveness in reducing time cost [5].

Since the growth of machine learning techniques in recent years, researchers have brought parallel computing to the domain. Meyer et al. [23] introduced DP to Support Vector Machine, which reduced the computation time considerably with only minor loss in accuracy. With regard to deep learning, Krizhevsky [14] learned the fact that the parameters in convolution layer of Convolutional Neural Network (CNN) take only about 5% while the time cost in the layer takes about 95% of whole time. He leveraged DP in convolution layer and reduced time cost effectively. Tencent’s Mariana [36] used DP, which gained a \(2.67\times \) speed increment with four GPUs. In recent years, more paralleled deep learning methods have been brought up [11]. In the aspect of algorithms, several algorithms have been brought up to accelerate multi-GPU implementation or make the inference more accurate [1, 26] and faster [7, 12]. Moreover, there are researches have been done to integrate DP and MP [8].

3 Proposed method

In this section, we describe the ATE and ACC tasks based on Q&A text pairs and our parallel strategy. On this basis, considering characteristics of Q&A-style reviews, we propose a multi-task model to jointly address the two tasks. Intuitively, the question text tends to be more important, because the aspect term needing categorization tends to appear in the question first. And then, the answer text also involves information related to the aspect term mentioned in the question text. Thus, we need to better model the representation of question text by doing a better job of harnessing relevant aspect information contained in both the question and answer text. Specifically, our proposed model uses two Bidirectional Long Short-Term Memories (Bi-LSTMs) to generate hidden state representations of the question and answer text, respectively. For the ATE task, we use a fully connected layer and CRF layer to extract the aspect term in the question text. For the ACC task, an attention mechanism is applied to capture the most relevant aspect information between the Q&A text, and extend the representation of question text by leveraging the relevant aspect information contained in the answer text. Finally, for making full use of training resources, we design a data parallel strategy for our proposed model.

3.1 Aspect term extraction and aspect category classification tasks

We tackle the ATE task as sequence tagging problem, which extracts an explicit aspect term in the question text. Note that the extracted term could be a single word or a phrase. From the sequence tagging perspective, the word tokens related to the given aspect category should be tagged according to a predefined label scheme. We define the label scheme as {B,I,E,O,S}, where B indicates an aspect term’s beginning, I indicates the inside of an aspect term, E indicates an aspect term’s end, O means others. In particular, if the aspect term is a single word, we label it as S. In this way, the question text “How about the color of the phone?” can be tagged as “How/O about/O the/B color/I of/I the/I phone/E ?/O”. Thus, we address the ATE task by training a sequence labeling model based on the combination of Bi-LSTM and CRF layers.

Instead of a sequence labeling model, the ACC task is considered as a general classification problem. Given the predefined categories, the task is to identify the aspect category for the specified aspect term. Thus, the proposed model uses two Bi-LSTM layers to model representations of the question text and answer text, and then an attention mechanism is adopted to extend the representation of the question text for improving our model’s performance on ACC task.

3.2 Multi-task model

Figure 2 shows the architecture of multi-task learning framework. Given a Q&A-style review, assume that the question text \(Q=\{w_{1}, w_{2}, \ldots , w_{M}\}\) contains M single words, where \(w_{i}\) represents the ith single word in the question text. Each single word is represented as \(q_{i}\in R^{d_{w}}\) which is obtained from a word embedding matrix \(E\in R^{d_{w}\times |V|}\), where \(d_{w}\) is the embedding dimension and |V| is the vocabulary size. Thus, we represent the question text as a character-level embedding matrix \(S_Q=\{q_{1}, q_{2}, \ldots , q_{M}\}\). Similarly, we represent the answer text \(A=\{s_1, s_2, \ldots , s_{N}\}\) as a character-level embedding matrix \(S_A=\{a_{1} ,a_{2}, \ldots , a_{N}\}\), where \(a_{j}\in R^{d_{w}}\) denotes the jth single word of the answer text and N is the number of single words in an answer text.

Next, we feed the character-level embedding matrix of question text \(S_Q\) into a Bi-LSTM layer shared by the ATE and ACC tasks to generate a hidden state matrix of question text \(H_Q=\{h_{q_1}, h_{q_2}, \ldots , h_{q_{M}}\}\), where we obtain the hidden state of each single word by averaging the forward and backward hidden state:

$$\begin{aligned}&\overrightarrow{H_{Q}}=\overrightarrow{L S T M}\left( S_{Q}\right) \end{aligned}$$
(1)
$$\begin{aligned}&\overleftarrow{H_{Q}}=\overleftarrow{L S T M}\left( S_{Q}\right) \end{aligned}$$
(2)
$$\begin{aligned}&H_{Q}=AVG(\overrightarrow{H_{Q}} ; \overleftarrow{H_{Q}}) \end{aligned}$$
(3)

where \(H_{Q}\in R^{d_{h}\times M}\), \(d_{h}\) is the dimension of hidden state and M is the number of single words in the question text.

3.2.1 Aspect term extraction

Given the hidden state matrix of question text \(H_Q\), the model transforms it into an output label space using an additional fully connected layer:

$$\begin{aligned} P = H_{Q}^{T}\cdot W_{ate} + b_{ate} \end{aligned}$$
(4)

where \(P\in R^{M\times N_{t}}\) is the output score matrix of \(N_{t}\) labels, and \(P_{ij}\) denotes the score of the jth tag of the ith single word in the question text. \(W_{ate}\in R^{d_{h}\times N_{t}}\) and \(b_{ate}\in R^{N_{t}}\) are parameters of the fully connected layer. Then we leverage conditional random field (CRF) layer for tagging because it takes an object’s neighbor into account, which is similar to the use of past and future input features via the bidirectional LSTM layer. The CRF layer takes as an input the sequence of vectors P and returns sequence of labels \(z=\left( z_{1}, z_{2}, \dots , z_{n}\right) \). According to the given question text \(Q = \{w_1, w_2, w_3, \ldots, w_{M}\}\), we define the prediction score for each output tag sequence \(z=\{z_1, z_2, \ldots , z_{M}\}\), where \(z_i\) denotes the label of \(w_i\) :

$$\begin{aligned} Score(Q,z) = \sum _{i=1}^{M-1} A_{z_i,z_{i+1}} + \sum _{i=1}^{M} P_{i,z_{i}} \end{aligned}$$
(5)

\(A\in R^{N_{t}\times N_{t}}\) is a transition score matrix, and \(A_{ij}\) is the transition probability from label i to label j. Furthermore, we adopt a softmax function over all possible tag sequences for computing the posterior probability:

$$\begin{aligned} P(z|Q) = \frac{e^{Score(Q,z)}}{\sum _{\tilde{z}\in Z_{Q}} e^{Score(Q,z)}} \end{aligned}$$
(6)

where \(Z_{Q}\) denotes all possible tag sequence collections. For CRF training, we use the maximum conditional likelihood estimation. For a training set \(\left\{ \left( {\mathbf {Q}}_{i}, {\textit{\textbf{z}}}_{i}\right) \right\} \),the logarithm of the likelihood (a.k.a. the log-likelihood) is given by:

$$\begin{aligned} L(params)=\sum _{i} \log p({\textit{\textbf{z}}} | {\mathbf {Q}} ; params) \end{aligned}$$
(7)

Maximum likelihood training chooses parameters such that the log-likelihood L(params) is maximized. While decoding, according to the principle of maximizing posterior probability, we select the tag sequence that maximizes the posterior probability as the optimal path \(z^*\) and then extract aspect terms according to the tag sequence:

$$\begin{aligned} z^* = \mathop {\arg \max _{\tilde{z}\in Z_{Q}}} Score(Q,\tilde{z}) \end{aligned}$$
(8)

3.2.2 Aspect category classification

Given the hidden state matrix of answer text \(H_A\), the model uses another Bi-LSTM layer to obtain a hidden state matrix of answer text \(A = \{s_1, s_2, \ldots , s_{N}\}\):

$$\begin{aligned} \overrightarrow{H_{A}}= & {} \overrightarrow{LSTM}(S_{A}) \end{aligned}$$
(9)
$$\begin{aligned} \overleftarrow{H_{A}}= & {} \overleftarrow{LSTM}(S_{A}) \end{aligned}$$
(10)
$$\begin{aligned} H_{A}= & {} AVG(\overrightarrow{H_{A}};\overleftarrow{H_{A}}) \end{aligned}$$
(11)

where \(H_{A}\in R^{d_{h}\times N}\) and N is the number of single words in the answer text. Noting that there may be irrelevant aspect terms mentioned in the Q&A text, we adopt an attention mechanism to capture the most relevant aspect information mentioned in both the question and answer context. Thus, the vector representation of question text could be enhanced by making full use of aspect information contained in the answer text. The attention layer calculates the attention representation of the question text according to the following formulas:

$$\begin{aligned} M= & {} \tanh (W_{c}\cdot (H_{A}^{T}\cdot H_Q)+ b_{c}) \end{aligned}$$
(12)
$$\begin{aligned} \alpha= & {} softmax(W_e^T\cdot M) \end{aligned}$$
(13)
$$\begin{aligned} r= & {} H_Q\cdot \alpha ^T \end{aligned}$$
(14)

where \(M\in R^{N\times M}\), r is the attention representation of question text, and \(W_{c}\in R^{N\times N}\), \(b_{c}\in R^{M}\), \(W_e\in R^{N}\) are parameters to be trained. Finally, the final vector representation of question text is calculated by non-linearly combining r with the final hidden state \(h_{q_{M}}\) :

$$\begin{aligned} h^* = \tanh (W_{f}r + W_{x}h_{q_{M}}) \end{aligned}$$
(15)

where \(h^*\in R^{d_h}\), \(W_{f}\in R^{d_h\times d_h}\) and \(W_{x}\in R^{d_h\times d_h}\) are parameters to be trained. In the softmax layer, the final aspect category distribution of the given question text is predicted using the final vector representation of question text \(h^*\):

$$\begin{aligned} y = softmax(Wh^*+b) \end{aligned}$$
(16)

where \(W\in R^{K\times d_h}\), \(b\in R^{K}\) are parameters in the softmax layer and K is the number of predefined categories.

3.2.3 Data parallel for model

We design our parallelism strategy generally in the following way:

  • Divide the model’s inputs into multiple sub-batches.

  • Apply a model copy on each sub-batch. Every model copy is executed on a dedicated GPU.

  • Concatenate the results (on CPU) into one big batch.

As mentioned above, gradients are needed to be passed backward to update parameters within the network. This is normally done by stochastic gradient descent in modern deep learning because the dataset is too big to be fit into the memory. For example, if we have 10K data points in the training dataset, every time we could only use 16 data points to calculate the estimate of the gradients, otherwise our GPU may stop working due to insufficient GPU memories.

The shortcoming of stochastic gradient descent is that the estimate of the gradients might not accurately represent the true gradients of using the full dataset. Therefore, it may take much longer to converge.

A natural way to have more accurate estimate of the gradients is to use larger batch sizes, or even use full dataset. To allow this, the gradients of small batches were calculated on each GPU, the final estimate of the gradients is the the weighted average of the gradients calculated from all the small batches.

Mathematically, data parallelism is valid because of

$$\begin{aligned} \begin{aligned} \frac{\partial {\text {Loss}}}{\partial w}&=\frac{\partial \left[ \frac{1}{n} \sum _{i=1}^{n} f\left( x_{i}, y_{i}\right) \right] }{\partial w} =\frac{1}{n} \sum _{i=1}^{n} \frac{\partial f\left( x_{i}, y_{i}\right) }{\partial w} \\&=\sum _{j=1}^{k}\frac{m_{j}}{n} \frac{\partial \left[ \frac{1}{m_{j}} \sum _{i=m_{j-1}+1}^{m_{j}+m_{j+1}} f\left( x_{i},y_{i}\right) \right] }{\partial w} \\&=\frac{m_{1}}{n} \frac{\partial l_{1}}{\partial w}+\frac{m_{2}}{n} \frac{\partial l_{2}}{\partial w}+\cdots +\frac{m_{k}}{n} \frac{\partial l_{k}}{\partial w} \end{aligned} \end{aligned}$$
(17)

where \(m_0=0\), w is the parameters of the model, \(\frac{\partial \mathrm {Loss}}{\partial w}\) is the true gradient of the big batch of size n, \(\frac{\partial l_{k}}{\partial w}\) is the gradient of the small batch in GPU k, \(x_i\) and \(y_i\) are the features and labels of data point i, \(f(x_i,y_i)\) is the loss for data point i calculated from the forward propagation, n is the total number of data points in the dataset, k is the total number of GPUs, \(m_k\) is the number of data points assigned to GPU k, \(m_1+m_2+\cdots +m_k=n\). When \(m_{1}=m_{2}=\cdots =m_{k}=\frac{n}{k}\), we could further have:

$$\begin{aligned} \frac{\partial \mathrm {Loss}}{\partial w}=\frac{1}{k}\left[ \frac{\partial l_{1}}{\partial w}+\frac{\partial l_{2}}{\partial w}+\cdots +\frac{\partial l_{k}}{\partial w}\right] \end{aligned}$$
(18)

Here for each GPU node, we use the same parameters of the model to do the forward propagation, we send a small batch of different data to each node, compute the gradient normally, and send the gradients back to the main node. This step is asynchronous because the speed of each GPU node is slightly different. Once we got all the gradients (we are doing synchronization here), we calculate the (weighted) average of the gradients, and use the (weighted) average of the gradients to update the model/parameters. Then we move on to the next iteration.

Table 1 Training data distribution

3.3 Model training

In the ACC task, given a set of training data \(S_{Q_t}\), \(S_{At}\), and \(y_t\), where \(S_{Q_t}\) is the tth question text, \(S_{At}\) is the tth answer text, and \(y_t\) is the ground-truth aspect category for the Q&A text pair (\(S_{Q_t}\), \(S_{At}\)). We use the cross entropy function between y and \(y_t\) with L2 regulations as a loss function:

$$\begin{aligned} L_{acc} = -\sum _{t=1}^N \sum _{k=1}^K y_{t}^k\log y^k + \frac{l}{2}\parallel \theta \parallel ^2 \end{aligned}$$
(19)

where N is the size of training set, K is the number of predefined categories, l is the parameter for L2 regularization, and \(\theta \) is a parameter set.

In the ATE task, given a set of training data \(S_{Q_t}\), \(S_{At}\), and \(z_t\), where \(S_{Q_t}\) is the tth question text, \(S_{At}\) is the tth answer text, and \(z_t\) is the prediction-output tag sequence for the tth question text, assuming Score(\(S_{Q_t}\),\(z_t\)) is the score of tag sequence \(z_t\), we describe the log-likelihood function as:

$$\begin{aligned} L_{ate} = \sum _{t=1}^N Score(S_{Q_t},z_t)-\log (\sum _{\tilde{z}\in Z_{Q}}e^{Score(Q,z)}) \end{aligned}$$
(20)

where N is the size of the training set and \(Z_{Q}\) is all possible tag sequence collections. To learn the parameters of the multi-task model, we define the loss function as a weighted linear combination:

$$\begin{aligned} L = \lambda L_{acc} + (1-\lambda )L_{ate} \end{aligned}$$
(21)

where \(\lambda \) is the weight parameter. Parameters are optimized by using Adam optimization functions, and to solve overfitting problems, dropout strategy is adopted.

Table 2 Results of aspect category classification in luggage domain
Table 3 Results of aspect term extraction in luggage domain
Table 4 Results of aspect category classification in beauty domain

4 Experiments

4.1 Experimental setting

  • Data settings Our experiments use Q&A-style reviews as training data which involves in digital domain, beauty domain and luggage domain. Table 1 shows the distribution of experimental data. Each line in the dataset denotes a Q-A pair, and there are three parts in a line: the first part is the question text which is tagged for each character; the second part is the untagged answer text; the third part is a (aspect term, aspect category, sentiment) triple. Considering the problem of imbalanced distribution of data, we discard the aspect categories that involve less than 50 question–answering text pairs.

  • Character-level representations Considering the informal nature of online reviews, we choose character-level embeddings instead of word-level embeddings to reduce the word segmentation errors. Specially, the character-level embeddings are obtained by using 320 thousand question–answering text pairs extracted from Taobao and we use skip-gram [24] model provided by gensim toolkit to model word respresentations.

  • Evaluation metrics For aspect category classification task, the main evaluation metrics are Accuracy(\(A_{acc}\)) and F1-measure(\(F_{acc}\)) where \(F_{acc}\) is calculated as \(F_{acc} = \frac{2P_{acc}R_{acc}}{P_{acc}+R_{acc}}\). For aspect term extraction task, \(F_{ate}\) is calculated by the formula \(F_{ate} = \frac{2P_{ate}R_{ate}}{P_{ate}+R_{ate}}\).

  • Hyper-parameters All out-of-vocabulary words are initialized by sampling from the uniform distribution U(\(- 0.01\), 0.01), the dimension of character-level embeddings and hidden state vectors are set to be 300. Other hyper-parameters are tuned according to the development data, the model use Adam optimizer with a batch size 32, and initial learning rate is 0.002. The weight parameters of the multi-task model \(\lambda \) is set to be 0.55, dropout rate is set to be 0.25 to reduce overfitting.

  • Training data percentage Consider the limitation of training data amount, there is probability that the result of model can still improve if new training data is coming. To test if the model is still under convergence, we do each experiment with different training data percentage from 0.2 to 1, with other settings remain the same.

  • Experiment setup All experiments were conducted on a GPU cluster. The cluster is equipped with two 12-core Intel(R) Xeon(R) E5-2650 v4 @ 2.20GHz processors and 8 NVIDIA GeForce 1080Ti GPU with 10 GB video Memory of each one.It runs Ubuntu 16.04, CUDA 9.0.176 and CUDNN 7.

Table 5 Results of aspect term extraction in beauty domain
Table 6 Results of aspect category classification in digital domain
Table 7 Results of aspect term extraction in digital domain

4.2 Baseline models

In order to comprehensively evaluate the performance of our multi-task model, we compare our proposed model with several popular baselines for aspect terms extraction and aspect category classification based on quesiton-answering reviews respectively.

In ACC task based on question–answering reviews, we build the following baseline models:

Fig. 3
figure 3

Time cost of multiple baselines and our model with different GPU utilization number

Fig. 4
figure 4

F1-measure score of baselines trained with different proportions of training data in three domains

  • LSTM(A): This model takes the answering text as input, and uses LSTM network to model the answering text, then the hidden state representations will be used for aspect category classification task.

  • LSTM(Q) This model takes the question text as input, and uses LSTM network to model the question text, then the hidden state representations will be used for aspect category classification task.

  • LSTM(Q+A) This model takes question text and answering text as input, and uses LSTM network to model the question context and answering context, then the final hidden state representations is obtained by concatenating the hidden state representations of question text and answering text.

  • Bi-LSTM This model takes the question text as input, and uses Bi-LSTM network to model the question text, then the hidden state representations will be used for aspect category classification task.

  • Multi-task This model is a variation of our proposed model for aspect category classification. Compared with ours, it ignores the relevant information between question text and answer text.

  • Multi-task+Attention (MTA) This proposed model of ours is used for question–answering aspect category classification by constructing a multi-task learning framework. Aspect category classification task is based on Bi-LSTMs with attention mechanism to better represent question text.

In ATE task based on question–answering reviews, we build the following baseline models:

  • CRF This method uses conditional random fields to extract aspect term from question text. It uses character-level embeddings learned from Skip-gram model as input.

  • Bi-LSTM: This method uses Bi-LSTM to model question text, and then leverages a softmax layer for aspect term extraction.

  • Bi-LSTM+CRF This method uses Bi-LSTM to model question text, and feed the hidden states into a CRF layer for aspect term extraction.

  • Mulit-task This model is a variation of our proposed model for aspect term extraction. Compared with ours, it ignores the relevant information between question text and answer text.

  • Mulit-task+Attention (MTA) This proposed model of ours is used for question–answering aspect term extraction by constructing a multi-task learning framework. Aspect term extraction task is conducted based on Bi-LSTM and CRF.

4.3 Experimental result and model comparison

Tables 2, 3, 4, 5, 6 and 7 show the performance of proposed model with other baseline models, divided by task of ACC and ATE. Since the curve trends on different domains are similar, we only show and describe the figures of one domain. Experiments are conducted with full training data and single GPU to get a overview of all well-trained models. By analysis, we can draw the following conclusion:

For aspect category classification task, the performace of LSTM(Q) is obviously better than LSTM(A) which proves the idea that question text tends to be more important than answering text. Moreover, LSTM(Q+A) outperforms LSTM(Q) and LSTM(A) which inspires us that the combination of question text and answering text could improve the performance of aspect category classification. The multi-task model without attention mechanism achieves the improvement of \(3.3\% (A_{acc})\) and \(2.0\% (F_{acc})\) in digital domain, \(1.6\% (A_{acc})\) and \(2.0\% (F_{acc})\) in beauty domain and \(0.9\% (A_{acc})\) and \(1.1\% (F_{acc})\) in luggage domain which proves that multi-task model could improve the performance of aspect category classification with the help of extracted aspect information. Further, the Multi-task+Attention model achieves the improvement of \(2.1\% (A_{acc})\) and \(5.5\% (F_{acc})\) in digital domain, \(3.5\% (A_{acc})\) and \(3.7\% (F_{acc})\) in beauty domain, \(2.8\% (A_{acc})\) and \(1.8\% (F_{acc})\) in luggage domain which proves that the attention mechanism could capture the most relevant aspect information between question context and answering context and enhance the representation of question text.

For aspect term extraction task, the model Bi-LSTM+CRF achieves the improvement of \(3.2\%\) in digital domain, \(0.8\%\) in beauty domain, and \(1.5\%\) in luggage domain compared with Bi-LSTM which proves that Bi-LSTM could learn the context information of question text, but softmax layer ignores the interaction of tag sequence. CRF layer introduces state transition matrix which can make use of sentence level tag information to improve the performance of aspect term extraction. Multi-task model without attention mechanism outperforms Bi-LSTM+CRF for the improvement of \(1.2\%\) in digital domain, \(0.8\%\) in beauty domain and \(1.4\%\) in luggage domain which confirms our intuition that aspect category information is helpful to distinguish aspect term from other words unrelated to aspect information. Further, Multi-task+Attention model achieves the improvement of \(0.6\%\), \(0.8\%\), \(2.4\%\) in three domains which indicates that the performance improvement of aspect category classification can further enhance the performance of aspect terms extraction.

Fig. 5
figure 5

F1-measure score of our model with different GPU utilization number and different training data proportion

Fig. 6
figure 6

Precision, recall, F1-measure score and accuracy of our model with increase in the number of GPUs

Simultaneously, to evaluate the affect of DP, we further conduct more experiments.

We first evaluate the efficiency performance of DP by vary the GPU number we use across. Results are shown in Fig. 3. Time includes training time, predicting time and evaluating time. Several conclusions can be drawn from this figure. First, for the fixed mini-batch size 32, total time cost is decreasing in general. For our model, the best result shortens the time by half in digital domain of ACC task. As for others, time shortens 43% and 45% in beauty and luggage domain accordingly. The observation shows the excellent effect of DP. Next, we can notice that the greatest time reduction happens between no parallelism and 2-GPU DP in most cases. This is reasonable because DP’s implementation can greatly benefit the model’s performance. And with the raise of GPU number, the extent of time reduction reduces, reaching the optimum in 6-GPU cases then begins to rise. As we know, aggregating gradient updates is a critical step in DP. The rise can be explained as the communication overhead has surpassed the time cost reduction brought by distribution.

Comparison of all baselines of ACC task and ATE task accordingly are shown in Fig. 4. We use the optimal GPU number of 6. We can notice that our model reaches the highest F1-measure score in all three domains in both ATE and ACC tasks when trained with 20% of training data, with notable improvements of 7.4% in digital domain, 3.6% in beauty domain and 35.8% in luggage domain respectively for ACC task; 6.2% in digital domain, 7.6% in beauty domain and 6.0% in luggage domain respectively for ATE task. Moreover, the chart lines of our proposed model are flatter than the baselines’. Preceding observations can prove the robustness of our model when the training data is inefficient. Somehow, we may also notice that the result seems still tend to go up when we use all of our training data, which could be a signal that the model is still under its best performance.

We further conduct experiments on our model with different training set proportion, across different number of GPUs. Certain observations can be found on Fig. 5. First, it’s easy to notice that the higher proportion of training data, the better result we got. The result of 20% training data is turbulent through different GPU number, indicates the model’s lack of training data. Besides, F1-measure score of other training data proportion behaves rather stable through all numbers of GPU. At last, we compared precision, recall, F1-measure score and Accuracy of our model differs on number of GPUs we use. Figure 6 shows the trend. we can notice that the result varies little on different evaluation matrix from GPU 1 to 8. The performance has no loss in thus change. Results above can be proofs of our model’s robustness.

Experimental results prove that our proposed multi-task model could make full use of the correlation between aspect category classification and aspect term extraction to improve the performance interactively. Besides, the experiment results confirm our two intuitive hypotheses according to characteristics of question–answering style reviews which are that the question texts are more important than answering texts for aspect category classification task and attention mechanism could capture the most relevant aspect information that is mentioned by both the question text and the answer text to improve the performance of aspect category classification. To end with, the GPU cluster can accelerate model training massively with few negative impacts in our experiments.

4.4 Error analysis

In order to figure out the the limitations of the proposed model, we carefully analyze the misclassified samples in the test set and find the factors that lead to errors as follows. The first factor is imbalanced data distribution which make the model tend to predict the aspect categories that contain more question–answering text. For example, in the digital domain, 22.95% of misclassified samples are predicted to be “IO”. Similarly, in beauty domain and luggage domain, misclassified samples tend to be predict to be “efficacy” and “quality”. The second factor is that in order to reduce word segmentation errors, we choose character-level embeddings rather than word-level embeddings for aspect term extraction and aspect category classification, but modeling only one single word may cause our model unable to model the real semantic/syntactic information of a clause or the whole sentence which results in performance degradation of the subsequent model. The third factor is that the semantic information of some aspect terms are ambiguous in different contexts which lead to the difficulty of aspect category classification. For the data parallelism, we should be aware of that the optimum of GPU utilization number changes with the size of the data. The relationship between data size and the optimum GPU utilization number is remained unclear.

5 Conclusion

In this paper, we propose a multi-task neural learning framework based on question–answering text for addressing aspect category classification and aspect term extraction simultaneously and explore its performance on GPU clusters. The initial inspiration comes from our analysis of characteristics of question–answering style reviews and the correlation between aspect category classification and aspect term extraction, i.e., extracted aspect information can assist aspect category prediction and aspect category information is advantageous to distinguish aspect term from other words unrelated to aspect information. In addition, the motivation of using GPU clusters comes from making full use of advanced training resources, including abundant training data, faster computing speed and more computing power. Experimental results prove that our proposed multi-task model outperforms other baseline models on GPU clusters.

6 Future work

Q&A-style reviews, as a novel form of online review, have great research value. We have achieved some research results in this paper and some previous work [31, 32], but there are still many aspects to be studied and improved. Our future work would like to focus on the followings:

  • In order to better model the clause and the whole question sentence, we would like to introduce more complex and powerful pretrained language model to model local contextual information for improving the performance of aspect category classification, because GPU clusters can process and train more larger and complex neural network models in an acceptable training time.

  • In question–answering text, in addition to the association between aspect term and aspect category, aspect category and aspect sentiment polarity are also related, thus, we would like to make use of the correlation between aspect category classification and aspect sentiment classification to bulid a joint learning model.

  • Considering there may be more than one relevant aspect terms mentioned by both question context and answer context, we would try to conduct aspect category classification for multiple aspect terms simultaneously.