Introduction

With the rapid development of Internet technology and the popularity of social media’s ease-of-use, low cost and speed, the number of Internet users worldwide has reached a staggering 4.95 billion [1]. Nowadays, social networking platforms have become the main channel for people to obtain information, with millions of users browsing and disseminating various information on the Internet every second. However, it is the ease and speed of social media that make it an ideal environment for the spread of fake news. Nowadays, to increase the credibility of fake news, fake news is spread in the form of text combined with images [2]. Compared with real news, fake news is characterized by unusual updates, fast-spreading speed, deep spreading depth, and wide spreading range [3, 4]. Fake news misleads readers with the help of fabricated content, which not only distorts the factual truth of the news but also induces readers to have negative emotions. Especially during emergencies, emergent events often trigger a high degree of public concern and discussion, and in this highly sensitive environment, the harm of emergent fake news to the social network environment is more pronounced, causing more serious damage to netizens’ trust in social media, with faster dissemination speed and greater impact, which will cause panic, mislead the public, and even have a serious impact on social stability in a short period of time. For example, during the COVID-19 epidemic, the spread of fake news led to an increase in the number of people suffering from the disease and confusion among people. Truth and accuracy are core values that people relentlessly pursue, and the spread of fake news fundamentally challenges this. In the face of major emergencies, the effective identification and exposure of fake news is the only way to better respond to the public opinion environment of emergencies, so the detection of fake news plays an irreplaceable role in the construction of a stable and good network environment.

Faced with the threat of fake news, many efforts have been made to build news detection models. In the beginning, News detection was manual, and a lot of human resources were used to label the news as true or false. Although this method is highly accurate, it can no longer meet the needs of today’s vast information era. In recent years, with the deepening of research by scholars, more and more excellent detection models have been introduced. In the early days, machine learning was used to utilize a large number of hand-crafted features, such as “user behavior”, “user information”, “news dissemination time” and other variable information, and put them into the trained SVM classifiers, decision tree classifiers, etc. for the purpose of news detection [5,6,7,8]. This approach often loses its advantage of “label diversity” when faced with news with few labels and sudden events. Nowadays, news detection increasingly relies on deep learning methods, leveraging neural networks to capture the semantic features of news [9,10,11]. Models extract text or image features to discern the authenticity of news [12,13,14,15]. These models enhance the correlation between modalities continuously to achieve multimodal feature fusion, fully exploiting the semantic information across modalities, thereby achieving accurate fake news detection [16]. Extracting image pattern information, image semantic information, and textual information from the gathered news. These are subsequently fed into individual coarse classifiers, leveraging diverse perspectives to distinguish the authenticity of the entire news article. This approach can be seen as employing a task-based attention mechanism to assign weights to modal information [15]. Utilizing cross-modal fuzziness as a metric for the disparity between individual modalities. When the cross-modal fuzziness is weak, it employs single-modal information for detecting fake news. The above-mentioned methods overlook the “domain attributes” of news, limiting the model’s performance to the learned historical events. In today’s era of millions of interactions per second on social media, as novel news continues to emerge, traditional models may struggle to achieve satisfactory detection results when dealing with emerging subtopics in these domains or significant events with scarce labels. With the deepening research on neural networks, an increasing number of cross-disciplinary technologies and methods are being applied to fake news detection, providing the potential for enhancing detection performance [17]. Extracting knowledge from a single-modal feature extraction model through knowledge distillation and transferring it to another single-modal feature extraction model to enhance the correlation between modalities [18]. The study of the dynamic characteristics of pulse RDNNs plays a crucial role in adversarial attacks, providing the possibility of frequently changing fake news features to deceive the discriminator [19]. Capturing visual features from news image information using capsule neural networks [20]. Ingeniously introducing dual ETM into the RDNNs model, enabling it to simultaneously encode different modal information in multimodal news detection, providing the potential for superior multimodal information fusion [21]. Providing a direction by combining the event triggering system of RDNNs with graph neural networks for event correlation analysis. The application of this network allows for an in-depth understanding of the relationship between specific events and others.

In summary, despite significant progress in previous research on news detection, it still faces the “domain challenge.” The domain attributes of news significantly impact the model’s transferability when detecting news, especially in news domains with a lack of labels, where the model may struggle to effectively extract distinguishing features of fake news in that specific domain [22]. For instance, a fake news detection model in the health domain relies on features related to “disease,” “medical,” and “medication” extracted from the training set. Similarly, models in the technology domain depend on features related to “artificial intelligence,” “cybersecurity,” and others. These features assist the models in accurately identifying false information in their respective domains. Real-world news detection models need to address real-time critical events without labeled data. Although historical events have annotated data, there exists a domain gap between past and new events, ultimately leading to poorer performance of detection models (Fig. 1).

Fig. 1
figure 1

High-frequency lexical features of texts in news in different fields, a high-frequency lexical features of texts in news about social life, b high-frequency lexical features of texts in news about politics

To address the aforementioned issues, we introduce an innovative Clip-GCN approach for news. This model utilizes text information as a supervisory signal to extract semantic features from news in both textual and image domains. We employ a domain adversarial neural network strategy to narrow the distribution gap between different domains. Simultaneously, we establish a domain news graph to ensure addressing the scarcity of training samples through the relevance across news domains.

The main contribution of this paper is:

  1. 1

    In response to the domain gap between historical events and breaking events, as well as the lack of labels for breaking events, we propose using Domain-Adversarial Neural Networks to continuously stimulate the feature extractor and domain discriminator. This allows the feature extractor to capture domain-invariant features, achieving domain adaptation for news detection.

  2. 2

    A novel graph convolutional network (GCN) method is proposed, which establishes positive correlations among news articles within the same domain through similarity measures. The entire news domain is constructed into a graph network, with each news article serving as a node on the graph. This approach effectively leverages the inter-domain correlations in news, enhancing the discriminative features for detecting the authenticity of breaking news.

  3. 3

    We evaluate Clip-GCN on Weibo and Twitter datasets to assess the model’s performance in detecting breaking news. Experimental results demonstrate that the proposed approach outperforms existing methods, achieving superior detection performance for breaking fake news.

Related work

Fake news detection

The definition of fake news: refers to news that is verifiably false [23]. The social media fake news detection task aims to assess the authenticity of emergent news events, which do not have labeled data and are reduced to a binary classification problem (true or false) in deep learning models [23]. The main difficulty of the social media fake news detection task is to correctly binary classify the news based on domain-invariant features. In this paper, we classify news detection into three broad categories based on the features used for news detection: machine learning-based news detection tasks, unimodal feature-based news detection tasks, and multimodal feature-based news detection tasks. In the following section, an overview is developed from the aforementioned three broad categories.

Machine Learning-Based Detection Tasks: Previous detection tasks required more labels that were manually set based on the user’s social information. Based on these labels, different features are learned and put into algorithms such as the random forest algorithm and decision tree classification for true/false classification [5,6,7,8]. Rumors are detected using user and linguistic features by creating a time window for rumor propagation [6]. Random forests and logic models were used to filter features to obtain an integrated feature set, and numerous features were used to judge rumors [7]. These methods are based on having a large number of labels and targeting historical events for detection, resulting in poor generalizability, and cannot effectively solve the news detection problem when faced with real-time news.

Amidst the ongoing progress and extensive exploration in fake news detection, scholars have increasingly adopted neural networks to address the task. Within deep learning models, these approaches can be categorized into two groups: one focuses on extracting single-modal information for news detection, while the other leverages multi-modal information for the same purpose. Most of the early deep learning detection models extracted textual information for fake news detection [9,10,11]. Textual information is fed into RNN networks to get textual features, and LSTM is used to connect to get textual information throughout the news, and finally to discriminate the news [9]. The image information is put into the convolutional neural network to obtain the frequency domain features and pixel domain features, and by combining these two features, the final image features are obtained to discriminate the news [24]. With the development of communication and the popularization of 5G technology, the news content spread by social networks is no longer limited to a single modality, the information between modalities is complementary, and the detection of fake news in a single modality ignores the information of other modalities, resulting in reduced applicability.

News detection task based on multimodal features: Nowadays, news content in social networks has become more diverse and compelling due to the integration of rich images and exciting text elements. Only a news detection model based on multimodal features can effectively sift through the current news [15, 25,26,27,28,29,30,31,32,33]. A cross-modal reconstruction learning model is proposed, which uses the VAE model to reconstruct the text and images separately to get the similar feature distributions of both by discovering that there is a connection between the modalities, and at the same time shares the memory space to use the final multimodal features for news discrimination [28]. Considering that the semantic information of news is a crucial factor in detecting news, the text and images are feature extracted using Bert and Vgg-19 large pre-trained models respectively [25]. Considering that there is a similar relationship between the image and text, the cosine function is used to measure the similarity between the two [26]. Uses the attention mechanism to consider the text features at each level in conjunction with the image information [27].

Although the above methods have achieved good results in news detection, they do not achieve real cross-modal interaction when targeting multimodal problems. These methods are still stuck in feature extraction for single-modal information, i.e., at the very beginning of the model’s input, the model extracts the same semantic information of the image and text, but there is no correlation between the feature distribution of the two features. The cross-modal semantic feature extraction module in the model proposed in this paper uses an optimized CLIP model for joint feature extraction of image and text information. The model uses the textual information as a supervisory signal to obtain image features that are no longer limited to a specific classification, but are closest to the essential information of the image. Semantic interaction between image and text features is realized. The overall semantic features of news can be obtained by this model.

Domain adaptation

For the characteristics of emergent events, where there are fewer labels and higher real-time requirements, domain adaptive techniques become a key approach. In social media news recognition, it is often difficult to obtain large amounts of accurately labeled data due to the timeliness and diversity of news reports. Traditional supervised learning methods rely on sufficient labeled data for training, but for emergent news, such methods face challenges due to the scarcity of labels. Facing this challenge, domain adaptive techniques provide a solution [34,35,36]. The domain adaptive technique can migrate the knowledge and model parameters from the source domain to the target domain by using the labeled data of true and false news in historical news to adapt to the characteristics of emergent news data. The domain adaptive technique can overcome the challenge of scarcity of labeled data and improve the performance and generalization ability of the model in the social media fake news detection task. The similarity between the events was found and adversarial neural networks were used to extract the common features among them, which are ultimately used for news detection [37]. The method improves the migration ability of the model through adversarial neural networks, but it does not effectively utilize the empirical knowledge of the source domain to the extent that it does not provide better detection [38]. They proposed to solve the problem of domain adaptation with graph neural networks, which first represent each domain as a graph and achieve cross-domain classification on text by using knowledge from different domains for joint training [39]. Tackled the domain adaptation problem by reducing the feature distribution between domains. Event-specific features were filtered out by an adversarial network, and the VAE model was used to reconstruct the image-text feature distribution, and the final features were used for news detection [40]. This method also takes into account the correlation between events, but it does not leverage the knowledge of the source domain extensively for final news detection, resulting in suboptimal performance. The graph neural network was combined with the adversarial network, which gave the model strong migration capabilities. Additionally, it effectively utilized the knowledge of the source domain by employing the graph neural network to discern the authenticity of the news [41]. However, since the news features in the node variables only consider the text information, when faced with multimodal news, the image information also affects the news detection results.

The aforementioned methods make a significant contribution to addressing the issue of detecting fake news on social media. However, they still struggle to achieve satisfactory results when focused on detecting emerging news with limited tags. The model proposed in this paper is able to obtain all the information of the image text of the news through the cross-modal semantic feature extraction module. The obtained semantic features of the news are used as node inputs to the graph neural network to construct a graph with the news in a domain, and the information of the domain is used to make authenticity detection of the news. At the same time, to ensure that the semantic features can filter out the information belonging to the domain, the domain detection module is constructed, and the final input to the news detection module is obtained by adversarial learning network stimulation.

Clip-GCN-based multimodal emergent fake news detection model

In this section, we constructed a scenario for detecting breaking news, ensuring that the domain of detecting events is distinct from the domain of historical events. For the detection of emergent news, we propose a multimodal domain-adaptive model called Clip-GCN. Figure 2 illustrates our proposed multimodal domain-adaptive model. The model primarily consists of a cross-modal feature extraction module based on Clip, a domain detection module based on adversarial neural networks, and a news detection module based on graph neural networks.

Fig. 2
figure 2

Structure of the Clip-GCN model

The establishment of the emergent news detection scenario and the specific details of each branch module mentioned above will be elaborated in the following subsections.

Detection scene construction

This paper is dedicated to addressing the real-world challenge faced by social media, i.e., emergent news detection in social media events. The goal of this paper is to effectively detect fake news in social media events in the absence of labeled data. Assume that each news in the training data of this paper is multimodal news (containing images and text), In this setup, \({\mathcal{D}}_{s} = \left\{ {p_{i} ,y_{i} } \right\}\) represents historical news (source event). \(p_{i} = \left( {t_{i} ,v_{i} } \right)\), where \(t_{i}\) represents the textual information of the news, and \(v_{i}\) represents the image information of the news. \(y_{i} = \left( {y_{i} ,d_{i} } \right)\) is the labeling of the news, \(d_{i}\) is the domain labeling of the news, and \(y_{i}\) is the true/false labeling of the news. The emergent event (target event) is \({\mathcal{D}}_{T} = \left( {p_{i} } \right)\). Based on the difficulty of less labeled and multimodal news in the social media fake news detection task described in the previous section, this paper simulates this difficulty in data partitioning. Therefore, in the Chinese dataset, the news items are divided into 7 categories according to the domain labels. Six of these categories are used as the training set, represented by historical events labeled as \({\mathcal{D}}_{s}\), while the remaining 1 category serves as the test set, denoted as target events labeled as \({\mathcal{D}}_{T}\). In the English dataset, the news is divided into 6 categories based on the events. Five of these categories are used as the training set, represented by historical events labeled as \({\mathcal{D}}_{s}\), while the remaining 1 category serves as the test set, denoted as target events labeled as \({\mathcal{D}}_{T}\). In the process of dividing the events, it is ensured that the news \({\mathcal{D}}_{T}\) with the attribute of the target event (the test set) has not appeared in the historical events \({\mathcal{D}}_{s}\) (the training set).

The research in this paper focuses on training a cross-domain model (denoted as \(p(y|p,\theta )\)) with a parameter \(\theta\) using historical events (labeled as \({\mathcal{D}}_{s}\)). This model possesses the ability to extract cross-domain public features for accurately classifying news in a target event as true or false, all without prior knowledge of the class labels (represented as \({\mathcal{D}}_{T}\)).

Cross-modal feature extraction module

This encoder \(E_{x}^{T}\) is a converter [42] that transforms a text sequence into a feature representation by surrounding the text with [SOS] and [EOS] markers at the beginning and end of the text sequence. At the highest level of the transformer, the activation at the [EOS] marker is considered as a feature representation of the text. To project this representation into the multimodal embedding space, it is normalized by layers and then linearly projected. In addition, the encoder uses a masked self-attention mechanism to preserve the ability to initialize using a pre-trained language model.

Vision coder \(E_{x}^{I}\) is the VIT (Vision Transformer) model [43], which uses a visual image coding method based on the attention mechanism. Unlike traditional convolutional neural networks, the VIT model uses a transformer architecture to extract a feature representation of the image. The encoder divides the input image into multiple blocks and transforms each image block into a one-dimensional vector. These vectors are then processed by the transformer to learn the feature representation of the image using the self-attention mechanism.

The original content of the image \(X_{img}\) and the original content of the text \(X_{text}\) are input into the text encoder \(E_{x}^{T}\) and the visual encoder \(E_{x}^{I}\), respectively, to obtain the embedded text features \(f_{text}\) and image features \(f_{img}\). They contain all the information of image and text, but the correlation between them is weak and there is a semantic gap. Therefore, in this paper, we use the Clip pre-training model [44] to embed the two features \(f_{text}\) and \(f_{img}\) into the same embedding space, as shown in the left half of Fig. 2, and measure the similarity of image and text by the cosine similarity, which makes the similar image and text closer in this space, improves the correlation between the image and text, and bridges the semantic gap between them. In the embedding space, \(f_{text}\) is used as the supervised signal, \(f_{text}\) and \(f_{img}\) are compared and searched, and the image feature \(f_{Clip - T}\) and text feature \(f_{Clip - I}\) with semantic relevance are extracted. The textual feature representation \(f_{Clip - T}\) and the visual feature representation \(f_{Clip - I}\) are fed into the CMA (Cross-Modal Attention) module, which is designed to compute the correlation weights between modalities by using the attention mechanism with the formula (1).

$$ A_{ti} = \frac{{\exp \left( {f\left( {T_{t} ,I_{i} } \right)} \right)}}{{{\sum }_{j = 1}^{N} \exp \left( {f\left( {T_{t} ,I_{j} } \right)} \right)}} $$
(1)

where \(T_{t}\) is the \(t\)-th element of the text feature, \(I_{i}\) represents the \(i\)-th element of the image features. and \(N\) is the dimension size of the feature. Function \(f\left( {T_{t} ,I_{i} } \right)\) is a dot product formula used to calculate the attention score. After obtaining the weights, the image and text features are merged to obtain the semantic features \(f_{m}\) of the news by using the formula (2).

$$ f_{m} = \mathop \sum \limits_{i = 1}^{M} A_{ti} T_{i} I_{i} $$
(2)

Previous research has typically explored coarse and fine features within a single modal feature. However, with the introduction of the Clip model, inter-modal feature extraction is concatenated, allowing for the capture of more detailed semantic cues and the analysis of news semantics from a more detailed perspective. This approach effectively solves the problem of cross-modal inter-semantic interaction in news recognition.

Domain detection module

To enable the model to effectively learn the inter-domain invariant features, this paper adds a feature extractor after the CMA module. The input of this extractor is the semantic feature \(f_{m}\) of the news, and its purpose is to extract the cross-domain invariant feature \(f_{D}\) from \(f_{m}\).

$$ G_{f} \left( {f_{m} ;\theta_{f} } \right) $$
(3)

where \(f_{m}\) are the semantic features of the news and \(\theta_{f}\) are the parameters to be learned by the feature extractor.

The role of the domain detection module is to enable the feature extractor to learn optimal parameters, denoted as \(\theta_{f}\), so that the extractor can extract invariant features between domains. In this module, the idea of adversarial networks is primarily used to continually stimulate the feature extractor. To achieve this, this paper builds a domain discriminator consisting of two layers of fully connected neural networks, denoted as \(G_{D}\). The output of \(G_{D}\) is \(\hat{y}^{d}\), a \(K\)-dimensional vector representing the probability that the news belongs to each of the domains (out of a total of \(K\) domains).

$$ \hat{y}^{d} = G_{D} \left( {f_{D} ;\theta_{d} } \right) $$
(4)

where is the features extracted by the feature extractor, \(\theta_{d}\) is the learning parameters of the domain discriminator, and \(\hat{y}^{d}\) is the output of the domain discriminator.

The domain discrimination loss is calculated using the cross-entropy formula (5).

$$ \begin{aligned}&{\mathcal{L}}_{c} \left( {\theta_{f} ,\theta_{d} } \right) = - {\mathbb{E}}_{{\left( {f_{D} ,d_{i} } \right) \sim \left( {{\mathcal{F}}_{D} ,{\mathcal{D}}_{i} } \right)}}\\ &\left[ {\mathop \sum \limits_{k = 1}^{K} 1_{{\left[ {k = d_{i} } \right]}} \log \left( {G_{d} \left( {G_{f} \left( {f_{D} ;\theta_{f} } \right);\theta_{d} } \right)} \right)} \right]f_{D} \end{aligned}$$
(5)

where the learning parameter \(\theta_{d}\) of the domain discriminator \(G_{D}\) is estimated by minimizing the loss function \({\mathcal{L}}_{c}\), see Eq. (6).

$$ \hat{\theta }_{d} = \arg \mathop {\min }\limits_{{\theta_{t} ,\theta_{f} ,\theta_{d} }} {\mathcal{L}}_{c} \left( {\theta_{f} ,\theta_{d} } \right) $$
(6)

Meanwhile, \({\mathcal{L}}_{c} \left( {\theta_{f} ,\theta_{d} } \right)\) is also used as a criterion to estimate the difference in distribution between different domains. The larger the loss \({\mathcal{L}}_{c}\), the more similar the distribution of features extracted by the feature extractor becomes, meaning the extracted features represent commonalities between domains. Therefore, to filter out the domain attributes between news, it is necessary to maximize the loss \({\mathcal{L}}_{c}\) to influence the feature extractor. To achieve the above purpose, a gradient inversion layer [45] is incorporated into the domain discriminator module. This layer ensures that the inputs remain unchanged during forward propagation but inverts the gradient during backpropagation. In essence, a min–max game is established between the feature extractor and the domain discriminator, as shown in Eq. (7).

$$ \left( {\hat{\theta }_{f} ,\hat{\theta }_{d} } \right) = \arg \mathop {\min }\limits_{{\theta_{d} }} \;\mathop {\max }\limits_{{\theta_{f} }} \;{\mathcal{L}}_{c} \left( {\theta_{f} ,\theta_{d} } \right) $$
(7)

In the above formulation, the goal of the domain discriminator is to correctly classify the features extracted by the feature extractor into the domain, while the goal of the feature extractor is to extract features that prevent the domain discriminator from correctly classifying the news domain.

News detection module

In this paper, we hypothesize that news in different domains have domain-specific attributes, including differences in content, linguistic style, lexical choice, and visual presentation. By constructing a relational network between news semantics in the same domain, domain-specific patterns, themes, and semantic associations can be captured. By combining the domain relational network and the classification model, domain-related features and contextual information can be introduced to improve the efficiency and discriminative ability of the fake news classifier. Taking the political domain as an example, we focus on specific vocabulary, event correlations, and semantic expressions in political news to more accurately identify fake news within the domain. This approach of integrating domain relational networks and classification models has the following advantages: First, domain relational networks capture the connections and features between news semantics within the same domain, providing richer feature representations. Second, by focusing on domain-specific vocabulary, event correlations, and semantic representations, it is possible to better distinguish the difference between real and fake news, especially in the presence of domain-specific patterns. Therefore, the approach in this paper is to construct a graph for each domain with the formula (8)

$$ G^{(m)} = (V,E) $$
(8)

where \(m \in [1,M]\), the variable \(V\) represents the set of vertices within this graph, in this paper the news are used as vertices to model the semantic association relationship between the news. The input of vertex information is the final feature obtained by the feature extractor \(f_{D}\). The variable \(E\) represents the relationship of edges between vertices within the graph. In order to be able to build a domain relational network, this paper establishes the connection between news semantics by using the cosine similarity theorem to measure the relationship between two vertices \(u\) and \(v\). The formula is (9)

$$ e(u,v) = \left\{ {\begin{array}{ll} {\cos (vec(u),vec(v))} &\quad {{\text{if}}\cos (vec(u),vec(v)) \ge \lambda } \\ 0 &\quad {{\text{Otherwise}}} \\ \end{array} } \right. $$
(9)

where \(e(u,v)\) denotes the edge between nodes \(u\) and \(v\), \(\cos (vec(u),vec(v))\) is the cosine similarity between the semantics of news \(u\) and \(v\), and \(\lambda\) is a threshold set by us. By employing the graph data modeling as described above, a deeper understanding of the interrelationships between news within the same domain can be obtained.

After constructing the news in each domain into a graph, this paper uses a GCN [46, 47] to process the graph structure information. It can update the feature representation of the node through the neighboring nodes of the node, and in fake news detection, this information dissemination mechanism can enable each news node to obtain the feature representation of its neighboring nodes, so as to take into account the similarities and differences of the news in the domain, and to better distinguish between true and false news. The formula of GCN is (10)

$$ {\mathbf{X}}^{(l + 1)} = G_{c} \left( {{\mathbf{X}}^{(l)} ;\theta_{{G_{c} }} } \right) $$
(10)

where \(\theta_{{G_{c} }}\) is a parameter of GCN. In GCN, the features of neighboring nodes are aggregated by weighted summation with the formula (11)

$$ {\mathbf{X}}^{(l + 1)} = \sigma \left( {{\hat{\mathbf{D}}}^{{ - \frac{1}{2}}} {{\hat{\mathbf{A}}}}{\hat{\mathbf{D}}}^{{ - \frac{1}{2}}} {\mathbf{X}}^{(l)} {\mathbf{W}}^{(l)} } \right) $$
(11)

where \({\mathbf{X}}^{(l)} \in {\mathbb{R}}^{{N \times F^{(l)} }}\) is the node feature matrix of layer \(l\), \(N\) is the number of nodes, \(F^{(l)}\) is the feature dimension of layer \(l\), \({\mathbf{A}} \in {\mathbb{R}}^{N \times N}\) is the adjacency matrix representing the connectivity between nodes, \({\hat{\mathbf{A}}} = {\mathbf{A}} + {\mathbf{I}}\) is the adjacency matrix for increasing the self-connectivity, \({\text{I}}\). is the unit matrix, \({\hat{\mathbf{D}}}\) is the matrix with 1 added to the diagonal of the degree matrix \({\mathbf{D}}\), \({\mathbf{W}}^{(l)} \in {\mathbb{R}}^{{F^{(l)} \times F^{(l + 1)} }}\) is the weight matrix from layer \(l\) to layer \(l + 1\), and \(\sigma ( \cdot )\) is the activation function RELU. The nodes are made to continuously update the feature representation of the nodes based on their correlation with the neighboring nodes by the formula (11). The feature representation of the node is continuously updated based on its association with neighboring nodes by Eq. (11).

The node feature representation with correlation is obtained based on the GCN network, which is fed into the true–false news discriminator, which consists of a fully connected layer as in Eq. (12).

$$ \hat{y}_{p} = P_{x}^{s} \left( {x^{(l + 1)} ;\theta_{{P_{x}^{s} }} } \right) $$
(12)

where \(\theta_{{P_{x}^{s} }}\) is the parameters of the label predictor and \(\hat{y}_{p}\) is the output, and its prediction process is described in Eq. (13).

$$ \hat{y}_{p} = \sigma \left( {{\mathbf{W}}_{g} \cdot x^{(l + 1)} + {\mathbf{b}}_{g} } \right) $$
(13)

where \(x^{(l + 1)}\) is the updated node feature representation, \({\mathbf{W}}_{g}\) is the weight matrix of the true/false news discriminator, and \({\mathbf{b}}_{g}\) is the vector of bias terms. The loss of true/false news prediction is shown in Eq. (14).

$$\begin{aligned}& {\mathcal{L}}_{p} \left( {\theta _{{E_{x}^{s} }} ,\theta _{{G_{c} }} ,\theta _{f} } \right) = - {\mathbb{E}}_{{\left( {x^{{(l + 1)}} ,y_{i} } \right)\sim \left( {{\mathcal{X}}^{{(l + 1)}} ,{\mathcal{Y}}_{i} } \right)}}\\ & \quad \left[ {Y_{p} \log \left( {\hat{y}_{p} } \right) + \left( {1 - y_{i} } \right)\log (1 - )\hat{y}_{p} } \right]\end{aligned} $$
(14)

where \(y_{i}\) is the true/false label of the news and \({\mathcal{L}}_{p}\) is the prediction loss of the true/false label of the news. In this paper, the optimal parameter \(\theta_{{E_{x}^{s} }}\) of the news true/false predictor and the optimal parameter \(\theta_{{G_{c} }}\) of the GCN network are obtained by minimizing the prediction loss of the true/false label.

The main idea of the training method in this paper is to iteratively update between domain-invariant feature extraction as well as new detection, Algorithm 1 illustrates the training algorithm. The goal of the first stage is to put the semantic features obtained by the model into the feature extractor and use the domain discriminator to compute the loss \({\mathcal{L}}_{c}\). At the same time, the inputs to the forward propagation of the model are kept constant by the gradient inversion layer, but the gradient is inverted and processed in the backward propagation, and the final maximization and minimization of the loss \({\mathcal{L}}_{c}\) to obtain the optimal parameters \(\theta_{f}\) and \(\theta_{d}\). The goal of the second stage is to use the fake news detector to predict the news tags to get the loss \({\mathcal{L}}_{p}\), by minimizing the loss \({\mathcal{L}}_{p}\) to get the optimal model parameters \(\theta_{{E_{x}^{s} }}\) and \(\theta_{{G_{c} }}\) for GCN and news detector.

Algorithm 1:
figure a

Model training

Experiments

In this section, the datasets and experiments used are first introduced and explained, and then the experimental results of the Clip-GCN model on both Chinese and English datasets are given. And a comparison experiment is set up to compare the Clip-GCN model with the listed baseline methods. To test whether each module in the model works, ablation experiments are set up to analyze the performance of each module. Also, to investigate whether the threshold setting of the edges between vertices in each graph has an effect on the experimental results, a thresholding experiment is set up. Through the comparison experiments, it is proven that Problem 1: Clip-GCN can show stronger performance than other models in cross-domain fake news detection; through the ablation experiments, it is proven that Problem 2: All three modules in the model can be effective when targeting the extraction of cross-domain invariant features.

Experimental setup

This section provides a detailed description of the dataset used for the experiments and the setup of the Clip-GCN model on the dataset.

The following are the experimental details of the Clip-GCN model. First, the raw text information is transformed into a 512-dimensional vector after passing through the text coder. The original image becomes a vector of \(3 \times 224 \times 224\) after embedding into the visual coder, and together with the text-embedded features, it enters into the cross-modal feature extraction module, and after the feature extraction of the Clip pre-training model to get the vectors whose dimensions are all 512 dimensions. The CMA module effectively combines the image text features to obtain 1024-dimensional new semantic features. In the feature extractor, the output of the feature extractor is set as a 512-dimensional vector to better extract the inter-domain invariant features, and the ReLU layer and the Dropout layer with a forgetting rate of 0.3 are added. The domain discriminator is composed of two fully connected layers, and its output is a K-dimensional (domain type) feature vector. In the process of graph data production, multiple data sets with different semantic relationships between domains are obtained by setting different thresholds. To obtain the optimal model parameters, this paper uses the Adam optimizer for optimization, and for the feature extractor, it uses the gradient inversion layer to implement the maximum-minimum game idea.

To create the scenarios for the social media fake news detection task described in the previous section, the model is made to face the less labeled, cross-domain challenge head-on. In this paper, the dataset is divided into historical events (training set) and target events (test set). In the Chinese dataset, news items are divided into 8 categories based on domain labels. Seven of these categories serve as the training set, specifically for historical events, while the remaining 1 category functions as the testing set for target events. In the English dataset, news is categorized into 6 groups based on events. Five of these groups are designated as the training set, focusing on historical events, while the remaining 1 group serves as the test set for target events. When dividing the events, it is ensured that the news with the attributes of the target events (test set) have not appeared in the historical events (training set). The model learning rate is set to 0.001, 100 training rounds are performed, and the optimal model parameters from the training set are placed in the test set for model testing.

Datasets

To fully evaluate the effectiveness of the model, this paper conducts experiments in two scenarios in Chinese and English. The English dataset and the Chinese dataset are real news content (including fake news) from Twitter and Weibo platforms, respectively.

The dataset used in this study was obtained from the “Internet Fake News Detection During the Epidemic” competition, which focused on detecting fake news on social media during the 2020 COVID-19 pandemic. The dataset contains news items from various domains and was verified by Weibo’s official disinformation system. Approximately half of the news items in the dataset have both text and image data, while the other half contains only text data. For the experiments in this paper, only news items with images were selected. The entire dataset is divided into eight distinct domains, each representing unique topics and content, as summarized in Table 1.

Table 1 Model parameter settings

The English dataset [48] originates from Twitter and is utilized for validating the use of multimedia to detect fake content on social media. In this paper, we selected news articles written in English that are combined with images, choosing seven event sets to form the English dataset, as presented in Table 2. To provide a preliminary assessment of the difficulty in transitioning between events, this paper briefly describes each event:

  1. 1.

    Hurricane Sandy: Hurricane Sandy was a major storm that occurred in 2012. This hurricane wreaked havoc on the east coast of the United States. It hit the Northeast, especially New York and New Jersey.

  2. 2.

    Boston Marathon: The Boston bombing took place in 2013. Two explosions near the finish line of the Boston Marathon caused many deaths and injuries. This terrorist attack caused panic and confusion.

  3. 3.

    Malaysia Airlines: The crash of Malaysia Airlines Flight 17 happened in 2014. The flight was traveling from Amsterdam to Kuala Lumpur when it was hit by a missile and crashed over eastern Ukraine.

  4. 4.

    Nepal earthquake: The Nepal earthquake refers to a powerful earthquake that occurred in 2015. The epicenter of this earthquake was located near Kathmandu, the capital of Nepal, and had a magnitude of 7.8. The earthquake caused widespread destruction and casualties.

  5. 5.

    Collage: This is a collage of “Samurai Ghosts”, “Porcupine Fish” and “Elephant Rocks”.

  6. 6.

    Solar Eclipse: The March 20, 2015 equinox solar eclipse was called a “super eclipse.” The eclipse was widely visible in the northern hemisphere, including Europe, North America, and parts of Asia.

  7. 7.

    Sochi Olympics: The Sochi Olympics refer to the 2014 Winter Olympics held in Sochi, Russia. It was the first time Russia hosted the Winter Olympics and the first Olympic Games to be held in Russia since the 1980 Summer Olympics in Moscow.

Table 2 Statistical data for the Chinese dataset

Experimental results

In this section, this paper first shows the experimental results of Clip-GCN on two datasets. To test the experimental performance of the model more comprehensively, this paper tests the news in each domain as the target event once, so for the Chinese dataset, we launched 8 sets of experiments, which are labeled as A, B, C, D, E, F, G, and H, respectively, according to the domain labels in Table 2. For the English dataset, we ran 6 sets of experiments, labeled a, b, c, d, e, f, and g, corresponding to the event labels in Table 3. Throughout the experiments, news with the attributes of the target events does not appear in the historical events. Finally, the average results are used to verify the effectiveness of the model. Meanwhile, this paper uses the news detection accuracy, AUC curve, precision, recall and F1 score as the performance evaluation indexes of news detection in formulas (15)–(18) [26].

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(15)
$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$
(16)
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$
(17)
$$ F1 = \frac{{2 \times {\text{Precision}} \times {\text{Re}} {\text{call}}}}{{{\text{Precision}} + {\text{Re}} {\text{call}}}} $$
(18)

where TP is true news with model-predicted label 1, TN is false news with model-predicted label 0, FP is true news with model-predicted label 0, and FN is false news with model-predicted label 1.

Table 3 Statistical data for the English dataset

Table 4 presents the experimental results of Clip-GCN on the Chinese dataset. News in different domains exhibit variations in content, keywords, and image focal points. The overall experimental results indicate diverse detection outcomes for each domain, validating the impact of domain differences on the model’s fake news detection performance in the Chinese dataset. In these eight experimental scenarios, the AUC-ROC indices are all above 0.925, indicating excellent performance of the model in the classification of true and false news. Meanwhile, the model achieves a detection accuracy of over 85% for target events, with the lowest being in the Military Affairs domain at 85.37%, and the highest in the Medicine and Health domain at 89.24%. The average accuracy across all eight domains is 87.79%. It verifies the migration ability of the model in the Chinese dataset model for unlabeled events that can achieve accurate fake news detection, proving the effectiveness of the model for news detection in Chinese social media.

Table 4 Experimental results of Clip-GCN on the Chinese dataset

Table 5 presents the experimental results of Clip-GCN on the English dataset. Across the seven news events, content, keywords, and image focus differ significantly, allowing us to estimate the migration difficulty for each event. The results demonstrate that the model achieves a detection accuracy exceeding 80% for target events. The lowest accuracy is observed in the Solar Eclipse domain at 80.25%, while the highest is in the Boston Marathon domain at 88.99%. The average detection accuracy across the seven domains is 83.98%. This confirms the model’s capability to effectively detect fake news in unlabeled events within the English dataset, underscoring its efficacy for news detection on English social media

Table 5 Experimental results of Clip-GCN on the English dataset

Comparing the experimental results on the Chinese dataset and the English dataset, the overall results on the Chinese dataset are better than those on the English dataset. To analyze the reasons, firstly, the news in the Chinese dataset is classified according to a domain broad category, and the related knowledge of this domain can be more fully utilized in the graph network. In the English dataset, the category labels of the news are categorized according to events, and the knowledge that can be utilized by the graph network is only different representations for this event; second, the semantic features of the news in the Chinese dataset are complete, and the textual information contains rich semantic information, the richer the semantic features, the more news nodes can be linked in the graph network, which makes the network relationship within the graph richer. In the English dataset, the textual information is mostly short and not overly descriptive, and the semantic features contained are depleted, which ultimately leads to fewer links to graph network relationships.

The t-SNE plots presented on both the Chinese and English datasets [49] exhibit similar clustering effects. Here, we showcase the visualization results on the Chinese dataset. Each dot in Fig. 3 represents the news classification features of each news article in the target events after processing through the Clip-GCN model. The dots in the graph are color-coded as orange and blue based on the true/false labels corresponding to each news article, where orange represents fake news and blue represents true news. Upon observing Fig. 3, it becomes evident that the proposed model demonstrates excellent performance in news classification within target events, with a clear clustering effect based on the true/false labels. This confirms the model’s robust generalization ability, showcasing its effectiveness in extracting discriminative features for distinguishing between true and fake news even in novel domains. Simultaneously, the features extracted by the model exhibit accuracy and high precision in the task of true/false news detection, particularly when faced with challenges such as scarce training samples and cross-domain scenarios. This observation further underscores the model’s robustness and reliability across different languages and domains.

Fig. 3
figure 3

Visualization of the features extracted from the Clip-GCN model using t-SNE in 8 sets of experiments on the Chinese dataset. (a), (b), (c), (d), (e), (f), (g), and (h) are the representations of the features in the eight sets of experiments A, B, C, D, E, F, G, and H, respectively

Contrast study

Baseline

VQA (Visual Question Answering): [50] Firstly features are extracted from the problem and feature extraction is done based on the given image features. The image quiz model is used for multicategorization task and since this experiment is for bicategorization task, in this paper a 32 dimensional single layer LSTM is used in VQA.

EANN: [37] Event Adversarial Neural Network aims to improve the detection of multimodal news by using event domain discriminators. Multiple convolutional layers of different granularity are used for text feature extraction and finally fused with image features.

att-RNN: [51] a multimodal fake news detection framework that fuses textual, visual, and social contextual features through an attention mechanism.

MVAE: [28] Multimodal Variational Auto-Encoder (MVAE), is a multimodal fake news detection framework that mines deep information between images and text by reconstructing multimodal information using VAE.

SpotFake: This model [52] uses Bert and Vgg-19 to extract image-text features and concatenate the two features to predict news.

DAGA-NN: [41] extracts inter-domain features under the idea of using adversarial network games while using graph attention neural networks for true and fake news detection on news.

Comparison result

In this section, this paper makes an experimental comparison of the average results of Clip-GCN with the baseline model on each performance using Chinese and English datasets. In this paper, we use: accuracy of detection, AUC, precision, recall and F1 score as the performance evaluation metrics of the model.

Table 6 shows the comparison of the average experimental results between Clip-GCN and the baseline model for 8 sets of experiments on the Chinese dataset. As can be seen from Table 6, the Clip-GCN model proposed in this paper is better than the baseline method in five evaluation metrics: accuracy, recall, and F1 score, and the average accuracy in the Chinese dataset reaches 87.79%, which is an improvement of 2.91% compared to the highest accuracy in the baseline method, answering the first question

Table 6 Experimental effects of different methods on the Chinese dataset, model effects in this paper are shown in bold

Table 7 shows the comparison of the average experimental results between Clip-GCN and the baseline model for seven sets of experiments on the English dataset. The Clip-GCN model proposed in this paper is better than the baseline method in five evaluation metrics: accuracy, AUC, precision, recall, and F1 score, and the average accuracy on the English dataset reaches 83.98%, which is an improvement of 2.41% compared to the highest accuracy of the baseline method, answering the first question

Table 7 Experimental effects of different methods on the English dataset, modeling effects in this paper are shown in bold

Through the analysis of the average experimental results on Chinese and English datasets, our model demonstrates robust cross-domain invariant feature extraction capabilities compared to models such as VQA, att-RNN, MVAE, and SpotFake. Consequently, it exhibits excellent model transferability, with a significantly higher average accuracy than these four models. In the context of cross-domain news detection, models like EANN, DAGA-NN, and Clip-GCN outperform others, showcasing potential in narrowing the gap between target events and historical event domains. In comparison to the EANN model, our proposed model achieves better results by ingeniously utilizing GCN networks for authenticating news knowledge within the same domain. This underscores the beneficial impact of news knowledge from the same domain on fake news detection. Compared to the DAGA-NN model, our model possesses greater depth in handling multi-modal feature extraction from news, comprehensively considering correlations between modalities. By fully leveraging the joint feature extraction of textual and visual information using the Clip model, we obtain more comprehensive multi-modal semantic features. Building upon this, by establishing a semantic association graph between news articles, we achieved an increased accuracy of 2.91% on the Chinese dataset and 2.41% on the English dataset.

Ablation study

In this section, this paper evaluates whether each module of the model works or not by the accuracy of ablation experiments with various performance metrics on both Chinese and English datasets. Therefore, the model setup starts from the most basic and gradually increases the model modules until it is complete. First, Part1 is with only cross-modal semantic feature extraction module. Part2 is with an added GCN network but without a domain discriminator module. Part3 is the complete model Clip-GCN. Tables 8 and 9 present the ablation experiment results of the Clip-GCN model on the Chinese and English datasets, respectively.

Table 8 Ablation Experiments of the Clip-GCN Model on the Chinese Dataset
Table 9 Ablation experiments of the Clip-GCN model on the English dataset

The effectiveness of the cross-modal feature extraction module

Observing the experimental results in Part 1 of Tables 8 and 9, the model successfully achieves the authenticity detection of breaking news by extracting semantic features from news. This validates the effectiveness of the news semantic features extracted by the model in the detection of breaking news.

The effectiveness of utilizing domain knowledge for news detection

By comparing the experimental results of Part 1 and Part 2 on the Chinese and English datasets, we observe that Part 2 outperforms Part 1 in the detection performance metrics across various target events. This finding validates the effectiveness of the model in leveraging domain-specific news knowledge for the detection of breaking news.

The effectiveness of the domain detection module

In the overall model, the core lies in narrowing the feature distribution differences between target events and historical events. Therefore, by comparing the experimental results of the overall model Part3 with Part2, we observe an average accuracy improvement of 2.43% in the Chinese dataset and 2.04% in the English dataset. The experimental results validate that the introduction of the domain detection module enables the model to reduce the feature distribution differences between target events and historical events while learning news classification features, thereby enhancing the model’s transferability.

The experimental results from Tables 8 and 9 indicate that after the cross-modal semantic feature extraction module generates semantic features, the domain detection module comes into play. It utilizes the idea of adversarial training to enable the feature extractor to capture invariant features between domains. As the cross-modal semantic feature extraction module provides rich semantic information, the connectivity relationships in the graph network become more enriched. When GCN detects fake news, the model can comprehensively leverage domain-specific knowledge, contributing to the gradual improvement in model performance. Overall, each module of Clip-GCN plays a positive role, addressing question two—each of the three modules in the model is effective in extracting cross-domain invariant features.

The domain discriminator and the feature extractor are a pair of networks with opposite training, and we assume that if the domain discriminator can effectively categorize the domain labels of the news, the features extracted by the feature extractor should also be invariant features between domains. Therefore, in this paper, the features processed by the domain discriminator are visualized using t-SNE to obtain Fig. 4.

Fig. 4
figure 4

Visualization of the feature representation of domain discriminator learning on Chinese and English datasets using t-SNE. a is the feature representation of domain discriminator learning on the English dataset, b is the feature representation of domain discriminator learning on the Chinese dataset

Each dot in Fig. 4 represents the features of each news article in historical events after being processed by the domain discriminator. The dots in the graph are color-coded based on the domain labels corresponding to each news article, with each domain represented by a distinct color. As can be seen in Fig. 4, the feature representations extracted by the domain discriminator have good clustering effects, indicating that the domain discriminator is able to correctly classify the domain of the news. Returning to the above hypothesis, the domain discriminator and the feature extractor are contradictory during training, when the domain discriminator has a strong ability to discriminate domains, while the feature extractor also has a strong ability to deceive the domain discriminator, i.e., the ability to extract features that are invariant across domains. Figure 4 also shows that the domain discriminator plays an effective role in cross-domain new detection.

The learning ability of the model is not only demonstrated by the performance metrics and the clustering effect on t-SNE, but when each module of the model plays an active role in each other, the loss function of the model training also converges. Therefore, we conducted a curve analysis of the model’s test loss, domain discrimination loss, and model training loss to obtain Fig. 5. By observing Fig. 5, it is evident that after a certain number of parameter iterations, the curves of the three types of losses eventually stabilize. The data results confirm that the model can effectively learn features for the classification of breaking news authenticity, and each module plays a positive role.

Fig. 5
figure 5

Loss curves of the model under each task

Threshold study

In this section of the experiment, we conducted a study on parameters influencing the model’s performance. Among the three modules of the model, variations in the semantic feature space size of the cross-modal feature extraction module and the feature space size of the domain discriminator in the domain detection module have minimal impact on the model’s detection performance, and we won’t delve into detailed discussions here. The specific parameter settings are provided in Sect. “Experimental Setup” Experimental Setup. However, changes in the connectivity threshold of the graph network in the news detection module result in significant fluctuations in the model’s detection performance. Therefore, this section focuses on experiments discussing the impact of the threshold on cross-domain fake news detection tasks.

The threshold \(\lambda\) value affects the number of edges between nodes; the smaller the threshold \(\lambda\) value, the more edges between nodes; the larger the threshold \(\lambda\) value, the fewer edges between nodes. From the experimental results in Table 10, it is not the case that the smaller the threshold \(\lambda\) value, the more complex the connection relationship in the graph, and the better the result of cross-domain news detection. When the threshold \(\lambda\) value, the cross-domain new detection task is the best. To analyze the reason, the establishment of the graph facilitates the fake news detector to use the knowledge within the domain for detection, but when the complexity of the relationships within the graph is too high, more domain knowledge affects the GCN’s aggregation calculation of the nodes in the graph. When the threshold \(\lambda\) value is set too high, the edges between the nodes become fewer and the effect of cross-domain news detection is weakened, indicating that it is necessary to utilize domain knowledge in the news detection task. When the threshold \(\lambda\) value is set too high, the domain knowledge that can be utilized by the simple relationships within the graph becomes insufficient and ultimately fails to achieve good results.

Table 10 News detection accuracies of the Clip-GCN model with different thresholds \(\lambda\) in the Chinese dataset, with optimal results in bold

In Fig. 6, nodes of two different colors represent real and fake news, and the numbers within the nodes indicate the number of connections. From the graph, we can observe the node connectivity and graph complexity under two different threshold values. When \(\lambda = 0.35\) the threshold value is low, the connectivity between nodes is high, resulting in a relatively large number of edges in the graph. This indicates that the semantic similarity between nodes is high, and the distinction between fake and real news is not clear, leading to a relatively low cost of generating fake news. In this scenario, fake news is more likely to spread in the network and is challenging to detect and identify effectively. On the other hand, when \(\lambda = 0.35\) the threshold value is high, the connectivity between nodes is low, resulting in a relatively small number of edges in the graph. This indicates that the semantic similarity between nodes is low, and the distinction between fake and real news is more pronounced, leading to a higher cost of generating fake news. Therefore, a higher threshold value can improve the effectiveness of fake news detection.

Fig. 6
figure 6

Visualization of connectivity relations for news semantics, a is the connectivity relation with threshold \(\lambda\) set to 0.75, b is the connectivity relation with threshold \(\lambda\) set to 0.35

Conclusion

In this paper, we study emergent fake news detection in social media with the aim of detecting fake news for news appearing without labeled data. To address the challenge of less labeled data, a multimodal emergent news detection model Clip-GCN is proposed, which consists of three modules: cross-modal feature extraction module, domain detection module, and news detection module. In the cross-modal feature extraction module, the semantic information between modalities can be fully mined to get more complete news semantic features; in the domain detection module, the great small game idea is utilized to stimulate the feature extractor to extract the features that are invariant between domains; and finally in the news detection module, the knowledge information between domains is utilized to make the authenticity detection of news. The comparative experimental results show that it is more suitable for emergent news detection in social media compared to existing methods. We have conducted a large number of experimental studies on Chinese and English datasets on Weibo and Twitter, and the highest average accuracy reaches 87.79%, which is a better accuracy compared to existing methods.