Abstract

This work aims to study applying the graph neural network (GNN) in cross-border language planning (CBLP). Consequently, following a review of the connotation of GNN, it puts forward the research method for CBLP based on the Internet of Things (IoT)-native data and studies the classification of language texts utilizing different types of GNNs. Firstly, the isomorphic label-embedded graph convolution network (GCN) is proposed. Then, it proposes a scalability-enhanced heterogeneous GCN. Subsequently, the two GCN models are fused, and the research model-heterogeneous InducGCN is proposed. Finally, the model performances are comparatively analyzed. The experimental findings suggest that the classification accuracy of label-embedded GNN is higher than that of other methods, with the highest recognition accuracy of 97.37% on dataset R8. The classification accuracy of the proposed heterogeneous InducGCN fusion model has been improved by 0.09% more than the label-embedded GNN, reaching 97.46%.

1. Introduction

Machine learning (ML) and deep learning (DL) have made great breakthroughs in natural language, speech, and image processing. However, speech, image, and text are very simple sequences, grid data, or structured data. DL lends well to processing structured data. Meanwhile, graph neural networks (GNNs) research has become a hotspot in DL [13]. GNN’s outstanding ability to process unstructured data has made breakthroughs in network data analysis, recommendation systems, physical modeling, natural language processing (NLP), and graph combination optimization [46]. Not all things in the real world can be represented as a sequence or a grid, such as social networks, knowledge maps, and complex file systems. These unstructured data make it necessary to study GNNs [79]. The advent of the Internet of Things (IoT) has sped up and enriched data generation and is challenging the existing cloud technology architecture and data processing methods. Inevitably, it will promote the rational understanding of “all data are valuable” to “valuable data extraction.” Data as value is a highly respected concept in the computer field. As an integral part of China’s language ecosystem, cross-border languages condense cross-border ethnic groups and carry multiculturalism. However, at present, China lacks reasonable planning for cross-border languages in core areas, and some are endangered and need particular attention [1012].

Many experts and scholars have studied the application of GNNs. Park et al. [13] proposed a learning framework based on GNN and reinforcement learning (RL) to arrange workshop problems. They expressed the scheduling process as a sequential decision-making (DM) problem with state graph representation. They derived the optimal scheduling strategy that mapped the characteristics of embedded nodes to the optimal scheduling action. Zhao et al. [14] constructed a spider web GNN to solve the multiview gait recognition problem. The single view gait data were connected with the gait data of other views simultaneously to build an active graph convolutional network (GCN). Curdt Christiansen and Gao [12] observed the interaction among family, school, community, and workplace to trace the sociolinguistic and political environment in which language change occurred. Sanden [15] reviewed the role of language policy in multilingual organizations and pointed out ten key language policy challenges in international business and management.

Bing [16] reviewed the implementation of Guangxi’s foreign language policy. They analyzed the multilingual opportunities and language challenges shared by Guangxi’s multilingual and neighboring countries in multiethnic and multilingual communities. The research revealed the sociological significance of foreign language policy beyond the meaning of language. As a result, the English + multilingual policy guideline was put forward for the Association of Southeast Asian Nations (ASEAN) countries. Chang et al. [17] reasoned that immigrant parents recognized the value of English in the language market and had high expectations and aspirations for their children’s English education. However, the linguistic ideology they constructed in history limited their participation in children’s English learning and hindered their family language DM. Sung [18] investigated the language ideology of a group of international students in an English-teaching university in multilingual Hong Kong. The results showed that participants’ beliefs about English went beyond their role as a teaching medium and included their use as a universal language and a means of social inclusion. Participants also had complex but sometimes contradictory ideologies about the types of English used or accepted and the monolingual and multilingual use of English in the context of English-teaching universities. Borges et al. [19] wanted to identify and analyze stakeholders’ views on promoting cross-border cooperation. Cooperation through proactive participation, targeted policies, coordinated institutional structures, and socio-cultural proximity could inform decisions aimed at increasing stakeholder participation in cross-border regional participatory approaches.

Therefore, there are many kinds of research on GNN and cross-border language. The research in this work can provide new ideas for this field. Through literature review and questionnaire survey (QS), this work studies the data classification of IoT, the composition of GNN, and graph convolutional network (GCN). The innovation lies in establishing cross-border language planning (CBLP) research theory. An isomorphic label-embedded GCN model and an enhanced-scalability heterogeneous GCN model are proposed. It is verified that the text classification performance of the proposed GNN model is good. This work is divided into five sections. Firstly, Section 1 introduces the research background, motivation, and research results in related fields. Section 2 discusses the main research methods used. In Section 3, the research model is proposed. Then, Section 4 carries out the experimental design and performance evaluation, and the research conclusion is summarized in Section 5.

2. Materials and Methods

2.1. GNN in IoT Data

Everything, tangible or intangible, is connected, and the image representation of node + relationship is enough to cover everything. For example, individuals act as nodes in human social networks, and various relationships between people act as edges. A large amount of business data in real life can be represented by graphs. In an e-commerce business, users and commodities can also build a mapping network. The IoT, power grids, and biomolecules are natural node + relationship structures. Even physical objects can be abstracted into 3D node clouds and represented in the form of graph data. Thus, graph data expression is most suitable for business. IoT data can be divided into static and dynamic data.

Static data refer to the address information of the tested equipment, such as location, asset attribute related equipment name and number, equipment-related label class, and equipment specification. Static data are primarily stored in structured and relational databases. By comparison, dynamic data are time-series data, diagnostic signal data such as temperature, humidity, and pressure status, and equipment status data, such as battery power. Every piece of data has a corresponding relationship with time and is usually stored in a time-series database. Dynamic data include unstructured data, such as pictures, text, voice, and video. The graph data representation of IoT composition is shown in Figure 1.

The IoT system comprises four basic components: sensors/devices, data processing, connectivity, and user interface. Sensors in the device collect data and transmit it to the cloud over an Internet connection. After that, the software will process the data and perform operations, such as sending alarms and automatically adjusting the equipment. Finally, adjustments or required operations are made through the user interface. Big data storage is both a data repository and a data source. Adding more and more IoT-native devices will complicate the artificial intelligence (AI) model and data acquisition. The ability to process and perform big data operations depends on the ability of the hardware to help extract necessary and useful data insights. Therefore, investing in efficient hardware and optimized infrastructure design is essential. One of the primary data sources in the IoT is IoT-native equipment. These devices have built-in sensors that collect ambient information. The valuable data collected are transmitted to the cloud through the Internet, and AI and ML generate useful insights. Remarkably, the neural network is commonly used in image feature extraction, and GNN is mainly used to address graphs. GNN calculates directly on the graph. The whole calculation process is carried out along the graph structure. The advantage of such processing is that it can well retain the structure information of the graph. The ability to learn structural information is one prominent feature of GNN. Figure 2 sketches the structure of GNN.

GNN generalizes the traditional DL technology to graph-structure data, so it is a deep representation model designed from the graph structure. GNN mainly follows the message propagation framework. Firstly, it collects messages from node neighbors and then uses a neural network to update the node representation, essentially the aggregation process of neighbor node information. According to different domains, GNNs are generally divided into spectral-domain GNN and spatial-domain GNN. Spectral-domain GNN is modeled from the perspective of traditional graph signal processing. By comparison, the spatial-domain GNN mainly starts from the graph structure: the graph node and its neighbor nodes, and it directly aggregates the graph structure to learn the node representation. In addition to the graph convolution layer for node representation learning, the pooling layer in traditional vision is also extended to graph data. Various graph pooling operations are proposed to streamline the graph and learn the representation of the whole graph through a differentiable compression model.

2.2. Cross-Border Language Planning (CBLP)

Language planning is closely related to language-ecological issues, such as language resources, diversity, and social functions. It is an essential national means for social governance and promoting social harmony [2022]. In terms of minority language planning, China has formulated a series of minority language policies in the Xinjiang province at different historical stages. These policies have played an inestimable role in protecting minority language rights, coordinating language relations, and safeguarding national unity. Some cross-border languages in the core area have been endangered [23]. Language diversity is a solid basis for a stable and robust language ecosystem related to the sustainable development of human civilization. Language planning is the practice of planning language, including the language work carried out by each planning subject at the macro, meso, and micro levels and scientific research on language work or practical activities. In reality, the specific forms of language planning will change according to diverse planning subjects. The planning subject includes institutions, groups, and individuals represented by the government and language planning departments. More generally, a series of activities to formulate, implement, and enforce language policies are the most critical and direct manifestations of language planning. The language planning process usually consists of five stages, as shown in Figure 3.

According to different perspectives, language planning can be divided into status, acquisition, reputation, and function. Generally, language planning involves dichotomy, trisection, and quartering methods. Based on the different characteristics and contents, the dichotomy divides language planning into “language status planning” and “language ontology planning.” On this basis, the trisection method subdivides the concept of acquisition planning. Then, based on previous studies, the quartering method integrates language planning into four basic types: language status planning, language ontology planning, language education planning, and language reputation planning. Figure 4 details their relationship.

The formulation and implementation of all language planning are not accidental but the result of language planning motivation, language ideology, and language planning goal. These three constitute the driven-process theory of language planning, and the structural component is illustrated in Figure 5.

Cross-border language exists between China and coterminous countries and within different regions within China, a special language phenomenon. In China alone, the cross-border languages in the South and North have distinct characteristics. Some major languages in the North, such as Uighur, Mongolian, and Korean, are spoken by a large population. They share a long history of traditional minority characters with minor dialect differences and are widely used within minority groups. Thus, there are basically no communication obstacles with the same cross-border language. However, most cross-border languages in the South lack written characters. Han people live together in the South with other ethnic groups, so these cross-border languages evolve towards poor functionalities. Cross-border languages can break national boundaries and contain rich bilateral or even multilateral historical accumulation in social function. Also, it can carry multiculturalism, condense specific ethnic groups, and present a unique language form and a significant economic value. Essentially, CBLP is to protect, utilize, and develop the service functions of cross-border language as resources, namely, tool use function, humanistic construction function, economic support function, and safety maintenance function. The social function of cross-border language is classified according to the region of use, as in Figure 6.

Cross-border language embodies its instrumental social function in government affairs, school teaching, mass media, and other aspects. Its use in the school curriculum, academic research, and other fields can reflect its humanistic function. Its economic function can be reflected in the economic, scientific, and technological fields. Its safety function is mainly reflected in science, technology, and national defense. The cross-border language in the core area has a distinct international instrumental value and the safety function of maintaining national security in science, technology, and national defense. They are key languages in China. Therefore, their function as a regional international language should be highlighted and developed in combination with the national strategy.

2.3. Language Text Classification Based on GNN

GCN is mainly divided into the spectrum-domain-based GCN and spatial-domain-based GCN. The former introduces a filter from graph signal processing to define graph convolution operation, whereas the latter defines graph convolution operation through information propagation. GNN has some achievements in text classification (TC). However, most studies only consider text information and do not consider label information. This work proposes to add label nodes on top of text nodes and word nodes in the text graph to form the information transmission path of text-label-text. Then, the supervised label information can propagate more directly in the whole graph along the path. As such, it realizes label embedding and text embedding in the same semantic space through graph convolution operation. This way, an isomorphic label-embedded GNN model is implemented for text classification.

GCN is a multilayer neural network acting on the graph structure data. It learns the characteristics of nodes based on the nodes’ neighbors. Suppose represents a graph, where denotes a set of nodes, and indicates a set of edges. In an isomorphic GNN, contains three types of nodes: word node , text node , and label node . stands for the adjacency matrix, then (1) is obtained as

In (1), means no edge connection between node and node ; otherwise, there is an edge connection between node and node . The feature matrix of graph nodes is expressed as

In (2), the dimension of the feature is represented by , and the normalized symmetric adjacency matrix is represented as

In (3), is the diagonal matrix, then the propagation expression of the GCN is demonstrated by

In the following equation, we get

H(l) represents the stacking of -dimension hidden layer vectors of all nodes of the -th layer. denotes the trainable parameter matrix. signifies the rectified linear unit (ReLU) Activation Function (AF). Now, to build a text graph, this work considers three types of nodes: word node, text node, and label node. The word node represents all non-repetitive words in the dictionary. The text node stands for all the text in the text set, including the training and testing sets. Lastly, the label node is all the labels in the label set corresponding to the text set. Next, to encode different nodes and edges, this work constructs four combined subgraphs to form the text graph inputted into the GCN. The word-word subgraph is outlined in Figure 7.

The word-text subgraph representation is drawn in Figure 8.

The label-text subgraph is portrayed in Figure 9.

The label-word subgraph is charted in Figure 10.

The word-word subgraph obtains the local text-level word cooccurrence. Meanwhile, the word-word subgraph can obtain global property through graph convolution operation on the high-order neighbor information. The word-text subgraph obtains the text-level word cooccurrence. In this work, a label subgraph is a way to introduce tags into a text graph network. It is a bipartite graph network. That means there is no connection between tags or among texts, and only tags and texts are connected. The label propagation algorithm inspires the subgraph connecting the text and the corresponding label. This way, the label information can be transmitted more directly in the text network, conducive to learning the feature representation of text nodes.

3. Research Model

The network model of TextGraph can be built based on the combination of the above four subgraphs. That is the isomorphic GNN model based on label embedding (as shown in Figure 11).

As per Figure 11, the model’s node set composed of words is shared among subgraphs. Then, the constructed TextGraph model containing label information is sent to the two-layer GCN model to learn node representation and classification. The dimension size of the second layer of the GCN is set to the number of categories, namely the number of labels corresponding to this text set. Afterward, a normalized exponential function, the SoftMax layer, is added for classification. Finally, after the learned features are sent to the SoftMax layer, the index corresponding to the largest feature in the feature vector is the final predicted label.

4. Experimental Design and Performance Evaluation

4.1. Research Materials and Collection

An experiment is designed to verify the isomorphic GCN. The popular text classification methods are selected for comparison based on the text classification dataset. In particular, the unique heat vector is chosen to represent the initialization node, and Glove is the word vector. Then, the GCN’s first layer dimension is set to 200, and the second layer dimension to the number of corresponding datasets. For example, if R8 has eight categories, the second dimension is set to eight. The initial learning rate is set to 0.02. Meanwhile, the dropout layer is set after the first image convolution layer, with a value of 0.5. The data set selected in the experiment is the classical text classification data set: OHSUMED, R8, and R52.

The isomorphic GCN model establishes a network containing text, word, and label nodes. The graph convolution process is isomorphic. GCN learns the embedded representation of nodes. There is no difference between the neighbors of each node in the aggregation process. All nodes are regarded as the same type. This form of composition lacks flexibility, making it difficult to deal with newly added text nodes. Therefore, this work proposes an extensible and heterogeneous GNN model: InductGCN. InductGCN first uses different subgraphs for word learning. Then, it fuses the output word representations of different subgraphs. Finally, classification is made by embedding the learned words into the embedded representation of the learning text. It splits the word-text subgraph in TextGCN into three parts: word-word subgraph, text-word subgraph, and word-text subgraph. The structure of the word-word subgraph is shown in Figure 12.

Word-word subgraph is a word subgraph composed of words in the text set. Each node is a word that is not repeated in the dictionary, and the weight of edges is the weight between word pairs. This subgraph gets the embedded representation of words through GCN, and the input word features have no prior knowledge of the unique heat vector. The word-text subgraph is a subgraph based on the relationship between words and text. Its structure is shown in Figure 13.

The final structure of the GNN is shown in Figure 14.

The word embedding representation obtained in the above two stages is used as the input, as the feature vector of the text-word subgraph. Then, the final text embedding is obtained through the graph convolution operation, and classification is performed. In this model, the two parts of word embedding are fused to obtain the final word embedding and then sent to the classification GCN. Finally, the experiment is carried out to verify the InductGCN model.

4.2. Experimental Environment

The hardware and software environment settings of this experiment are as follows: Intel (R) Core (TM) i9-9900K [email protected]; 8G Random Access Memory (RAM); Windows 10 operating system; Python development language; and Matlab experimental environment.

4.3. Performance Comparison of Isomorphic GNN Based on Label Embedding

This section compares different text classification methods. Specifically, CNN uses convolution to extract sentence features and a fully connected layer to classify. CNN-non-static represents the pretrained word vector model, and CNN-rand is the random initialization model based on the word vector. Multitask long short-term memory (LSTM), a multitask framework, uses the last state of LSTM to vectorize the whole sentence. Then, the vector is classified through the fully connected layer. The full name of BERT is bidirectional encoder representation from transformers, a pretrained language representation model. The performance of these models and the label embedding-based isomorphic GNN model reported here on different datasets are compared in Figure 15.

According to Figure 15, the classification accuracy of label-embedded GNN is higher than that of other methods, with the highest recognition accuracy of 97.37% on the R8 dataset. The pretraining word vector of the neural network model can greatly improve the model prediction performance. Because texts sharing the same label in the semantic space is very closely distanced, it is reasonable for text nodes to exchange information through the central label node. In other words, the information transmission is reasonable on the text-label-text path. By comparison, the word-label-word connection path can capture the category-level word cooccurrence feature. However, the classification accuracy is not as good as the main label-embedded GNN model. Therefore, directly connecting text with the corresponding label can learn better-embedded representation.

The complexity analysis results of the label-embedded GNN model are revealed in Figure 16.

As in Figure 16, the classification accuracy increases significantly with the edge weight between the label and text nodes. Nevertheless, when the weight exceeds 1, the accuracy begins to decline. It shows that the weight is too small to allow full information exchange between text pairs sharing the same label, and excessive weight will cause overfitting.

4.4. Performance Comparison of Scalability-Enhanced GCN Model

Given that the text classification effect of the previous label-embedded GNN model is suitable, Figure 17 compares different models’ classification and recognition rates.

As in Figure 17, the classification accuracy of the InducGCN has been improved by 0.09% more than label-embedded GNN, reaching 97.46%. Thus, InducGCN makes full use of the heterogeneous graph information. Different GCN nodes are used to represent the three subgraphs of words to improve the model performance.

The complexity analysis results of the InducGCN model are exhibited in Figure 18.

As in Figure 18, the InducGCN model divides large heterogeneous graphs into small graph networks to reduce computational complexity. The scalability analysis of the classification model is unveiled in Figure 19.

As Figure 19 suggests, the heterogeneous InducGCN model proposed does not need to retrain the whole network when aiming at the newly added text. Simply put, the network model does not have to be retrained given a new text input; only part of the network structure needs to be trained. Hence, InducGCN has good scalability.

5. Conclusion

Since cross-border language entered the academic field, ontological research on the domestic, cross-border language has been a high-pursued topic. Targeted at various cross-border languages, scholars worldwide enthusiastically investigate language structures involving vocabulary, grammar, and pronunciation. On the other hand, the individual’s relationship, such as friendship, in a social network can be modeled through graphs. So are the relationships between stations in the urban transportation system. The push or recommendation over the Internet or mobile terminals is often based on pinpointing users’ topics of interest. The GNN is much similar to an image processing neural network. For example, the graph theory treats a picture as a graph structure formed by the interconnection of each pixel and adjacent points. The relationship between each pixel and the surrounding pixels on the image is relatively fixed and can be expressed by the relative position of the top, bottom, left, and right. In contrast, the position and distance between points on a regular graph are relatively flexible. With the in-depth study of IoT-native data, applying GNN in cross-border language research might improve the accuracy of language text classification. Consequently, this work proposes a CBLP model and studies language text classification by fusing isomorphic and heterogeneous GCNs. Finally, the proposed heterogeneous InducGCN-based CBLP model is verified through comparative analysis. The numerical results corroborate that the heterogeneous InducGCN beats other GCN models. However, there are also some deficiencies in the research. The research on cross-border language in China is relatively short and still in its infancy. The theoretical system of language ecology is not yet fully mature. Further analysis and research are needed in the future. Additionally, there is a potential pitfall in the GNN-based text classification, such as excessive smoothness. The model training process is not stable enough. It is hoped to develop a model with better robustness in the future.

Data Availability

The raw data supporting the conclusions of this article are available from the author upon request.

Informed consent was obtained from all individual participants included in the study.

Conflicts of Interest

The author declares that they have no conflicts of interest.

Acknowledgments

The author acknowledges the help from the university colleagues. This work was supported by the project of the Hunan Social Science Achievement Evaluation Committee (2021, Document no. 3, Project no. XSP22YBC531, Research on Strategies for Enhancing Communication Ability of Hunan Culture from the Perspective of Language Service).