Prediction of future gene expression profile by analyzing its past variation pattern

https://doi.org/10.1016/j.gep.2021.119166Get rights and content

Abstract

A number of initial Hematopoietic Stem Cells (HSC) are considered in a container that are able to divide into HSCs or differentiate into various types of descendant cells. In this paper, a method is designed to predict an approximate gene expression profile (GEP) for future descendant cells resulted from HSC division/differentiation. First, the GEP prediction problem is modeled into a multivariate time series prediction problem. A novel method called EHSCP (Extended Hematopoietic Stem Cell Prediction) is introduced which is an artificial neural machine to solve the problem. EHSCP accepts the initial sequence of measured GEPs as input and predicts GEPs of future descendant cells. This prediction can be performed for multiple stages of cell division/differentiation. EHSCP considers the GEP sequence as time series and computes correlation between input time series. Two novel artificial neural units called PLSTM (Parametric Long Short Term Memory) and MILSTM (Multi-Input LSTM) are designed. PLSTM makes EHSCP able to consider this correlation in output prediction. Since there exist thousands of time series in GEP prediction, a hierarchical encoder is proposed that computes this correlation using 101 MILSTMs. EHSCP is trained using 155 datasets and is evaluated on 39 test datasets. These evaluations show that EHSCP surpasses existing methods in terms of prediction accuracy and number of correctly-predicted division/differentiation stages. In these evaluations, number of correctly-predicted stages in EHSCP was 128 when as many as 8 initial stages were given.

Introduction

Hematopoietic stem cells (HSCs) are considered as multi-potent cells which can proliferate, or differentiate into hematopoietic progenitor cells (HPCs). They can give rise to their progenies within a hierarchical model (Seita and Weissman, 2010). Generation of blood cells from HSCs involves multiple downstream cell division/differentiation steps through which multi-potent HSCs differentiate into uni-/bi-potent progenitors and finally into well differentiated peripheral blood cells (Scala and Aiuti, 2019). The differentiation path, HSCs choose is not completely understood yet (Scala and Aiuti, 2019). However, the cell lineage decision is a matter of alteration between gene expression programs which can be affected by extracellular signals (Barbosaet al., 2019).

The differentiation behavior of HSCs can be examined by either in vitro culturing or in vivo transplantation of them. In either case, the overall quality and the type of descendant cells of HSCs can be speculated by checking properties of the initial stem cells (we reviewed such methods in Section 2). Such speculation, however, is based on a few cases of cell division/differentiation that were experienced previously and does not entail as many cases as possible. In fact, there is no systematic method to predict type and properties of the future descendant cells from HSCs. Our study tries to address this issue.

Status of genes in a cell can be checked by measuring the GEP (Gene Expression Profile) of the cell. A GEP contains expression level for every gene in the cell (Section 2.1). Type, behavior, and properties of a cell can be determined by looking at its GEP. If a stem/progenitor cell divides/differentiates into a number of new cells, we consider the stem/progenitor cell as parent and the resulting cells as its children/descendants.

In this paper, we have the idea that we can design a learning machine that reads datasets of GEP matrices of parent HSCs and their descendant cells. This way, the machine learns how GEP matrices change during each cell division/differentiation. Then after adequate rounds of training, the machine acquires the knowledge to analyze the GEP matrix of a parent HSC and then predict GEP matrices of its future descendant cells.

To develop this idea further, we analyze the problem in Section 2 and model it into a standard multivariate time-series prediction problem. Various methods exist to solve such a problem. We review the existing methods for multivariate time-series prediction in Section 3 and realize they cannot achieve high accuracy in GEP prediction. In addition, we review existing prediction methods for stem cell differentiation and conclude that they do not cover most cases of differentiation. To design an efficient method for GEP prediction, we describe a deep recurrent neural machine called HSCP (Section 4) and extend it to an advanced machine called EHSCP in Section 5. To do this extension, we design two novel artificial neural nodes. They make EHSCP able to consider a parameter that quantifies correlation between input time-series. Since there exist thousands of time series in GEP, we propose a hierarchical encoder that computes this correlation using 101 nodes. As input, EHSCP accepts the initial sequence of GEPs which were measured in a culturing container. As output, EHSCP predicts GEP matrices of the descendant cells that will be generated in the container. In Section 6, we provide a number of solutions which are required to implement EHSCP in practice. We evaluate EHSCP in Section 7. According to our evaluations, if we want to predict one stage of division/differentiation, EHSCP slightly improves existing methods in term of prediction accuracy. If a large number of division/differentiation stages has to be predicted, EHSCP noticeably improves prediction accuracy of existing methods. In our evaluations, number of correctly-predicted stages in EHSCP was 128 when as many as 8 initial stages were given. Section 8 discusses properties of EHSCP. Finally, Section 9 concludes the paper.

Section snippets

Problem modeling and generic solution

In this section, we describe the prediction problem which we want to solve. Then, we describe a generic solution.

Related works

There exist many works in Biology, Bioinformatics, Artificial Intelligence, and Mathematics which are related to our research. In this section, we classify them into a number of categories and review each category in a separate subsection. Even though we have focused on HSCs in our scheme, prediction methods for all kinds of stem cells are related to our research.

Our basic method: HSCP

In this section, we design HSCP (HSC Prediction) which is a neural machine to predict GEPs of future descendants of HSCs. Although HSCP is made of an existing machine (Deep LSTM network), we have to describe it to be able to compare it with our extended machine in the next sections.

Our extended method: EHSCP

In this section, we extend HSCP to an advanced multivariate neural machine called EHSCP (Extended HSCP). Prediction and training in EHSCP are similar to HSCP.

Practical issues and solutions

In this section, we discuss a number of issues that may arise when we want to use HSCP or EHSCP in practice. In addition, we propose solutions to resolve each one of these issues.

Since many datasets are required to train the neural machine in our scheme, those datasets cannot be provided by a few biologists. Therefore, many independent biologists are involved and we cannot expect that they follow exactly the same procedure and conditions in their experiments. Therefore, we add a number of

Evaluations and results

In this section, we train EHSCP and evaluate it. To compare to our machine, we have chosen DA-RNN (Qinet al, 2017) which is one of the best existing learning machines for multivariate time-series prediction. Only one scheme was proposed so far for next-element prediction of GEPs in (Bhattacharjee and Vishwakarma, 2019). We call it Bhattacharjee2019 and evaluate it too. In addition, we compare our machine with HSCP to understand the advantages of considering a multivariate design compared to a

Discussion

EHSCP surpassed the existing prediction methods in our evaluations because of the following facts:

  • EHSCP considers the order and the pattern that exist on elements of main time-series. LSTM1, LSTM2, and PLSTM3 do the job of memorizing previous values of their inputs.

  • EHSCP considers dependencies between main time-series and secondary time-series.

  • EHSCP efficiently summarizes and reduces number of secondary time-series from thousands to about a hundred. Then, it encodes the hundred values into a

Conclusion

A few learning machines were proposed in the literature for multivariate time series prediction (Section 3). They are not designed for HGEP. They cannot accept thousands of secondary time-series. They are not designed to be trained using the extensive number of GES datasets which are available worldwide from various researchers. Therefore, they cannot achieve high prediction accuracy in HGEP. Much more research are required in the multivariate case to design application-specific machines for

Author statement

All the four authors confirm that they have equal amounts of contribution in this paper.

Declaration of competing interest

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (34)

  • C.M. Barbosa

    Extracellular annexin-A1 promotes myeloid/granulocytic differentiation of hematopoietic stem/progenitor cells via the Ca2+/MAPK signalling transduction pathway

    Cell Death Discovery

    (2019)
  • A. Bhattacharjee et al.

    Time-course data prediction for repeatedly measured gene expression

    Int. J. Biomath. (IJB)

    (2019)
  • F. Buggenthin

    Prospective identification of hematopoietic lineage choice by deep learning

    Nat. Methods

    (2017)
  • Y. Chen et al.

    Gene expression inference with deep learning

    Bioinformatics

    (2016)
  • Z. Cui et al.

    Deep stacked bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction

  • H. Goel et al.

    R2N2: residual recurrent neural networks for multivariate time series forecasting

    (2017)
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

  • Cited by (1)

    View full text