1 Introduction

In December 2019, a new coronavirus, called SARS-CoV-2, was recognized in the city of Wuhan, China, and spread quickly to other countries in the world. In January 2020, the World Health Organization declared a Public Health Emergency of International Importance, and in March, the COVID-19 pandemic. At the beginning of September 2021 there were already more than 220.5 millions confirmed cases and more than 4.5 millions deaths from the disease [34]

When infecting the human body, there is a period of latency, followed by an infectious period. During this period, the infected person can transmit to others through coughing and sneezing. The virus mainly affects the respiratory tract and the first symptoms appear after the incubation period. In the most common cases,symptoms include fever, cough and fatigue, which will do so within on average for 11 to 14 days of infectiion [25]. Other symptoms, such as mucus production, headache, hemoptysis, diarrhea, dyspnoea, lymphopenia can also appear. The main clinical diagnosis is pneumonia [3, 32, 41, 46]. Furthermore, the risk of symptomatic infection increases with age. Thus, older individuals are more likely to have symptomatic infection and worse outcomes [3].

Laboratory diagnosis is an important tool for diagnosis, as well as for follow-up, evaluation and evolution of the case. The recommended diagnostic test is the real-time polymerase chain reaction (RT-PCR) of nasal and oropharyngeal swab samples. Other serological tests can be used to detect immune responses, such as class M (IgM) and class G (IgG). However, it is important to use resources rationally in conducting diagnostic tests [49].

Regarding the rational use of resources for detection of infection spread, artificial intelligence techniques have been used to predict the diagnosis of COVID-19. The algorithms are managing to predict the stage of disease by means of several features such as age, comorbidities, symptoms, diagnosis and outcome [33, 51].

A very useful approach in this context is Nowcasting, mainly due to the transmission dynamics of an epidemic or pandemic. It’s a technique used for prediction of the present, that is, an estimate of the current number given an event [5]. Although it generally uses time-series, recent advances in machine learning techniques have diversified the possibilities [29, 48].

This paper is an expanded version of the other article [19] and complements the initially developed proposal. Here, the data from the public health system in the capital of Santa Catarina – Brazil was enlarged with variables of weather, mobility and non-pharmacological government measures.

Thus to create predictive model we performed an investigation to assess the main features that can determine Covid-19 infection of an individual. In our work, we conduct several experiments with 221 features to label the 30th most important features that represent the high Covid-19 infection likelihood.

1.1 Contributions

Among the contributions of our work, we can highlight:

  1. 1.

    The verification of the high importance of the features of confirmed, discarded, and removed per territories and sub-territories of health, as well as the features of symptoms (fever, cough, and sore throat), all along the time of the notification date.

  2. 2.

    An intensive feature importance investigation results in findings that also highlighted the importance of traffic load and mobility, which reflects the people’s isolation level.

  3. 3.

    The accuracy of the model achieved an average of 81.82% of correctness in determining whether the individual is infected with Covid-19.

The remainder of this paper is structured as follows: In Section 2, we describe the more relevant related works on the effort to determine the Covid-19 infection; Section 3 introduces the methodology applied to feature engineering; Section 4 detail the experimental assessments; Section 5 outline the discussion about results and finally, in Section 6, we present our final remarks and future work.

2 Related work

COVID-19 had a significant impact on the life and economy of several countries [21]. In addition to collapsing economies, the moral values of nations have been strongly affected by the pandemic [43]. All the impact, economic and social, motivated the Pan American Health Organization to seek to better understand the signs and symptoms of Sars-cov-2, in order to disseminate this knowledge. The challenge of the pandemic is to find the best model that elucidates the initial growth trajectory and the epidemiological characteristics of the new coronavirus [40]. In this sense, the predictive models has been useful to deal with the dynamic behavior of this virus [44].

Sars-cov-2 is a respiratory virus transmitted through droplets of saliva, sneezing or by close contact. In their study, [47] described 69 cases of COVID-19 in China, where it was identified that 15% of individuals had fever, cough and dyspnoea. However, a survey conducted in the United States, showed that 50% of patients affected by this virus did not have a fever, however cough and dyspnea were reported by 88% of people with the virus [6]. Still, in other studies, reports of symptoms were difficult to measure objectively, such as anosmia (loss of smell), hyposmia (decreased smell) and ageusia (loss of taste) [24].

In addition, infected individuals may never develop symptoms, others may have mild symptoms or develop moderate to severe Sars-cov-2 disease [35]. In order to understand the symptoms that best represent the pandemic, researchers around the world try to understand the behavior of the virus [24]. A group of researchers from Spain found five patterns of skin infection that may be associated with COVID-19. These patterns were repeated in patients with varying demographic characteristics, in different periods and with different severities of the disease. Among these patterns are maculopapular rashes (47% of cases), vesicles or pustules (19% of cases), hives (19% of cases) and other vesicular rashes (9% of cases) and livedo or necrosis (in 6% of cases) [15].

A preliminary analysis by the World Health Organization (WHO) shows that in relation to gender, there is a relatively uniform distribution of infections between women and men (47% versus 51 respectively), however, it seems that men have a higher rate mortality rate (58%) in relation to women [35]. Nevertheless, due to the need to know the outbreak of COVID-19, some studies are being carried out considering exogenous factors such as the social environment, climatic variables, pollution and population density [44]. Other studies point to the role of room temperature in the survival and transmission of viruses. According to the WHO, several environmental factors can influence the spread of communicable diseases that can cause epidemics. The underlying theory is that the number of cases and the spread of previous infectious viruses demonstrate seasonal patterns, affected by the climate, and therefore Covid-19 is likely to be similar in this respect [30].

Therefore, the prediction of a pandemic can be made based on several parameters, such as the impact of environmental factors, incubation period, impact of quarantine, age, sex and many more. The difficulty in predicting the number of cases of a pandemic is the fact that the number of cases to be studied does not match the total infected population. [37].

Considering the importance of knowing this difficult epidemiological scenario, the Forecasting Models are an alternative to unravel the impacts of the pandemic. This technique assess past situations, which allows for better predictions about the situation that will occur in the future [43, 44]. Another is the Nowcasting Models that provide a prediction for cases that have not yet been reported [31]. It’s very useful in situations where there is a delay in the response of the control instruments.

Both techniques help managers to create strategic planning and carry out decision making in the most assertive manner possible [30]. Then, several models have been developed which allow governments not to focus only on underlying methods such as personal judgment. Many use mathematical and statistical methods, as well as Artificial Intelligence techniques, to predict epidemic trajectories [11, 13, 26, 50], evaluation of non-pharmaco-logical interventions [4, 14, 22, 38], among others.

Another relevant theme in this context is the analysis of the importance of features that contributes to predictive modeling through the recognition of related variables. In one article, for example, the authors managed to develop a new algorithm, called Variance Based Feature Weighting, which not only ranks the COVID-19 symptoms but also assigns a quantitative importance measure [2]. Results indicated fever - 75%, cough - 39.8%, fatigue - 16.5%, sore throat - 10.8% and shortness of breath - 6.6% as quantitative measures of relevance for disease detection.

In another, researchers proposed a classification to predict the clinical severity of patients with COVID-19 [12]. The authors used 37 features, including basic patient information, a physical index, initial examination findings, clinical findings, comorbid diseases, and general blood test results at an early stage. The feature importance analysis was performed with AdaBoost, Random Forest and XGBoost, which selected the 20 most important to be processed in the Deep Learning classifier. The results showed that age, lymphocyte level, platelet count and shortness of breath or dyspnea were the most relevant factors in predicting severity.

And finally, a research in which the importance of various features in 2,787 US counties was investigated during the COVID-19 transmission trajectory [27]. The period involved the stages of outbreak, social distancing and reopening of activities. Through data-driven machine learning models, 23 features distributed into six categories were evaluated: social demographics of counties, population activities, mobility within the counties, movement across counties, disease attributes and social network structure. The results reveal that in municipalities with high population densities, mobility resources have a greater impact. As for municipalities with a low population density, the importance of the social network structure is smaller and the index of social distancing is greater. This allow policymakers to adjust control measures and strategies according to different levels and different time points.

3 Methodology

The main goal of our work is to analyze which features most contribute to the diagnosis of suspected cases of COVID-19 using the classification technique with Machine Learning. The methodology steps that can be seen in Fig. 1 are: 1) data selection and extraction, 2) data pre-processing and feature engineering, 3) hyperparameterization and feature selection and 4) model validation. Each step is detailed further next.

Fig. 1
figure 1

Methodology Flow [19]

3.1 Data selection and extraction

The model considered four datasets for the classification task with representation of climate, mobility, government policy measures and, above all, patient data.

The first dataset, on climate, was obtained from the National Institute of Meteorology (INMET) [23]. It contains meteorological data from several cities in Brazil.

The second was Google Mobility Report. It has daily data on the movement trend of people in more than 200 countries and their respective cities, since February 2020 [28]. It’s organized into six categories of places, such as retail and recreation, groceries and pharmacies, parks, public transit stations, workplaces and residential areas. In this article only the variables of Percent Change from Baseline were used.

The third database concerns non-pharmacological policy actions to control COVID-19 transmission. For this, mitigation measures and flexibility of activities defined by the decrees of the city of Florianópolis (SC), between March and May 2020, were analyzed [10].

The functioning of the society’s activities were divided into several types as sports, gyms, beach, schools, mandatory internships, open fairs, commerce, shopping, public transport, food services, cultural establishments, churches, hotels, public agencies, civil construction, essential services, among others. According to the delimitation of the decrees, each activity have been classified daily into levels: open, open with restrictions and closed [17].

The last dataset used has been set corresponding to 1,927 reported cases of COVID-19 in the period between 02/18/2020 and 05/25/2020, obtained according to data availability. It was extracted from the Health Department of Florianópolis, capital of the State of Santa Catarina in southern Brazil and is available to be analysed [18].

According to the [16], the database comes from three sources: 1) anonymized data on suspected and confirmed cases resident in Florianópolis; 2) demographic data of the 49 health regions that make up the municipality; and 3) data on the mobility represented by the traffic flow in the municipality.

The database contains individual data on the diagnosis (confirmed or discarded), sex, age (in years), and age groups (under 10 years old, 10 years old under 20 years old, 20 years old under 40 years old) years old, 40 years old to under 60 years old, 60 years old to under 80 years old, 80 years old or more), skin color (white and not), date of onset of symptoms, in addition to the following clinical data of symptoms of the disease: pain throat, dyspnoea, fever and cough.

There is also data on health regions in the city of Florianópolis. In total, there are 49 territories and 104 sub-territories that correspond to regional divisions of the city.

Furthermore, the database contains the following: demographic data for health territories the total number of inhabitants and by sex; the number of persons aged 1 year, 2 years and so on up to 100 years or more; the number of people by skin color (white, black, yellow, Brazilian, indigenous and ignored); the number of people by years of schooling (from 1 to 17 years completed or more, in addition to literate, non-literate, literate through youth and adult literacy programs and with uninformed schooling); the total income per household, the average income of the households, the total income of the heads of households, the average income of the heads of households, the total income per person and the average income per person, the proportion of males, persons with 60 years of age or more, of people with non-white skin and of people with 10 years or less of education, as possible indicators of vulnerability.

Regarding mobility features, the database provides data on the average daily traffic on four major avenues in the city. The time window for calculating the average considers it starts on the day of symptom detection until the thirteenth day before, that is, it is a window delayed in time.

3.2 Data pre-processing and feature engineering

Initially, the patient’s dataset was processed. All records with the value ’Missing’ in the attributes of symptoms (Sore Throat, Dyspnea, Fever and Cough) and Diagnosis were removed. Then, the categorical attributes were converted to numerical ones, using the One-hot-encoding technique for Race / Color, Age group, Screening Method and symptoms, and Feature Hashing [45] for Territory and Sub-territory. The Table 1 has the conversion process result detailed.

Table 1 Conversion of categorical features

Another procedure performed was the creation of new attributes. As suggested by [16], the number of infected people (with a positive diagnosis and up to 14 days after the onset of symptoms) in each health territory was calculated. Moreover, according to the principle of the SIR model of epidemiology [7], it was proposed to include the number of people discarded (with a negative diagnosis) and the number of people removed (with a positive diagnosis and more than 14 days after the onset of symptoms).

Furthermore, it was included the rate of people infected by the number of inhabitants of their respective health territory, as well as the rate of discarded and removed rate.

Following the idea of grouping the number of cases in each compartment of the SIR model by health territory, the process was repeated for sub-territory, creating three new variables: number of infected people in last 14 days, number of people discarded and the number of people removed, both per sub-territory.

Besides that, the Google Mobility Report was joined to the initial base with including the variables pertinent to population moviment. The key among the datasets was the date considering a three-day delay between the dynamics of daily activities and the onset of the patient’s first symptoms, in an eventual contamination.

The process was repeated for climate dataset. The temperature data (mean, minimum and maximum), hourly humidity (mean, minimum and maximum), wind speed (mean, minimum and maximum), radiation (mean, minimum and maximum) and the sum of precipitation were consolidated in daily values and added to the main base.

Data on non-pharmacological government measures, obtained from municipal decrees, were also added. Before, the categorical attribute of the level of openness of each activity was transformed into a numerical value with the following scale: -1 (open), 0 (open with restriction) and 1 (closed). An index equivalent to the daily arithmetic mean of all activities was also created [17].

Finally, the data were normalized, transforming them to values within the range [0, 1] and, thus, establishing a common scale.

3.3 Hyperparameterization and feature selection

The database was divided into training and test basis, 70% for training and 30% for testing. As there is an imbalance in the amount of data between the discarded and confirmed cases, the first being a larger amount, the sample was balanced using the Undersampling technique.

In the training stage, cross-validation was adopted as a way to assess the model’s generalization. According to [39], the technique consists of dividing the database into k folds, one of which is selected at a time to be the test set and the other k-1 are used as a training set. The test is repeated until each of the k folds is used as a test set. In the end, the accuracy is given by the mean of the accuracy obtained for each of the k folds.

Hyperparametrization was performed using a random combination of parameters with 10 iterations in each tuning process. Accuracy was chosen as the maximization score.

After defining the parameters of the algorithm, the feature selection was performed considering the values of permutation importance as a criterion for assessing the degree of importance [1]. The criterion used was to select only those features with a value greater than zero. In this way, the features with values above this threshold remained in the model and the rest were removed from the database.

3.4 Model validation

In the last step of the process, with the algorithm trained and configured with the best parameters that fit the model, the algorithm was validated with the test base to assess its prediction capacity.

Steps 3 and 4 were repeated 100 times and the results for each were stored. Then the data were used to calculate the mean and stantard devation of evaluation metrics and permutation feature importances.

The equipment used to carry out the experiments had: i) Intel (R) Xeon (R) Gold 6126 CPU @ 2.60GHz CPU with 12 CPUs, ii) 32.0 GB of RAM, iii) 250 GB of hard disk and iv) Linux Ubuntu 16.04. The entire implementation was developed in the Python programming language, version 3.8.

4 Experiment assessments

We carried out experiments to analyze the evaluation metrics that measure the accuracy, in addition to ascertaining the features that had the most contribution to the performance.

The specific parameters of the Random Forest are presented in Table 2 as well as the possible value ranges. Through them, the best configuration is adjusted by means of a random search of hyperparameters.

Table 2 Random Forest Hyperparameters

The parameters described in Table 3 relate to the general settings of the environment.

Table 3 General SettingsGeneral Settings

In the experiments, the metrics used in the analysis of the proposed model to assess performance were accuracy, sensitivity and specificity. The data samples were obtained by running the algorithm repeatedly and they are presented in Table 4 as mean and standard deviation.

Table 4 Prediction Metrics

The Random Forest algorithm had an accuracy of 0.82695 0.02344 (mean standard deviation) on the training set and 0.81819 0.02331 on the test set. These results achieved a AUC ROC mean of 0.90890 0.02515 in the test data. Similarly, the precision recall curve obtained an average result of 0.82344 0.04823.

To make a parallel with the results obtained by [19] and analyze whether the changes contributed to an improvement in the metrics, the Mann Whitney test was used. It is a non-parametric statistical test that compares two independent samples [20].

In the accuracy analysis of the data sample obtained from both experiments that involved 100 executions, the initial article had 0.79275 and 0.03843 of median and interquartile range against 0.82383 and 0.03152 achieved in this paper. By applying the aforementioned test, it is possible to state with 95% confidence that there is a difference between the two groups and, therefore, the alternative hypothesis was accepted (U = 2,053.5; p < 0.001).

The same could be seen for sensitivity (U = 3,546.0; p < 0.001), 0.85194 and 0.06189 against 0.87452 and 0.06189 of median and interquartile range, respectively. And also specificity (U = 3,170.0; p < 0.001), 0.75603 and 0.5565 against 0.79089 and 0.05094 of same measurements. Therefore, the proposals presented here served to improve prediction in the three metrics.

The main features selected with their respective Permutation Importance percentages are shown in Table 5. The results presented below are in the form of mean and standard deviation for the set of executions.

Table 5 Features Permutation Importance of Accuracy Score

For a better visual understanding of the importance of the resources, the 15th most important variables obtained by the model are shown in Fig. 2.

Fig. 2
figure 2

Features Permutation Importance of Accuracy Score

Lastly, the response time of the algorithm had an average result of 11.74 of 1.94 seconds, considering the training step that involved the hyperparameter tuning process and feature selection, in addition to the test step that consisted of the model validation.

5 Discussion

After nearly two years of its discovery, COVID-19 is a disease that arouses much interest because of its great impact on humankind. Wherefore, this work proposed to investigate the relevance of a set of variables in the diagnostic prediction of the disease.

The investigation started with the acquisition of a preliminary database with 221 features, which after going through pre-processing, increased to 277 due to the techniques of coding categorical variables. Then, the model was processed and analyzed by the Permutation Feature Importance method to assess the impact of each feature on the accuracy metric.

The most important feature was the number of Confirmed per Sub-territory in the last 14 days. All symptom features appeared among the thirty most significant, with the exception of dyspnea. This fact corroborates with the researches that investigate the symptoms and indicate fever, cough and sore throat among some of the most common ones [8, 9, 36].

The weather features had a good representation on the prediction importance scale. Among the top thirty classified are Mean Temperature, Maximum Temperature and Minimum Temperature. These results reinforce the high correlation of climatic variables in the prediction of COVID-19, as other studies have point out [42, 44].

Mobility features were also meaningful. There were fourteen features to represent traffic on the city’s four major avenues and three of them were among the most important features. In the same direction, among the six variables extracted from Google Mobility Report, three arised listed in Table 5: Workplaces Percent, Grocery and Pharmacy Percent and Residential Percent.

The features related to non-pharmacological government measures had a more timid presence. Only the Essential Services, Common Areas of Condominium and Non-Essential Public Agencies appeared, given a total of twenty-seven.

In this paper, the feature engineering process carried out in the pre-processing step resulted in the creation of new health sub-territory variables contributed significantly to the model performance. In particular “Confirmed_subterritory_14days”, the attribute most significant with a Permutation Importance value greater than the double of the second.

Thus, it is clear that the “health region” factor is very relevant in the correct classification of the disease diagnosis, as highlighted in [19].

Finally, the results of the model were somewhat satisfactory. This matter is associated with the addition of new variables such as data of climate, mobility, government actions and, above all, confirmed and discarded cases related to the health sub-territory, that were not present in the previous article.

6 Final remarks and future work

The present work shows a investigation about the feature importance in a prediction diagnostic model for cases of COVID-19, using the classification technique with Machine Learning.

These classification approaches are fundamental for monitoring the number of virus reproductions and for making decisions in the face of the pandemic. The advantage of them is to produce quick responses and relatively low cost compared to laboratory diagnosis.

The methodology section emphasized the hyperpara-metrization and feature selection techniques, as the research aimed to investigate two aspects: the features that best contributed to the performance of the model and the results of the hit rates in the validation of the test step.

In the first investigation step, the Random Forest with Permutation Importance method was used to assess the impact of the features on the results. Among the 277 variables that make up the database, the most relevant are: number confirmed per sub-territory of the last 14 days, date of notification and onset of symptoms, attributes of fever and cough, rate and number of confirmed per territory of the last 14 days, number discarded from territory and symptom variable sore throat. Other also were importants as temperature, age, flow traffic and mobility.

Regarding the second stage of the investigation, the metrics showed consistent results. Accuracy had a mean of 81.82%, whereas sensitivity reached 87.52% and specificity 78.67% of cases. All of them showed significant results compared to [19].

Therefore, the research conducted has shown that there is a feasible alternative in the process of underdiagnosis COVID-19 disease, considering the most relevant characteristics for the determination of infection. The limitation of the created model is directly related to the dataset used, as the importance of the features can change according to the applied environment (algorithms, hyperparameters, database, etc.).

In the near future, we intend to evaluate the model in the context of other cities. Another possibility is to add vaccination data and, later, to analyze the behavior of the classifier model.