1 Introduction

The analysis of segregated communities, due to its important implications for the lives of citizens [1, 2] and for social cohesion [3], has held the attention of policy-makers and academics in the field of social and urban sciences for some time. Segregation can have many dimensions [4], and may come with many different faces: spatial [5], economic [6], occupational [7], gender based [810], religious [11, 12], ethnic [13] and etc. Along with the quantification of segregation there is a long-standing debate on how to promote social cohesion in ethnically diverse environments, which often focuses solely on the effects of spatial segregation [14]. However, a broad perspective on the effects of housing and urban planning suggests that top-down public policies centred exclusively on spatial mixing are not fully effective at promoting social integration [15]. After the social unrest experienced in the US and Europe in the early 2000s, policies turned to the need to foster social and community cohesion [16] as well. Toward this objective, traditional census data face three main limitations: difficulties in reaching certain segregated groups which are reluctant to participate, the practically inexistence of social data (necessary to assess non-spatial aspects of segregation), and high economic cost for the administration. To overcome this situation, many countries have increasingly incorporated alternative sources of information from administrative registers or sample surveys [17]. In this lines, the consolidation of ICT, mobile phones and social networks offers an outstanding possibility to complement or even provide higher quality data in a variety of public policy areas, even though the use of Big Data also raises a number of methodological and moral concerns [18] related to privacy or biases embedded [19, 20] in the information collected by social networks and machine learning algorithms, e.g. the use of satellite images to quantify poverty [21, 22].

Here, we contribute to the segregation analysis debate by defining, analysing and optimising integration as it relates to the communication patterns across groups. Specifically, we are interested in studying the variation in communication patterns between different segregated communities. It is our claim that, in order to integrate separate communities, it is not enough to simply bridge the gap between individual characteristics (e.g. social, economic and occupational) and spatial distributions, but that communities should also be similar in the way they interact with each other. In this regard, the main assumption we rely on is that, if two communities are equally distributed across a territory (that is, if there is no spatial segregation) and their behavioural patterns are similar, their calling patterns should also be similar, based on a sort of natural equilibrium derived from the geo- and socio-economic situation of the territory. We further assume that an increase in the similarity of the two group’s average behaviour could have a positive impact on their level of interaction, following the principle of homophily in social interactions [23]. This approach to behavioural segregation (as opposed to purely static aspects such as residential, spatial segregation) is paralleled in studies of the mobility patterns of segregated groups (activity-space segregation) [24], and in other works focusing on levels of contact between groups [25].

For the experimental part, we work with Call Detail Records (CDRs) of Syrian refugees in Turkey, currently the largest refugee population of any country in the world [26]. The situation in Syria remains unstable, and with policies in force to prevent refugee out-migration to the European Union, most analysts agree that actions must rely on the assumption that refugees will remain in the country for the long-term. Under this problematic situation and in the frame of the Data 4 Refugees project, Turkey’s largest mobile phone service operator, Türk Telekom, has released a large collection of CDRs with information about the nationality of the citizen who makes the call (Turkish or Syrian refugee), as well as the origin and destination of the call. After a strong data pre-processing effort (Sect. 3.1), we first perform an analysis to quantify the current behavioural segregation in Istanbul on the basis of communication patterns. This analysis provides us with a picture of the current situation, and presents the base scenario on which to improve. Subsequently, we develop a methodology to mitigate segregation based on house mixing strategies. Results show that behavioural segregation can be largely reduced. However, such a change could have side-effects on other social aspects, such as rent prices. This is analysed in the final sections of the paper. We conclude the paper with a summary and a discussion of the obtained results.

2 Quantifying existing levels of spatial segregation

The development and presence of enclaves is a common phenomenon within immigrant communities. An immigrant enclave is an expression of spatial segregation as defined by the Dissimilarity Index [28, 29] or by Louf et al. [30], who quantify segregation in terms of the deviation from the random distribution of populations in an area. As expected, refugees are not distributed equally across space in Turkey. Figure 1A plots the ratio of refugee to local population in Istanbul’s 39 districts, from census data. Along with other plots in the paper, the x-axis indices correspond to Istanbul’s districts according to Table 1. Refugees are over-represented in districts with ratios above the horizontal red line (representing the average ratio of refugee-local of the Istanbul province), and under-represented in districts with ratios below the same line. Additionally, we calculate the Dissimilarity Index from the same data. The value obtained (around 30%) is not small, but nor is it as large as one might expect in spatially segregated community [31] (e.g. values between 0.50 and 0.6 were found for the geographic segregation of the black and white populations in U.S. cities in 2000). Panels B and D of the same figure map the distribution of the Turkish and refugee populations, respectively. The maps confirm that most of refugee enclaves are concentrated in the West-Center part of the province. On the contrary, we observe in the Eastern part of the province a comparatively smaller proportion of the refugee population. Panel C of Fig. 1 provides complementary information to the spatial segregation analysis and shows that indeed Syrian refugees tend to live in cheaper, thus less favourable, neighbourhoods. We see that while there is practically no relation between district rent prices and Turkish population, there is a negative relation (slope \(\mbox{$p$-value} < 0.01\)) between rent prices and population of refugees.

Figure 1
figure 1

Segregation analysis of the Istanbul province. Panel A illustrates the ratio of refugees to the local Turkish population across the 39 districts of the Istanbul province. The horizontal red line indicates the average ratio for the entire city. Deviations from this line indicate district-level segregation. Panel C shows the relationship between rent price paid for residence between Turkish and Syrian refugees. For Turkish, we obtained a slope of \(-7.059\mbox{e}{-}05\) and with 95% confidence bounds of −0.0004516 and 0.0003104. For refugees, we obtained a negative slope of −0.0009222 with 95% confidence bounds of −0.001607 and −0.0002374. Data has been obtained from www.endeksa.com [27]. Panels B and D show, illustrated in map form, the distribution of the local Turkish and refugees populations, respectively, across the Istanbul province

Table 1 Correspondence between district names and id’s

The analysis shown in this section presents an initial picture of the extent to which Turkish and refugee citizens are segregated spatially within Istanbul province. The unexpectedly moderate results of the Dissimilarity Index open up the possibility of exploring other possible, non-spatial measures of segregation, such as one sensitive to behavioural differences. The development and implementation of such a measure of behavioural segregation, through the analysis and comparison of group communication patterns, will be the subject of the rest of the paper.

3 Measuring behavioral segregation through communication pattern analysis

Segregation is usually, with a few exceptions [32, 33], assessed in terms of the local demographic or socio-economic characteristics of each geographic area of interest. However, segregation does not only regard the physical or spatial distribution of communities around an area, but also the relative level of harmonisation between groups [34, 35]. Keeping in mind that behavioural and cultural adoption is not easily quantifiable, here we develop a framework based on mobile phone data records to assess the extent to which communities differ in their behaviour and cultural habits [36].

3.1 Communication network generation

The framework we propose is based on the analysis of communication patterns between various collectives of people in terms of their communication networks (CN). The CN will be represented, as is usual, as an adjacency matrix [37], O, where its entries \(o_{ij}\) correspond to the number of communication events originated at location i and with destination j. In particular, we analyse three different CNs, each one representing the communication patterns between pairs of our two study groups: \(O^{\mathrm{TT}}\), \(O^{\mathrm{RR}}\) and \(O^{\mathrm{RT}}\). For each dataset, the first letter of the superscript (T for Turkish and R for refugees) is the originating group and the second is the receiving group. We omit the \(O^{\mathrm{TR}}\) network due to scarcity of the data. To build each CN, we used real communication data provided for the Data 4 Refugees project [38, 39]. The raw data is made up of cellphone calls and SMS (which we will call communications for convenience), and is structured into 3 sub-datasets, DS1, DS2 and DS3, for anonymity purposes. As noted before, the key feature differentiating this dataset from other comparable Call Detail Record datasets [40] is that the users are each assigned a binary tag indicating their status as either refugee or non-refugee (we use the term Turkish to refer to non-refugees). Only DS1 and DS2 were used in this work. The process of constructing the CN is described below, and is illustrated in Fig. 2.

Figure 2
figure 2

Illustration of the process to merge the information of DS1 and DS2. Top tables indicate the information contained in each individual dataset and bottom table the combined dataset. Colours of entries indicate the dataset from which the information is extracted

DS1 consists of aggregate communication counts between cell phone antennas on an hourly basis, indicating the total number of calls made by each group (Turkish or refugee) from each antenna i, and directed to each other destination antenna j. However, information about the group receiving the calls was absent. To estimate this, we made use of DS2, which contains information, for each origin antenna, about the destination group but not about the destination antenna. Combining data from DS1 and DS2 we have been able to estimate, for each origin and destination antenna, the total communications made by each group that was directed at each other group over the entire period of study. So, for example, the number of calls refugee-to-refugee from antenna i to j would be the total refugee-originated calls (DS1) multiplied by the proportion of refugee-originated calls from i directed at other refugees (DS2). Communication events originating from and received by the same district are also represented in our Communication Networks as self-loops.

Finally, for convenience, and to reduce data noise, antenna-to-antenna data were aggregated into district-to-district data. We considered districts a better unit of measurement, as they have explicit administrative meaning (as opposed to a Voronoi tesselation of antenna locations, for example). The aggregation of large amounts of antenna data also lessens the risk of uneven geographical distribution of antennas, skewing the interpretation of the data. Turkey is divided into 81 large administrative provinces, which are further subdivided into smaller districts, 923 in total. Our analysis focused on Istanbul, a province of Turkey containing 39 districts.

3.2 Aggregate communication pattern analysis: province scale

Once the CNs have been assembled, we are in position to analyse the communication patterns of both collectives. We start with a macro-analysis of call destination probability in each CN, independent of the individual district. This provides an initial overview of how different the communication patterns are between the two originating groups. A visual analysis of the results, see Fig. 3, suffices to show that both distributions have a similar shape, and it may seem there is not much difference between the communication habits of both collectives on average. However, detailed comparison at the district level evidences a different situation, see Fig. 4. Panel A shows that, while there are districts where differences are small, in many others are they much larger than in the aggregated analysis. Panel B shows the differences in the distributions for three hand-picked districts. The differences are visually evident.

Figure 3
figure 3

Aggregated analysis of communication patterns of local Turks and refugees. Panel A shows the aggregated distribution of call destination (i.e. independent of district). Panel B presents the same data on an Istanbul district map

Figure 4
figure 4

District level analysis of the difference between communication patterns of the Turkish and refugee populations. Panel A illustrates the mean squared deviation of the distribution of probabilities of communication (call and SMS) destination originating from each district (e.g., the chances that a call from district i will be directed at district j and not a third district k). That is, for district i this is obtained as \(\mbox{MSD}_{i} = \frac{1}{39} \sum_{j} (o_{ij}^{ \text{TT}} - o_{ij}^{\mbox{S}\circ })^{2}\), where \(o_{ij}^{\text{S}\circ } = o_{ij}^{\text{SS}} + o_{ij}^{ \text{ST}}\), \(\forall ij\). Panel B plots the distributions of the districts where the communication patterns for both communities is largest

The difference between the results obtained from the aggregated analysis (Fig. 3) and the local analysis (Fig. 4) might be indicative of the Simpson’s Paradox [41] in the different CNs. Within the aggregated whole of the province, each district has different proportions of refugee and local populations; additionally, social and economic factors vary by district and population. These considerations indicate the advantage of a local-scale analysis of the CN in the characterisation of the differences between local and refugee communication patterns. In the next section, we address the formal structure of this local-scale Communication Network analysis, which forms the basis of the rest of this work.

3.3 Fine-grained communication pattern analysis: district scale

Given the CNs of the different community pairs, Turkish–Turkish, Refugee–Refugee and Refugee–Turkish, respectively \(O^{\mathrm{TT}}\), \(O^{\mathrm{RR}}\) and \(O^{\mathrm{RT}}\) we define our behavioural segregation measure in terms of the \(\chi ^{2}\) test for homogeneity between the various outgoing communication distributions. Among the many other options we could use to perform this comparison (e.g. cosine similarity, mean square displacement, Pearson correlation, etc), we have chosen the \(\chi ^{2}\) since it can work directly on the raw data we have and requires no further assumptions and no pre-processing of the data. Formally, the extent to which the two frequency counts are drawn from the same random variable is measured statistically by the p-value. In our case, the frequency counts correspond to the calls originating from district i and directed to each of the other districts j (represented as vector \(\mathbf{o}_{{i}}\)) for both the Turkish population, \(\mathbf{{o}}_{i}^{\mathrm{TT}}\), and the refugee population, \(\mathbf{o}_{i}^{\mathrm{RR}}\) (or \(\mathbf{o}_{i}^{\mathrm{RT}}\)). Thus,

$$ p\mbox{-value} \bigl(\mathbf{{o}}_{i}^{\mathrm{TT}}, \mathbf{o}_{i}^{ \mathrm{RT}} \bigr), $$
(1)

allows us to assess statistically if both communication patterns are indistinguishable or not. If the results of the test inform us that both samples come from a different distribution (the two call register samples differ significantly), we can conclude that there is segregation in the area in terms of communication. If the test does not allow us to reject the null hypothesis (\(H_{o}\): both samples come from the same distribution) we cannot conclude segregation exists in that area. Note that, while we are measuring behavioural segregation, we are not trying to measure the level of interaction between the groups. Rather, we solely want to assess to what degree the two groups behave similarly.

In Eq. (1), we are measuring the patterns of outgoing calls from a particular district to all other districts. We have shown that the different districts have different refugee and local population (see Sect. 2). Thus, with a high probability, we are comparing two samples of different sizes. Considering there are more locals than refugees, we expect more calls originating from locals than from refugees. This does not affect our analysis, since the \(\chi ^{2}\) test already accounts for these differences in absolute counts. However, when comparing between different destination populations, a difference in spatial distribution between populations can have a significant effect on the shape of outgoing call patterns, since a larger population of one community may mean that they tend to receive more calls than their counterparts, solely because they are more abundant. This problem is magnified if the ratios of local to refugee population are different from district to district. This issue presents itself only in the case of the comparison of TT to RR, since, when comparing TT to RT, the destination populations are identical. Thus, before conducting the \(\chi ^{2}\) test, we need to normalise the RR call patterns by the refugee and local populations of the destination districts. In particular, we adjust the outgoing call counts of \(O^{\mathrm{RR}}\) as \(o_{ij} \frac{|T_{j}|}{|R_{j}|}\), where \(|T_{j}|\) and \(|R_{j}|\) are the size of local and refugees communities in district j.

The results of conducting the \(\chi ^{2}\) test for each district of the Istanbul province shows us that refugee and Turkish calling patterns are always significantly different in every district, and for both comparison datasets (Refugee–Refugee and Refugee–Turkish).

4 Mitigating behavioural segregation through residential mixing policies

Politicians [11, 42], urban planers [13] and scholars [7, 43] have been debating the solutions to segregation and concentration of poverty in Europe and North America since the 70’s. One of the primary mechanisms developed, along with some criticisms [43], is residential and social mixing [5, 44]. Policies developed under this approach aim at incentivising the mobility of the segregated communities to other neighbourhoods in order to increase spatial diversity. Rearranging the spatial distribution of each community would be in line with recent research suggesting that diversity within neighbourhoods can actually increase a positive contacts among citizens belonging to different groups [45]. Other than maximising geographic proximity, a parallel approach for increasing the mutual exposure of communities is to make individuals from different groups more similar, relying on the effects of homophily. Homophily is the well-known sociological principle which states that: the more similar individuals are, the more frequent their interactions are expected to be [23, 46].

Our work builds from these fundamental debates and hypotheses, and particularly relies on the principle of homophilic interactions. We assume calling behavioural can be understood as one behavioural feature [46] of individuals. Thus, reducing differences between communities in this regard (i.e. reducing behavioural segregation) may increase exposure, and subsequently, interaction between communities. With this aim in mind, in the following section we estimate the specific volumes of residents that would need to move from their current district, as well as the districts they would need to move to, in order to improve behavioural segregation as measured by variations in CNs.

4.1 Minimising segregation: residential mixing as an optimisation problem

As discussed, house or residential mixing aims at promoting the mobility of segregated communities into other less segregated neighbourhoods. Framing this idea within our definition of behavioural segregation, the problem can be rephrased as obtaining a mobility matrix M, where each entry \(m_{ji}\) stands for the fraction of refugees living in district i that are required to be reallocated in district j, in order to maximiseFootnote 1 the p-value of the \(\chi ^{2}\) homogeneity test. Our interpretation of the problem, although applied to call patterns and not spatial distributions, is very similar to the definition of the Dissimilarity Index [28, 29], which is usually interpreted as the percentage of the minority population that would need to relocate in order to perfectly spatially integrate the residential distributions in a region.

The estimation of the best M can be formally defined as an optimisation problem. We begin with the case of the “eliminating” differences between the RT and TT networks. The non-linear optimisation problem corresponds to

$$\begin{aligned} \mbox{maximize} &\quad \sum_{i} \mbox{ $p$-value} \bigl(\mathbf{{o}}_{i}^{ \mathrm{TT}},\hat{\mathbf{o}}_{i}^{\mathrm{RT}} \bigr) \end{aligned}$$
(2)
$$\begin{aligned} \mbox{s.t.} &\quad\sum_{i} m_{ji} = 1\quad \forall i \end{aligned}$$
(3)
$$\begin{aligned} & \quad \sum_{j} {\hat{\mathbf{o}}}_{ij} \leq f_{i}\quad \forall i \end{aligned}$$
(4)
$$\begin{aligned} &\quad 0 \leq m_{ji} \leq 1 \end{aligned}$$
(5)
$$\begin{aligned} \quad \mbox{where } &\quad \hat{\mathbf{O}}^{\mathrm{RT}} = \mathbf{M} { \mathbf{{O}}}^{ \mathrm{RT}}, \end{aligned}$$
(6)

where O is the matrix of original communication records, \(\hat{\mathbf{{O}}}\) is the resulting matrix of communication records after the mobility matrix has been applied, and each \(m_{ji}\) is an unknown to be obtained. The restriction in Eq. (3) guarantees that the total number of communications is maintained. That is, in the mobility matrix, the sum from each origin and to all the destinations must equal the total number of communications observed in the call record matrix O. The restriction in Eq. (4) requires that no district has more than \(f_{i}\) refugees. This restriction is important, as the definition of enclaves has to do with a high fraction of immigrants living in an area with respect to the total immigrant population in the region. In our case \(f_{i}\) is obtained such that the fraction of refugees living in a district never exceeds 10% of the total population. This percentage was chosen as a rounded upper bound based on the empirical observation that, under current conditions, the highest percentage of refugee population in a single district is 8%. The restriction in Eq. (5) simply ensures that the different \(m_{ji}\) are bounded in the range \([0,1]\).

Unlike in the comparison between the TT and RT networks, when comparing TT and RR networks, the destination groups of calls the are different. This requires some modifications to the optimization problem in Eq. (2) when applied to the RR case. First, as we explained in Sect. 3.3, it is necessary to normalise the call destination counts by the different volumes of the target populations of the two datasets when computing the p-value. Second, in order to account for the fact that the refugees being moved are the same ones receiving calls (refugees call refugees in the RR network), we need to apply an additional transformation to change the destination districts of the calls directed at the relocated refugees. This can be done by multiplying the result of \(\mathbf{MO}^{\mathrm{RR}}\) by \(\mathbf{M}'\) (the transpose of M). For the definition of the optimisation problem, this means replacing Eq. (6) with \(\hat{\mathbf{O}}^{\mathrm{RR}} = \mathbf{M}{\mathbf{{O}}}^{ \mathrm{RR}}{\mathbf{M}}'\). Figure 5 provides a simplified example of the optimisation problem we propose.

Figure 5
figure 5

Description of the variables, structures and the process of the non-linear optimisation problem in Eq. (2). The process of minimising segregation considering the datasets TT and RT is exemplified within the orange square, where the effect of the application of the mobility matrix M is shown. The additional step of multiplying O by \(\mathbf{M}'\), carried out when reducing segregation considering datasets TT and RR, is shown within the gray square. For both processes, we show an example together with the resulting CN and mobility matrices

The high non-linearity of the problem, in both the RT and RR case, does not allow us to obtain satisfactory results optimizing directly the problem in Eq. (2). The fundamental complication is due to the very low p-values obtained with the initial call densities, \(\mathbf{{o}}_{i}^{\mathrm{TT}}\) and \(\mathbf{{o}}_{i}^{\mathrm{RR}}\). From those values, we were unable to find good initialisations for unknowns \(m_{ji}\) that were close enough to a satisfactory mobility matrix solution. Instead, we developed a two-step procedure based on two similar optimisation problems. In the first step, we modified the objective function (with equivalent restrictions) to find the mobility matrix that minimises the mean squared difference between vectors \(\mathbf{{o}}_{i}^{\mathrm{TT}}\) and \(\hat{\mathbf{o}}_{i}^{\mathrm{RR}}\). In the second step, using as initialisation vector the mobility matrix outcome of the previous optimisation, we minimised the sum of the \(\chi ^{2}\) value for the different vectors \(\mathbf{{o}}_{i}^{\mathrm{TT}}\) and \(\hat{\mathbf{o}}_{i}^{\mathrm{RR}}\). The solution to the optimisation problem was been obtained using the MatLab R2017a engine. We used the fmincon function configured to use the Interior-Point algorithm.

This two-step process, similar to the original objective function in Eq. (2), gives very satisfactory results, as Fig. 6 shows. Note again that, under the initial conditions, all of the districts indicated segregation in both the Refugee–Refugee and Refugee–Turkish case. Figure 6 Panel A shows the results mitigating segregation considering the Refugee–Refugee network. We observe that after the proposed mobility, we reduce segregation in 43% of the districts. When considering Refugee–Turkish communications, the results are also impressive (see Fig. 6B). After promoting mobility, segregation is reduced in 40% of the districts.

Figure 6
figure 6

Maps of the results of the optimisation, for the Refugee–Refugee and Refugee–Turkish networks respectively. Districts are coloured in a gradient according to their relative change in population. Districts where Turkish and Refugee communication patterns were harmonised (p-value from \(\chi ^{2}\) test indicating no significant difference) have a dotted pattern

4.2 Optimising behavioural vs. spatial segregation: the potential trade-offs

In order to establish a baseline for the outcome of our method, we compared our results with a process directed to minimise the Dissimilarity Index (DI). That is, maximize

$$ \frac{1}{2} \sum_{i}^{n} \biggl\vert \frac{{{c^{T}_{i}}}}{\sum_{j}^{n} {{c^{T}_{j}}}}- \frac{{{c^{R}_{i}}}}{\sum_{j}^{n} {{c^{R}_{j}}}} \biggr\vert , $$

where n corresponds to the number of districts, and \(c^{T}_{j}\) and \(c^{R}_{j}\) are the sum of all outgoing calls made from district j, serving as a proxy of population. As in Sect. 4.1, we preformed a separate optimisation for both the RR and RT networks. In each case, in order to have fair comparison with the results of our method, we impose a constraint to limit the total number of citizens to be relocated under the optimisation, which is set to the number relocated using our behavioural segregation optimisation described above. After optimisation, we compared the results in terms of the change in the DI, and in terms of the number of districts in which refugee and local call patterns did not exhibit significant differences. Clearly, each optimisation will do its job better than the other (when minimising segregation, we expect a better outcome for segregation than when we minimise call pattern differences), but seeing how distinct the outcomes are can point us towards potential trade-offs.

We note that the original DI calculated using our call volume-based population estimation for the RT network was 24%, while for the RR network it was 32%. Both are quite similar to the value calculated using official population data (around 30%). The optimal mobility matrices found in the optimisation of behavioural segregation increased the DI, to 53% in the RT case and 57% in the RR case. When minimising the DI, we reach 2.5% and 3% for the RT and RR cases respectively. With respect to behavioural segregation: in both cases, RT and RR, when optimising the DI, all of the districts remain significantly segregated (\(p\mbox{-value} < 0.01\)). This is in contrast to the optimisation minimising behavioural segregation, which reduced segregation in 40% of the districts. These results suggest that spatial segregation as measured by the Dissimilarity Index and behavioural segregation as we measure it here present somewhat different objective functions with different optima. The optimisation of both measures may be taken as being desirable, and studying their mutual effects on one another could be useful. An interesting prospect for future work could go in the direction of designing a multi-objective objective function, in order to find points in the problem space where a positive outcome exists for improving both spatial and behavioural segregation.

4.3 Economic incentives towards integration

From one perspective, social integration can be framed in terms of cost-benefit analysis [47]. In this conceptualisation, language acquisition, distance from family, and exposure to unfamiliar cultures can be considered costs, though they are difficult to quantify in economical terms. Housing costs, in contrast, are relatively easy to quantify. Aside from the characteristics of individual houses, this cost reflects a variety of factors including access to services, employment, and city resources [4850]. As previously mentioned, rent prices are negatively related to refugee population, as Fig. 1C show. This implies that some rent-reduction incentives might be effective in getting refugees to relocate out of enclaves. This could be an opportunity for public and private actors interested in increasing host-refugee integration in Turkey to adjust the cost-benefit analysis of refugee location choice by subsidising rent in targeted areas of the city, thereby encouraging refugees to live away from enclaves and making inter-group contact more frequent.

In support of the viability of using rental subsidies as a way to incentivise refugee location choice, we examined the overall change in rent payments that would occur under the new population distribution considering rental markets for the 2017 period [27]. The proposed optimisation problem in Eqs. (2)–(6) provides us with information about the volume of communications that need to be shifted from one district to another. The density of communications originating from an area is known to be related to the population density of the area [5153] as Fig. 7, drawn from the real population and CDR data, confirms. We can thus use outgoing call volume as a proxy for the amount of citizens for whom we need to incentivise mobility. Performing the optimisation considering the RR communication, a total of 54,942 refugees are required to be relocated (12% of the refugee population). The resulting net increase in monthly rent cost is 11,709,295 (1,847,817€), which corresponds to 213 (34€) per person/month. Performed considering the RT communication network, the optimisation resulted in a relocation of 212,100 refugees (approx. 40% of the population). This corresponds to a net rent increase of 52,430,540 (8,273,946€), or 247 (39€) per person per month.

Figure 7
figure 7

Observed relationship between the amount of communications originated in each district of Istanbul and the population living in the district. We confirm a linear relationship as expected. Using this relationship to make an estimate indicates that there are about 565,000 refugees living in Istanbul. This is above the 400,000 number cited by public authorities. On the other hand, using the relationship to estimate the local population of the city returns a figure of 10,000,000 people, below the official count of 13,500,000

As it can be seen in Fig. 8A and C, the changes in rent payment approximate a normal distribution with a large variance, meaning that, under the adjusted population distribution, some refugees would considerably increase their savings on rent, and others would pay a higher price. The overall tendency, though, is a positive increase in the rent costs. The distribution of these changes in rent cost over the districts of Istanbul at the level of the individual is provided in Fig. 9. These figures provide an individual (refugee) point of view in terms of the increase or reduction in cost of living. Panels B and D of Fig. 8, on the other hand, provide a governmental or organisational perspective. The maps indicate the total investment that would be required in each district in order to fully offset the increased rent payments of refugees. As we can see, the subsidies would be larger at the districts near the Bosphorus Strait. Surprisingly, these largest subsides are not regularly distributed among adjacent districts, and they correspond to the densest areas of the province.

Figure 8
figure 8

The histograms at left illustrate the distribution of monthly rent changes after optimisation, for the Refugee–Refugee and Refugee–Turkish networks respectively. The height of the vertical bars indicates the number of relocated refugees whose rent payments increased or decreased by the value indicated on the horizontal axis. Subsequently, the maps at right indicate, for each respective network, the total monthly rent change per district. That is, the product, for each origin district j and destination district i, of the difference in rent cost between j and i and the number of refugees moved m to the destination district. Thus, formally, the reported rent change per district i is obtained as \(v_{i} = \sum_{\forall j}m_{ij}(c_{i}-c_{j})\)

Figure 9
figure 9

Maps, for each corresponding network, of the average monthly rent change per refugee arriving in each district. These maps complement the distributions of Fig. 8

5 Discussion

This work essentially makes two contributions. On the one hand, we perform a large-scale data analysis of behavioural segregation in Istanbul on the basis of call patterns. On the other, we provide a framework for reducing the level of segregation based on the normative assumption that lowering behavioural segregation can increase social integration.

The method we propose allows for the quantification and potential mitigation of refugee segregation within a geographical area. The method goes beyond the spatial dimension typically considered, and accounts for behavioural aspects of the different communities. From the combined analysis of communication data, the first step is to establish if and to what extent the two group of interest behave differently. Our analysis confirms that differences in communication patterns were always significantly different (p-values lower than 0.01) comparing the two groups. The two plausible reasons accounting for these differences are the existence of strong cultural differences and residential enclaves, and the combination of both factors are reflected in the segregation of refugees in specific areas. According to the classical assumptions of public policy, indistinguishable communication patterns between Turkish population and Syrian refugees would reflect a situation of integration; that is, if their patterns of communication reached a kind of natural equilibrium, considering the geo- and socio-economical situation of the city. From this approach we hypothesise that, by merging the differing communication patterns of the Turkish and Syrian refugee populations into a single one, we can increase the potential for more inter-communication and integration between them, following the principle of homophily in social interactions.

Nevertheless, while the reported results presented here have a number of potentialities, there are also a number of limitations that should be further assessed in future studies, most of them with an interdisciplinary approach in mind. First, our model works within an idealised situation that does not address some well-known and important factors for integration, such as Syrian-Turkish cultural differences, which should be taken into account for a well-designed public policy aimed at improving social cohesion. These elements are important not only when a model such as the one presented here might be used as an input for public planning, but also for further research in the academic field. An example of this are the long-term effects that achieving similar communication patterns might have on the cultural aspects at the individual or inter-group level. On the other hand, local dynamics of urban politics should be taken into account. In our case, this would include for instance the complex relationship between the Turkish state, local government, real estate businesses, and residents in the context of the trend of “urban transformation” [5456]. Second, concerning the methods used here, further research could also include other variables that we have not addressed here. While our analysis is aggregated and anonymised, a similar procedure could be carried out with data tracking individuals over a period of time, to draw related but distinct conclusions. Additionally, “quality” of communications could be taken into consideration. Here, SMS and phone calls are given the same value. Perhaps even call duration could provide some measure of communication quality. All in all, these elements would allow to better address the interaction between individual, group and contextual factors that determine spatial patterns of segregation. Finally, the comparison we performed between our behavioural optimisation problem and an optimisation problem minimising the Dissimilarity Index opens up an interesting possibility of attempting a multi-objective optimisation, in the attempt to find an outcome beneficial to both spatial and behavioural segregation. This could be especially relevant given the fact that both optimisations were ineffective at reducing the segregation measure lowered by the other.

While we admit that these unaddressed aspects regarding other socio-cultural factors or different levels of detail of the Communication Network [57] should be considered in subsequent works, the estimations given here can be of practical use in several ways. First, the developed procedure provides estimations as to the level of integration that can be achieved by using social and residential mixing strategies. Secondly, we provide a systematic method that can give consistent quantitative evidence about the volumes and destinations required if a group—in our case Syrian refugees—were to be relocated in a particular urban or regional area. This recommendation can be seen as a good starting point for governments and NGOs to analyse the situation, target their campaigns, and optimise their economic investment in the area. Lastly, the optimisation framework proposed here can be easily complemented with other interesting parameters. In this work, we applied only one restriction to mobility: the one limiting the proportion of refugees per district. That said, mobility can be easily restricted in other ways. For instance, assuming the availability of the data, a restriction could be applied using employment data or labor demand in each district in order to achieve more socially accurate results.

An illustrating example of this exercise is if we consider how refugees’ choice of residential location influences their integration with the local community [58, 59]. Policy-driven incentives such as rent subsidies could facilitate those who, for example, might choose to move away from an ethnic enclave if rent prices outside were lower [56]. In our case, we have estimated that average rent paid among the relocating refugee population would not rise by more than by 39€ per family per month. This is a barrier that could be too high for refugees who already have difficulties. However, it is also a barrier that governmental and NGO policy could reduce. Governments and NGOs have a range of options available to them to incentivise locational choice which are out of the scope of the scientific work presented here. However, several well-known approaches to the problem exist. In this sense, our method could be used as an input to design programs involving a differential rent subsidy, or “voucher” [60], based on the relative rent price in target districts.

As a final note, it is worth saying that it is not our objective to advocate particular policies, but instead to provide methods to quantify and give indications of what could be expected from house mixing policies. An optimal integration of the refugee and host population should probably be considered an organic process, as the meaning of integration here is connections between people, and connections are made voluntarily and maintained only by individual choice. In the event that governments and non-governmental entities decide to take a hands-off approach to integration policy, the proposed framework can be useful for analysing how the situation evolves and providing early warnings of recessive or problematic conditions.