Elsevier

Computer Communications

Volume 167, 1 February 2021, Pages 15-30
Computer Communications

Flow length and size distributions in campus Internet traffic

https://doi.org/10.1016/j.comcom.2020.12.016Get rights and content

Abstract

The efficiency of flow-based networking mechanisms strongly depends on traffic characteristics and should thus be assessed using accurate flow models. For example, in the case of algorithms based on the distinction between elephant and mice flows, it is extremely important to ensure realistic flows’ length and size distributions. Credible models or data are not available in literature. Numerous works contain only plots roughly presenting empirical distribution of selected flow parameters, without providing distribution mixture models or any reusable numerical data. This paper aims to fill that gap and provide reusable models of flow length and size derived from real traffic traces. Traces were collected at the Internet-facing interface of the university campus network and comprise four billion layer-4 flow (275 TB). These models can be used to assess a variety of flow-oriented solutions under the assumption of realistic conditions. Additionally, this paper provides a tutorial on constructing network flow models from traffic traces. The proposed methodology is universal and can be applied to traffic traces gathered in any network. We also provide an open source software framework to analyze flow traces and fit general mixture models to them.

Introduction

Flow-based switching and routing has been gaining the attention of researchers for a quite some time [1]. It can be advantageous in comparison to per-packet switching, especially with regard to traffic engineering [2], quality of service (QoS) [3] or security [4]. For example, flow routing enables multipath and adaptive approaches, which are impossible to achieve in per-packet routing due to routing loops and route-flapping constraints, respectively [5].

The efficiency of numerous flow-based solutions strongly depends on traffic characteristics, and thus, should be assessed based on realistic and accurate flow models. An example of such solutions are traffic engineering mechanisms exploiting the heavy-tailed nature of IP flows. To the best of our knowledge, the first paper exploring such a possibility is [6], in which the authors proposed heuristic that differentiates traffic into elephant and mice flows. Then, assuming that elephant flows have a more significant impact on network performance, this type of traffic is routed adaptively to the current network load, while flows classified as mice are handled using the shortest paths. Recently, the heavy-tailed nature of IP flows is being exploited to reduce management overheads in software-defined networking (SDN). For example, in work [7], the authors employed a reinforcement learning approach to detect elephant flows in advance to limit the number of flow entries in forwarding tables.

To reliably evaluate such ideas, realistic distributions of flows’ length and size must be ensured. Unfortunately, such a data is not available in the literature. For example, the authors of [6] used distributions extracted from their own traffic measurements, but they did not provide any reusable data and their trace was limited to only one week. By contrast, the authors of [7] assumed a 1:9 constant ratio between elephant and mice layer-4 flows and fixed flow sizes of 25.6 MB and 256 KB, respectively. Such assumptions are not only arbitrary, they also often do not correspond to reality, as we show in this work.

The lack of realistic models negatively impacts on the credibility of results presented in numerous papers. Moreover, different and arbitrary assumptions in various works exclude the possibility to effectively compare different solutions. As we show in related works, all the papers attempting address this issue provide plots presenting empirical probability density functions (PDFs) or cumulative distribution functions (CDFs) of selected flow parameters at best. None of these papers provide distribution mixture models or even reusable numerical data of any kind.

This paper’s goal is to provide accurate flow statistics and reusable distribution mixture models of flow’s length and size for any researchers who may need such a data. We believe that these models can be considered as general models of typical Internet traffic, and thus, widely used in numerous applications, including AI and Big Data. Examples of such the applications are summarized in Table 1. Furthermore, we provide a tutorial and software for building similar models based on data gathered in any network.

The structure of this paper is as follows. First, we present the methodology with a tutorial covering the following steps:

  • collecting flow records,

  • cleaning the data,

  • merging split flow records,

  • data binning and plotting,

  • fitting mixture models to data,

  • generating realistic traffic based on these models.

For each step, we provide tips and highlight caveats and possible pitfalls. In addition to the methodology and tutorial, we provide an open source software framework comprising tools aimed at performing these steps. The framework is designed with big data analysis capabilities in mind. Specifically, it supports out-of-core computing, making possible to analyze data which exceeds available memory. Moreover, most processing steps can be scaled horizontally using the well-established map-reduce technique. Therefore, provided implementation is not limited in terms of the number of processed flow records. Together, the provided methodology and framework create an opportunity for any interested parties to extract traffic characteristics from their networks and validate any potential mechanisms before applying them in the production environment.

Then, we use the framework to apply the methodology to the real traffic traces in order to extract models of flow length and size. Traces cover a thirty-day period of layer-4 flows (four billion flows, defined by 5-tuple, 275 TB of transmitted data) and were collected on the Internet-facing interface of a large wired university network. This is several orders of magnitude more than in previous analyses which mostly comprised tens of millions of flows. Flows number, and total sum of packet and octet distributions are extracted, analyzed and modeled as functions of both flow length (in packets) and flow size (in bytes). In previous works, only selected distributions were presented, without any models or reusable numerical parameters (see Section 2: Related work).

Finally, along with the framework source code, we make the data publicly available. This makes our results reusable and fully reproducible, increasing the value of the tutorial part of this work:

https://github.com/piotrjurkiewicz/flow-models

Section snippets

Related works

To our knowledge, no other paper jointly provides tutorial style methodology to extract accurate flow characteristics. Furthermore, we are unaware of either any software framework able to determine such characteristics or any previous work providing reusable flow model reflecting general Internet traffic. Some works provide only selected traffic properties, without trying to fit accurate mixture models. Such works are briefly introduced below.

The contribution of paper [8] is the most similar

Methodology

This section covers steps aimed at collecting and analyzing flow traces from the network, as well as constructing flow models that accurately describe the traffic. For each stage, numerous tips and possible pitfalls are provided to reveal all the lessons learned during the research.

The overall data pipeline is as follows. First, all flow records have to be collected. Next, before any further processing, the data need to be cleaned and filtered. Since long lasting flows may be reported multiple

Campus traffic model

We applied the methodology described in the tutorial to the real traffic traces in order to extract models of flow lengths and sizes. We collected NetFlow records of all flows passing through the Internet-facing interface of the AGH University of Science and Technology wired network over 30 consecutive days. Flows traveling in both directions (upstream and downstream) were collected separately. We only collected dataplane traffic, without any control traffic (like OpenFlow messages between

Conclusion

The contribution of this paper is fourfold. Firstly, it provides a complete tutorial on methodology aimed at constructing network flow models from flow records.

Secondly, a ready-to-use and scalable framework implementing this methodology is published as an open source software. Due to applying big data techniques it scales horizontally and can be used to process an unlimited number of flow records and fit distribution mixtures to them.

Thirdly, the paper presents an example of applying the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research was carried out with the support of the project “Intelligent management of traffic in multi-layer Software-Defined Networks” funded by the Polish National Science Centre, Poland under project no. 2017/25/B/ST6/02186.

The authors would like to thank Bogusław Juza for providing NetFlow flow record dumps.

References (55)

  • ShinS. et al.

    Enhancing network security through software defined networking (SDN)

  • ShaikhA. et al.

    Load-sensitive routing of long-lived IP flows

    ACM SIGCOMM Comput. Commun. Rev.

    (1999)
  • MuT.-Y. et al.

    SDN flow entry management using reinforcement learning

    ACM Trans. Auton. Adapt. Syst.

    (2018)
  • PustisekM. et al.

    Empirical analysis and modeling of peer-to-peer traffic flows

  • B. Ryu, D. Cheney, H. werner Braun, Internet flow characterization: Adaptive timeout strategy and statistical modeling,...
  • FangW. et al.

    Inter-AS traffic patterns and their implications

  • GuanX. et al.

    Dynamic feature analysis and measurement for large-scale network traffic monitoring

    IEEE Trans. Inf. Forensics Secur.

    (2010)
  • QianL. et al.

    A flow-based performance analysis of TCP and TCP applications

  • ZhangY. et al.

    On the characteristics and origins of internet flow rates

    ACM SIGCOMM Comput. Commun. Rev.

    (2002)
  • QianF. et al.

    TCP revisited: A fresh look at TCP in the wild

  • BrownleeN. et al.

    Understanding Internet traffic streams: dragonflies and tortoises

    IEEE Commun. Mag.

    (2002)
  • PapagiannakitK. et al.

    Impact of flow dynamics on traffic engineering design principles

  • BensonT. et al.

    Network traffic characteristics of data centers in the wild

  • LanK.-C. et al.

    On the correlation of Internet flow characteristics

    (2003)
  • chan LanK. et al.

    A measurement study of correlations of Internet flow characteristics

    Comput. Netw.

    (2006)
  • MegyesiP. et al.

    Analysis of elephant users in broadband network traffic

  • AntunesN. et al.

    Estimation of flow distributions from sampled traffic

    ACM Trans. Model. Perform. Eval. Comput. Syst.

    (2016)
  • Cited by (47)

    • When less is more: BBR versus LEDBAT++

      2022, Computer Networks
      Citation Excerpt :

      We also must set a value for size of the best effort flow. According to [19], 99.1% of 4 billion flows analyzed in the context of a campus network carry less than 262 KB. Similarly, we analyze the 1.4 M flow traces from an Italian nation-wide Internet Service Provider (ISP), collected in [20] in 2017 and we observe that 90% of flows have a size smaller than 50,150 bytes, and 95% of the flows are smaller than 227,500 bytes.

    View all citing articles on Scopus
    View full text