Flow length and size distributions in campus Internet traffic
Introduction
Flow-based switching and routing has been gaining the attention of researchers for a quite some time [1]. It can be advantageous in comparison to per-packet switching, especially with regard to traffic engineering [2], quality of service (QoS) [3] or security [4]. For example, flow routing enables multipath and adaptive approaches, which are impossible to achieve in per-packet routing due to routing loops and route-flapping constraints, respectively [5].
The efficiency of numerous flow-based solutions strongly depends on traffic characteristics, and thus, should be assessed based on realistic and accurate flow models. An example of such solutions are traffic engineering mechanisms exploiting the heavy-tailed nature of IP flows. To the best of our knowledge, the first paper exploring such a possibility is [6], in which the authors proposed heuristic that differentiates traffic into elephant and mice flows. Then, assuming that elephant flows have a more significant impact on network performance, this type of traffic is routed adaptively to the current network load, while flows classified as mice are handled using the shortest paths. Recently, the heavy-tailed nature of IP flows is being exploited to reduce management overheads in software-defined networking (SDN). For example, in work [7], the authors employed a reinforcement learning approach to detect elephant flows in advance to limit the number of flow entries in forwarding tables.
To reliably evaluate such ideas, realistic distributions of flows’ length and size must be ensured. Unfortunately, such a data is not available in the literature. For example, the authors of [6] used distributions extracted from their own traffic measurements, but they did not provide any reusable data and their trace was limited to only one week. By contrast, the authors of [7] assumed a 1:9 constant ratio between elephant and mice layer-4 flows and fixed flow sizes of 25.6 MB and 256 KB, respectively. Such assumptions are not only arbitrary, they also often do not correspond to reality, as we show in this work.
The lack of realistic models negatively impacts on the credibility of results presented in numerous papers. Moreover, different and arbitrary assumptions in various works exclude the possibility to effectively compare different solutions. As we show in related works, all the papers attempting address this issue provide plots presenting empirical probability density functions (PDFs) or cumulative distribution functions (CDFs) of selected flow parameters at best. None of these papers provide distribution mixture models or even reusable numerical data of any kind.
This paper’s goal is to provide accurate flow statistics and reusable distribution mixture models of flow’s length and size for any researchers who may need such a data. We believe that these models can be considered as general models of typical Internet traffic, and thus, widely used in numerous applications, including AI and Big Data. Examples of such the applications are summarized in Table 1. Furthermore, we provide a tutorial and software for building similar models based on data gathered in any network.
The structure of this paper is as follows. First, we present the methodology with a tutorial covering the following steps:
- •
collecting flow records,
- •
cleaning the data,
- •
merging split flow records,
- •
data binning and plotting,
- •
fitting mixture models to data,
- •
generating realistic traffic based on these models.
For each step, we provide tips and highlight caveats and possible pitfalls. In addition to the methodology and tutorial, we provide an open source software framework comprising tools aimed at performing these steps. The framework is designed with big data analysis capabilities in mind. Specifically, it supports out-of-core computing, making possible to analyze data which exceeds available memory. Moreover, most processing steps can be scaled horizontally using the well-established map-reduce technique. Therefore, provided implementation is not limited in terms of the number of processed flow records. Together, the provided methodology and framework create an opportunity for any interested parties to extract traffic characteristics from their networks and validate any potential mechanisms before applying them in the production environment.
Then, we use the framework to apply the methodology to the real traffic traces in order to extract models of flow length and size. Traces cover a thirty-day period of layer-4 flows (four billion flows, defined by 5-tuple, 275 TB of transmitted data) and were collected on the Internet-facing interface of a large wired university network. This is several orders of magnitude more than in previous analyses which mostly comprised tens of millions of flows. Flows number, and total sum of packet and octet distributions are extracted, analyzed and modeled as functions of both flow length (in packets) and flow size (in bytes). In previous works, only selected distributions were presented, without any models or reusable numerical parameters (see Section 2: Related work).
Finally, along with the framework source code, we make the data publicly available. This makes our results reusable and fully reproducible, increasing the value of the tutorial part of this work:
Section snippets
Related works
To our knowledge, no other paper jointly provides tutorial style methodology to extract accurate flow characteristics. Furthermore, we are unaware of either any software framework able to determine such characteristics or any previous work providing reusable flow model reflecting general Internet traffic. Some works provide only selected traffic properties, without trying to fit accurate mixture models. Such works are briefly introduced below.
The contribution of paper [8] is the most similar
Methodology
This section covers steps aimed at collecting and analyzing flow traces from the network, as well as constructing flow models that accurately describe the traffic. For each stage, numerous tips and possible pitfalls are provided to reveal all the lessons learned during the research.
The overall data pipeline is as follows. First, all flow records have to be collected. Next, before any further processing, the data need to be cleaned and filtered. Since long lasting flows may be reported multiple
Campus traffic model
We applied the methodology described in the tutorial to the real traffic traces in order to extract models of flow lengths and sizes. We collected NetFlow records of all flows passing through the Internet-facing interface of the AGH University of Science and Technology wired network over 30 consecutive days. Flows traveling in both directions (upstream and downstream) were collected separately. We only collected dataplane traffic, without any control traffic (like OpenFlow messages between
Conclusion
The contribution of this paper is fourfold. Firstly, it provides a complete tutorial on methodology aimed at constructing network flow models from flow records.
Secondly, a ready-to-use and scalable framework implementing this methodology is published as an open source software. Due to applying big data techniques it scales horizontally and can be used to process an unlimited number of flow records and fit distribution mixtures to them.
Thirdly, the paper presents an example of applying the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The research was carried out with the support of the project “Intelligent management of traffic in multi-layer Software-Defined Networks” funded by the Polish National Science Centre, Poland under project no. 2017/25/B/ST6/02186.
The authors would like to thank Bogusław Juza for providing NetFlow flow record dumps.
References (55)
- et al.
A roadmap for traffic engineering in SDN-OpenFlow networks
Comput. Netw.
(2014) - et al.
Testing implementation of FAMTAR: Adaptive multipath routing
Comput. Commun.
(2020) - et al.
Characteristic analysis of internet traffic from the perspective of flows
Comput. Commun.
(2006) - et al.
Variable heavy tails in internet traffic
Perform. Eval.
(2004) Lognormal and Pareto distributions in the Internet
Comput. Commun.
(2005)- et al.
Game traffic analysis: An MMORPG perspective
Comput. Netw.
(2006) - et al.
Live streaming of user generated videos: Workload characterization and content delivery architectures
Comput. Netw.
(2011) - et al.
Accurate modeling of VoIP traffic QoS parameters in current and future networks with multifractal and Markov models
Math. Comput. Modelling
(2013) - et al.
Software-defined networking: A comprehensive survey
Proc. IEEE
(2015) - et al.
Flow oriented approaches to QoS assurance
ACM Comput. Surv.
(2012)
Enhancing network security through software defined networking (SDN)
Load-sensitive routing of long-lived IP flows
ACM SIGCOMM Comput. Commun. Rev.
SDN flow entry management using reinforcement learning
ACM Trans. Auton. Adapt. Syst.
Empirical analysis and modeling of peer-to-peer traffic flows
Inter-AS traffic patterns and their implications
Dynamic feature analysis and measurement for large-scale network traffic monitoring
IEEE Trans. Inf. Forensics Secur.
A flow-based performance analysis of TCP and TCP applications
On the characteristics and origins of internet flow rates
ACM SIGCOMM Comput. Commun. Rev.
TCP revisited: A fresh look at TCP in the wild
Understanding Internet traffic streams: dragonflies and tortoises
IEEE Commun. Mag.
Impact of flow dynamics on traffic engineering design principles
Network traffic characteristics of data centers in the wild
On the correlation of Internet flow characteristics
A measurement study of correlations of Internet flow characteristics
Comput. Netw.
Analysis of elephant users in broadband network traffic
Estimation of flow distributions from sampled traffic
ACM Trans. Model. Perform. Eval. Comput. Syst.
Cited by (47)
Design, implementation and validation of a receiver-driven less-than-best-effort transport
2023, Computer NetworksWhen less is more: BBR versus LEDBAT++
2022, Computer NetworksCitation Excerpt :We also must set a value for size of the best effort flow. According to [19], 99.1% of 4 billion flows analyzed in the context of a campus network carry less than 262 KB. Similarly, we analyze the 1.4 M flow traces from an Italian nation-wide Internet Service Provider (ISP), collected in [20] in 2017 and we observe that 90% of flows have a size smaller than 50,150 bytes, and 95% of the flows are smaller than 227,500 bytes.