Abstract
We present a method for automatically learning an effective strategy for clustering variables for the Octagon analysis from a given codebase. This learned strategy works as a preprocessor of Octagon. Given a program to be analyzed, the strategy is first applied to the program and clusters variables in it. We then run a partial variant of the Octagon analysis that tracks relationships among variables within the same cluster, but not across different clusters. The notable aspect of our learning method is that although the method is based on supervised learning, it does not require manually-labeled data. The method does not ask human to indicate which pairs of program variables in the given codebase should be tracked. Instead it uses the impact pre-analysis for Octagon from our previous work and automatically labels variable pairs in the codebase as positive or negative. We implemented our method on top of a static buffer-overflow detector for C programs and tested it against open source benchmarks. Our experiments show that the partial Octagon analysis with the learned strategy scales up to 100KLOC and is 33\(\times \) faster than the one with the impact pre-analysis (which itself is significantly faster than the original Octagon analysis), while increasing false alarms by only 2%. The general idea behind our methodis applicable to other types of static analyses as well. We demonstrate that our method is also effective to learn a strategy for context-sensitivity of interval analysis.
Similar content being viewed by others
Notes
Because the pre-analysis uses \(\bigstar \) cautiously, only a small portion of variable pairs is marked with \(\oplus \) (that is, 5864 / 258, 165, 546) in our experiments. Replacing “some” by “all” reduces this portion by half (2230 / 258, 165, 546) and makes the learning task more difficult.
By nontrivial, we mean finite bounds that are neither \(\infty \) nor \(-\infty \).
In practice, eliminating these false alarms is extremely challenging in a sound yet non-domain-specific static analyzer for full C. The false alarms arise from a variety of reasons, e.g., recursive calls, unknown library calls, complex loops, etc.
References
Blanchet B, Cousot P, Cousot R, Feret J, Mauborgne L, Miné A, Monniaux D, Rival X (2003) A static analyzer for large safety-critical software. In: PLDI
Breiman L (2001) Random forests. Machine Learning 45:5–32
Cousot P, Halbwachs N (1978) Automatic discovery of linear restraints among variables of a program. In: POPL
Garg P, Neider D, Madhusudan P, Roth D (2016) Learning invariants using decision trees and implication counterexamples. In: POPL, pp 499–512
Grigore R, Yang H (2016) Abstraction refinement guided by a learnt probabilistic model. In: POPL
Heo K, Oh H, Yang H (2016) Learning a variable-clustering strategy for octagon from labeled data generated by a static analysis. In: SAS
Jeannet B, Miné A (2009) Apron: a library of numerical abstract domains for static analysis. In: CAV
Mangal R, Zhang X, Nori AV, Naik M (2015) A user-guided approach to program analysis. In: ESEC/FSE, pp 462–473
Miné A (2006) The octagon abstract domain. Higher-Order Symb Comput 19:31–100
Mitchell TM (1997) Machine learning. McGraw-Hill Inc, New York
Murphy KP (2012) Machine learning: a probabilistic perspective (adaptive computation and machine learning series). Mit Press ISBN
Nori AV, Sharma R (2013) Termination proofs from tests. In: FSE, pp 246–256
Octeau D, Jha S, Dering M, McDaniel P, Bartel A, Li L, Klein J, Le Traon Y (2016) Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In: POPL, pp 469–484
Oh H, Heo K, Lee W, Lee W, Park D, Kang J, Yi K (2014) Global sparse analysis framework. ACM Trans Program Lang Syst 36(3):8:1–8:44. https://doi.org/10.1145/2590811
Oh H, Heo K, Lee W, Lee W, Yi K (2012) Design and implementation of sparse global analyses for C-like languages. In: PLDI
Oh H, Lee W, Heo K, Yang H, Yi K (2014) Selective context-sensitivity guided by impact pre-analysis. In: PLDI
Oh H, Yang H, Yi K (2015) Learning a strategy for adapting a program analysis via bayesian optimisation. In: OOPSLA
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Raychev V, Bielik P, Vechev MT, Krause A (2016) Learning programs from noisy data. In: POPL, pp. 761–774
Sankaranarayanan S, Chaudhuri S, Ivančić F, Gupta A (2008) Dynamic inference of likely data preconditions over predicates by tree learning. In: ISSTA, pp 295–306
Sankaranarayanan S, Ivančić F, Gupta A (2008) Mining library specifications using inductive logic programming. In: ICSE, pp 131–140
Sharir M, Pnueli A (1981) Two approaches to interprocedural data flow analysis. Program flow analysis: theory and applications. Prentice-Hall, Englewood Cliffs, pp 189–234
Sharma R, Gupta S, Hariharan B, Aiken A, Liang P, Nori AV (2013) A data driven approach for algebraic loop invariants. In: ESOP, pp 574–592. https://doi.org/10.1007/978-3-642-37036-6_31
Sharma R, Gupta S, Hariharan B, Aiken A, Nori AV (2013) Verification as learning geometric concepts. In: SAS, pp 388–411
Sharma R, Nori AV, Aiken A (2012) Interpolants as classifiers. In: CAV, pp 71–87
Singh, G., Püschel, M., Vechev, M (2015) Making Numerical Program Analysis Fast. In: PLDI
Sparrow: http://ropas.snu.ac.kr/sparrow
Venet A, Brat G (2004) Precise and efficient static array bound checking for large embedded C programs. In: PLDI
Yi K, Choi H, Kim J, Kim Y (2007) An empirical study on classification methods for alarms from a bug-finding static C analyzer. Inf Process Lett 102(2–3):118–123
Acknowledgements
This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Numbers SRFC-IT1701-09. This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2015-0-00565, Development of Vulnerability Discovery Technologies for IoT Software Security, No. 2017-0-00184, Self-Learning Cyber Immune Technology Development).
Author information
Authors and Affiliations
Corresponding authors
Additional information
This work was carried out while Heo was at Seoul National University and Yang was at the University of Oxford.
Rights and permissions
About this article
Cite this article
Heo, K., Oh, H. & Yang, H. Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses. Form Methods Syst Des 53, 189–220 (2018). https://doi.org/10.1007/s10703-017-0306-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10703-017-0306-7