Skip to main content
Log in

Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses

  • Published:
Formal Methods in System Design Aims and scope Submit manuscript

Abstract

We present a method for automatically learning an effective strategy for clustering variables for the Octagon analysis from a given codebase. This learned strategy works as a preprocessor of Octagon. Given a program to be analyzed, the strategy is first applied to the program and clusters variables in it. We then run a partial variant of the Octagon analysis that tracks relationships among variables within the same cluster, but not across different clusters. The notable aspect of our learning method is that although the method is based on supervised learning, it does not require manually-labeled data. The method does not ask human to indicate which pairs of program variables in the given codebase should be tracked. Instead it uses the impact pre-analysis for Octagon from our previous work and automatically labels variable pairs in the codebase as positive or negative. We implemented our method on top of a static buffer-overflow detector for C programs and tested it against open source benchmarks. Our experiments show that the partial Octagon analysis with the learned strategy scales up to 100KLOC and is 33\(\times \) faster than the one with the impact pre-analysis (which itself is significantly faster than the original Octagon analysis), while increasing false alarms by only 2%. The general idea behind our methodis applicable to other types of static analyses as well. We demonstrate that our method is also effective to learn a strategy for context-sensitivity of interval analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Because the pre-analysis uses \(\bigstar \) cautiously, only a small portion of variable pairs is marked with \(\oplus \) (that is, 5864 / 258, 165, 546) in our experiments. Replacing “some” by “all” reduces this portion by half (2230 / 258, 165, 546) and makes the learning task more difficult.

  2. By nontrivial, we mean finite bounds that are neither \(\infty \) nor \(-\infty \).

  3. In practice, eliminating these false alarms is extremely challenging in a sound yet non-domain-specific static analyzer for full C. The false alarms arise from a variety of reasons, e.g., recursive calls, unknown library calls, complex loops, etc.

References

  1. Blanchet B, Cousot P, Cousot R, Feret J, Mauborgne L, Miné A, Monniaux D, Rival X (2003) A static analyzer for large safety-critical software. In: PLDI

  2. Breiman L (2001) Random forests. Machine Learning 45:5–32

    Article  Google Scholar 

  3. Cousot P, Halbwachs N (1978) Automatic discovery of linear restraints among variables of a program. In: POPL

  4. Garg P, Neider D, Madhusudan P, Roth D (2016) Learning invariants using decision trees and implication counterexamples. In: POPL, pp 499–512

  5. Grigore R, Yang H (2016) Abstraction refinement guided by a learnt probabilistic model. In: POPL

  6. Heo K, Oh H, Yang H (2016) Learning a variable-clustering strategy for octagon from labeled data generated by a static analysis. In: SAS

  7. Jeannet B, Miné A (2009) Apron: a library of numerical abstract domains for static analysis. In: CAV

  8. Mangal R, Zhang X, Nori AV, Naik M (2015) A user-guided approach to program analysis. In: ESEC/FSE, pp 462–473

  9. Miné A (2006) The octagon abstract domain. Higher-Order Symb Comput 19:31–100

    Article  Google Scholar 

  10. Mitchell TM (1997) Machine learning. McGraw-Hill Inc, New York

    MATH  Google Scholar 

  11. Murphy KP (2012) Machine learning: a probabilistic perspective (adaptive computation and machine learning series). Mit Press ISBN

  12. Nori AV, Sharma R (2013) Termination proofs from tests. In: FSE, pp 246–256

  13. Octeau D, Jha S, Dering M, McDaniel P, Bartel A, Li L, Klein J, Le Traon Y (2016) Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In: POPL, pp 469–484

  14. Oh H, Heo K, Lee W, Lee W, Park D, Kang J, Yi K (2014) Global sparse analysis framework. ACM Trans Program Lang Syst 36(3):8:1–8:44. https://doi.org/10.1145/2590811

  15. Oh H, Heo K, Lee W, Lee W, Yi K (2012) Design and implementation of sparse global analyses for C-like languages. In: PLDI

  16. Oh H, Lee W, Heo K, Yang H, Yi K (2014) Selective context-sensitivity guided by impact pre-analysis. In: PLDI

  17. Oh H, Yang H, Yi K (2015) Learning a strategy for adapting a program analysis via bayesian optimisation. In: OOPSLA

  18. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  19. Raychev V, Bielik P, Vechev MT, Krause A (2016) Learning programs from noisy data. In: POPL, pp. 761–774

  20. Sankaranarayanan S, Chaudhuri S, Ivančić F, Gupta A (2008) Dynamic inference of likely data preconditions over predicates by tree learning. In: ISSTA, pp 295–306

  21. Sankaranarayanan S, Ivančić F, Gupta A (2008) Mining library specifications using inductive logic programming. In: ICSE, pp 131–140

  22. Sharir M, Pnueli A (1981) Two approaches to interprocedural data flow analysis. Program flow analysis: theory and applications. Prentice-Hall, Englewood Cliffs, pp 189–234

    Google Scholar 

  23. Sharma R, Gupta S, Hariharan B, Aiken A, Liang P, Nori AV (2013) A data driven approach for algebraic loop invariants. In: ESOP, pp 574–592. https://doi.org/10.1007/978-3-642-37036-6_31

  24. Sharma R, Gupta S, Hariharan B, Aiken A, Nori AV (2013) Verification as learning geometric concepts. In: SAS, pp 388–411

  25. Sharma R, Nori AV, Aiken A (2012) Interpolants as classifiers. In: CAV, pp 71–87

  26. Singh, G., Püschel, M., Vechev, M (2015) Making Numerical Program Analysis Fast. In: PLDI

  27. Sparrow: http://ropas.snu.ac.kr/sparrow

  28. Venet A, Brat G (2004) Precise and efficient static array bound checking for large embedded C programs. In: PLDI

  29. Yi K, Choi H, Kim J, Kim Y (2007) An empirical study on classification methods for alarms from a bug-finding static C analyzer. Inf Process Lett 102(2–3):118–123

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Numbers SRFC-IT1701-09. This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2015-0-00565, Development of Vulnerability Discovery Technologies for IoT Software Security, No. 2017-0-00184, Self-Learning Cyber Immune Technology Development).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hakjoo Oh or Hongseok Yang.

Additional information

This work was carried out while Heo was at Seoul National University and Yang was at the University of Oxford.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heo, K., Oh, H. & Yang, H. Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses. Form Methods Syst Des 53, 189–220 (2018). https://doi.org/10.1007/s10703-017-0306-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10703-017-0306-7

Keywords

Navigation