Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses

Heo, Kihong; Oh, Hakjoo; Yang, Hongseok

doi:10.1007/s10703-017-0306-7

Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses

Published: 21 November 2017

Volume 53, pages 189–220, (2018)
Cite this article

Formal Methods in System Design Aims and scope Submit manuscript

253 Accesses
1 Citation
Explore all metrics

Abstract

We present a method for automatically learning an effective strategy for clustering variables for the Octagon analysis from a given codebase. This learned strategy works as a preprocessor of Octagon. Given a program to be analyzed, the strategy is first applied to the program and clusters variables in it. We then run a partial variant of the Octagon analysis that tracks relationships among variables within the same cluster, but not across different clusters. The notable aspect of our learning method is that although the method is based on supervised learning, it does not require manually-labeled data. The method does not ask human to indicate which pairs of program variables in the given codebase should be tracked. Instead it uses the impact pre-analysis for Octagon from our previous work and automatically labels variable pairs in the codebase as positive or negative. We implemented our method on top of a static buffer-overflow detector for C programs and tested it against open source benchmarks. Our experiments show that the partial Octagon analysis with the learned strategy scales up to 100KLOC and is 33\(\times \) faster than the one with the impact pre-analysis (which itself is significantly faster than the original Octagon analysis), while increasing false alarms by only 2%. The general idea behind our methodis applicable to other types of static analyses as well. We demonstrate that our method is also effective to learn a strategy for context-sensitivity of interval analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis

Learning a Strategy for Choosing Widening Thresholds from a Large Codebase

Learning a Static Analyzer from Data

Notes

Because the pre-analysis uses \(\bigstar \) cautiously, only a small portion of variable pairs is marked with \(\oplus \) (that is, 5864 / 258, 165, 546) in our experiments. Replacing “some” by “all” reduces this portion by half (2230 / 258, 165, 546) and makes the learning task more difficult.
By nontrivial, we mean finite bounds that are neither \(\infty \) nor \(-\infty \).
In practice, eliminating these false alarms is extremely challenging in a sound yet non-domain-specific static analyzer for full C. The false alarms arise from a variety of reasons, e.g., recursive calls, unknown library calls, complex loops, etc.

References

Blanchet B, Cousot P, Cousot R, Feret J, Mauborgne L, Miné A, Monniaux D, Rival X (2003) A static analyzer for large safety-critical software. In: PLDI
Breiman L (2001) Random forests. Machine Learning 45:5–32
Article Google Scholar
Cousot P, Halbwachs N (1978) Automatic discovery of linear restraints among variables of a program. In: POPL
Garg P, Neider D, Madhusudan P, Roth D (2016) Learning invariants using decision trees and implication counterexamples. In: POPL, pp 499–512
Grigore R, Yang H (2016) Abstraction refinement guided by a learnt probabilistic model. In: POPL
Heo K, Oh H, Yang H (2016) Learning a variable-clustering strategy for octagon from labeled data generated by a static analysis. In: SAS
Jeannet B, Miné A (2009) Apron: a library of numerical abstract domains for static analysis. In: CAV
Mangal R, Zhang X, Nori AV, Naik M (2015) A user-guided approach to program analysis. In: ESEC/FSE, pp 462–473
Miné A (2006) The octagon abstract domain. Higher-Order Symb Comput 19:31–100
Article Google Scholar
Mitchell TM (1997) Machine learning. McGraw-Hill Inc, New York
MATH Google Scholar
Murphy KP (2012) Machine learning: a probabilistic perspective (adaptive computation and machine learning series). Mit Press ISBN
Nori AV, Sharma R (2013) Termination proofs from tests. In: FSE, pp 246–256
Octeau D, Jha S, Dering M, McDaniel P, Bartel A, Li L, Klein J, Le Traon Y (2016) Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In: POPL, pp 469–484
Oh H, Heo K, Lee W, Lee W, Park D, Kang J, Yi K (2014) Global sparse analysis framework. ACM Trans Program Lang Syst 36(3):8:1–8:44. https://doi.org/10.1145/2590811
Oh H, Heo K, Lee W, Lee W, Yi K (2012) Design and implementation of sparse global analyses for C-like languages. In: PLDI
Oh H, Lee W, Heo K, Yang H, Yi K (2014) Selective context-sensitivity guided by impact pre-analysis. In: PLDI
Oh H, Yang H, Yi K (2015) Learning a strategy for adapting a program analysis via bayesian optimisation. In: OOPSLA
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Raychev V, Bielik P, Vechev MT, Krause A (2016) Learning programs from noisy data. In: POPL, pp. 761–774
Sankaranarayanan S, Chaudhuri S, Ivančić F, Gupta A (2008) Dynamic inference of likely data preconditions over predicates by tree learning. In: ISSTA, pp 295–306
Sankaranarayanan S, Ivančić F, Gupta A (2008) Mining library specifications using inductive logic programming. In: ICSE, pp 131–140
Sharir M, Pnueli A (1981) Two approaches to interprocedural data flow analysis. Program flow analysis: theory and applications. Prentice-Hall, Englewood Cliffs, pp 189–234
Google Scholar
Sharma R, Gupta S, Hariharan B, Aiken A, Liang P, Nori AV (2013) A data driven approach for algebraic loop invariants. In: ESOP, pp 574–592. https://doi.org/10.1007/978-3-642-37036-6_31
Sharma R, Gupta S, Hariharan B, Aiken A, Nori AV (2013) Verification as learning geometric concepts. In: SAS, pp 388–411
Sharma R, Nori AV, Aiken A (2012) Interpolants as classifiers. In: CAV, pp 71–87
Singh, G., Püschel, M., Vechev, M (2015) Making Numerical Program Analysis Fast. In: PLDI
Sparrow: http://ropas.snu.ac.kr/sparrow
Venet A, Brat G (2004) Precise and efficient static array bound checking for large embedded C programs. In: PLDI
Yi K, Choi H, Kim J, Kim Y (2007) An empirical study on classification methods for alarms from a bug-finding static C analyzer. Inf Process Lett 102(2–3):118–123
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Numbers SRFC-IT1701-09. This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2015-0-00565, Development of Vulnerability Discovery Technologies for IoT Software Security, No. 2017-0-00184, Self-Learning Cyber Immune Technology Development).

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, PA, USA
Kihong Heo
Korea University, Seoul, Korea
Hakjoo Oh
KAIST, Daejeon, Korea
Hongseok Yang

Authors

Kihong Heo
View author publications
You can also search for this author in PubMed Google Scholar
Hakjoo Oh
View author publications
You can also search for this author in PubMed Google Scholar
Hongseok Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hakjoo Oh or Hongseok Yang.

Additional information

This work was carried out while Heo was at Seoul National University and Yang was at the University of Oxford.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heo, K., Oh, H. & Yang, H. Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses. Form Methods Syst Des 53, 189–220 (2018). https://doi.org/10.1007/s10703-017-0306-7

Download citation

Published: 21 November 2017
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10703-017-0306-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses

Abstract

Access this article

Similar content being viewed by others

Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis

Learning a Strategy for Choosing Widening Thresholds from a Large Codebase

Learning a Static Analyzer from Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning analysis strategies for octagon and context sensitivity from labeled data generated by static analyses

Abstract

Access this article

Similar content being viewed by others

Learning a Variable-Clustering Strategy for Octagon from Labeled Data Generated by a Static Analysis

Learning a Strategy for Choosing Widening Thresholds from a Large Codebase

Learning a Static Analyzer from Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation