当前位置: X-MOL 学术arXiv.cs.OH › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Looking for non-compliant documents using error messages from multiple parsers
arXiv - CS - Other Computer Science Pub Date : 2020-12-15 , DOI: arxiv-2012.10211
Michael Robinson

Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this article focuses upon the case of the PDF format (ISO 32000-2), we make no attempt to use any specific details of the format. Furthermore, we show how principal components analysis can be useful for a format specification designer to assess the quality and structure of these samples of files and parsers. While these tests are absolutely rudimentary, it appears that their use to measure file format variability and to identify non-compliant files is both novel and surprisingly effective.

中文翻译:

使用来自多个解析器的错误消息查找不符合要求的文档

文件是否被单个解析器接受并不可靠地表明文件是否符合其规定的格式。解析器和格式规范中的错误意味着兼容文件可能无法解析,或者读取不兼容文件而没有任何明显的麻烦。后者会带来重大的安全风险,应避免使用。本文建议一种更好的评估格式规范符合性的方法是检查一组解析器而不是单个解析器产生的错误消息集。如果同时提供了一个合规文件样本和一个不合规文件样本,那么我们将展示基于伪似然比的统计测试如何非常有效地确定文件的合规性。我们的方法与格式无关,并且不直接依赖于格式的正式规范。尽管本文重点介绍了PDF格式(ISO 32000-2)的情况,但我们没有尝试使用该格式的任何特定细节。此外,我们展示了主成分分析如何对格式规范设计人员评估文件和解析器这些样本的质量和结构有用。尽管这些测试绝对是基本的,但看来它们用于测量文件格式的可变性和识别不符合要求的文件既新颖又令人惊讶。我们展示了主成分分析如何对格式规范设计人员评估文件和解析器这些样本的质量和结构有用。尽管这些测试绝对是基本的,但看来它们用于测量文件格式的可变性和识别不符合要求的文件既新颖又令人惊讶。我们展示了主成分分析如何对格式规范设计人员评估文件和解析器这些样本的质量和结构有用。尽管这些测试绝对是基本的,但看来它们用于测量文件格式的可变性和识别不符合要求的文件既新颖又令人惊讶。
更新日期:2020-12-21
down
wechat
bug