当前位置: X-MOL 学术Journal of Quantitative Linguistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Discriminativeness of Internal Syntactic Representations in Automatic Genre Classification
Journal of Quantitative Linguistics ( IF 0.761 ) Pub Date : 2019-09-26 , DOI: 10.1080/09296174.2019.1663655
Mingyu Wan 1 , Alex Chengyu Fang 2 , Chu-Ren Huang 1
Affiliation  

ABSTRACT

Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between syntactic variation and genre differentiation by introspecting stylistic cues in functional and structural aspects beyond word level. It has engineered 14 syntactic feature sets of internal representations for genre classification through Machine Learning devices. Experiment results show significant superiority of fusing structural and lexical features for genre classification (F∆max. = 9.2%, sig. = 0.001), suggesting the effectiveness of incorporating syntactic cues for genre discrimination. In addition, the PCA analysis reports the noun phrases (NP) as the most principle component (66%) for genre variation and prepositional phrases (PP) the second. Particularly, noun phrases with dominant structures of prepositional complements and pronouns functioning as a subject are most effective for identifying printed texts of high formality, while prepositional phrases are useful for identifying speeches of low formality. Error analysis suggests that the phrasal features are particularly useful for classifying four groups of genre classes, i.e. unscripted speech, fiction, news reports, and academic writing, all distributed with distinct structural characteristics, and they demonstrate an incremental degree of formality in the continuum of language complexity.



中文翻译:

自动体裁分类中内部句法表征的区别性

摘要

与大多数主题的文档检索和分类应用程序所关注的主题不同,流派对文档的表征与主题不同。这项工作通过对词级以外的功能和结构方面的风格线索进行内省,来假设句法变体与体裁差异之间的紧密相互作用。它设计了14种内部表示形式的语法特征集,用于通过机器学习设备进行体裁分类。实验结果表明,融合结构和词汇特征对于体裁分类具有明显的优越性(F ∆max。= 9.2%,信号。= 0.001),这表明将语法提示纳入体裁歧视的有效性。此外,PCA分析报告名词短语(NP)是体裁变异的最主要成分(66%),其次是介词短语(PP)。特别地,具有介词补语和代词的主语结构的名词短语作为主体,对识别高形式性的印刷文本最有效,而介词短语对识别低形式性的语音非常有用。误差分析表明,短语特征对于分类四类体裁类别特别有用,即,未成文的演讲,小说,新闻报道和学术著作,这些类别均具有明显的结构特征,

更新日期:2019-09-26
down
wechat
bug