当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Probabilistic Modeling for Frequency Vectors Using a Flexible Shifted-Scaled Dirichlet Distribution Prior
ACM Transactions on Knowledge Discovery from Data ( IF 4.0 ) Pub Date : 2020-09-29 , DOI: 10.1145/3406242
Nuha Zamzami 1 , Nizar Bouguila 2
Affiliation  

Burstiness and overdispersion phenomena of count vectors pose significant challenges in modeling such data accurately. While the dependency assumption of the multinomial distribution causes its failure to model frequency vectors in several machine learning and data mining applications, researchers found that by extending the multinomial distribution to the Dirichlet Compound multinomial (DCM), both phenomena modeling can be addressed. However, Dirichlet distribution is not the best choice, as a prior, given its negative-correlation and equal-confidence requirements. Thus, we propose to use a flexible generalization of the Dirichlet distribution, namely, the shifted-scaled Dirichlet, as a prior to the multinomial, which grants the model a capability to better fit real data, and we call the new model the Multinomial Shifted-Scaled Dirichlet (MSSD). Given that the likelihood function plays a key role in statistical inference, e.g., in maximum likelihood estimation and Fisher information matrix investigation, we propose to improve the efficiency of computing the MSSD log-likelihood by approximating its function based on Bernoulli polynomials where the log-likelihood function is computed using the proposed mesh algorithm. Moreover, given the sparsity and high-dimensionality nature of count vectors, we propose to improve its computation efficiency by approximating the novel MSSD as a member of the exponential family of distribution, which we call EMSSD. The clustering is based on mixture models, and for learning a model, selection approach is seamlessly integrated with the estimation of the parameters. The merits of the proposed approach are validated via challenging real-world applications such as hate speech detection in Twitter, real-time recognition of criminal action, and anomaly detection in crowded scenes. Results reveal that the proposed clustering frameworks offer a good compromise between other state-of-the-art techniques and outperform other approaches previously used for frequency vectors modeling. Besides, comparing to the MSSD, the approximation EMSSD has reduced the computational complexity in high-dimensional feature spaces.

中文翻译:

使用灵活的移位尺度狄利克雷分布先验的频率向量概率建模

计数向量的突发和过度分散现象对准确建模此类数据提出了重大挑战。虽然多项式分布的依赖假设导致其无法在多个机器学习和数据挖掘应用中对频率向量进行建模,但研究人员发现,通过将多项式分布扩展到狄利克雷复合多项式 (DCM),可以解决这两种现象建模问题。然而,狄利克雷分布不是最好的选择,作为先验,考虑到它的负相关和等置信度要求。因此,我们建议使用 Dirichlet 分布的灵活推广,即移位尺度的 Dirichlet,作为多项式的先验,这使模型能够更好地拟合真实数据,我们将新模型称为多项式移位-标度狄利克雷 (MSSD)。鉴于似然函数在统计推断中起着关键作用,例如,在最大似然估计和 Fisher 信息矩阵调查中,我们建议通过基于伯努利多项式逼近其函数来提高计算 MSSD 对数似然的效率,其中对数-使用建议的网格算法计算似然函数。此外,鉴于计数向量的稀疏性和高维性质,我们建议通过将新型 MSSD 近似为指数分布族的成员来提高其计算效率,我们称之为 EMSSD。聚类基于混合模型,为了学习模型,选择方法与参数估计无缝集成。所提出方法的优点通过具有挑战性的现实世界应用得到验证,例如 Twitter 中的仇恨言论检测、犯罪行为的实时识别以及拥挤场景中的异常检测。结果表明,所提出的聚类框架在其他最先进的技术之间提供了很好的折衷,并且优于以前用于频率向量建模的其他方法。此外,与 MSSD 相比,近似 EMSSD 降低了高维特征空间的计算复杂度。结果表明,所提出的聚类框架在其他最先进的技术之间提供了很好的折衷,并且优于以前用于频率向量建模的其他方法。此外,与 MSSD 相比,近似 EMSSD 降低了高维特征空间的计算复杂度。结果表明,所提出的聚类框架在其他最先进的技术之间提供了很好的折衷,并且优于以前用于频率向量建模的其他方法。此外,与 MSSD 相比,近似 EMSSD 降低了高维特征空间的计算复杂度。
更新日期:2020-09-29
down
wechat
bug