当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A machine learning, bias-free approach for predicting business success using Crunchbase data
Information Processing & Management ( IF 7.4 ) Pub Date : 2021-03-06 , DOI: 10.1016/j.ipm.2021.102555
Kamil Żbikowski , Piotr Antosiuk

Predicting the success of a business venture has always been a struggle for both practitioners and researchers. However, thanks to companies that aggregate data about other firms, it has become possible to create and validate predictive models based on an unprecedented amount of real-world examples. In this study, we use data obtained from one of the largest platforms integrating business information – Crunchbase. Our final training set consisted of 213 171 companies.

This work aims to create a predictive model based on machine learning for the purpose of forecasting a company’s success. Many similar attempts have been made in recent years. Plenty of those experiments, often conducted with the use of data gathered from several different sources, reported promising results. However, we found that very often they were significantly biased by their use of data containing information that was a direct consequence of a company reaching some level of success (or failure). Such an approach is a classic example of the look-ahead bias. It leads to very optimistic test results, but any attempt at using such an approach in a real-world scenario may result in dramatic consequences. We designed our experiments in a way that would prevent the leaking of any information unavailable at the decision moment to the training set.

We compared three algorithms – logistic regression, support vector machine, and the gradient boosting classifier. Despite the conscious decision to limit the number of predictors, we reached very promising results in terms of precision, recall, and F1 scores which, for the best model, were 57%, 34%, and 43% respectively. The best outcomes were obtained with the gradient boosting classifier. We give detailed information about the importance of different features, with the top three being country and region that the company operates in and the company’s industry. Our model can be applied directly as a decision support system for different types of venture capital funds.



中文翻译:

一种使用Crunchbase数据预测业务成功的机器学习,无偏差方法

对于从业者和研究者而言,预测企业的成功一直是一项艰巨的任务。但是,由于公司汇总了其他公司的数据,因此有可能基于史无前例的真实示例创建和验证预测模型。在本研究中,我们使用从最大的集成业务信息平台之一-Crunchbase获得的数据。我们的最终培训课程包括213 171家公司。

这项工作旨在基于机器学习创建预测模型,以预测公司的成功。近年来已经进行了许多类似的尝试。这些实验中的许多通常是使用从几个不同来源收集的数据进行的,它们报告了令人鼓舞的结果。但是,我们发现,他们经常会因使用包含信息的数据而有很大的偏见,而这些信息是公司达到某种成功(或失败)水平的直接结果。这种方法是预见性偏差的经典示例。它会带来非常乐观的测试结果,但是在实际场景中使用这种方法的任何尝试都可能导致巨大的后果。我们设计的实验方式应避免将决策时刻无法获得的任何信息泄露给训练集。

我们比较了三种算法– Logistic回归,支持向量机和梯度提升分类器。尽管有意识地决定限制预测变量的数量,但我们在准确性,召回率和F1分数方面均取得了非常可观的结果,对于最佳模型,这些分数分别为57%,34%和43%。使用梯度增强分类器可获得最佳结果。我们提供了有关不同功能的重要性的详细信息,其中前三位是公司运营所在的国家和地区以及公司所在的行业。我们的模型可以直接用作不同类型的风险投资基金的决策支持系统。

更新日期:2021-03-07
down
wechat
bug