Abstract
Data mining, data analytics and data processing are three inter-related processes that are carried out on large volume of datasets. Data can be of any form such as text, numeric, ontology, alpha-numeric, images, video, and other multi-dimensional datasets. People dataset is one of the famous datasets from the above datasets. Crowdsourcing is used to solve the large size of data with people. The crowdsourcing input will be from a group of people by collecting a large number of people and analysis it is one the emerging technology, which initiate a new model for big data mining process. To define the nature of data, data mining is one of the traditional process for the exert in analytics domain. Data mining is an expensive process and it also take long time to complete the process. In industry and research area, crowdsourcing has become a very active component. Crowdsourcing uses smart phone users as volunteers and share their annotation process for different type of contributions. This paper is used to review about the bigdata mining from crowdsourcing in recent years. Using crowdsourcing the opportunities and challenges of data analytics are reviewed, and summarize the data analytics framework. Then it is discussed several algorithms of including applications, cost control, quality control, latency control and big data mining framework which must be consider in the field of crowdsourcing. Finally, the conclusion of this project tells about the data mining limitation and give some suggestions for future research in crowdsource data analytics.
Similar content being viewed by others
References
Howe J (2006) The rise of Crowdsourcing. Wired Magazine 14(6):1–4
Faridani S, Lee B, Glasscock S, Rappole J, Song D, Goldberg K (2009) A networked telerobotic observatory for collaborative remote observation of avian activity and range change. Elsevier, International Federation of Automatic Control
Von Ahn L (2006) Games with a purpose. Computer 39(6):92–94
Verykios VS et al (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. In: Proceedings of the VLDB endowments, vol 5, no. 12
Christen P (2012) The data matching process. In: Data matching. Data-centric system and application. Springer, pp 23–35
Davidson S, Khanna S, Milo T, Roy S (2014) Top-K clustering with noisy comparisons. ACM Trans Database Syst, 39(4)
Firmani D, Saha B, Srivastava D, Online entity resolution using an Oracle. Proceedings in VLDB Endowment, vol. 9, No. 5
Verroios V, Garcia-Molina H (2015) Entity Resolution with crowd errors. In: 2015 IEEE 31st International Conference on Data Engineering, Seoul, pp 219–230
Gruenheid A, Nushi B, Kraska T, GatterBAuer W, Kossmann D (2015) Fault-tolerant entity resolution with the Crowd, arXiv.Org, arXiv.1512.00537v1
Wang J, Kraska T, Franklin MJ, Feng J (2012) CrowdER: crowdsourcing entity resolution. Proc VLDB Endowment 5(11):1483–1494
Wang J, Li G, Kraska T, Frankline MJ, Feng J (2013) Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 229–240
Vesdapunt N, Bellare K, Dalvi N (2014) Crowdsourcing algorithm for entity resolution. In: Proceedings of the VLDB Endowment, vol 7, no. 12
Yi J, Jin R, Jain S, Yang T, Jain AK (2012) Semi-crowdsourced clustering: generalizing crowd labeling by robust distance metric learning. Adv Neural Inf Process Syst 25(1):1–9
Whang SE, Lofgern P, Garcia-Molina H (2013) Question selection for crowd entity resolution. Proc VLDB Endowment 6(6):349–360
Demartini G, Difallah DE, Cudre-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing technique for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, pp 469–478
Mazumdar A, Saha B (2017) A theoretical analysis of first heuristics of crowdsourced entity resolution. In: AAAI'17: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 970–976
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labellers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp 614–622
Salehian H, Howell P, Lee C (2017) Matching restaurant menus to crowdsourced food data: a scalable machine learning approach. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 2001–2009
Welinder P, Branson S, Belongie S, Perona P (2010) The multidimensional wisdom of crowds. Adv Neural Inf Process Syst 23(1):1–9
Bonald T, Combes R (2017) A minimax optimal algorithm for crowdsourcing. In: Proceedings of the 31st international conference on neural information processing systems, pp 4355–4363
OferDekel and Ohad Shamir on VoxPopuli: Collecting High-Quality Labels from a Crowd in Twenty-Second Annual Conference on Learning Theory, 2009.
Shi Z et al (2017) Leveraging crowdsourcing for efficient malicious users detection in large-scale social networks. IEEE Internet Things J 4(2):330–339
Rogstadius J et al (2013) Crisis tracker: crowdsourced social media curation for disaster awareness. IBM J Res Develop 57(5):1–13
Gomes RY, Welinder P, Krause A, Perona P (2011) Crowdclustering. Neural Information Processing Systems (NIPS)
Mazumdas A, Saha B (2017) Clustering with noisy queries. Neural Information Processing System (NIPS)
Vinayak RK, Hassibi B (2016) Crowdsourced clustering: querying edges vs. triangles. Advances in Neural Information Processing System (NIPS)
Ukkonen A (2017) Crowdsourced correlation clustering with relative distance comparisons. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1117–1122
Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp 1339–1347
Jiang H et al (2018) Fuzzy clustering of crowdsourced test reports for apps. ACM Trans Internet Technol 18(2):1–28
Xu Q et al. (2017) Exploring outlier in crowdsourced ranking for QoE. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 1540–1548
Zhuang H, Parameswaran A, Roth D, Han J (2015) Debiasing Crowdsourcing Batches. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1593–1602, 2015.
Sun C, NarasimhanRampalli, Yang F, Doan A (2014) Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing. In: Proceedings of the VLDB Endowment, Vol. 7, No. 13, pp 1529–1540, 2014.
Lease M (2011) On quality control and machine learning in crowdsourcing. Association for the Advancement of Artificial Intelligence
Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans Intell Syst Technol 4(3):1–21
Cheng J, Bernstein MS (2015) Flock: hybrid crowd- machine learning classifiers. In: Proceedings of the 8th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp 600–611
Kamar E, Hacker S, Horvitz, Combining human and machine intelligence in large-scale crowdsourcing. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems
Brabham DC (2013) Crowdsourcing. The MIT Press, Cambridge
Law E, Ahn LV (2011) Human computation. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael
Michelucci P (2013) Handbook of Human Computation. Springer, Incorporated, New York
Franklin MJ et al. (2011) CrowdDB: answering queries with crowdsourcing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 61–72
Alt F et al. (2010) Location-based crowdsourcing: extending crowdsourcing to the real world. In: Proceedings of the 6th Nordic Conference on Human-Computer Interaction: extending boundaries, pp. 13–22
Georgios G, Konstantinidis A, Christos L, Zeinalipour-Yazti D, Crowdsourcing with smartphones. IEEE Internet Comput. 36–44
Gupta A, Thies W, Cutrell E, BalaKrishnan R (2012) “mClerk: enabling mobile crowdsourcing in developing regions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing System, pp. 1843–1852
Charoy F, Benouaret K, Valliyur-Ramalingam R (2013) Answering complex location -based queries with crowdsourcing. In: 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, Austin, pp 438–447
Kazemi L, Shahabi C (2012) GeoCrowd: enabling query answering with Spatial crowdsourcing”, Proceedings of the 20th International Conference on Advance in Geographic Information Systems, pp. 189–198, 2012.
Mea VD, Maddalena E, Mizzaro S (2012) Crowdsourcing to mobile users: a study of the role of platform and tasks. In: Proceedings of the 20th international conference on advances in geographic information systems, pp 189–198
Yan T, Marzilli M, Holmes R, Ganesan R, Corner M (2009) mCrowd: a platform for mobile crowdsourcing. In: Proceedings of the 7th ACM conference on embedded networked sensor systems, pp 347–348
Guo S, Parameswaran A (2012) Hector Garcia-Molina, “So who won?: dynamic max discovery with the crowd. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, pp 385–396
Parameswaran AG et al. (2012) CrowdScreen: algorithm for filtering data with humans. In: Proceedings of the 2012 ACM SICMOD international conference on management of data, pp 361–372
Sarma AD, Parameswaran A, Garcia-Molina H, Halevy A (2014) Crowd- powered find algorithm. In: 2014 IEEE 30th international conference on data engineering, Chicago, pp 964–975
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Statist Soc, pp 20–28
Hui SL, Walter SD (1980) Estimating the error rates of diagnostics tests. Int Biometric Soc 36(1):167–171
Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1995) Inferreing ground truth from subjective labelling of venus images. Adv Neural Inf Process Syst, pp 1085–1092.
Albert PS, Dodd LE (2004) A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. J Int Biometr Soc, 60(2)
Raykar VC et al. (2010) Learning from Crowds. J Mach Learn Res, 1297–1322
Liu Q, Peng J, Ihler AT (2012) Variation inference for crowdsourcing. Adv Neural Inf Process Syst
Welinder P, Perona P (2010) Online crowdsourcing: rating annotators and obtaining cost-effective labels. In: 2010 IEEE computer society conference on computer vision and pattern recognition- workshops, San Francisco, pp. 25–32
Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. Adv Neural Inf Process Syst 24(1):1–9
Karger DR, Oh S, Shah D (2013) Efficient crowdsourcing for multi-class labelling. In: Proceedings of the ACM SIGMETRICS performance evaluation review, vol 41, no. 1, pp. 81–92, 2013.
Karger DR, Oh S, Shah D (2014) Budget-optimal task allocation for reliable crowdsourcing systems. Oper Res 62(1):1–24
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators?: crowdsourcing abuse detection in users-generated content. In: Proceedings of the 12th ACM conference on electronic commerce, pp 167–176
Dalvi N, Dasgupta A, Kumar R, VibhorRastogi (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on world wide web, pp 285–294
Gao C, Zhou D (2015) Minimax optimal convergency rates for estimating ground truth from crowdsourced labels. arXiv: 1310.5764v6
Zhang Y, Chen X, Zhou D, Jordan MI (2016) Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. J Mach Learn Res 17(1):1–44
Wang D et al. (2013) Recursive fact-finding: a streaming approach to truth estimation in crowdsourcing applications. In: 2013 IEEE 33rd international conference on distributed computing systems, pp 530–539
Baba Y, Kashima H (2013) Statistical quality estimation for general crowdsourcing tasks. In: 19th ACMSIGKDDC conference knowledge discovery and data mining (KDD), (Baba and Kashima 2013).
Ma F, Li Y, Li Q, MinghuiQiu, Gao J, Zhi S (2015) FaitCrowd (2015): fine grained truth discovery for crowdsourced data aggregation. In: KDD’15, 2015, Sydney, NSW, Australia, pp 745–754
Stantchev V et al (2015) Cloud computing service for knowledge assessment and studies recommendation in crowdsourcing and collaborative learning environment based on social network analysis. Comput Hum Behav 15:762–770
Najafabadi MM, Villanustre F, Khoshgoftaar TM et al (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1–21
Doroudi S, Kamar E, Brunskill E, Horvitz E (2016) Toward a learning science for complex crowdsourcing tasks. In: Proceedings of the00202016 CHI conference on human factor in computing systems, pp 2623–2634
Basharat A, Budak I, Rasheed K (2016) Leveraging crowdsourcing for the thematic annotation of the Qur’an”. In: Proceedings of the international conference on world wide web
Alsheikh MA, DusitNiyato Lin S, Tan H-P, Han Z (2016) Mobile big data analytics using deep learning and apache spark. IEEE Netw 30(3):22–29
Chen M, Yang J, Hu L, Shamim Hossain M, Muhammad G (2018) Urban healthcare big data system based on crowdsourced and cloud-based air quality indicators. IEEE Commun Magazine 56(11):14–20
Liu S, Chen C, Lu Y, Ouyang F, Wang B (2019) An interactive method to improve crowdsourced annotations. IEEE Trans Visual Comput Graphics 25(1):235–245
Kong X, Li M, Tang T, Tian K, Moreira-Matias L, Xia F (2018) Shared subway shuttle bus route planning based on transport data analytics. IEEE Trans Autom Sci Eng 15(4):1507–1520
Birkin M (2019) Spatial data analytics of mobility with Consumer data. J Transp Geogr 76:245–253
Rahman MM, Roy C (2018) Effective reformulation of query for code search using crowdsourcing knowledge and extra-large data analytics 2018. In: IEEE international conference on software maintenance and evolution (ICSME), pp 473–484
Berhmer M, Lee B, Isenberg P, Choe E (2019) Visualizing ranges over time on mobile phones: a task-based crowdsourced evaluation. IEEE Trans Conf Visual Comput Graphics 25(1):619–629
Yoonjiung K, Choong-Kikima, Dong K, Hyun-woo L, Rogelio II. T A (2019) Quantifying naturebased tourism in protected areas in development countries by using social big data. Tourism Manag, 72, 249–256
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dhinakaran, K., Nedunchelian, R. & Balasundaram, A. Crowdsourcing: Descriptive Study on Algorithms and Frameworks for Prediction. Arch Computat Methods Eng 29, 357–374 (2022). https://doi.org/10.1007/s11831-021-09577-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11831-021-09577-8