Crowd versus Experts Forecasting Technologies
Impact of Collective Diversity & Size on Collective Performance
More Info
expand_more
Abstract
For centuries, Homo sapiens have been trying to predict the future through supernatural or scientific methods. Since prehistory, this quality has been essential to humans (e.g., anticipate prey, then ambush it). Even in today’s society, predicting the future remains essential in many industries and research domains. Although, we are still far from producing flawless forecasts (e.g., weather) because future events are uncertain. When decisions contain uncertainty, governments, organizations, and individuals alike tend to be interested in the advice of others.
One such case that was interesting is when predicting the outcome of a standard battle. Within such a battle, high-tech firms compete to obtain most customers in a given market through their technological inventions. To predict which technology will be the standard on the market, experts are independently interviewed to determine the importance of factors (e.g., weights) that can influence this battle. Were the second round of expert interviews need to assign a value for each of these factors for every competing technology—resulting in a performance grade used to make the prediction. This prediction indicates which firm/technology will likely have the upper hand in the market. The factors originate from the list of factors in combination with the Best-Worst Method (BWM), which allows evaluating the multiple-criteria decision-making (MCDM) problem (Rezaei, 2015a; v.d. Kaa, G., v.d. Ende, J., de Vries, H.J., van Heck, 2011).
Multiple studies expressed concerns that it is challenging to persuade and find experts willing to participate in the interview. Instead of finding a better way to approach the experts, this study focused on another solution not applied before in predicting standard battles. Hence, the objective was to understand, test, and examine how the Collective Intelligence (CI) of the crowd (i.e., group of random individuals) performs compared to experts. The idea of CI is that it does not reside in any individual but emerges from the group. When people's opinions are combined, their advice should be more truthful or similar to a typical expert's.
In other words, this quantitative exploratory study investigated if CI in comparison to experts differs when predicting standard battles. Hence, a literature review was required to provide deeper insights and factors that influence CI. This study explored the underlying mechanism of CI and established a conceptual model based on the theoretical background, which indicates the (moderating) relationship between ‘Diversity’ (DIV), ‘Group Size’ (GS), ‘Performance’ (PERF) of the crowd.
The variable DIV was measured based on differences in gender, age, degree, job, and nationality and expressed by the Simpson’s index, reflecting the number of different species and distributions (SIMPSON, 1949). As for GS, the only attribute measured was the number of people in a contrived group. Further, the definition of the PERF of a collective is the quantifiable difference in their solution relative to the prediction proposed by the experts (Wagner et al., 2010). Hence, this was dubbed ‘Relative Performance’ (RP) for the rest of this paper.
Prior research on standard battles was selected to test and validate the assertions in this study. This selection was based on several factors, such as the outcome of the battle was predicted by experts but where the ground truth is unknown. Doing so allowed validating if the crowd performs differently than experts in prediction tasks. The selected study involved the battle of two wind turbine technologies (WTT) dubbed ‘Gearbox’ (GB) and ‘Direct Drive’ (DD) (van de Kaa et al., 2020). Further, to interview the crowd, the traditional BWM questionnaire was converted into a cross-sectionally distributed online survey (i.e., MTurk) to obtain the data. Two hundred respondents completed the survey, but only 137 remained after pre-qualifications.
In this research, ‘groups’ were contrived from the sample that completed the prediction task. Here, the group members employ their expertise to carry out the given task. In other words, all group members performed the same activity and then were randomly pooled based on varying group sizes. Next, simplified random sampling was carried out twice. The first time this resulted in six groups with a respective size of 5, 10, 20, 30, and 40. A second time was required because the former did not provide a good range of DIV scores. Because of this limitation, the sample was sampled multiple times, which resulted in multiple subgroups with each a DIV score—resulting in five groups of size 5, 10, 20, 30, 40, and 137 with respectively 26, 13, 6, 4, 3, and 1 subgroup for comparison.
After sampling the data, the variables were tested for normality and homoscedasticity. The results indicate that the independent variables (i.e., DIV, GS) do not satisfy the normality assumption. Hence, the variables are not suited for parametrical testing. However, both the non- and parametrical tests were applied. Because ANOVA seems not to be very susceptible to modest divergence from normality. Namely, various studies used a variation of non-normal distributions and concluded that the false positive rate is not affected much if the notion of normality is not satisfied (Glass et al., 1972; Harwell et al., 1992; Lix et al., 1996). In addition, it was required to compare more than two groups. Hence, the One-Way ANOVA test (OWAT) and the Kruskal-Wallis Test (KWT) were selected to investigate the relationship between the variables (DIV, GS, RP). More specifically, the effects that GS or DIV can have on the RP of the crowd. In addition, due to systematic limitations with simplified random sampling, Bootstrapping (BOOT) and Monte Carlo (MC) were also performed respectively for the OWAT and KWT to investigate the relationship between DIV and RP. Lastly, the moderation effect was tested based on the Linear Regression (LR) method.
The results did not show any significant differences between DIV and RP, and any significant moderation effect. Hence, the initial hypothesis that a more diverse group of individuals would perform better was refuted. In addition, this research rejects the proposition that the relationship between GS and RP should be positively moderated by how diverse the crowd is. Consequently, this research was not able to conclude how these variables affected the PERF of the crowd. Nevertheless, the former results weaken the theory of (S. Krause et al., 2011; Nguyen et al., 2018; Surowiecki, 2005), who underlines the importance of DIV. In contrast, the results gave ample support to (Reynolds et al., 2017) and their claim that there is no correlation between DIV and PERF. However, their second assertion about an existing relationship between cognitive DIV was not tested. Hence, it is recommended that future studies investigate this relationship.
The findings did show a significant difference between groups for the variable GS in the case of GB WTT. Namely, when the size of the group increases, the PERF also proportionately increases with an upper limit, dubbed ‘Optimal Group Size’ (OGS). A U-shape relationship defines this relation between the variables. However, both the OWAT and KWT provide contradicting findings. Namely, findings from the OWAT suggest that the OGS consists of 15 people and that there is indeed a U-shape relationship between GS and RP, which proves that the claim of (Hashmi, 2005) and our hypothesis is correct. In contrast, the KWT indicates a relatively linear relationship, where the group of 10 and 20 performed significantly better than 30. However, OWAT showed that the group of 30 performed significantly better than the group of 10 and 20 people. Whereas the OGS was 10 and not 15 individuals. The results from the (non) parametrical for the location of OGS indicate that a group of 10 or 15 outperforms the other groups of smaller and bigger sizes. Hence, supporting the claim of (Carvalho et al., 2016) and weakening (S. Krause et al., 2011).
To conclude, how CI operated in this research depended on survey completion time, consistency ratio, selection of best and worst criteria, PERF grade, and the final prediction. Hence, this research concluded that the CI of the crowd did show differences in predicting the outcome of a standard battle compared to the expert pool. This was primarily based on the fact that the crowd had a PERF score that was two-thirds lower than that of the experts. Although, the crowd performed in some aspects in similar or better ways. This research only tested one case, limiting our insights if this happened due to chance or not. In addition, the main goal of this study was to investigate if the crowd could come up with the exact prediction. Meaning that the conclusion and the process (e.g., selection criteria and grading WTT) towards this conclusion should be similar to experts. However, this was not the case. In addition, the consistency ratio remains a matter of doubt since most of the results were initially unreliable, which required (logical) corrections to obtain reliable results. Making the comparison based on the consistency ratio rickety. To reiterate, the crowd performed differently than experts when predicting the outcome of the WTT battle. Despite various differences, the individual and groups would conclude a similar prediction like in the case of experts.