Large-scale data generated by crowds provide a myriad of opportunities for monitoring and modeling people's intentions, preferences, and opinions. A crucial step in analyzing such "big data" is identifying the relevant data items that should be provided as input to the modeling process. Interestingly, this important step has received limited attention in previous research. In this paper, we offer a novel crowd-based method to address this data selection problem. We label the method “crowd-squared,” as it leverages crowds to identify the most relevant elements in large-scale crowd-generated data. We empirically tested this data selection method in two domains and found that our method yields predictions that are equivalent or superior to those obtained in previous studies (using alternative data selection methods) and to predictions obtained using various benchmark data selection methods. These results emphasize the importance of the data selection stage in the prediction process, and demonstrate the utility of the crowd-squared approach.
Session 5A, paper #3