Forecasting any future event or the possibility of an industry-specific trend dominating the market by shaping it towards new possibilities can be challenging. A research team from UC Berkeley, MIT, the University of Illinois, and the University of Oxford recently presented Autocast, a dataset containing thousands of forecasting questions and an accompanying date-based news corpus for evaluating the automatic forecasting abilities of neural network models. They also curated IntervalQA, a dataset of numerical questions and metrics for calibration. Both datasets are included in a paper titled Forecasting Future World Events with Neural Networks.
According to the researchers, there are two types of forecasting. In statistical forecasting, predictions are made using either ML time-series models or more conventional statistical time-series prediction models like autoregression. The models are built and fine-tuned by humans, but individual forecasts are not changed. This is effective when the variable being forecast has many prior observations and a slight distribution shift. However, the forecasts made by human forecasters in judgmental forecasting are based on their own judgment. Although the forecasters frequently incorporate data from a variety of sources, such as news, common sense, general knowledge, and reasoning, they may also use statistical models. This type of forecasting is used when historical data are scarce.
Earlier, forecasting was only employed for a select few areas since it depends on limited human skills. This inspired scientists to leverage ML to automate forecasting, for example, by automating human reasoning, quantitative modeling, and information retrieval. In comparison to human forecasters, ML models could potentially offer certain benefits. These include parsing data or comprehending data rapidly, and finding patterns in noisy, high-dimensional data, where relying on human intuition and skills may not suffice. Besides, there is a possibility that the knowledge of outcomes of certain historical events can introduce a bias in reasoning. Here, ML models can offer better results on historical data on the basis of data patterns instead of inclination toward specific outcomes due to past records.
The team enumerates their key contributions as follows:
- Introducing Autocast, a forecasting dataset with a wide range of topics (such as politics, economics, society, and science) and time horizons.
- A substantial news corpus arranged by date is a standout feature of that dataset, enabling them to assess model performance on historical projections exhaustively.
- Showcasing that current language models struggle with forecasting, while having accuracy and calibration well below a reliable human baseline.
The team assembled 6,707 total forecasting questions from three open forecasting competitions (Metaculus, Good Judgment Open, and CSET Foretell) to create their Autocast dataset. These questions typically have a large public interest (such as national elections as opposed to municipal polls) and clear resolution requirements. The questions are either multiple-choice, true/false, or ask you to predict a number or a date.
Participants in these forecasting competitions start forecasting a question on a specific day (the “start date”), and then revise it several times until the “close date.” The forecast is resolved at a later time, and participants are graded according to all of their projections.
It is important to note that, although not invariably, the resolution date falls immediately following the closure date. It is also possible that the resolution can potentially occur before the scheduled closure date, as could be the case when predicting the timing of an event. As a result, from the start to the closure date, a time series of projections comprise the “crowd” forecast (which aggregates over participants). The question, the start and end dates, the resolution of the question, the response, and the time-series of crowd forecasts are all included in the Autocast.
To determine if retrieval-based methods could enhance model performance by choosing appropriate articles from the dataset with Autocast, the researchers first examined the QA model UnifiedQA-v2 (Khashabi et al., 2022) and text-to-text framework T5 (Raffel et al., 2020) without retrieval. These models are trained on various tasks, providing high generalization on numerous unknown language problems. The team reported results on classification questions using zero-shot prompting for UnifiedQA. Researchers reported random performance since the UnifiedQA models were not trained on numerical questions and to allow comparison with other baselines. Meanwhile, using its original output head, T5 was adjusted for true/false and multiple-choice questions. They introduced an additional linear output head to T5 in order to produce numerical responses.
The team encoded articles obtained by the lexical search method BM25 (Robertson et al., 1994; Thakur et al., 2021) with cross-encoder reranking using a Fusion-in-Decoder (FiD, Izacard and Grave, 2021) model for retrieval. The frozen fine-tuned FiD model creates an embedding of every day’s top news article between the open and closing dates of a given question and then feeds these created embeddings to an autoregressive big language model like GPT-2. The team explains that FiD can be seen as a rudimentary extension of T5 for incorporating retrieval because it uses T5 to encode retrieved passages together with the question.
The results reveal that retrieval-based techniques using Autocast significantly outperform UnifiedQA-v2 and T5, and their efficiency increases as the number of parameters rise. This suggests that larger models are better able to learn to extract relevant information from retrieved articles than smaller models.
Overall, the study demonstrates that extracting from a sizable news corpus can effectively train language models on prior forecasting questions.
Although the findings are still below the baseline of a human expert, performance can be improved by expanding the model and strengthening information retrieval. The team is certain that Autocast’s innovative method for allowing large language models to predict future global events will have major practical advantages in a variety of applications.
The group also pointed out that there are several orders of magnitude worth of quantitative data in the Autocast training set. Additionally, there are fewer than 1,000 numerical training questions in Autocast. This issue of calibrating predictions for values spanning several orders of magnitude using text inputs has not been addressed in the work on calibration for language models. Therefore, they compiled IntervalQA, an additional dataset of numerical estimate problems, and offered metrics to gauge calibration. The dataset’s problems entail making calibrated predictions for fixed numerical quantities rather than forecasting problems.
The questions were taken from the following NLP datasets: SQuAD, 80K Hours Calibration (80k, 2013), Eighth Grade Arithmetic (Cobbe et al., 2021), TriviaQA (Joshi et al., 2017), Jeopardy, MATH (Hendrycks et al., 2021b), and MMLU (Hendrycks et al., 2021a). When these datasets were filtered for questions with numerical responses, the researchers received roughly 30,000 questions.
The Autocast dataset and code are available on the project’s GitHub.