Weather forecasting has always been a critical issue across various fields, and traditionally, weather predictions have relied on solving complex atmospheric equations using supercomputers. However, the rise of machine learning has introduced a more efficient method that utilizes
...
Weather forecasting has always been a critical issue across various fields, and traditionally, weather predictions have relied on solving complex atmospheric equations using supercomputers. However, the rise of machine learning has introduced a more efficient method that utilizes historical data to forecast weather patterns. Nonetheless, there is a problem with auto-correlation, where values at a specific time point correlate with those at a previous time. Machine learning models assume independence between data points, which can lead to overfitting if the data is not split properly. This happens because training data might not be independent of test data and models can learn the noise rather than the underlying pattern in data. Also, auto-correlation can apply when having multiple locations with data (named spatial data), because nearby locations share similar characteristics.
This thesis investigates temporal and spatial data splitting strategies to address this challenge to determine which methods provide the most reliable performance estimates. Temporal splitting strategies involve dividing the data using time intervals, while spatial splitting strategies involve dividing based on the geographical location of weather stations. The goal is to identify strategies that minimize bias and variance in the error estimates of weather forecasting models. This will ensure less overfitting and robust and generalisable predictions across different climatic conditions and data regimes.
By systematically evaluating various data splitting strategies, this thesis aims to provide insights into the best approaches for preparing meteorological data for Machine Learning-based weather prediction models, thereby contributing to the advancement of more accurate and efficient weather forecasting techniques. Four different strategies for temporal data splitting were evaluated: random splitting, testing at the end of the dataset, testing in the middle of the dataset, and splitting the test year into multiple parts. Three different strategies for splitting spatial data were evaluated: random splitting, the use of one cluster of cities as the test data with the other clusters as training data, and lastly, choosing only the non-neighbouring clusters as training data. Some of these strategies, such as the random strategy, do not take into account auto-correlation when splitting the data.
Results indicate that the random strategy in both scenarios yields a relatively modest error rate. However, it is advisable to employ a strategy that organizes the test data into a cohesive block and positions it in the centre of the dataset time. Hence, this thesis contributes to the development of more refined weather prediction by shedding light on how data splitting strategies influence outcomes.