Ensemble Machine Learning Approach Produces Greater Spatial Prediction Accuracy of Soil Carbon Stocks
A team of researchers from three National Laboratories investigated the predictive skill of four different machine learning methods applied to estimating organic carbon content in soils across the data-limited northern circumpolar region and compared their results with the commonly used regression kriging approach. Different machine learning techniques indicated different sets and importance of predictor variables in estimating soil organic carbon (SOC) stocks. The ensemble median prediction of SOC stocks obtained from all four machine learning techniques showed the highest prediction accuracy.
In regions of high heterogeneity and sparse observations, an ensemble machine learning approach produces greater spatial detail and better prediction accuracies for SOC stocks. Across machine learning techniques, temperature, latitude, land cover type, slope, and elevation were found to be the most important variables in estimating the spatial variation of surface SOC stocks. Areas with high uncertainty in SOC stocks were found in small patches of Southern Alaska and Iceland, and in larger areas of the Southern and Western Russian permafrost region.
Various approaches of differing mathematical complexity are applied for spatial prediction of soil properties, with regression kriging being a widely used method for combining soil properties and environmental factors with spatial autocorrelation to estimate soil organic carbon (SOC) stocks. In this study, four machine learning approaches (gradient boosting machine, multi-narrative adaptive regression spline, random forests, and support vector machine) were compared with regression kriging for predicting the spatial variation of surface (0–30 cm) SOC stocks at 250 m resolution across the northern circumpolar permafrost region. We combined 2,374 soil profile observations with georeferenced datasets of environmental factors and evaluated the prediction accuracy at randomly selected sites across the study area. The multi-narrative adaptive regression spline and support vector machine methods yielded higher prediction errors than regression kriging, but the gradient boosting machine and random forest methods yielded accuracies comparable to regression kriging. While different techniques inferred different numbers of environmental factors and their relative importance for estimating SOC stocks with varying accuracies, the ensemble median prediction obtained from all four machine learning techniques showed the highest prediction accuracy. Thus, an ensemble prediction approach is better than using any single prediction technique for estimating the spatial variation of SOC stocks. The uncertainty in surface SOC stocks was less than 20% in about half of the study area. Areas with >50% uncertainty were found in small patches in Southern Alaska and Iceland, and in larger areas of the Southern and Western Russian permafrost region.