Scoring Methods in the International Land Benchmarking (ILAMB) Package
The International Land Model Benchmarking (ILAMB) project is a model-data intercomparison and integration project designed to improve the performance of the land component of Earth system models. This effort is disseminated in the form of a python package which is openly developed (https://bitbucket.org/ncollier/ilamb). ILAMB is more than a workflow system that automates the generation of common scalars and plot comparisons to observational data. We aim to provide scientists and model developers with a tool to gain insight into model behavior. Thus, a salient feature of the ILAMB package is our synthesis methodology, which provides users with a high-level understanding of model performance.
Within ILAMB, we calculate a non-dimensional score of a model’s performance in a given dimension of the physics, chemistry, or biology with respect to an observational dataset. For example, we compare the Fluxnet-MTE Gross Primary Productivity (GPP) product against model output in the corresponding historical period. We compute common statistics such as the bias, root mean squared error, phase shift, and spatial distribution. We take these measures and find relative errors by normalizing the values, and then use the exponential to map this relative error to the unit interval. This allows for the scores to be combined into an overall score representing multiple aspects of model performance. In this presentation we give details of this process as well as a proposal for tuning the exponential mapping to make scores more cross comparable.
However, as many models are calibrated using these scalar measures with respect to observational datasets, we also score the relationships among relevant variables in the model. For example, in the case of GPP, we also consider its relationship to precipitation, evapotranspiration, and temperature. We do this by creating a mean response curve and a two-dimensional distribution based on the observational data and model results. The response curves are then scored using a relative measure of the root mean squared error and the exponential as before. The distributions are scored using the so-called Hellinger distance, a statistical measure for how well one distribution is represented by another, and included in the model’s overall score.