Exploitation of large materials databases relies on effective ways of comparing molecular and crystalline structure

Figure 1: Exploitation of large materials databases relies on effective ways of comparing molecular and crystalline structures. Mathematically well founded similarity metrics need to be developed

Large databases of materials and molecules are increasingly available, due in part to maturation of electronic structure software and in part to cheap computing, but also advances in our ability to exploit them, using modern machine learning tools and similar data-driven approaches. Many of these methods rely on effective ways to compare individual entries in the database, i.e. having a measure of similarity between crystal structures or molecules. Many such measures have been proposed and used for various purposes in the past, and typically they are ad-hoc, tailor-made for a particular application. Often such similarity measures are not "complete" and cannot be made complete systematically, in the sense that non-identical data items are deemed equivalent.

Around 2010, several researchers realised that the problem of using machine learning methods to created interatomic potential models for materials is a related problem, in which similarity measures of local atom environments (rather than entire structures) need to be constructed. Novel solutions emerged, including several that used spherical harmonic expansions, and soon "Smooth Overlap of Atomic Positions" (SOAP) emerged as a leading contender, using the rotational power spectrum as a descriptor. It was shown that the simple scalar product of the power spectra is a similarity measure of local environments that respects all physical symmetries and is stable to deformations. In 2015, De, Bartok, Csanyi and Ceriotti generalised the original SOAP metric so that it can serve as a similarity measure of entire structures - in fact a whole family was introduced, in which a continuous parameter tunes the metric: at one end, it is just the average of the metrics between all environments in the two structures, while in the other extreme every atom environment in one structure must have a unique counterpart in the other structure. The new structure metric can be used as a basis for clustering algorithms, regression, dimensional reduction, etc. The figure shows the embedding of all published silicon crystal structures into a two-dimensional plane, using a variant of Multidimensional Scaling. The metric is capable of distinguishing liquid and amorphous solid structures, and also highlights areas that are less well explored. It is applicable to any type of material or molecule, without further modification.