Flow cytometry is widely used within the manufacturing of cell and gene therapies to measure and characterise cells. However, conventional (manual) flow cytometry data analysis relies heavily on operator judgement and as a result, it has been previously shown that there are limitations in reproducibility and the process is liable to bias. This can adversely impact the quality and predictive potential of therapies given to patients.
Computational tools have the capacity to minimise operator variation in flow cytometry data analysis, with some 50 packages in current use [1]. In many cases confidence in these technologies have yet to be fully established mirrored by aspects of regulatory concern. Therefore there is a need to investigate the performance of automated cell population identification tools that utilise different classes of clustering algorithms, allowing the benchmarking for a range of cell analysis conditions. In this instance novel synthetic toolsets for highly controlled numerical evaluation have been developed [2].
Synthetic flow cytometry datasets containing controlled population characteristics of separation, normal/skew distributions, noise and rarity, have been designed and generated using the R software environment. Platforms investigated included: Flock2, FlowMeans, FlowSOM, PhenoGraph, SPADE1, SPADE3 and SWIFT, these representing different underpinning mathematical operators; density-based, k-means, self-organising map, k-nearest neighbour, hierarchical, deterministic k-means, and model-based clustering respectively. Performances were assessed based on measures of absolute accuracy and repeatability of cell count.
Outputs from software analysing the same reference synthetic dataset (either two clusters or three clusters) were found to vary considerably with accuracy typically deteriorating as cluster separation index decreased and fell below 0. Consequently, as clusters began to merge software platforms struggled to identify target clusters [3]. Synthetic datsets were generated to present skewed cell populations with SWIFT being more negatively affected than other platforms. Furthermore, when simulating noise components within the synthetic datasets FlowSOM was found to have the most significant loss of performance. An extension of the work involving rare cell synthetic datasets found that different limits of detection occurred as a function of total cell events. Most software were unable to consistently identify 102 rare cells in 105 events.
Our research has been predicated on the generation of novel synthetic flow cytometry data and demonstrated their use in the unique validation of a range of automated cell identification methods. The synthetic datasets overcome limitations of using real biological datasets for such comparability studies – specifically allowing absolute determination of accuracy and repeatability of cell count through careful consideration and design, and the ability to derive a suitable range of metrics for judging software strengths and weaknesses. This work has revealed differing trends in performance between flow cytometry automated software platforms when analysing clusters with simulated distance, probability distribution, rare and noise properties. This investigation will be pertinent when selecting such platforms for off-line and in-process evaluation of cell metrics, both for biomanufacturing and clinical scenarios, as well as helping to inform regulatory science.
History
School
Mechanical, Electrical and Manufacturing Engineering