Reply is the place to meet an incredible variety of enthusiastic, passionate, ideas-driven people, who want to make a difference and an impact.Would you like to know more?
The Defence Science and Technology Laboratory, Dstl, is an executive agency of the UK Ministry of Defence. Dstl ensures that innovative science and technology contribute to the defence and security of the UK by cooperating with academia and industry.
Data Reply supported Dstl in evaluating options for processing large geospatial datasets by conducting an evidence-based evaluation of six relevant Big Data technologies. This included benchmarking ingestion, indexing and querying, with latency being the primary objective metric under study.
By working with randomly generated data, the project work could be declassified and publicly shared, an important goal of Dstl’s when scoping the engagement.
By working with academia and innovative businesses, Dstl develops battle-winning technologies that support UK defence operations - now and into the future. Dstl also provides the UK Government with specialist science and technology research, advice and analysis – much of which is operationally critical and offers potential for technological breakthroughs.
In many domains, notably intelligence services, the value of information decreases over time. Consequently, time-to-insight is a key metric. Features inherent in the analysis of geospatial data, which Dstl often operates with, cause large scale data processing to be challenging and, often, computationally expensive.
Efforts have been made to reduce the complexity of geospatial data with standardized specifications (e.g. GeoJson data format), and a variety of promising technologies which eliminate superfluous details for the end user. However, there is insufficient comparative data available to understand the relative performance of many of these technologies. In particular, relative query and ingestion times are not well understood.
Reflecting their desire to make evidence-based decisions in this area, Dstl engaged Data Reply to benchmark six prominent Big Data technologies with geospatial processing capabilities, to assist them in selecting the appropriate technology given the workload, along with advice on tuning for performance.
Using Google Cloud technology, Data Reply tested six different Big Data technologies (GeoSpark, GeoMesa, Hive, MongoDB, ElasticSearch & Postgres-XL) and benchmarked their data ingestion and query speeds. All six technologies were set-up and configured according to the recommended settings – with subsequent tuning as appropriate to aim for a broadly level playing field. For some of the technologies, Data Reply also developed custom utilities for data ingestion to provide the required scalability while supporting the correct structure of the data, and to ensure conformity with the GeoJson standard and Dstl’s specifications.
To execute queries, Data Reply mandated SQL query descriptions into each technology’s DSL – and in some cases applying necessary approximation when the DSL syntax could not be directly mapped to the query. Also, Data Reply set up the essential infrastructure to allow intra-cluster replication for high availability and fault tolerance. This allowed 20 billion data points to live on multiple machines at once, with a single write request, in case a node in the cluster goes down or is unavailable due to issues like network partitioning.
To conclude this work Data Reply has published a comprehensive report to Dstl (and available for public use) detailing the study and results. This has provided Dstl with reliable insight into the capabilities and performance of different Big Data technologies.