Case Study

Benchmarking Technologies with Geospatial Big Data

Scenario

The Defence Science and Technology Laboratory, Dstl, is an executive agency of the UK Ministry of Defence. Dstl ensures that innovative science and technology contribute to the defence and security of the UK by cooperating with academia and industry.

Data Reply supported Dstl in evaluating options for processing large geospatial datasets by conducting an evidence-based evaluation of six relevant Big Data technologies.

This included benchmarking ingestion, indexing and querying, with latency being the primary objective metric under study.

By working with randomly generated data, the project work could be declassified and publicly shared, an important goal of Dstl’s when scoping the engagement.

DSTL

By working with academia and innovative businesses, Dstl develops battle-winning technologies that support UK defence operations - now and into the future. Dstl also provides the UK Government with specialist science and technology research, advice and analysis – much of which is operationally critical and offers potential for technological breakthroughs.

The Goal

Dstl engaged Data Reply to benchmark six prominent Big Data technologies with geospatial processing capabilities, to assist them in selecting the appropriate technology given the workload, along with advice on tuning for performance.

In many domains, notably intelligence services, the value of information decreases over time. Consequently, time-to-insight is a key metric. Features inherent in the analysis of geospatial data, which Dstl often operates with, cause large scale data processing to be challenging and, often, computationally expensive.

Efforts have been made to reduce the complexity of geospatial data with standardized specifications (e.g. GeoJson data format), and a variety of promising technologies which eliminate superfluous details for the end user. However, there is insufficient comparative data available to understand the relative performance of many of these technologies. In particular, relative query and ingestion times are not well understood.

Reflecting their desire to make evidence-based decisions in this area, Dstl engaged Data Reply to benchmark six prominent Big Data technologies with geospatial processing capabilities, to assist them in selecting the appropriate technology given the workload, along with advice on tuning for performance.

The Solution

Data Reply conducted a technology benchmarking study under broadly equivalent hardware topology and configuration constraints.

Following strict specifications set by Dstl, Data Reply generated synthetic geospatial datasets each composed of up to 10 billion datapoints and conforming to specific structural requirements to emulate real world conditions. With an over-arching aim of achieving a level playing field, each technology configuration was tuned. The datasets were ingested and indexed, and then a series of pre-defined queries were performed. This process was repeated, in some cases multiple times, in order to derive and record the benchmark results.

Using Google Cloud technology, Data Reply tested six different Big Data technologies (GeoSpark, GeoMesa, Hive, MongoDB, ElasticSearch & Postgres-XL) and benchmarked their data ingestion and query speeds. All six technologies were set-up and configured according to the recommended settings – with subsequent tuning as appropriate to aim for a broadly level playing field. For some of the technologies, Data Reply also developed custom utilities for data ingestion to provide the required scalability while supporting the correct structure of the data, and to ensure conformity with the GeoJson standard and Dstl’s specifications.

To execute queries, Data Reply mandated SQL query descriptions into each technology’s DSL – and in some cases applying necessary approximation when the DSL syntax could not be directly mapped to the query. Also, Data Reply set up the essential infrastructure to allow intra-cluster replication for high availability and fault tolerance. This allowed 20 billion data points to live on multiple machines at once, with a single write request, in case a node in the cluster goes down or is unavailable due to issues like network partitioning.

To conclude this work Data Reply has published a comprehensive report to Dstl (and available for public use) detailing the study and results. This has provided Dstl with reliable insight into the capabilities and performance of different Big Data technologies.

Dstl

Picture

The Defence Science and Technology Laboratory (Dstl) works to apply cutting-edge science and technology (S&T) to keep UK Armed Forces, and the British people, protected from harm. Dstl is an Executive Agency of the MOD, run along commercial lines. It is one of the principal government organisations dedicated to S&T in the defence and security field, with four sites; at Porton Down, near Salisbury, Portsdown West, near Portsmouth, Fort Halstead, near Sevenoaks and Alverstoke near Gosport. Dstl works with a wide range of partners and suppliers in industry, in academia and overseas.
For more information contact the Dstl press office on 01980 956845 or 07384 210107.
press@dstl.gov.uk.
Follow us on Twitter: @DefenceHQ and @dstlmod

Data Reply is the Reply Group company specialised in data management using Big Data & Advanced Analytics methodologies. Data Reply supports customers in the design and implementation of data platforms that aim to enhance and capitalise on corporate information assets.