System Requirements | Open Data Cube

ODC System Requirements

DATA CUBE SYSTEM REQUIREMENTS

The focus of this document is to provide potential users with a reference for the type of systems that the Data Cube is targeting and what sort of performance they can expect for various investment levels. Covered topics include basic requirements that are common to all Data Cube systems, methods for estimating hardware requirements, example systems, and benchmarks of our current systems. Using this document, a user should be able to create an estimate of their hardware needs based on their requirements and get an idea of the performance that they will be aiming for.

BASIC SYSTEM REQUIREMENTS

While the Data Cube is flexible regarding basic requirements, all scalable Data Cube systems will have some of the same properties:

Shared storage
Storage capacity for both original datasets and their ingested counterparts
High memory capacity
Large processing capacity
High availability/Large bandwidth internet connection

STORAGE

Storage space scales with amount of data that the system is responsible for managing. For a general estimate on the storage space required, we use the following general formula for Landsat data:

*Estimate = (number of path/rows) * (365/revisit time in days) * (years of storage) * (data_size) * 1.5 (margin_of_safelty_multiplier) *

Using a baseline of 22 acquisitions per year (16 day revisit) and a scene size of 1.2Gb (unzipped, uncompressed) with a multiplier of 1.5 to account for the ingested dataset, this yields roughly 40Gb per path-row per year. That means for 15 years of historical data over a single path row, around 600Gb of storage would be required.

For example, we have 170 Landsat 7 scenes stored along with the ingested data. This amounts to 202Gb for the scenes and 36Gb for the ingested data. This is slightly below the threshold described above, leaving some room for error. This figure can be reduce by storing archived and compressed original datasets rather than uncompressed data. This is a figure to be used for general estimation and should be padded by a factor of 3 - 4 to account for general storage, the storage of analysis output products, and to prevent bumping into storage limits in the future.

MEMORY AND CPU

The ability to run concurrent processes is based on the number of cores, and the size of each concurrent process depends on the amount of available memory. Analyses on large datasets often require splitting a large geospatial and temporal region into smaller ‘chunks’ for more efficient processing. A ‘chunk’ in this context refers to a smaller portion of a larger dataset; large analysis regions are broken into many smaller subsets and processed independently. For example, a region made up of four square degrees may be split into four one square degree regions and processed concurrently to decrease analysis time. The chunk size can be manipulated by area or time slices: For Landsat data, pulling in one square degree for one acquisition is around 400Mb of memory. In a low memory setup, it is possible to chunk analyses into smaller geographic chunks, e.g. splitting a one square degree analysis into ten 0.1 square degree analyses and combining the end result.

The general guidelines for baseline system requirements are below:

At least one core per desired concurrent user
- More per user is desired if execution speed is a concern, for example each analysis run using our UI is split into 5+ parts that are then computed concurrently
At least one gigabyte of memory per core
- Depending on your analysis case, this may need to be higher. Most of our algorithms allow for temporal as well as geospatial chunking, so each process’ memory usage is fairly low.
  - We try to chunk the different analyses so that they use a constant amount of memory during execution. Generally, our goal is around 1Gb memory per running process. This is adjusted by modifying chunk sizes. For a standard mosaic analysis, our chunk size is 0.5 square degrees, loading five chunks at a time, while a median pixel mosaic has a chunk size of 0.01 square degrees and loads all available scenes at once.
Shared storage with capacity for all ingested data, original data, and analysis results
- We use NFS storage on our systems and don’t run into any transfer speed limitations.
- Other big data analytics shared storage solutions would work as well, but we have not experimented with any of them.

SYSTEM SUMMARIES

As summary of CEOS Data Cube systems can be found below:

This includes a local development server, our demonstration server hosted on AWS, and the Colombian IDEAM system. We have benchmarked our own systems and provide some data points below. The estimated number of concurrent users for our home office system is based on the number of active Data Cube developers while the AWS instance capabilities are estimated using the numbers provided by our Colombian colleagues. The number of supported users could be stretched in all of the systems listed above by smaller chunking and longer processing times.

PRIMITIVE BENCHMARKING

We have performed some sample tasks on the systems we have available in order to provide some reference for how various systems perform with various tasks. These are all based on UI task execution times, so they are fully parallelized and chunked into 10-200 chunks for multicore processing. Larger systems are capable of higher concurrency with the same number of active users. While execution times may be similar between systems, multiple concurrent jobs could be executed at once on the higher number of core systems. This difference can be seen with the tasks that require a larger number of chunked tasks like the median pixel custom mosaic; the median pixel mosaic splits into 200 tasks, while the most recent pixel splits into 10. Execution time is referring to a full (2000-2016) analysis over a 1 degree square in Colombia. TSM was done on a 1 degree square on Lake Chad. There are no differing application settings between machines to make use of more cores, so the execution times of our AWS servers could be optimized at the expense of the number of supported concurrent users if desired.

These analysis cases involve the processing of 91 scenes for mosaicking, fractional cover, and water detection and 314 scenes for the TSM analysis.

It can be seen above that having more cores decreases the processing time significantly due to the larger machines (AWS) having a sufficient number of cores to concurrently process all chunks at once. There is a clear drop off when comparing 36 and 72 cores, as most of the analyses were only split into 10-20 chunks. The main distinguishing point between the two AWS benchmarks is the ability to concurrently process more tasks at once. This is seen with the median pixel custom mosaics; the performance gain is still significant between the two AWS benchmarks, signifying the ability for the AWS instances to process more concurrent users.

APPENDIX

Additional Examples:

Included below is a table of raw data-volume for satellite imagery based on varying spatial and temporal requirements.

Africa Regional Data Cube Webinar - November 30, 2018

A webinar hosted by the Africa Open Data Network (AODN)

Sign-up for the Open Data Cube mailing list!

ODC System Requirements