Data Cube Installation Guide

This document is designed to provide easy installation instructions. Deviating from the documentation could result in serious problems and failure of the installation of the DataCube itself.

Contents

Introduction

The DataCube is a tool designed to ingest, store, and retrieve landsat data for use by the UI and Notebooks.

Code Base


The code is broken up into three different parts:

This document will cover the setup and installation for the DataCube and Notebooks only

For information about installing the UI, see the UI installation document

Setting up the VM

While production scale datacubes are typically hosted on a native Linux OS, experimental work with datacube can be conducted on less powerful virtual machines. For the purposes of this document,a VM created using Oracle’s VirtualBox will be used.

Create a new VM with the following specifications:

Installing Ubuntu

Once you've created the VM, start it,

Go through the Ubuntu install process and create a user called localuser with a password localuser1234

Checking Out Code and Creating Needed directories

In the home directory of the localuser, create a directory called Datacube

mkdir ~/Datacube  

Run the following commands in the terminal to prepare the VM to check out code:

sudo apt-get update
sudo apt-get install -y git

The code for the DataCube and Notebooks can be found in the links below:

Use the following commands to check these projects out:

cd ~/Datacube
git clone https://github.com/ceos-seo/agdc-v2.git -b master
git clone https://github.com/ceos-seo/data_cube_notebooks.git -b master && cd data_cube_notebooks && git submodule init && git submodule update  

Setting up the Virtual Environment

This will create a Python virtual environment to allow the DataCube to run.

Use the following commands:

sudo apt-get install -y python3-pip
sudo pip3 install virtualenv  
virtualenv ~/Datacube/datacube_env  

Installing Necessary Dependencies

These dependencies are required for the Datacube to operate.

Simply copy and paste the commands below into a terminal one at a time to install them:

sudo apt-get -y install postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5  
sudo apt-get -y install libhdf5-serial-dev libnetcdf-dev
sudo apt-get -y install libgdal1-dev
sudo apt-get -y install postgresql-doc-9.5 libhdf5-doc netcdf-doc libgdal-doc
sudo apt-get -y install hdf5-tools netcdf-bin gdal-bin pgadmin3  
sudo apt-get -y install libfreetype6-dev
sudo apt-get -y install libblas-dev
sudo apt-get -y install liblapack-dev

The next set of packages should be installed within the Python virtual environment:

source ~/Datacube/datacube_env/bin/activate
pip install numpy  
pip install psycopg2  
pip install sqlalchemy==1.0.13  
pip install rasterio  
pip install netcdf4  
pip install pandas  
pip install shapely 
pip install cachetools==1.1.6  

The next step will require a compatible installation of gdal.

Run the following command to see the version gdalinfo currently on the machine:

gdalinfo --version  

Run the following command where X.X.X is the version from the previous step, or as close to it as possible.
For instance, if 1.11.3 was shown but unable to install, try 1.11.2 and so forth:

pip install --global-option=build_ext --global-option="-I/usr/include/gdal" gdal==1.11.2

Installing the DataCube

The next steps will install the Datacube onto the machine.

NOTE: The python command should be ran in the virtual environment.

cd ~/Datacube/agdc-v2
python setup.py install  

Setting up the Data Base and Initializing

Locate and edit the postgresql.conf file in the directory /etc/postgresql/9.5/main/

cd /etc/postgresql/9.5/main
sudo gedit postgresql.conf  

Find the timezone field in the document and change its value from localtime to UTC

Using the terminal paste the following commands to create a user the database:

sudo -u postgres createuser --superuser dc_user  
sudo -u postgres psql -c "ALTER USER dc_user WITH PASSWORD 'dcuser1';"  
createdb -U dc_user datacube  

Initialize the database by running the commands:

cd ~/Datacube/agdc-v2/
datacube -v system init  

Installing and Configuring Notebooks

Jupyter must first be installed and configured to recognize certain file types:

pip install jupyter
pip install matplotlib
pip install scipy
cd ~/Datacube/data_cube_notebooks/
jupyter nbextension enable --py --sys-prefix widgetsnbextension

The next step is to install the latest version of basemap

Download the latest version and place it on the VM.

Run the following commands to set everything up:

mkdir ~/temp
cd ~/temp  

Move the zipped basemap from the current location into temp
tar -xvf basemap-* cd basemap-*

Install the geos library that was packaged in the basemap folder:

cd geos-*
export GEOS_DIR=~/
./configure --prefix=$GEOS_DIR
make
make install  

Navigate back to the basemap directory and run the set up: cd .. python setup.py install

The notebook can then be started using: cd ~/Datacube/data_cube_notebooks/ jupyter notebook

Ingesting Sample Data

The next section explains how to prepare and ingest some sample data.

Run the following commands OUTSIDE of the python virtual environment:

deactivate
sudo apt-get -y install python-pip
sudo apt-get -y install python-gdal 
pip install pathlib
pip install pyyaml
pip install python-dateutil
pip install numpy
pip install rasterio==0.35.1
pip install shapely
pip install cachetools==1.1.6

The DataCube uses a certain directory structure to store the original and ingested data.

Start by creating the datacube folder in the root folder:

sudo mkdir /datacube
cd /datacube
sudo mkdir ingested_data
sudo mkdir original_data
sudo mkdir ui_results
sudo mkdir ui_results_temp
cd /datacube/ui_results
sudo mkdir custom_mosaic
sudo mkdir fractional_cover
sudo mkdir tsm
sudo mkdir water_detection
sudo mkdir slip

Set the permissions of the data directories using chmod:

sudo chmod 777 -R /datacube  

The directory structure under the original_data folder can be left up to the developer but the standard currently being used is:

-original_data
    -AREA_OF_DATA (ex lake_chad)
        -UTM_ZONE (ex utm33)
            -Scene 1
            -Scene 2
            -Scene 3
        -UTM_ZONE
    -AREA_OF_DATA
    -AREA_OF_DATA
        -UTM_ZONE
        -UTM_ZONE  

Adding a Product for Indexing

Before ingesting any data, a product type must be added to the DataCube.

These commands should be ran under the Python virtual environment:

source ~/Datacube/datacube_env/bin/activate  

The following commands allows the developer to add the product:

cd ~/Datacube/agdc-v2
datacube product add ingest/dataset_types/NAME_OF_FILE.yaml  

The data then needs to be prepared for ingestion:

cd ingest/prepare_scripts/  
python usgslsprepare.py /PATH/TO/ORIGINAL/DATA

This needs to be ran outside of the Python virtual environment.

Multiple file paths can be passed in like so:

/datacube/original_data/lake_chad/utm33/LE7*

Match the newly created agdc-metadata.yaml files with the following command:

cd ~/Datacube/agdc-v2
source ~/Datacube/datacube_env/bin/activate
datacube dataset add /PATH/TO/ORIGINAL/DATA/YAML --auto-match  

The data is now ready to be ingested:

datacube -v ingest -c /PATH/TO/INGESTION/FILE