<< Back to WG5 Geo-AI        <<Back to Part I. Introduction

UN Open GIS - GeoAI Blog

Building a GeoAI Tool

Part II: DevOps and Scaling

By Tomaz Logar, Big Data Engineer, UN Global Pulse

July 16th, 2020

Why DevOps

The United Nations is not a software development organization. That, however, does not mean we can or want to fully rely on external software providers. Sometimes it makes sense to put something together in-house.



Our tech savvy teams are usually small. That naturally calls for agile software development (Dev) and close cooperation with IT operations (Ops). PulseSatellite is such an endeavor.


On the Ops side of this case we are striving for deployments with

  • maximum automation,
  • complete reproducibility,
  • minimal personal interaction


Putting these points in a working order will lead to

  • increased stability,
  • significant time savings and
  • lower cost of ownership

Importance of Open Source


The UN Open GIS Initiative has put forward a strategy that we agree and act in accordance with. Open source offers ways of developing and implementing technical capabilities while mitigating issues surrounding important points:

  • Cost of development
  • Cost of ownership
  • Vendor independence


There are many bits and pieces that were chosen from the open source world while making PulseSatellite. And when it came to decide what to use to enable DevOps, we chose from HashiCorp`s suite: Terraform, Packer and Vagrant.

You can start reading about them by scrolling a bit down.

Importance of Virtualization and Cloud

Many of us in the UN are old-school. And being old-school means having memories and experiences from a world that didn`t know what hardware virtualization is. A world where Amazon was selling books and didn`t think of Web Services quite yet. In that world DevOps would be fiction at best.


Luckily, quite a bit of time has passed since then, and quite a few minds have come up with solutions that give us options for some amazing development agility today.


We could consider deploying with many providers, most notably:

  • Amazon Web Services
  • Microsoft Azure
  • Google Cloud Platform


There are also others and we should keep our eyes on OpenStack. That might be a way to go should we need to deploy in our own datacenter.


A somewhat unfortunate (albeit logical) reality is that each cloud provider has its own flavor of provisioning mechanism. AWS has CloudFormation, Azure has Resource Manager, GCP has Deployment Manager, and so on... That would pose a bit of vendor locking to our plans. But...! This is where we thank our lucky star that Terraform is out there to consolidate them all! Of course we are going to lose some cutting-edge options but in our case those options are just not important. We`ll be fine.


We are going to tailor our deployment plans slightly for AWS, it just has to be done like that. But by using Terraform, it wouldn`t take much effort to tweak them to deploy to another provider.

Infrastructure as Code + Orchestration = Terraform

So what is Terraform? Terraform is a tool for building and updating virtualized infrastructure via configuration files. Also known as "Infrastructure as Code".


Writing these template files is more time consuming than interactively creating infrastructure. But that is the price we are more than willing to pay for the excellent scalability we get as a result. And there are two types of scalability we benefit from here:

  1. Within one cluster and
  2. Through multiple cluster deployments


It`s really easy to scale infrastructure up or down once we have a well designed set of templates.

A - Within One Cluster

For example: we see our users running two GPU processing services close to 100%, their work is probably affected by that bottleneck. We put our heads together and agree that the work being done is important enough to warrant giving AWS some more money to double that service count for a few days. We go to the Terraform configuration file, find the right variable and change it from 2 to 4. Then we issue the command "terraform apply" and Terraform adds the new infrastructure to our cluster. All this within minutes.


Or another example like that, just in the other direction: we see our two GPU processing services being run at 30%. Once again we put our heads together and agree that is wasteful. We have the option of not paying for what we don`t use!


Again, we go to the Terraform configuration file, find the right variable and change it from 2 to 1. 50% savings on GPUs with a few keystrokes in a few minutes! And this for sure beats having five $4000 GPU blades running in our data center for no good reason when workload is low.


Third example is giving users the ability to manage costs themselves to a degree. In previous two examples an admin went in and quickly up- or downscaled the infrastructure. HOWEVER, we can also trust our users to be frugal. Through another level of automation we give our tool`s users a limited bit of control over the infrastructure. Admin sets the maximum number of GPU processing services, but users can turn them on or off depending on their needs. We can save a lot of money like that!

B - Through Multiple Cluster Deployments

Another great positive side of these kinds of deployments are much easier handovers to other parties. We are not providing only source code of our services. We can also provide automatic infrastructure building.


For instance we have a PulseSatellite cluster running in North America, but colleagues serving Central Africa would want to spin up their own in Cape Town. Can do!



We give them access to our repository from which they pull all the files, they spend some time learning and configuring, and Terraform starts the cluster much faster than we`re used to from pre-Infrastructure-as-code times.

Examples

Here is a quick look at what our infrastructure as code looks like...

There are three types of files that end up building our clusters:

  • Infrastructure templates
  • Configurations
  • Secrets


Infrastructure templates are where the build logic is described and consist of all the technical nitty-gritty. These files change only in development phases and include references to the other two file types. They look somewhat like this example - a tiler service instance template:


resource "aws_instance" "psTiler" {
  count         = var.tiler_node_count
  ami           = var.tiler_node_ami
  instance_type = var.tiler_node_instance_type

  subnet_id                   = var.subnets[count.index]
  associate_public_ip_address = true

  vpc_security_group_ids = [var.tiler_node_security_group]

  connection {
    type        = "ssh"
    user        = "ubuntu"
    private_key = file("../keys/xxxxxx.key")
    host        = self.private_ip
  }
  tags = {
    Name        = "PLNY-ps-${var.cluster_name}-tiler-${count.index}"
    Project     = "Pulse Satellite"
    Service     = "tiler"
    Environment = var.cluster_environment
    Version     = var.cluster_version
    LabName     = "New York"
  }
}

resource "null_resource" "psTiler" {
  count = var.tiler_node_count
  triggers = {
    tiler_instance_ids = join(",", aws_instance.psTiler.*.id)
  }
  connection {
    type        = "ssh"
    user        = "ubuntu"
    private_key = file("../keys/goose_rsa.key")
    host        = element(aws_instance.psTiler.*.private_ip, count.index)
  }
  provisioner "file" {
    source      = "services/pstiler.service"
    destination = "/home/ubuntu/pstiler.service"
  }
  provisioner "remote-exec" {
    inline = [
      "sudo apt-get -y install nfs-common",
      "sudo mkdir /efs",
      "sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-xxxxxxxx.efs.us-east-1.amazonaws.com:/ /efs",
      "echo xxxxxxxx.efs.us-east-1.amazonaws.com:/ /efs nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,_netdev,nofail 0 0 | sudo tee -a /etc/fstab",
      "sudo cp /home/ubuntu/pstiler.service /etc/systemd/system/pstiler.service",
      "sudo systemctl daemon-reload",
      "sudo systemctl enable pstiler",
      "sleep 1",
      "sudo systemctl --no-block restart pstiler",
    ]
  }
}

output "tiler_ids" {
  value = join(",", aws_instance.psTiler.*.id)
}

output "tiler_ips" {
  value = join(",", aws_instance.psTiler.*.private_ip)
}


Configuration files get used when the deployment phase starts. This is where we - or our UN colleagues who will spin up their own infrastructure - decide how powerful and expensive cluster elements will be:


cluster_name = "psdev"
cluster_environment = "development"
cluster_version = "2019a"
cluster_dataroot = "/efs/psat-deploy/development/dataroot"
tiler_node_instance_type = "r4.large"
tiler_node_ami = "ami-02ba22000919f0036"
tiler_node_security_group = "sg-0c0000078"
tiler_node_count = "1"
tiler_process_count = "4"


Secrets files look the same as configurations they should just be treated with high security concerns. This is where we put our access credentials and similar pieces of information.


You may have noticed that there are some configuration information pieces hard-coded in the above examples. That is because the snippets come from a running deployment. We needed to decide on a level of abstraction that would find a sweet spot between time spent on development and getting a useful product in reasonable time. Thus, some elements, such as security groups, subnet topology and distributed file system, are built out of scope of these plans. This is the current state of affairs but will change as more development time can be invested in project automation.


Perhaps a side-note - it is somewhat important to keep in mind that infrastructure-as-code deployments are tightly linked to infrastructure element immutability. We never patch any running instance by hand. If an instance is misbehaving or needs upgrading we destroy and recreate it from scratch. All the mutable elements (configurations and such) are changed out of deployment and injected through an automated process.

Image Building - Packer

I will go more in specifics on this topic in one of the future articles but for now let me just write down a note or two about the image building tool, Packer.


We want to have clear barriers between distinct parts of our running code. For that reason we`re developing and running services required for PulseSatellite in virtualized containers. 


We are building for two levels of containers:

  • cloud provider containers and
  • docker containers


Cloud provider containers are really virtualized instances - this is the level on which we run our user interface web server, relational database, tiling services, AI services, etc. Since we`re deploying to Amazon Web Services, we need to build Amazon Machine Images (AMIs) on this level. If we were to deploy to Azure, we`d need to build Azure Virtual Machine Images. If we were to deploy to Google Compute Engine, we`d need to build images for that one, and so on.


Each AI service needs to go one level deeper in code isolation efforts. We need to run multiple Docker containers within a running Amazon virtual instance.


Just like in provisioning cloud infrastructure, each image provider - be it cloud or Docker - has its own flavor of image building. It would be quite an exercise to write image builders for each of them! Luckily, Hashicorp has come to developer`s aid once more and consolidated these flavors in one single tool - Packer.


So instead of worrying how Amazon or Docker would want our images described, we just read up on Packer and produce JSON files that Packer requires. And they look something like this:


 {
 "builders": [{
   "type": "amazon-ebs",
   "ami_name": "UNGP-Pulse-Satellite-Brain-Node-{{user `build_version`}}-{{isotime | clean_resource_name | upper}}",
   "access_key": "{{user `aws_access_key`}}",
   "secret_key": "{{user `aws_secret_key`}}",
   "region": "us-east-1",
   "source_ami": "ami-00a208c7cdba991ea",
   "instance_type": "m3.xlarge",
   "ssh_username": "ubuntu",
   "iam_instance_profile": "PLNY-Flock-ECR-Read-Only-Role",
   "launch_block_device_mappings": [{
     "device_name": "/dev/sda1",
     "volume_type": "gp2",
     "volume_size": 96,
     "delete_on_termination": true
   }],
   "run_tags": {"Name": "UNGP-PLNY Packer Build - PS"}
 }],
 "_comment": "To find the right AMI for the amazon-ebs builder go to https://cloud-images.ubuntu.com/locator/ec2/ and search for `18.04 hvm ebs us-east-1 amd64` ",
 "provisioners": [
   {
     "type": "file",
     "source": "../../keys/xxxx.pubkey",
     "destination": "/home/ubuntu/.ssh/authorized_keys"
   },
   {
     "type": "file",
     "source": "scripts/dockerLogin.sh",
     "destination": "/home/ubuntu/dockerLogin.sh"
   },
   {
     "type": "file",
     "source": "scripts/dummyService.sh",
     "destination": "/home/ubuntu/psBrain.sh"
   },
   {
     "type": "shell",
     "script": "scripts/PS-Brain-Launcher-image-init.sh"
   }
   ],
   "variables": {
     "build_version" : "0.2.0.dev",
     "ssh_timeout": "60m"
   }
 }


This template is written for Amazon only. If we wanted to use it on Azure or for Docker, we`d just add an Azure or Docker-specific builder configuration in the "builders" branch and Packer would be able to build images there, too. 


Build logic in the "provisioners" branch would stay the same, no matter what cloud provider we would be building for.

Container Build Cost Optimization - Vagrant

And to wrap these image building and running shenanigans up, it is also worth mentioning the third tool - Vagrant.


Packer will usually do its work remotely in cloud compute providers to build cloud images. There are many tries and errors to go through when we`re developing our deployments, and these take time. Time for which AWS will send us a bill every month.


That is where Vagrant comes in. It enables us to run Packer builds locally on our workstations or laptops. The upside of working like this is it doesn`t cost us any extra money. And one of the downsides is that our workstations are usually further from public software repositories, so build times are going to be taking longer to finish.


Nevertheless, it`s a good option to have.


Previous: Introduction

Up next: Model Adaptation.


Image attribution:

<< Back to WG5 Geo-AI