Maintaining Spatial Data Infrastructures (SDIs) using distributed task queues

Similar documents
Harvard Hypermap: An Open Source Framework for Making the World's Geospatial Information more Accessible

GeoNode Intro & Demo

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Enhancing discovery in spatial data infrastructures using a search engine

Euro Bird Portal (LIFE15 PRE/ES/000002) Design of the new database repository and data-flow

New in WorldMap Version 1.5 Center for Geographic Analysis, Harvard

IEMS 5722 Mobile Network Programming and Distributed Server Architecture

IERG 4080 Building Scalable Internet-based Services

Troubleshooting Performance Issues with Enterprise Geodatabases. Ben Lin, Nana Dei, Jim McAbee

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.

IEMS 5780 / IERG 4080 Building and Deploying Scalable Machine Learning Services

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Distributed Data on Distributed Infrastructure. Claudius Weinberger & Kunal Kusoorkar, ArangoDB Jörg Schad, Mesosphere

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Daiquiri an VO ready solution for medium size data providers. Anastasia Galkin Jochen Klar Gal Matievic Harry Enke

Data Acquisition. The reference Big Data stack

Performance Benchmarking an Enterprise Message Bus. Anurag Sharma Pramod Sharma Sumant Vashisth

Enterprise service bus

DATA SHARING AND DISCOVERY WITH ARCGIS SERVER GEOPORTAL EXTENSION. Clive Reece, Ph.D. ESRI Geoportal/SDI Solutions Team

Data Acquisition. The reference Big Data stack

Addressing Geospatial Big Data Management and Distribution Challenges ERDAS APOLLO & ECW

+ + a journey to zero-downtime

An Introduction to GIS for developers

Leveraging OGC Services in ArcGIS Server. Satish Sankaran, Esri Yingqi Tang, Esri

Smart Client Offline Data Caching and Synchronization

Playing tasks with Django & Celery

Towards Transparent Integration of Heterogeneous Cloud Storage Platforms

Indirect Communication

Learn Well Technocraft

Celery-RabbitMQ Documentation

ArcGIS Enterprise: An Introduction. Philip Heede

Design Patterns for the Cloud. MCSN - N. Tonellotto - Distributed Enabling Platforms 68

Implementing a Hybrid Approach to ArcGIS. Philip McNeilly and Margaret Jen

Integration Framework. Architecture

A data-driven framework for archiving and exploring social media data

On-demand Authentication Infrastructure for Test and Development Andrew Leonard Dell EMC/Isilon

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Time and Space. Indirect communication. Time and space uncoupling. indirect communication

Enterprise Geographic Information Servers. Dr David Maguire Director of Products Kevin Daugherty ESRI

INSPIRE: The ESRI Vision. Tina Hahn, GIS Consultant, ESRI(UK) Miguel Paredes, GIS Consultant, ESRI(UK)

Cartoview Documentation

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

ArcGIS for Server Michele Lundeen

Best Practices for Designing Effective Map Services

The Stream Processor as a Database. Ufuk

Orchestrating Big Data with Apache Airflow

Micro- Services. Distributed Systems. DevOps

Configuring, Tuning and Managing ArcGIS Server. Dan O Leary James Cardona Owen Evans

Data Ingestion at Scale. Jeffrey Sica

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Spoilt for Choice Which Integration Framework to choose? Mule ESB. Integration. Kai Wähner

Building a Real-time Notification System

ArcGIS Enterprise: Architecture & Deployment. Anthony Myers

Building Java Apps with ArcGIS Runtime SDK

Trimble GeoCollector for ArcGIS: An Introduction. Morgan Zhang (Esri), Matthew Morris (Trimble)

Part2: Let s pick one cloud IaaS middleware: OpenStack. Sergio Maffioletti

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Building loosely coupled and scalable systems using Event-Driven Architecture. Jonas Bonér Patrik Nordwall Andreas Källberg

Building (Better) Data Pipelines using Apache Airflow

PaaS Cloud mit Java. Eberhard Wolff, Principal Technologist, SpringSource A division of VMware VMware Inc. All rights reserved

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Providing Interoperability Using the Open GeoServices REST Specification

AWS Lambda + nodejs Hands-On Training

Beyond 1001 Dedicated Data Service Instances

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Smart Client Offline Data Caching and Synchronization

ArcGIS Enterprise: Architecting Your Deployment

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

How to bootstrap a startup using Django. Philipp Wassibauer philw ) & Jannis Leidel

Alteryx Technical Overview

Learning What s New in ArcGIS 10.1 for Server: Administration

WorldMap Help Center for Geographic Analysis, Harvard

Message Queueing. 20 March 2015

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

Market Data Publisher In a High Frequency Trading Set up

Cloud I - Introduction

ArcGIS for Server: Administration and Security. Amr Wahba

Outline. The Collaborative Research Platform for Data Curation and Repositories: CKAN For ANGIS Data Portal. Open Access & Open Data.

Take Risks But Don t Be Stupid! Patrick Eaton, PhD

Esri Geoportal Server

IBM Compose Managed Platform for Multiple Open Source Databases

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

AN EVENTFUL TOUR FROM ENTERPRISE INTEGRATION TO SERVERLESS. Marius Bogoevici Christian Posta 9 May, 2018

Microservices mit Java, Spring Boot & Spring Cloud. Eberhard Wolff

ArcGIS Enterprise Extending Services. Bill Major

Nevin Dong 董乃文 Principle Technical Evangelist Microsoft Cooperation

CMS users data management service integration and first experiences with its NoSQL data storage

What s New in ArcGIS Server 10

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

S4 Wordpress & Zend Server

Chapter 5. The MapReduce Programming Model and Implementation

Building Large Scale Distributed Systems with AMQP. Ted Ross

GeoEvent Server: An Introduction. Josh Joyner RJ Sunderman

Building Applications with the ArcGIS Runtime SDK for WPF

Quick Start. Scalable Deployers in SDL Web 8.5. Feb 2017 SDL Web. Document owner: Richard Hamlyn

Top 40 Cloud Computing Interview Questions

CIB Session 12th NoSQL Databases Structures

Transcription:

2017 FOSS4G Boston Maintaining Spatial Data Infrastructures (SDIs) using distributed task queues Paolo Corti and Ben Lewis Harvard Center for Geographic Analysis

Background Harvard Center for Geographic Analysis WorldMap http://worldmap.harvard.edu Biggest GeoNode instance on the planet https://github.com/cga-harvard/cga-worldmap HHypermap http://hh.worldmap.harvard.edu Map service registry https://github.com/cga-harvard/hhypermap

Note Billion Object Platform (BOP) https://github.com/cga-harvard/hhypermap-bop

Demo of WorldMap / HHypermap

The need for an asynchronous processor In WorldMap and HHypermap there are operations run by users which are time consuming and cannot be handled in the context of a web request Harvest the metadata of a service and its layers Synchronize the metadata of a new or updated layer to the search engine Feed a gazetteer when a new layer is uploaded or updated Upload a spatial datasets to the server Create a new layer using a table join

HTTP request/response cycle must be fast In web applications the HTTP request/response cycle can be synchronous as long as there are very quick interactions between the client and the server unfortunately there are cases when the cycle become slower In these situations the best practice for a web application is to process asynchronously these tasks using a task queue

Task Queues Asynchronous processing in a web application can be delegated to a task queue, which is a system for parallel execution of tasks in a non-blocking fashion

Asynchronous processing model

Asynchronous processing model The asynchronous processing model is composed by services that produce processing tasks (producers) and by services which consume and process these tasks (consumers) accordingly A message queue is a broker which facilitates message passing by providing a protocol or interface which other services can access. Work can be distributed across threads or machines In the context of a web application the producer is the client application that creates messages based on the user interaction. The consumer is a daemon process that can consume the messages and run the needed process

Glossary Task Queue: a system for parallel execution of tasks in a non-blocking fashion Broker or Message Queue: provides a protocol or interface for messages exchanging between different services and applications Producer: the code that places the tasks to be executed later in the broker Consumer or Worker: takes tasks from the broker and process them Exchange: takes a message from a producer and route it to zero or more queues (messages routing) Tasks must be consumed faster than being produced. If not, add more workers

Use cases for task queues in web applications some process is taking too much time and must be processed asynchronously heterogeneous applications/services in a given system architecture need an easy way to reliably communicate between each other periodic operations (vs crontab) a way of parallelizing tasks in multi processors monitor processes and analyze failing tasks (and execute them again)

Typical use cases for a task queue in a web application Thumbnails generation Sending bulk email Fetching large amounts of data from APIs Performing time-intensive calculations Expensive queries Search engine index synchronization Interaction with another application/service Replacing cron jobs (backups, maintenance, etc )

Typical use cases for a task queue in a GIS Portal/SDI Upload a shapefile to the server (GeoNode) Thumbnails generation for layers and maps (GeoNode) OGC services harvesting (Harvard Hypermap) Geoprocessing operations Geospatial data maintenance

Producer, broker and consumer architecture Producer Broker Consumer Producer Consumer Broker Producer Broker Consumer Producer Consumer Producer

Message brokers implementations Most of them are open source! RabbitMQ (AMQP, STOMP, JMS) Apache ActiveMQ (STOMP, JMS) Amazon Simple Queue Service (JMS) Apache Kafka Several standard protocols: AMQP, STOMP, JMS, MSMQ (Microsoft.NET)

Tasks (Jobs) queues implementations Celery (RabbitMQ, Redis, Amazon SQS, Zookeeper) Redis Queue (Redis) Resque (Redis) Kue (Redis) And many others!

Celery asynchronous task queue based on distributed message passing focused on real-time operation, but supports scheduling as well the execution units, called tasks, are executed concurrently on a single or more worker servers it supports many message brokers (RabbitMQ, Redis, MongoDB, CouchDB,...) written in Python but it can operate with other languages great integration with Django! great monitoring tools (Flower, django-celery-results)

RabbitMQ RabbitMQ is a message broker: it accepts and forwards messages most widely deployed open source broker (35k+ deployments) support many message protocols supported by many operating systems and languages Written in Erlang

Architecture of Celery/RabbitMQ https://tests4geeks.com/python-celery-rabbitmq-tutorial/

A real use case: Harvard Hypermap HHypermap (Harvard Hypermap) Registry is a platform that manages OWS, Esri REST, and other types of map service harvesting, and orchestration and maintains uptime statistics for services and layers. Where possible, layers are cached by MapProxy. HHypermap provides thousands of remote layers to WorldMap users

Harvard Hypermap WorldMap Architecture

HHypermap interface

Need for a task queue SLOW!!!

Producer Is the code that places the tasks to be executed later in the broker

Celery messages

Consumer Takes tasks from the broker and process them in a worker

Replacing cron jobs

Replacing cron jobs

Workers and threads with htop

Monitoring

Monitoring a task

Thanks! Question and Answer