Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Size: px

Start display at page:

Download "Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt"

Agnes Dorthy Gray
5 years ago
Views:

1 Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Date: 10 Sep, 2017 Draft v 4.0

2 Table of Contents 1. Introduction Infrastructure Reference Architecture Current Status of CPI-Related Solutions Targeted Data Management Continuum Current Infrastructure Architecture Targeted Solution Architecture Recommendations for Applications and Data Management Main Recommended Components Estimated Hi-Level Sizing and Specifications Conclusion and Next Actions References Page 2 of 23

3 1. Introduction Realizing the advantages of using mobile technology for data collection and statistical production, the United Nations Economic Commission for Africa (ECA) is implementing a series of pilot projects on strengthening the capacity of African countries to use mobile technologies to collect data for effective policy and decision making. The pilot projects are designed to be executed by the National Statistical Office (NSO) in collaboration with a Training and Research Institute (TRI) designated by the NSO. The main partner in the project is the NSO in Egypt, called the Central Agency for Public Mobilization and Statistics (CAPMAS). CAPMAS has in turn designated Nile University as the TRI. The main objectives of the pilot project are as follows: Strengthen the capacity of country to collect data with mobile technology Experiment with self enumeration using mobile devices to collect data and determine the suitability of such data for the production of statistics; Strengthen working relationship between NSO and TRI in statistical development. The focus of this report is to support CAPMAS to install and/or upgrade technical infrastructure, including computer servers and software to receive data from the project and integrate into standard statistical processes in Egypt, as well as to acquire handheld devices. Based on several meetings and assessment events with CAPMAS team, the current infrastructure and the targeted upgrades has been illustrated in this report. At the end, sizing estimates along with recommendations for Big Data components and platform has been made. The main infrastructure achievement at CAPMAS is the virtualized data center which is recommended to be upgraded further to Cloud Computing platform. The National Institute of Standards and Technology (NIST) Cloud reference architecture is recommend to be sued to achieve a private cloud computing platform for this purpose. Page 3 of 23

2. Infrastructure Reference Architecture For the sacked of standardizing the infrastructure design for the project, a suitable reference architecture need to be used.

4 2. Infrastructure Reference Architecture For the sacked of standardizing the infrastructure design for the project, a suitable reference architecture need to be used. As the cloud computing provides several benefits and at the same time exiting data center provide a solid foundation for such approach, The National Institute of Standards and Technology (NIST) Cloud reference architecture will be used as detailed in reference 2, following are key points. The Architectural Components of the NIST Reference Architecture describes the important aspects of service deployment and service orchestration. The overall service management of the cloud is acknowledged as an important element in the scheme of the architecture. Business Support mechanisms are in place to recognize customer management issues like contracts, accounting and pricing and are vital to cloud computing. Following figure presents an overview of the NIST cloud computing reference architecture, which identifies the major actors, their activities and functions in cloud computing. The diagram depicts a generic high-level architecture and is intended to facilitate the understanding of the requirements, uses, characteristics and standards of cloud computing. Page 4 of 23

5 The NIST cloud computing definition is widely accepted as a valuable contribution toward providing a clear understanding of cloud computing technologies and cloud services. It provides a simple and unambiguous taxonomy of three service models available to cloud consumers: cloud software as a service (SaaS), cloud platform as a service (PaaS), and cloud infrastructure as a service (IaaS). It also summarizes four deployment models describing how the computing infrastructure that delivers these services can be shared: private cloud, community cloud, public cloud, and hybrid cloud. Finally, the NIST definition also provides a unifying view of five essential characteristics that all cloud services exhibit: ondemand self-service, broad network access, resource pooling, rapid elasticity, and measured service. The NIST cloud computing reference architecture defines five major actors: cloud consumer, cloud provider, cloud carrier, cloud auditor and cloud broker. Each actor is an entity (a person or an organization) that participates in a transaction or process and/or performs tasks in cloud computing. Following table briefly lists the actors defined in the NIST cloud computing reference architecture: Actor Cloud Consumer Cloud Provider Cloud Auditor Cloud Broker Cloud Carrier Definition A person or organization that maintains a business relationship with, and uses service from, Cloud Providers A person, organization, or entity responsible for making a service available to interested parties A party that can conduct independent assessment of cloud services, information system operations, performance and security of the cloud implementation An entity that manages the use, performance and delivery of cloud services, and negotiates relationships between Cloud Providers and Cloud Consumers An intermediary that provides connectivity and transport of cloud services from Cloud Providers to Cloud Consumers Page 5 of 23

6 Our focus in this solution will be on the Private Cloud Model that need to be in place at CAPMAS as infrastructure of the mobile data collection applications as well as back end processing technologies. NIST defines A private cloud to give a single Cloud Consumer s organization the exclusive access to and usage of the infrastructure and computational resources. It may be managed either by the Cloud Consumer organization or by a third party, and may be hosted on the organization s premises (i.e. on-site private clouds) or outsourced to a hosting company (i.e. outsourced private clouds). Page 6 of 23

7 3. Current Status of CPI-Related Solutions Currently, there is neither dedicated infrastructure for CPI related processing at CAPMAS nor back end processing components like database engines or big data platforms to handle data processing, transformation and modeling. Most work is done either manually or collected to spread sheets for processing and estimation of CPI and intermediate statistics and KPIs. The following statistics provided by CAPMAS illustrates the workload for the CPI process in terms of effort needed by involved members: KPI Measure Description Number of Researchers Number of Supervisors Number of Researchers per Supervisor Overall number of governorates Overall number of regions Overall Number of markets Number of markets per region Number of markets per researcher About About Not specified One region to a one researcher Filed persons assigned to collected data from the different markets Filed person assigned to manage filed operation of researchers The average number of researchers being supervised by a supervisor Governorates where filed operation takes place Regions where markets are located for collecting prices Markets where prices are being collected Number of markets per regions where operation takes place Number of markets assigned during one month to single researcher Page 7 of 23

8 Number of forms per researcher Number of products per form Number of branch reviewers Number of head office reviewers products Number of forms to be completed by a researcher in one month Number of products the researcher need to get prices for per each single form Number of reviewers assigned to review the collected prices for each branch office Number of reviewers at the head office responsible for the final review of prices collected from all filed operations Page 8 of 23

4. Targeted Data Management Continuum The effectiveness of mobile data collection solution for the CPI Process requires the exitance of enterprise data management platform that is capable of handling

The current situation in the CPI process at CAPMAS lacks for such enterprise data management platform hence most of the process is done manually through paper forms except for the final analysis

9 4. Targeted Data Management Continuum The effectiveness of mobile data collection solution for the CPI Process requires the exitance of enterprise data management platform that is capable of handling collected data in integrated, secured and accessible way so that collaborative model among researchers, supervisors and CAPMAS branches, central departments and CPI departments can be achieved. The current situation in the CPI process at CAPMAS lacks for such enterprise data management platform hence most of the process is done manually through paper forms except for the final analysis which is conducted using excel sheets or local desktop software prohibiting the value of collaborative data models. The target platform and infrastructure should fulfill the following main requirements split by each phase of the data management continuum: Data Collection: enables automating the data sourcing, review, approval and consolidation using automated process through the workflow embedded into the mobile application for the filed researchers and their supervisors. Page 9 of 23

10 Data Aggregation: the sourced data from the mobile applications after review and approval needed to be aggregated properly into the backend database through direct connection and predefined rules defined by the CPI department. Data Matching: ability to extract external data and maintain master data while provide ability to query date using predefined queries as well as ad-hoc queries. At the same time, enable augmenting CPI data with other data like spatial and geolocation data. Data Quality: provide means for checking data quality and validation during the collection process and post collection while reviewing on the back-office processing and applying standard CPI statistical analysis. Data Persistence: retain and organize data for as long time as possible while provides capabilities of multi structured data to save the cost of storage. Data Consolidation: assemble data entities integrated into the back-end systems with flexible meta data management to ensure accessibility by specific roles. Data Distribution: enable analysis tools to access, retrieve and communicate data in an intuitive way suitable to each level of CPI employees as well as structured for branches access and top management reporting. The new model proposed to be implemented in the pilot project will address the above requirements for each area targeting an integrated data management platform that enables data integration, collaboration, retention using most recent big data management technologies. Transfer data directly to secured servers managed internally by CAPMAS including the following features: End-to-end encryption using existing CAPMAS telecommunication infrastructure. Reliable simultaneous connections to CAPMAS datacentre servers. Online/offline synchronization. GIS Integration. Multilanguage. Could architecture be used by all surveys and by all statistical processes. Could architecture be easily used to handle the self-enumeration concept. Page 10 of 23

5. Current Infrastructure Architecture At CAPMAS, virtualized data center infrastructure is used widely for other applications which can be leveraged for the CPI project with some modifications and

11 5. Current Infrastructure Architecture At CAPMAS, virtualized data center infrastructure is used widely for other applications which can be leveraged for the CPI project with some modifications and upgrades as per the next sections. The current infrastructure is based on VMWare virtualization technologies as details in reference 3 main points are following. VMware Infrastructure includes the following components as shown in above figure: VMware ESX Server A production-proven virtualization layer run on physical servers that abstract processor, memory, storage and networking resources to be provisioned to multiple virtual machines VMware Virtual Machine File System (VMFS) A high-performance cluster file system for virtual machines Page 11 of 23

VMware Virtual Symmetric Multi-Processing (SMP) Enables a single virtual machine to use multiple physical processors simultaneously VirtualCenter Management Server The central point for configuring,

12 VMware Virtual Symmetric Multi-Processing (SMP) Enables a single virtual machine to use multiple physical processors simultaneously VirtualCenter Management Server The central point for configuring, provisioning and managing virtualized IT infrastructure Virtual Infrastructure Client (VI Client) An interface that allows administrators and users to connect remotely to the Virtual Center Management Server or individual ESX Server installations from any Windows PC Virtual Infrastructure Web Access A Web interface for virtual machine management and remote consoles access VMware VMotion Enables the live migration of running virtual machines from one physical server to another with zero downtime, continuous service availability and complete transaction integrity Page 12 of 23

13 VMware High Availability (HA) Provides easy-to-use, cost effective high availability for applications running in virtual machines. In the event of server failure, affected virtual machines are automatically restarted on other production servers that have spare capacity VMware Distributed Resource Scheduler (DRS) Intelligently allocates and balances computing capacity dynamically across collections of hardware resources for virtual machines VMware Consolidated Backup Provides an easy to use, centralized facility for agentfree backup of virtual machines. It simplifies backup administration and reduces the load on ESX Server installations VMware Infrastructure SDK Provides a standard interface for VMware and third-party solutions to access VMware Infrastructure Page 13 of 23

14 6. Targeted Solution Architecture While leveraging the current virtualized infrastructure using a cloud computing model is the designated approach, the target infrastructure has several roles in running the mobile data collection solution to work smoothly as planned. Those roles including as per reference 4: Support the tabled mobile application communications for field researcher and supervisor applications. Enable hosting and running the REST APIs and associated data services developed for the mobile application data interfacing. Provide Big Data capabilities for long term data retention and high-performance computing. For supporting the tabled mobile application communications for field researcher and supervisor applications, following figure shows the communications topology: System Communication Diagram Page 14 of 23

15 The tablet devices are connected a 4G broadband cellular network The end-to-end communication between field devices and the back-end server is done through a Virtual Private Network (VPN) tunneling to ensure data security. Due to communication limitation, tablet devices should alternate between Online and Offline modes In Offline mode, the tablet device can still gather and store data and save them locally on a local database that resides on the tablet In Online mode, the device can synchronize the local and central database, send and receive messages and perform all other functions that require connectivity. On the other side, for enabling hosting and running the REST APIs and associated data services developed for the mobile application data interfacing, following figure shows the main tablet mobile applications system components and data flow: Mobile Tablet Applications System Modules Diagram Providing Big Data capabilities for long term data retention and high-performance computing will be covered in next section. Page 15 of 23

16 7. Recommendations for Applications and Data Management In the previous section on the tablet mobile application system components, the CAPMAS Backend Server is the landing space for collected data through the field researchers and supervisors. To provide Big Data capabilities for long term data retention and high-performance computing, and receiving additional data like self-enumeration and external sources integration, additional services will be integrated beneath the backend server receiving tabled data. The following features will be attained through the additional services: # Feature Description 1 Distributed Data Management Data will be stored in distributed blocks on several nodes enables granular management, scalability and highperformance computing. 2 Distributed Processing Aggregation, transformation, statistical analysis, data modeling will be implemented on a distributed application framework to enable high performance scalable resilient computing. 3 Batch Loading Enable ingestion of accumulated data into batches for long frequency loads. 4 Streaming Loading Enables ingesting data into small frequent streams of data in the form of pipeline of messages or transactions. 5 In Memory Processing Running data analysis in selected set of data in memory for faster processing and manipulation. 6 Data Science Modeling Specialized libraries that implements machine learning, deep learning, statistical modeling, data mining and analysis operations atop of the data platform 7 Graph Analysis Components that enable big graph implementation and network analysis models. Page 16 of 23

17 8. Main Recommended Components Based on the previous sections of current status and targeted requirements, several component need to be installed to achieve needed upgrades of exiting infrastructure. The following sections describes recommended components subject to review during the implementation of infrastructure upgrades and setup: - VMware vcloud Suite Leverage the current virtualized infrastructure into cloud management. vcloud Suite is an integrated offering that brings together VMware s industry-leading vsphere hypervisor and VMware vrealize Suite multi-vendor hybrid cloud management platform. VMware s new portable licensing units allow vcloud Suite to build and manage vsphere-based private clouds. Accelerate application delivery across both traditional and container based applications by giving developers the freedom to use the tools that make them most productive while still ensuring that applications can be moved seamlessly from developer laptop to production. - Apache Hadoop Distributed File System (HDFS) Distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. - Apache YARN The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global Resource Manager (RM) and per-application Application Master (AM). An application is either a single job or a DAG of jobs. The Resource Manager and the Node Manager form the data-computation framework. The Resource Manager is the ultimate authority that arbitrates resources among all the applications in the system. The Node Manager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Resource Manager/Scheduler. The per-application Application Master is, in effect, a framework specific library and is tasked with negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks. Page 17 of 23

18 - Apache Spark A fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. - Apache Hive Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. - Apache HBase Provides random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, nonrelational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. - Apache Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system. - Apache Tez building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. Provides expressive dataflow definition APIs, flexible Input-Processor-Output runtime model, data type agnostic, Simplifying deployment, performance gains over Map Reduce, optimal resource management, plan reconfiguration at runtime and dynamic physical data flow decisions Page 18 of 23

19 - Apache Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. - Apache Sqoop A tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. - MongoDB A document database with the scalability and flexibility that you want with the querying and indexing that you need. MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. Will be used a document store for unstructured data. - PostgreSQL A powerful SQL based database engine that will be used for landing mobile tablet applications collected data working behind the data services of the REST APIs. It provides extensive high-performance processing as well as special capabilities like GIS data handling. Page 19 of 23

20 9. Estimated Hi-Level Sizing and Specifications The following table lists the estimated sizing for the infrastructure required for deploying and running the for mentioned components. Sizing will be revised during the implementation taking advantage from the cloud approach deployed on top of the virtualized infrastructure at CAPMAS data center: # VM Function Estimated Node Sizing 1 2 x Name Nodes 4 Cores 3.0 GHz 16 GB RAM 200 GB Storage Linux OS 2 2 x Resource Scheduling Nodes 4 Cores 3.0 GHz 16 GB RAM 200 GB Storage Linux OS 3 8 x Worker Nodes 2 Cores 3.0 GHz 8 GB RAM 500 GB Storage Linux OS 4 2 x Document Services Nodes 4 Cores 3.0 GHz 16 GB RAM 500 GB Storage Linux OS 5 2 x REST APIs Hosting Nodes 4 Cores 3.0 GHz 16 GB RAM 100 GB Storage Linux OS Page 20 of 23

21 6 2 x Central Database Nodes 4 Cores 3.0 GHz 16 GB RAM 500 GB Disk Space Linux OS 7 2 x Back Office Applications 4 Cores 3.0 GHz 8 GB RAM 200 GB Disk Space Windows Server Page 21 of 23

22 10. Conclusion and Next Actions The achievement of virtualized infrastructure at CAPMAS is paving the way for building solid foundation for the mobile data collection solution as well as other potential data solutions and integration with external data sources. To leverage this achievement two main additional layers need to be build: Extending Virtualization to Cloud Platform Deploying Big Data Management Platform Next Actions would include commencing in implementing plan for the two above items where implementation team need to be invited while ensuring complete know-how transfer to CAPMAS team specially on the Big Data management solutions as well as extending the backend capabilities to support the mobile data collection solution as the main focus of this pilot project. Page 22 of 23

23 11. References 1- UNECA CAPMAS Nile University Letter of Agreement (LoA). 2- Cloud Computing Reference Architecture: Recommendations of the National Institute of Standards and Technology 3- VMware Virtualization Documentation 4- CAPMAS Pricing Tablet Application Requirements and Design Document. 5- VMware vcloud Suite 6- Apache Hadoop Main Page Page 23 of 23

Cloud Computing & Visualization

Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International