Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Size: px

Start display at page:

Download "Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar"

Albert Welch
5 years ago
Views:

1 Luigi Build Data Pipelines of batch jobs - Pramod Toraskar

I am a Principal Solution Engineer & Pythonista with more than 8 years of work experience, Works for a Red Hat India an open source solutions company and community-powered approach to provide

2 I am a Principal Solution Engineer & Pythonista with more than 8 years of work experience, Works for a Red Hat India an open source solutions company and community-powered approach to provide reliable and High-Performing Cloud, Virtualization, Storage, Linux, and Middleware Technologies. I am responsible for the design, architect, build, and implementation of scalable marketing process automation (Python, MongoDB) deployed to internal PaaS systems, including Openshift 3 (Docker/Kubernetes-based Container Application Platform). As Team/technical for Systems Scalability, a data engineering team which supports operational data pipelines. Also actively participate in open source community programs like Python and Data Science. Has a Diploma in Computer engineering from AISSMS Polytechnic & B.E. in Computer engineering from the MIT Pune. RH_PREStemp_light_v2_0816

DATA ENGINEERING FOR MARKETING AUTOMATION Modern

built-in functionality for data flow, data

They are not necessarily built with programmers in

3 DATA ENGINEERING FOR MARKETING AUTOMATION Modern marketing automation platforms include some built-in functionality for data flow, data manipulation, and business rule implementation. They are not necessarily built with programmers in mind, and pass up some functionality in favor of robustness and user-friendliness. As processes grew more complex, so did our implementation requirements. We needed to become more rigorous in our testing and version control, and expand our available toolkit to match business needs.

4 LEVERAGING OPENSHIFT AND LUIGI FOR DATA ENGINEERING WORK The data engineer team moving most of operational data pipelines into container-based software deployment and management product called OpenShift. In this process relied heavily on Luigi for workflow orchestration. We will walk through our reasons for moving marketing data pipelines into scripted workflows, our technology choices, a high-level overview of the solution architecture, and a brief overview of what pipelines we have implemented (and are planning to implement). How much data do we deal with? - 36 Millions contact data per year 2.6 to 3 Million contact data on average per month Monthly/daily/hourly reporting Business metric dashboards

5 TECHNOLOGY CHOICE Luigi - workflow engine

6 WHY OPENSHIFT? Moving to Openshift Container Application Platform v3. OS3 is a modern platform for building and deploying Docker containers using Kubernetes. This is very exciting for us as it removes some of the limitations we previously experienced, especially around memory management, job scheduling, and available frameworks. Also a bonus is the built-in Source-to-Image tool, which allowed a much easier migration for us. With S2I, our team just worries about writing scripts, not about building containers.

built in Data Storage Clean, filter, join and aggregate data Cassandra Time Series data Luigi is

7 WHAT IS LUIGI Python module to help build complex pipelines. Created by Spotify Dependency Resolution Workflow management Data flow visualization Hadoop support built in Data Storage Clean, filter, join and aggregate data Cassandra Time Series data Luigi is used internally at Spotify to build complex data pipelines Luigi doesn t replace Hadoop, Scalding, Pig, Hive, Redshift. It orchestrates them.

8 LUIGI CONCEPTS Tasks Units of work that produce Outputs Can depend on one or more other tasks Is only run if all dependents are complete Are idempotent Entirely code-based Most other tools are gui-based or declarative and don t offer any abstraction with code you can build anything you want

9 DEFINING A DATA FLOW A bunch of data processing task with inter-dependencies

10 Dependencies Task Targets External API: Contacts Get Contacts S3://data/ contacts.json POST to DWM S3://data/ processed/ contacts.json

11 TASK class class GetContacts(luigi.Task): GetContacts(luigi.Task): def def output(self): output(self): pass pass def def requires(self): requires(self): pass pass def def run(self) run(self) pass pass luigi.run(main_task_cls=mytask) luigi.run(main_task_cls=mytask)

12 LUIGI TASK BREAKDOWN Parametrization To use data flows as command line tools

13 TARGETS & PARAMETERS Dependencies Task Get Contacts date:dateparameter api_method:parameter input() Targets External API: Contacts output() POST to API date:dateparameter chunk_ct:intparameter input() S3://data/ contacts.json output() S3://data/ processed/ contacts.json

14 TARGET Target is simply something that exists or doesn t exist Lots of ready-made targets in Luigi: For example local file a file in a local file system HDFS file S3 key/value target a file in a remote file system SFTP remote target a file in an Amazon S3 bucket SQL table row target a database row in a SQL database Amazon Redshift table row target ElasticSearch target

15 MULTIPLE DEPENDENCIES Dependencies Task input() External API: ELQ External API: HubSpot Get Contacts date:dateparameter api_method:parameter POST to API date:dateparameter chunk_ct:intparameter input() input() External API: Contacts output() input() S3://data/ contacts.json Targets output() S3://data/ processed/ contacts.json

16 DYNAMIC DEPENDENCIES class class LoadAllContact(Luigi.WrapperTask): LoadAllContact(Luigi.WrapperTask): date date == luigi.dateparameter() luigi.dateparameter() def def run(self): run(self): For For file_path file_path in in os.listdir( /data/api_contact_data/*.json ): os.listdir( /data/api_contact_data/*.json ): TransformContactAPIData(file_path) TransformContactAPIData(file_path)

17 WRAPPER TASK class class LoadAllContact(Luigi.WrapperTask): LoadAllContact(Luigi.WrapperTask): date date == luigi.dateparameter() luigi.dateparameter() def def run(self): run(self): yield yield GetContact(self.date) GetContact(self.date) yield yield SyncContact(self.date) SyncContact(self.date) yield yield LoadContactRules(self.date) LoadContactRules(self.date)

18 RUN WITH MULTIPLE WORKERS $ PYTHONPATH dataflow --workers 3 AggergateArtists --date 2018-W11 Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) AggergateArtists (date= )

19 LARGE DATA FLOWS (Screenshot from web interface)

20 PROCESS SYNCHRONIZATION Prevents two identical tasks from running simultaneously luigid Simple task synchoronization Data flow 1 Common dependency Task Data flow 2

21 TIPS & TRICKS Save often Save the results of each step They may be useful later on Its super useful for debugging But be ok with regenerating when needed accidentally deleted massive output directory, but was easy ( though time consuming) to recreate only what was needed. Aim small miss small (code small retry small) Shoot for relatively small units of work The pipeline will be easier to understand If there is a task that takes a long time and might fail, easier to deal with

22 Idempotency think it, live it, love it Again, keep things small Write to somewhere else and don t update the source data Tasks should only be changing one thing(if possible) Use atomic writes (where possible) Parallelization can be your friend Luigi can parallelize your workflows But you need to tell it that you want that Default number of workers is 1 Use workers to specify more

23 THINGS WE MISSED OUT There are lots of task types which can be used which we haven t mentioned spark elasticsearch hive pig redis redshift salesforce S3 ecs mongodb mysql pyspark etc. Check out the luigi.contrib package ** Using a persistent task history database, you could train a simple k-nn classifier to predict how long a task will run, Then use this with the dependency graph to predict when any task will finish

24 LUIGI LIMITATIONS It wouldn t be fair not to mention some limitations with the current design: Less useful for near real-time pipelines or continuously running processes. Schedule a few thousand jobs. Doesn t support distributed Execution. Doesn t provide a way to trigger flows.

ONWARDS The docs: The mailing list: The source: http://luigi.readthedocs.

25 ONWARDS The docs: The mailing list: The source: m/forum/#!forum/luigiuser/ /luigi

26 THANK YOU! We re hiring Python data engineers!!

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data