This presentation was recorded at YOW! 2019. #GOTOcon #YOW
Tommy Hall - Theatre fan, occasional mountaineer, part time runner, thoroughly nice chap, available in fine bookstores everywhere
ABSTRACT
In all businesses, there is some kind of data pipeline, even if it’s powered by humans working off a shared drive somewhere. Lots of places are better than this - they have workflow systems, ETL pipelines, analytics teams, data scientists, etc - but can they say months later which version of which code is running on what data generated insights?
Can they be reproduced?
What if the algorithms change, do you go back and re-run everything?
Science itself has a reproducibility problem, but it’s worse in most companies, and mistakes can be expensive.
There is a useful subset of data pipelines, let’s call them “pure”, that only depend on the data flowing through them. For pure pipelines, we can use techniques from distributed build systems to allow us to know what code was used for each step, not lose any previous results as we improve our algorithms and avoid repeating work that has been done already.
This talk contains interesting theory but is resolutely practical and with concrete examples in several languages and distributed computation frameworks. [...]
RECOMMENDED BOOKS
Bas P. Harenslak & Julian Rutger de Ruiter • Data Pipelines with Apache Airflow •
James Densmore • Data Pipelines Pocket Reference •
Barr Moses, Lior Gavish & Molly Vorwerck • Data Quality Fundamentals •
Rishu Mehra • What is Data Observability •
Gerardus Blokdyk • Observability Services A Complete Guide •
#DataPipelines #DataPipeline #Prometheus #Grafana #Data #ETLPipelines #Backend #DevOps #Streams #Frontend #TommyHall #Programming #YOWcon
Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket at
Sign up for updates and specials at
SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
1 view
0
0
7 years ago 01:30:40 4
Developing Real-Time Data Pipelines with Apache Kafka
5 years ago 00:59:16 3
Deep Learning Design Patterns - Jr. Data Scientist - Data Pipelines #1
3 years ago 00:23:12 1
RedisDays London 2022: Real-Time Data Pipelines - NLP + Redis
3 years ago 00:11:43 1
The Data Science Pipeline
3 years ago 01:02:21 10
Mastering a data pipeline with Python / Robson Luis Monteiro Junior (Microsoft)
4 years ago 00:16:02 2
Introduction to Data Pipelines and Kedro - Writing Data Pipelines With Kedro 1
5 years ago 00:32:45 1
Infoshare 2019 - Alejandro Saucedo: Industrial data and machine learning pipelines
6 years ago 00:55:08 1
ML Data Pipelines for Real-Time Fraud Prevention @PayPal
6 years ago 00:30:34 39
Data Pipeline Hyperparameter Optimization - Alex Quemy
2 years ago 00:47:39 1
Data Pipelines à La Mode • Tommy Hall • YOW! 2019
2 years ago 00:14:18 4
The Data Analysis Pipeline (DL 05)
6 years ago 00:45:45 1
The journey from queues to data pipeline streams (Shlomi Shemesh, Israel)
2 years ago 00:32:28 1
Orchestrating Your Data Pipelines with Apache Airflow • Ricardo Sueiras • GOTO 2022
5 years ago 00:07:45 4
Python Data Science Workflow and Pipeline (for data scientists)
7 years ago 00:44:32 1
Alejandro Saucedo, Eigen Technologies «Industrial Data Pipelines with Python and Airflow»
5 years ago 01:30:16 5
Jason Grafft- Enhancing Machine Learning and Data Visualization Pipelines with Isomorphisms- λC20 GE
5 years ago 00:46:56 15
Luxoft Tech Talk with Martin Toshev - Building highly-scalable data pipelines with Apache Spark
9 years ago 01:51:54 3
Lecture 8. Pipelining II: Data and Control Dependence Handling - CMU - Comp. Arch. 2015 - Onur Mutlu
3 years ago 00:26:20 2
Enterprise Machine Learning Pipelines with Unstructured Image Data | PyData Global 2021
3 years ago 00:14:14 7
How Data Engineering Works
3 years ago 02:17:11 3
Build a TikTok Data Science App with Streamlit and Python | Data Science Project
5 years ago 00:25:38 1
Breaking Pandas
3 years ago 00:28:52 1
Unifying Large Scale Data Preprocessing and ML Pipelines with Ray Datasets | PyData Global 2021
8 years ago 00:18:30 1
JWST 2016 - JWST Pipeline and Data Products (K. Gordon)