Installing, Deploying, Using Tensorflow in Python
Speaker: Chris Fregley, Pipeline.ai (former Netflix, Databricks, kickstarted Spark.)
https://github.com/PipelineAI/pipeline/
Zero respect for Python until he found Scientific Python.
PySpark not his favorite, rather use Scala for Spark, originally a Java guy
Slides available on meetup: slides
Great GPU notebooks and examples
This guy is such a salesman. He used port 6969 because of port colissions. Clever.
Notes from meetup site
Main talk (~30 mins + Q&A)
Optimizing, Profiling, and Deploying High-Performance Spark ML and TensorFlow AI Models in Production with GPUs
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool, I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models - and the TensorFlow Runtime - in GPU-based production environment. This talk is 100% demo based on open source tools and completely reproducible through Docker on your own GPU cluster.
Bio Chris Fregly is Founder and Research Engineer at PipelineAI, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High-Performance TensorFlow in Production."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
Tensorflow!
Mostly in C.
- Neural networks never want to go to production
- always want to be retrained
- want someone else to do the last 20%
Pipeline.ai
A framework to get your models into solution. Something about data pipelines!
Today - TensorFlow (TF) - Condensed talk: 8 hour into 30 minutes
Optimize and Deploy TF models to production.
Nice to have a working knowledge of TF.
GPU heavy talk.
Content Breakdown
50% training optimizations(GPUs, Training Pipeline, JIT) 50% prediction optimizations (AOT Compile, TF Serving)
Why heavy focus on 'model prediction' vs 'just training'?
Training - boring and batch! Prediction - exciting and real time!
Agenda - see the slides
Biggest thing is telling data scientists that they can deploy from their jupyter notebook in a controlled manner. Ability to push it out & roll it back.
DOCKER - Package model and runtime as one
- Docker for mac - opened up a ton of use cases!
- no surprises in production
- deploy and tune model + runtime together
- same local, dev, production env
Production always a bit different than development before this. A bunch of virtualbox cruft.
Did he consider vagrant? I think docker just blows vagrant away for developers, removes tons of devops.
Tune Model + Runtime together! (see slides!)
Offline models aren't always going to do well online... (see slides)
Online - real-time metrics. Cost per prediction!
The prediction profiling and tuning stuff is about bot identification.
Pipeline AI - Continuous Model Training
Identify and fix borderline predictions. Hotdog - not hotdog.
Use crowdflower to label the data - then relabel along the classification's edges.
Shift traffic to winning model using AI Bandit Algorithms.
"Once we drain this experiment of its value... we can stop this experiment."
Shift traffic to minimum "Cloud Cost"
Google Cloud vs Amazon Web Services
Shift to amazon when the cloud instances are cheaper. Shift back and forth.
He made fun of microsoft azure but plugged their kubernetes support.
NVIDIA GPU Half-Precision Support
- All about Volta V100 (2017) - AMD - tensor cores / Google TPUs
-
Pascal P100 (2016)
-
FP32 = Full Precision
- FP16 = Half Precision
- Half good for approximate deep learning use cases
- Fit two FP16s into FP32 GPU cores for 2x throughput!!
- Thunk to 32 bit
Nvidia: P100, M40, K40 - all tested for whatever reason
AMD? - GPU CUDA Programming
- Barbaric, fun. Must know hardware. Probably hate your life.
- Optimized for same instruction, multiple thread.
- Do not like if-statement. Like half the cores go unused.
- Independent thread scheduling - finally
- CUDA Streams
- Tensorflow uses this heavily
- goal is to saturate cpus
Check out Batch Normalization
- Almost always use batch normalization - except rarely you should never use batch normalization
- Technique from 2015
- gradient descent
- don't want the network to learn the order the data is showing up
- first part of pipeline is shuffle
- normalize per batch
- normalize per layer
- each mini batch may have wildly different distributions
Dropout
- sniff connections in your network
- prevent overfitting
- ensemble different neural architectures
- randomly, 50% sniff these things, purposely cripple it
- figure out ways to distort things they are trying to block out (spaces, numbers)
- more difficult for the network, better the final network was
this guy something or other - his friend
- https://github.com/yaroslavvb/stuff
DON'T USE FEED_DICT
- feed_dict Requires Python<->C++ Serialization
- Dataset API
- what happened to dataframe? it's gone i guess? tensorflow went right to dataset api
Tensorflow Debugger - mouse from a terminal it's crazy, only Google...
ALWAYS START WITH estimator
AND experiment
APIs
These come from Google trying to productionize tensorflow, probably successfully.
See the slide, has the above title!!!!!!!!
- Skip all the early demos you see, they are old and crusty.
- These simplify model building.
- flexible parameter tuning
- enable rapid model experiments
Estimator API
See the slides.
- Train-to-Serve Design
- Create custom or use a canned Estimator
-
Hides session, graph layers,
-
Chief:Worker
- Supervisor:Worker
- getting away from master but not improving their language... just changing it...
Canned Estimators
See the slides, 60% use prebuilt, 40% roll their own estimator
- Commonly used estimators
- Pre tested and pre tuned
Multi-headed inference: Single-Objective Estimator
vs Multi-Objective Estimator
Get two answers (objectives out of one estimator or something).
Hparams - Hyper parameter tuning
Do a big grid search, a parameter as a range.
Layers API exists
Use kubernetes or mesos
Good cluster organizer for working on this stuff...
JIT Compiler - and visualizing
Need to do a Python Context Manager with device as... whatever
to use JIT.
Historically: Spark & Project Tungsten would take your code and create your plan into an optimized one and then to a physical plan.
Built on XLA - accelerated linear algebra framework.
- Goals
- reduce memory Movement
- reduce overhead of multiple function calls
AOT Compiler - Standalone, Ahead-Of-Time (AOT) Compiler
For super tiny devices... pass in your graph.
Build up DAGs of operations in spark and tensorflow.
- tfcompile - point to input for graph and output for graph
- built on XLA framework
- creates functions with feeds (inputs) and fetches (outputs)
- figures out what everything... .so files, what bare minimum is necessary
- creates binary down to 600k - this is how we fit in the apple app store or not.
- shrink, shrink, tons of slides, freeze for production
request batch tuning
At the end of the day the forward prop. is just matrix multiplies.
See slides
- max_batch_size
- batc