博客笔记三: [Airbnb] data science的pipline,工业级的解决

https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d
作者是:Robert Chang

1. Customer Lifetime Value (LTV)

客户生命周期价值模型, 应用场景:
At e-commerce companies like Spotify or Netflix, LTV is often used to make pricing decisions like setting subscription fees. At marketplace companies like Airbnb, knowing users’ LTVs enable us to allocate budget across different marketing channels more efficiently, calculate more precise bidding prices for online marketing based on keywords, and create better listing segments.

整个训练测试和部署的pipline,airbnb使用了很多amazing的工具,因此他们的data scientist不用关注太多data engineering的过程。pipline主要这四个步骤。
Feature Engineering: Define relevant features
Prototyping and Training: Train a model prototype
Model Selection & Validation: Perform model selection and tuning
Productionization: Take the selected model prototype to production

2. Feature Engineering:

Airbnb’s internal feature repository — Zipline,写好一些特征(150+),免得写麻烦的hive
一些业务常见特征:
Location: country, market, neighborhood and various geography features
Price: nightly rate, cleaning fees, price point relative to similar listings
Availability: Total nights available, % of nights manually blocked
Bookability: Number of bookings or nights booked in the past X days
Quality: Review scores, number of reviews, and amenities

3. Prototyping and Training

构造模型原型用sklearn和spark。哈哈他们也用sklearn
- 数据确缺失处理
- encoding:category比较少用one hot;多用ordinal encoding
两者区别

4. Performing Model Selection

  • 许多automl工具,比如
    • TPOT
    • Auto-Sklearn
    • Auto-Weka
    • Machine-JS
    • DataRobot
  • 模型比如xgboost等常见模型
  • Bias-Variance tradeoff 进行interpretability 与 complexity取舍,即准确性与过拟合的取舍。见下图。
    博客笔记三: [Airbnb] data science的pipline,工业级的解决

5. Production部署模型

  • Tool used: Airbnb’s notebook translation framework — ML Automator
     ML Automator 把jupyter notebook转化为他们自己的airflow pipline,见下图
    博客笔记三: [Airbnb] data science的pipline,工业级的解决
     - 有时候需要用Python写hive UDF(user-defined function )以便分布式部署

6. 学习要点