class: center, middle, inverse, title-slide .title[ # dbt Sharpens Data Science ] .author[ ### Alex Abraham
alextabraham.com ] .date[ ### 2023-08-16 ] --- # dbt Resolves Data Chaos, Through Standards - Data transformation: "a cohesive arc moving data from _source-conformed_ to _business-conformed_ " (`dbt` blog) - Data have idiosyncrasies -> transformations might adopt ad hoc designs - Ad hoc designs costly in maintenance, collaboration, ... - Worse if that design isn't your payload ("load that pays") - Standards save cognitive load, for better uses! - dbt standardizes the data transformation workflow, overcoming challenges - [historically](https://www.getdbt.com/blog/it-s-time-for-open-source-analytics/) - currently --- # dbt Standardizes Tooling, and also Philosophy - dbt Core tool is incredible - `dbt build` -- 'nuff said. - dbt blog shares secret sauce (principles, "philosophy"): - How to clearly structure a complex data project - How to unlock tool's full potential (_required reading_) - In my Data Science experience: - I will compose data transformations (that's [good](https://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/)) - Non-standard workflows hard to maintain, extend - Are components modular & orthogonal? - Time-to-understanding, for new reader or future self? --- class: inverse, center, middle # dbt philosophy and tools sharpen data science practice. --- # Philosophy: Standard Workflow Layers, Details - Philosophy articulates, - Transformation workflow as standard _layers_ - Overarching conventions for implementation, style <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Gonna start a conference called <a href="https://twitter.com/hashtag/NormIPS?src=hash&ref_src=twsrc%5Etfw">#NormIPS</a> that’s just presentations of middlebrow ML topics. “how to structure Python packages 2022”, “how many k-folds is too many”, “how to make the browser pop-up come up when the notebook is done running”, “putting features in Postgres”, etc.</p>— Vicki (@vboykis) <a href="https://twitter.com/vboykis/status/1552066833582276610?ref_src=twsrc%5Etfw">July 26, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> - Sources: [style guide](https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md), [best practices document](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) --- # Standard Layers Structure a Workflow - **"Stacking our transformations in optimized, modular layers means we can apply each transformation in only one place"** - Remarkable -- contrast to decision burden with ad hoc approach - Staging: atomic building blocks - Intermediate: transformation steps - Marts: business-meaningful entities - Probably not your workflow version 1 - "With a functioning model flowing in dbt, we can start refactoring and optimizing that mart." - Add this complexity as necessary! --- # A Few Properties Tell a Layer's Story - Folder structure - "Folder structure is also one of the key interfaces for understanding the knowledge graph encoded in our project [...] It should reflect how the data flows, step-by-step, from a wide variety of source-conformed models into fewer, richer business-conformed models." - File names - "By descriptively labeling the transformations happening [within model ...], even a stakeholder who doesn't know SQL would be able to grasp the purpose of this section" - Typical operations - Manage complexity by breaking out small steps --- # Staging Layer: Atomic Building Blocks - Folder structure - One subdirectory per source system (Stripe, transactions database, etc) - File names - Layer, source system, entity (plural): `stg_[source]__[entity]s.sql` - Typical operations - Rename - Change data types - Univariate transforms - Converting units - Discretizing/bucketizing - NOT joins (layer not dedicated for integrations) - NOT aggregations --- # Intermediate Layer: Transformation Steps - Folder structure - One subdirectory per business area (finance, marketing, etc) - File names - Layer, entity (plural), transform action: `int_[entity]s_[verb]s.sql` - Emphasize business-conformed concept over source system - "easy for **anybody** to quickly understand what's happening in that model" - Typical operations - Change of data grain - Table pivot - Aggregations - Joins --- # Mart Layer: Business-Meaningful Entities - Folder structure - One subdirectory per business area (finance, marketing, etc) - Still label calculation by _concept_, not team - Good: `tax_revenue`, `revenue` - Not good: `finance_revenue`, `marketing_revenue` - File names - Entity: `[entity]s.sql` - "for pure marts, there should not be a time dimension [...] typically best captured via metrics" - Typical operations - Joins - Wide array of calculations -- though limited count - Many calculations imply, opportunity for upstream move --- # Test-Driven Data Transformation - At least `unique` and `not_null` tests, always - How else to defend, a revision maintains existing functionality? - Might write tests first, a la Test-Driven Development - Broadly applicable practice: in dbt project, Jupyter notebook, ... --- # Code Readability is an Objective to Optimize - "**DO NOT OPTIMIZE FOR FEWER LINES OF CODE**" - Again, _Art of Readable Code_ - Leverage CTE early and often - Import-type (a la Python package imports) - Logical-type (break down complexity) --- # We Can Bring Order to Naming, Broadly - Already covered table names - `snake_case` for schema, table, and column naming - Full name > domain-specific abbreviation - Consider onboarding employee - "Code should be written to minimize the time it would take for someone else to understand it" (_Art of Readable Code_) - Timestamp columns `<event>_at`, date columns `<event>_date` - True/false columns `is_<condition>` --- class: inverse, center, middle # dbt tooling, to this data scientist --- # Interface Simplicity is Exemplary - Well-encapsulated API: one command line statement accomplishes goal - `dbt build` - Specific to machine learning: resist ["pipeline jungle"](https://www.youtube.com/watch?v=sr3_DQ-RhTc) - `dbt docs generate` - `dbt docs serve` - Low-visual noise interfaces for configuration - yaml files --- # dbt Inspires More SWE in Analytics - dbt _connects_ data analytics, software engineering - This connection should revolutionize more of analytics! - Some examples I've been working on, follow-- --- # Motivate One Workflow for ML Feature Transforms - Business stakeholders and data practitioners often unhappy with cost of reaching baseline predictive model - Even with analyst-friendly data in hand - Analyst-friendly data need transformed to model-ready features - _Feature transform workflow/pipeline_ - Workflow rewrites common between or within- projects --- # model-flow: One Workflow for ML Feature Transforms - **Solution: de-couple (a) standard operations, (b) project-specific configuration** .center[![An ML system standardizes by de-coupling.](data:image/png;base64,#images/ml_systems_complexity_modular.png)] SOURCE: [Hidden Technical Debt in Machine Learning Systems](https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf) --- # model-flow Usage - Compose a few fun-to-read configuration files - "What data are used in this analysis?" - "What transformation steps yield model-ready data?" - Get model-ready data with: .center[ ```python -m src.features.main``` ] - Transformed data could pass to many possible models - For more, - My [blog](https://alextabraham.com/post/2023-08-10-improvingtimetomodelfeaturetransforms/) - My [GitHub repo](https://github.com/abrahamalex13/model-flow) --- # Underutilized Data Abound Outside Formal Tech - Consider a conventionally anecdotal marketplace - Our example: baseball cards - Expect high friction in asset valuation precedent - time-intensive, - hard-to-scale, - inaccurate - With the internet, relevant data abound -- needs wrangled! - With our data and predictive models, we - Save sellers time spent during listing process, - Inform buyers about fundamental value drivers (performance _and_ card details) --- # Kyle Tucker's Batting Gloves? <img src="data:image/png;base64,#images/king_tuck.png" width="450px" height="450px" style="display: block; margin: auto;" /> We yammer as [@SchoolhouseData](https://twitter.com/SchoolhouseData) --- class: inverse, center, middle # dbt philosophy and tools sharpen data science practice. --- class: inverse, center, middle ## Alex Abraham<br>alextabraham.com<br>@SchoolhouseData