From charlesreid1

Outline

The Google Cloud Data Engineering Certification exam guide is pretty hefty. The entire contents are given here: https://cloud.google.com/certification/guides/data-engineer/#sample-case-study

This page will contain some notes on the different sections of the exam guide, based on the Coursera course and my own experience.

Section 1: Designing Data Processing Systems

Main topics:

  • Design of flexible data representations
  • Design of data pipelines
  • Design of data processing architecture

Here are some specific considerations in this category:

  • Future advances in data technology
  • Changes to business requirements
  • Current state, potential future states
  • Potential future migrations
  • Tradeoffs
  • Availability
  • Distributed systems
  • Designing data schema

Section 2: Building and Maintaining Data

Focus: data structures and databases

Main topics:

  • Flexible data representations
  • Data pipelines
  • Data processing infrastructure

Considerations in this category:

  • Data cleaning
  • Batch vs. streaming data processing
  • Transformation of data
  • Acquisition of data
  • Importing data
  • Quality control of data
  • New data sources
  • Resources needed for data processing
  • Monitoring of pipelines
  • Adjustment of pipelines
  • Quality control of pipelines

Section 3: Data Analysis and Machine Learning

Main topics:

  • Data analysis
  • Machine learning
  • Deploying machine learning models

Considerations in this section include:

  • Collecting data
  • Visualizing data
  • Reducing the dimension of data
  • Cleaning and normalizing data
  • Defining what "success" means
  • Defining other metrics
  • Feature selection
  • Algorithm selection
  • Model debugging
  • Cost vs. performance
  • Online learning

Section 4: Modeling Business Processes

Main topics:

  • Transforming business requirements into data representations
  • Optimizing data representations, infrastructure, performance, cost

Considerations in this topic are:

  • Working with business people
  • Working with users
  • Getting business requirements
  • Knowing the scale of resources required
  • Knowing what data cleaning to do
  • How to implement high performance algorithms
  • Common sources of error (e.g., selection bias)
  • How to remove error

Section 5: Reliability

Main topics are:

  • Quality control
  • Assessment, troubleshooting and improvement of infrastructure
  • Assessment, troubleshooting and improvement of models
  • Recovering data

This includes knowing and doing the following:

  • Verification of data
  • Test suites
  • Pipeline monitoring
  • Planning for fault-tolerance
  • Planning for execution on failure (retroactive analysis, re-running failedjobs)
  • Stress testing
  • Plan for failure

Section 6: Visualizing Data

Main topics are:

  • Building data viz tools
  • Publishing data
  • Reporting on data

Considerations:

  • Automating visualization and report generation
  • Supporting decision-making
  • Summarizing data
  • Reporting on data fidelity, data trackability, data integrity

Section 7: Design for Security and Compliance

This is the section I'm least familiar with.

Main topics:

  • Secure data infrastructure
  • Legal compliance in data handling

Specifically:

  • Identity/access management (IAM)
  • Data security
  • Performing penetration testing
  • Need-to-know/separation of responsibility
  • Implementing proper security controls
  • Knowing relevant legislation
  • Preparation for audits

Relevant legislation includes:

  • HIPPA (Health Ins. Portability and Accountability Act)
  • COPPA (Children's Online Privacy Protection Act)


Flags