GCDE/Outline of Topics
From charlesreid1
Outline
The Google Cloud Data Engineering Certification exam guide is pretty hefty. The entire contents are given here: https://cloud.google.com/certification/guides/data-engineer/#sample-case-study
This page will contain some notes on the different sections of the exam guide, based on the Coursera course and my own experience.
Section 1: Designing Data Processing Systems
Main topics:
- Design of flexible data representations
- Design of data pipelines
- Design of data processing architecture
Here are some specific considerations in this category:
- Future advances in data technology
- Changes to business requirements
- Current state, potential future states
- Potential future migrations
- Tradeoffs
- Availability
- Distributed systems
- Designing data schema
Section 2: Building and Maintaining Data
Focus: data structures and databases
Main topics:
- Flexible data representations
- Data pipelines
- Data processing infrastructure
Considerations in this category:
- Data cleaning
- Batch vs. streaming data processing
- Transformation of data
- Acquisition of data
- Importing data
- Quality control of data
- New data sources
- Resources needed for data processing
- Monitoring of pipelines
- Adjustment of pipelines
- Quality control of pipelines
Section 3: Data Analysis and Machine Learning
Main topics:
- Data analysis
- Machine learning
- Deploying machine learning models
Considerations in this section include:
- Collecting data
- Visualizing data
- Reducing the dimension of data
- Cleaning and normalizing data
- Defining what "success" means
- Defining other metrics
- Feature selection
- Algorithm selection
- Model debugging
- Cost vs. performance
- Online learning
Section 4: Modeling Business Processes
Main topics:
- Transforming business requirements into data representations
- Optimizing data representations, infrastructure, performance, cost
Considerations in this topic are:
- Working with business people
- Working with users
- Getting business requirements
- Knowing the scale of resources required
- Knowing what data cleaning to do
- How to implement high performance algorithms
- Common sources of error (e.g., selection bias)
- How to remove error
Section 5: Reliability
Main topics are:
- Quality control
- Assessment, troubleshooting and improvement of infrastructure
- Assessment, troubleshooting and improvement of models
- Recovering data
This includes knowing and doing the following:
- Verification of data
- Test suites
- Pipeline monitoring
- Planning for fault-tolerance
- Planning for execution on failure (retroactive analysis, re-running failedjobs)
- Stress testing
- Plan for failure
Section 6: Visualizing Data
Main topics are:
- Building data viz tools
- Publishing data
- Reporting on data
Considerations:
- Automating visualization and report generation
- Supporting decision-making
- Summarizing data
- Reporting on data fidelity, data trackability, data integrity
Section 7: Design for Security and Compliance
This is the section I'm least familiar with.
Main topics:
- Secure data infrastructure
- Legal compliance in data handling
Specifically:
- Identity/access management (IAM)
- Data security
- Performing penetration testing
- Need-to-know/separation of responsibility
- Implementing proper security controls
- Knowing relevant legislation
- Preparation for audits
Relevant legislation includes:
- HIPPA (Health Ins. Portability and Accountability Act)
- COPPA (Children's Online Privacy Protection Act)