From charlesreid1

Revision as of 00:11, 12 September 2017 by Admin (talk | contribs) (Admin moved page Google Cloud Data Engineer to Google Cloud)

Notes for google cloud data engineer certification.

The following list is based on the sample case study for the GCDE certification exam: https://cloud.google.com/certification/guides/data-engineer/casestudy-flowlogistic

The case study focuses on a logistics company tracking orders and shipments via rail, truck, aircraft, and ships.

Goals and Motivation

Goals:

  • Implement real-time inventory tracking system that tracks locations
  • Perform data analytics on order and shipment logs (structured/unstructured data) to make decisions about deploying resources, targeting customers, and expanding into markets
  • Predict delays in shipments

Requirements:

  • Reliable, reproducible environment that scales
  • Aggregated data in centralized data lake
  • Historical data used to perform predictive analytics on future shipments
  • Accurate tracking of worldwide shipments (proprietary technology)
  • Improvement of business agility and speed of innovation via rapid provisioning of new resources
  • Analysis and optimization for performance in the cloud
  • Migration to cloud, if all other requirements met

Deeper reasoning:

  • Inability to upgrade infrastructure hampering growth and efficiency
  • Ineffective at moving data around
  • Need to better understand where/who customers are, what they are shipping
  • IT is too busy managing infrastructure to organize data/build analytics/implement tracking technology
  • Penalties for late shipments and deliveries translates into direct correlation between profitability and bottom line

Technology Stack

Databases:

  • SQL DB storing user data, static data
  • Cassandra DB storing metadata, tracking messages
  • Kafka servers tracking message aggregation and batch insert

Applications:

  • Customer frontend, middleware for orders and customs
  • Tomcat for Java services
  • Nginx for static content
  • Batch servers (?)

Storage:

  • iSCSI (internet small-computer-system interface) to manage VM hosts
  • Fiber channel network for SQL server storage
  • NAS (network attached storage) for image storage, logs, and backups

Analytics:

  • Hadoop/Spark servers
  • Core data lake
  • Data analysis workloads

Miscellaneous servers:

  • Jenkins
  • Monitoring of servers
  • Bastion hosts
  • Security scanners
  • Billing software