From charlesreid1

Big Data and Machine Learning Fundamentals

Module 1

Overview of fundamentals course

Interesting question: why would Google be in the business of cloud computing?

Mission statement: to organize the world's information and make it accessible

Reason for being in cloud computing is, need to have massive amount of infrastructure in order to organize info and make it accessible

1 out of every 5 CPUs that is produced in the world is bought by Google


Organizing information:

GFS and Hadoop:

  • GFS (2002) was originally idea for organizing lots of files/information across large clusters, which in turn led to Hadoop HDFS (which is based on GFS)
  • MapReduce came out of Google around 2004
  • But, by 2006, Google was no longer writing any MapReduce programs
  • Why?
  • MapReduce and HDFS require sharding - distributing your data set across a cluster - which means that the size of your data sets and the size of your cluster are intimately linked


Note: link to all papers is here: https://research.google.com/pubs/papers.html


Google Data Technologies:

  • GFS - 2002 (basis for HDFS)
  • MapReduce - 2004 (basis for Hadoop - abandoned)
  • BigTable - 2006
  • Dremel - 2008 (replaced MapReduce, available in GCP as BigQuery)
  • Colossus - 2009 (replacement for GFS)
  • Flume - 2010 (replaced MapReduce, available in GCP as DataFlow)
  • Megastore - 2011
  • Spanner - 2012
  • Millwheel - 2013 (also part of DataFlow)
  • PubSub - 2013 (available in GCP as itself)
  • F1 - 2014
  • TensorFlow - 2015 (available in GCP as CloudML)


Various innovations coming out of Google are being released into Google Cloud

Elastic computing concept - you should be able to "instantaneously" scale out to as many machines as you need

Purpose of switching to the cloud:

  • Uptime, keeping hardware up and running
  • Making teams more efficient and effective
  • Having the entire Google data stack available to leverage the best software available

Big Data products

Spotify uses two products: PubSub and Dataflow

PubSub is a messaging system, Dataflow is a data pipeline tool

Using GCP big data products helps companies:

  • pay less per operation
  • be more efficient (better tooling)
  • be more innovative and powerful (big stack of data tools)

BigQuery: reducing 2.2 BILLION items to 20K items in <1 min (transformational promise of the cloud)

A functional view:

  • Foundation
    • Compute engine, compute storage
  • Databases
    • Datastore, Cloud SQL, Cloud BigTable
  • Analytics and Machine Learning
    • BigQuery, Cloud Datalab, Translate API etc.
  • Data-Handling Frameworks
    • Cloud PubSub, Cloud Dataflow, Cloud Dataproc

Why the forked approach?

Google is trying to solve SEVERAL DIFFERENT problems

Changing where people are computing

  • Keep doing the same things you're doing already, but changing where you're doing them
  • Each tool addresses different things that people are already doing on-premises (and would not require a change in CODE, just a change in LOCATION)
  • Cloud databases - (migrating DBs) Cloud SQL (relational databases, key-value databases, NoSQL databases), Cloud Datastore, Cloud BigTable
  • Storage platform - (migrating storage) Cloud Storage Standard, Durable Reduced Availability
  • Managed Hadoop/Spark/Pig/Hive - (migrating data processing) Cloud Dataproc

Providing speed, scalability, and reliability:

  • Want to provide scalable and reliable services (like Spotify)
  • Need to be able to justify using hundreds of machines for a few minutes, rather than a smaller number of machines that take much, much longer
  • Messaging - Cloud PubSub
  • Data Processing - Cloud Dataflow, Cloud Dataproc

Changing how computation is done:

  • Utilizing tools provided by Google to do new things, analyze more data, analyze in a different way, build better models
  • Examples: analyze customer behavior, analyze factory floors
  • There are basically three use-cases that typically play out
  • Data exploration and business intelligence - Cloud Datalab, Cloud Data Studio
  • Data Warehouse for large-scale analytics - Google BigQuery
  • Machine learning - Cloud Machine Learning, Vision API, Speech API, Translate API


Summary: three principal use-cases for GCP

  • (Based on what Google sees in their professional services organization)
  • Migrations - changing where they compute
  • Scale up and reliability - making a service more scalable/reliable
  • Transforming business - adding new ways to deal with more data

Usage Scenarios

Google Cloud platform usage scenarios: review

  • Change where you compute (migration to the cloud)
  • Scalability and reliability (flexible platform that can scale)
  • Change how you compute (explore, analyze, extract information differently)

Usage scenarios:

  • Changing where you compute: Movie company using cloud platform for scaling up rendering (can requisition more machines)
  • Scalability and reliability: Finance company performing consolidated audit (data repository of all equities, options, orders, quotes, events on stock market) - 6 TB per hour, 100 BILLION market events) (HUMONGOUS amount of data, that needs to be processed at scale, and none of it can be lost)
  • Changing how you compute: Rooms to Go (furniture retailer) combined CRM database and website, BigQuery analysis, redesign room packages

Spend less on ops/admin

Incorporate real-time data into apps/architecture

Apply machine learning broadly

Create citizen data scientists (putting tools into hands of domain experts)

This means your company can become a data-driven organization - decision-makers (domain experts) are no longer waiting for the data, they can deal with and see the data themselves to make the decisions and move forward

Labs

List of code labs: https://codelabs.developers.google.com/cpb100

Signing up for free trial (req. CC): https://console.developers.google.com/freetrial

Note they specifically say, you get $300 in credit over 60 days, and will not be charged.

Need to "activate" compute engine.

Quiz

Goals:

  • Get compute instance fired up, figure out how to use the control panel (the control panel is a bit overwhelming at first, but once you've gone through the process of creating the compute engine, you get the hang of it. Between this and the theoretical coverage of which products do what, the myriad options become a lot more manageable.)
  • Load a computational combustion data set into a graph database
  • Load a Docker image into a Google Cloud compute instance
  • Utilize Apache Giraph to perform a graph analysis


Module 2

Foundations

Three components of computing systems:

  • Computing
  • Storage
  • Networking

The foundations of Google Cloud are the computing and storage:

  • Compute engine
  • Cloud storage

(Network layer is mostly transparent)

GCP can be thought of as an earth-scale computer

CPUs provided by compute engine virtual machines

Hard drive/storage is provided by cloud storage

Network connections is the global private network (invisible layer)

Design is based on the scalable, no-ops idea



Custom machine types: https://cloud.google.com/custom-machine-types/

Compute engine pricing: https://cloud.google.com/compute/pricing

Can use preconfigured machine types (set price), or can use custom machine types (custom cores/memory, variable price)

Think about it abstractly: "I want a virtual machine that has 8 CPUs and 30 GB RAM"

GCP figures out how to requisition the necessary hardware

Using a node for long periods of time leads to steeper discounts



Preemptible virtual machine: https://cloud.google.com/preemptible-vms/

Get an 80% discount if you agree to give it up if someone else pays full price for it

Why do this? Hadoop jobs are fault tolerant (if a machine goes down, the data is redistributed)

Example: Dataproc cluster (Dataproc is the Google Cloud version of Hadoop)

Can use 10 standard VMs as your "backbone", and then use 30 preemptible VMs

If the preemptible VMs go down, no problem - Hadoop is designed to be robust to hardware going down

This makes it 4x faster than 10 VMs alone, and you get 80% discount on 30 VMs


Lab: Starting Compute Engine

Clicking "Compute Engine" in the side menu of Google Cloud control panel automatically activates Compute Engine

Lists VM Instances

Create a new instance called my-first-instance

Link to info on free compute nodes: https://cloud.google.com/compute/pricing

Extensive list of options when creating a new node:

  • Machine type - up to 64 cores, custom amount of memory, can even choose CPU architecture (Skylake/Broadwell), GPUs
  • Boot disk and OS - several options, Debian, Ubuntu, CentOS, CoreOS, SUSE, Windows Server; can also ask for several different disk sizes
  • Identity and API access - drop-down to select different service accounts; can turn on/off access for different Cloud APIs (BigQuery, BigTable Admin, BigTable Data, Cloud Datastore, PubSub, Cloud SQL, etc.)
  • Management - can set startup scripts, set labels (arbitrary key-value pairs to help organize instances, e.g., production/staging/development, environments, services), set metadata (arbitrary key-value pairs too...?), set whether instance is preemptible (24 hrs max)
  • Disks - can set disk encryption, encryption keys, add additional disks
  • Networking - add additional networking devices
  • SSH - can copy and paste an SSH key for passwordless access (you copy a public key from the computer that will be SSHing into the compute instance)

More info on the SSH key thing: https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys?hl=en_US#instance-only

When you add an SSH public key to the metadata of an instance, it allows the person that corresponds to that public key to access the machine. (In other words, any public key you add to the instance's SSH keys goes into the SSH list of authorized keys)

$ cat ~/.ssh/id_rsa.pub

Then copy and paste the output of this into a new SSH key. Note that this will automatically populate a username that corresponds with that SSH key, based on the username/contact details. You need to use this username, not root/other.

Example startup script to install and run an apache server: https://cloud.google.com/compute/docs/startupscript?hl=en_US

I set a simple startup script on this instance to install git and cowsay:

#!/bin/sh
apt-get install -y git cowsay

Initially I tried to SSH into the compute instance using the username root:

$ ssh root@<ip-address-of-instance>
permission denied (publickey)

This failed because the username I was using did not match the username corresponding to the public key I had initialized the compute instance with.


Tried changing the SSH public key while the compute instance was running - this is possible to do and pretty easyl:

  • Went into GC control panel for Compute Engine Virtual Machines
  • Found my-first-instance
  • Clicked Edit
  • Looked pretty much exactly like the setup options page
  • Added my RSA public key from Cronus
  • Clicked Save

No dice, still not working. (The issue, as I discovered later, was not using the correct username that corresponded to the SSH key.)

Google Cloud takes care of the SSH keys if you connect using the web panel, or using gcloud command line tool. You have to manage SSH keys manually if you're connecting via SSH manually. I wanted to make sure I could SSH into the compute instance manually.

This page describes the ssh command syntax to specify which private key to use: https://cloud.google.com/compute/docs/instances/connecting-to-instance

$ ssh -i /path/to/private/key <user>@<ip-of-compute-instance>

(This is convenient if you want to create a new, separate public key specific to different compute instances.)

This page describes where to find your public/private key pair: https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys#locatesshkeys

(All fine so far, no new information. Still not working. ssh -vv doesn't reveal any obvious problems.)

Solution: the issue was with the username I was using. When I copied and pasted my public SSH key into the list of SSH keys on the compute instance, in the Google Cloud control panel, it automatically populated the username with "charlesreid1" based on the email address associated with the public key.

All I had to do was SSH with that username:

$ ssh -v -i .ssh/id_rsa charlesreid1@<ip-of-compute-instance>

and voila!

Adding cowsay to the list of software that's initially installed, since I'm not sure if git is automatically installed...

Hit "reset" to reset the instance from scratch...

Everything worked like a charm.

Global Filesystem

We're talking about using data in Cloud SQL, BigQuery, and Dataproc (Google's versions of large scale big data stuff like Hadoop, Hive, Sqoop, Pig, etc etc)

We want to get data from "out there in the world" into the cloud. The problem is, when you allocate a compute engine, you allocate a disk associated with that compute engine - and when the compute engine goes away, the disk goes away too. Plus, persistent disk space is expensive anyway.

Instead, store your data in "Cloud Storage". This stores raw data and stages it for other products. This storage is durable, persistent storage that can be easily replicated across other nodes and utilized in other GCP products (Cloud SQL, BigQuery, Dataproc).

Your first step in the journey of doing big data in the cloud is to get your data into cloud storage. To do that, use gsutil.

Interacting with Cloud Storage

Simplest way is to use command line utility called gsutil to interact with Cloud Storage. (It's a command line utility, install it using Google-provided installer.)

Note: you can also use programming language, GCP Console, or REST API.

Now, can use gsutil command line and utilities like cp, rm, mv, ls, mb, rb, rsync, acl...

To copy to Google Storage (GS), run a command like:

$ gsutil cp sales*.csv gs://acme-sales/data/

This copies data into GS buckets. The folder structure is purely convenience.

Buckets are like a domain name in your GCP project. Bucket name must be unique. Typically related to business/company domain name (GCP will require you to prove you own the domain name). Or, you can use a unique "dummy" bucket name.

This will be a recurring pattern: anything you can do from gsutil command line just invokes a REST API. Anything that can be done with a REST API can also be done from any language that speaks HTTP (just about any).

Data Handling

Transfer services: useful for ingestion of data from data center, local system, AWS buckets, other sources. Can be one-time or recurring.

Cloud storage as staging area: useful for importing data into analysis tools and databases. Also useful for sstaging to disk for fast access.

Bucket access control: project-level (only editors of projects can add/remove files from a bucket), bucket-level, and object-level access control. Can control who is responsible for paying. Can make buckets publicly accessible (take advantage of reliability/caching/speed of Google data centers to create a content distribution network).

Zones and Regions

Can control zones and regions where data is located

If speed important, can choose closest zone and region to increase speed

If reliability important, can choose to distribute data across zones of a particular region in case one center has interruption in service

If global access important, can distribute apps across multiple regions

Lab: Interacting with Cloud Storage

Running the lab "Interact with Cloud Storage"


Multi-step process:

  • Ingesting data into a compute engine instance
  • Transforming data on the compute engine instance
  • Storing data in persistent cloud storage
  • Publishing data to the web via cloud storage


Use git to clone repository with instructions/data/scripts

The behind-the-scenes procedure is as follows:

  • download earthquake data using wget (ingest.sh)
  • install extra Python goodies (install_extras.sh)
  • transform data using a Python script (creates a basemap projection, then plots lat/long locations of earthquakes using dots scaled to magnitude of earthquake, colors indicate "class" of magnitudes 1-3, 3-5, 5+) (transform.py)

This results in an image file. There is already an HTML file that will serve up the image on a web page.

Now back to the GCP console - create a storage bucket by going to storage in LHS menu.

Create bucket, and pick zone/region.

Called it mah-bukkit

Now it takes you to an interface where you can actually upload files directly from the browser (looks almost like Dropbox).

Finally, we can use gsutil to copy the image file and HTML file to the bucket. (gsutil is automatically installed on all Google Cloud compute instances.)

$ gsutil cp earthquakes* gs://mah-bukkit/
Copying file://earthquakes.csv [Content-Type=text/csv]...
AccessDeniedException: 403 Insufficient OAuth2 scope to perform this operation.
Acceptable scopes: https://www.googleapis.com/auth/cloud-platform

Turns out, when you create your virtual machine instance, you have to specify permissions for each API. This was something I left as default originally.

The first option is to set the service account - this is what allows you to control access to different buckets for different people.

The second options relates to API access. I left it as "default access," which does not allow the compute instance to access very many APIs. I changed this to "Allow full access to all Cloud APIs" (can also set access for each API individually).

Note that this can't be changed while running, you have to shut down the instance to change the API access for a compute instance.

Note that this may still fail with the same 403 mentioned above. If so, it's because gsutil is using crusty credentials. Reset them via:

$ rm -rf ~/.gsutil

then try again:

$ gsutil cp earthquakes* gs://mah-bukkit/
Copying file://earthquakes.csv [Content-Type=text/csv]...
Copying file://earthquakes.htm [Content-Type=text/html]...
Copying file://earthquakes.png [Content-Type=image/png]...
- [3 files][660.4 KiB/660.4 KiB]
Operation completed over 3 objects/660.4 KiB.

Now go to the cloud storage console, click the bucket, check "share publicly", and get the link.

https://storage.googleapis.com/mah-bukkit/earthquakes.htm

Bingo!


Cloud Shell

Because what we were doing here was relatively simple, and just involved shuffling some scripts around and copying data to cloud storage, it is overkill to allocate an entire compute instance to do that, and have to wait for it to start up and shut down, etc.

Instead, we could use the cloud shell - this is a serverless instance that can be used to do minor tasks.

Here's how this works: this is like a head node on a cluster, where you get "free" cycles to do minor tasks. Here's what you get:

  • MicroVM
  • Single 2.2 GHz Intel Xeon CPU
  • 5 GB persistent storage in your home directory (place to save files) - stuff is already present! You get your "cloud home directory" (commonly-used scripts, repos, code, etc.)
  • Access to basic tools like gsutil, cloud/app engine sdks, docker, git, build tools, etc.
  • Access to languages: Python, Java, Go, and Node

The shell works like a Lish shell, opens within a split screen in the browser.

Can use this to launch serverless operations, requisition nodes, perform gsutil tasks, etc.


Module 3

List of upcoming topics:

  • Transforming/migrating to the cloud
  • SQL databases in the cloud
  • Working with Cloud SQL
  • Managed Hadoop in the cloud
  • Lab: Cloud Dataproc for recommendations


Note on migration history: app engine (you wrote your Java application, you uploaded your Java application, App Engine would scale out the code)

2008 - serverless, fully managed web app framework

Problem: not catering to enough users, running their own applications in their own applications with their own hardware

Now you have multiple options:

  • App Engine Managed Runtimes - this is the "original" product, you give it your Java app code, and it manages running and scaling the code
  • Google App Engine lets you run your web app in non-Java code, e.g., running a Python Flask app or other web frameworks; still autoscaled, but more flexibility
  • Google Container Engine lets you run, e.g., a web app in Tomcat; you can containerize it, put in a Docker image, and have the containerization orchestrated
  • Google Compute Engine lets you run code on bare metal - you can run your workload as-is, but run it in the cloud

Over time, you can migrate from compute engine to container engine to app engine, or whatever your business needs are

Back to the three pillars:

  • Changing where you compute
  • Improving scalability and reliability
  • Changing how you compute (exploration, analytics, business intelligence)

Example: Dataproc can utilize the exitsing workflows you have, and eventually you can move those workflows to BigQuery, which is a serverless end-to-end data warehouse solution - requires migration, but more powerful

"Machine learning... is the next transformation... the programming paradigm is changing. Instead of programming a computer, you teach a computer to learn something and it does what you want." - Eric Schmidt

Moving away from explicit rules, and letting data dictate the rules.

Recommendation engine - brought machine learning into social consciousness (Pandora, Netflix, e-commerce sites)

Recommendation engines 101:

  • Rating - start here; users rate X explicitly or implicitly
  • Training - machine learning model is created, and used to predict a user's rating of X based on existing data
  • Recommendation - for each user, model is applied to predict top 5 (unrated) X

Approaches:

  • User-based - who is this user like?
  • X-based - what other Xs are popular/highly rated?
  • Combine the cluster of users and the cluster of Xs

We will predict a rating using a Dataproc cluster by running Python script via Apache Spark

We will store the rating results in a MySQL database in Cloud SQL

(Recall - Hadoop is for distributed data and HDFS, Spark is for distributed computing and tasks)


RDD - resilient distributed data sets

Apache Spark ML-Lib documentation of recommendation systems: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

SQL and Data Storage in the Cloud

When do we want to use which products?

Cloud Storage:

  • is for petabytes
  • analogous to filesystem
  • Read/write via copying individual files
  • Useful for storing blobs

Cloud SQL:

  • is for gigabytes
  • analogous to relational database
  • Read/write via SELECT/INSERT
  • Useful for no-ops SQL database on the cloud

Datastore:

  • is for terabytes
  • analogous to persistent hash map
  • Read/write via filtering/putting
  • Useful for structured data from app engine applications

BigTable:

  • is for petabytes
  • analogous to HBase or key-value system
  • Read/write via row scan/row put
  • Useful for no-ops, high throughput, scalable, flattened data

BigQuery:

  • is for petabytes
  • analogous to relational database
  • Read via SELECT, write via batch or stream
  • Useful for interactive SQL queries to a managed data warehouse

For the purpose of building a recommendation system, we'll go with Cloud SQL - we want a relational database to store a relatively small amount of data

Question: why not use Cloud Launcher or a container or a compute instance that has MySQL installed?

Answer:

  • more flexible pricing (e.g., the database only runs between 9 and 5; or database that only runs when a unit test is running)
  • Google manages backups and security
  • Can connect from anywhere - not worrying about firewalls or access
  • Faster connections from google compute engine and google app engine (lives in same regions or data centers, Petabit per second intra-data center bandwidth)

Lab: Cloud SQL

Populating SQL database with house rental information

The lab does the following:

  • Creates Cloud SQL instance
  • Creates database tables from an .sql file
  • Populates tables from .csv files
  • Sets up access rules for Cloud SQL
  • Runs SQL statements from Cloud Shell to check out the data

Data is coming from the GCP github repo (same as before): https://github.com/GoogleCloudPlatform/training-data-analyst

(Hmm... 5 GB home directory seems to have been reset.)

Okay, in the lab3a folder there are a few files:

  • .sql file with SQL to clear out existing tables and create empty new ones with the correct types and schema
  • accomodation.csv file with listing of different houses
  • rating.csv file with (apparently) ratings of houses from two different people? One column is a house ID, two remaining columns are integers from 1 to 10

There are also two scripts:

  • find my ip script - this is a utility script used by the other one; runs a wget command to ipecho.net
  • authorize_cloudshell - runs "gcloud sql instances patch rentals --authorized-networks <find ip script>"
  • Not sure what the second script is actually doing.


Upload SQL script and CSV data to bucket:

  • Stage the sql and csv files to cloud storage (uses the gsutil cp command - apparently Cloud Shell has full API access, another reason why it's a really handy way to perform various development tasks)
  • in GCP Console, verify files are in the bucket

Create Cloud SQL instance:

  • Side menu > SQL
  • Click create instance
  • Pick MyQSL, second generation
  • Instance name: "mah-rentals" (this should probably match up with the authorize_cloudshell script above)
  • Need to authorize networks that are allowed to access this - click "Authorize Networks" and click "Add Network". Use the find my ip script in Cloud Shell to get the IP of the MicroVM that Cloud Shell is running on
  • (Note: you only need to authorize the cloud shell VM if you are going to connect to the MySQL database from it, e.g., to connect and examine DB contents)
  • Note that if you get kicked off your Cloud Shell, the authorize cloudshell script is just a convenient way to add the IP address of the current Cloud Shell MicroVM to the list of allowed networks for this Cloud SQL instance
  • Boom, done, create it.

Run the SQL script to initialize all your tables:

  • Go to side menu > SQL
  • Click the Cloud SQL instance
  • This is a dashboard for monitoring the Cloud SQL server...
  • Click Import at the top to import an SQL script. This will allow you to import an SQL script from a bucket
  • The script already specifies a database, but if it does not, you can specify one of your own.

Side note: picked SSL tab on the top, and picked "only allow encrypted connections to this database." To create an encrypted connection, we need a client certificate. There is a button toward the bottom to create a client certificate. mah-rentals-cert

Run the SQL script to populate tables with data:

  • Click Import at the top to import CSV data
  • You have to specify database and table names (database recommendation_spark, table Accommodation/Rating)

Now explore the data...

The certificate process went smoothly. Just had to copy-and-paste the .pem certificate files provided, and run the command provided.

$ mysql -uroot -p -h 35.197.45.63 \
    --ssl-ca=server-ca.pem \
    --ssl-cert=client-cert.pem \
    --ssl-key=client-key.pem

Now run SQL commands to explore the database:

Start by picking which database to use, and looking at the available tables:

> use recommendation_spark; 
> show tables;

Next, run a "head" style query to show 10 records and see names of fields:

> select * from Accommodation limit 10;

Now construct your query:

> select * from Accommodation where type='castle' limit 10;
> select * from Accommodation where type='castle' and price > 4000;

Everything's good and working.

Hadoop in the Cloud

Hadoop ecosystem:

  • Hadoop HDFS, MapReduce framework, programs are written in Java
  • Pig, arose from the problem of Hadoop MapReduce programs becoming verbose and unruly, scripts look like ETL operations
  • Hive, arose from observation that Hadoop data is often structured anyway, write queries and statements, but still store the data with HDFS
  • Spark, polyglot platform that is fast and real-time, designed for data analysis, runs on top of Hadoop and HDFS

Dataproc:

  • Google Cloud's Dataproc is Google-managed Hadoop, Pig, Hive, and Spark
  • Big advantage here is cluster and storage lifecycles are separate - becomes possible to store data and results in cloud storage
  • Can store data in cloud storage in a single bucket in a singe region, and allocate Dataproc cluster in the same bucket
  • Utilizes Google infrastructure, security, networking, &c.
  • Reduced cost and complexity
  • Can create core set of virtual machines, and allocate large number of pre-emptible VMs
  • Can start to think of clusters as being job-specific

Lab: Recommendations Machine Learning Algorithm with Spark

Start up the Cloud SQL instance with rental data

Start up google cloud shell

Click the console menu > Dataproc > Enable API

Click console menu > Dataproc > Create cluster

Change region/zone to match the Cloud SQL region/zone

Master and Worker node types should both be n1-standard-2

Change the machine type of both the Master and the Worker nodes to n1-standard-2. Leave other settings as defaults.

Now cluster compute nodes will be requisitioned.

Use the authorize_dataproc.sh script to authorize the cluster to access Cloud SQL instance.

Env variable $CLOUDSQL variable should match name of cloud sql instance (mah-rentals)

$ ./authorize_dataproc.sh recommendation-cluster us-west1-b 2

This authorizes access to the Cloud SQL instance for 2 workers, plus the master (the Dataproc page in the control panel will list how many worker nodes there are)

Now edit the SparkML script (sparkml/train_and_apply.py) to use the appropriate IP address for Cloud SQL (look it up via the Cloud SQL page in the console). Also set the root password for the Cloud SQL database.

The Python script should now be copied to your bucket, so that you can point Dataproc to it.

$ gsutil cp sparkml/train*.py gs://mah-bukkit/

Now start up a new job by clicking Dataproc > Jobs > Submit Job in the console's left hand menu

Pick the cluster you just created from the drop-down list, and change the job type to PySPark

Set the main Python file for the job to be the train_and_apply.py file you just copied to cloud storage:

gs://mah-bukkit/train_and_apply.py

Can also add properties/labels (key-value pairs to help identify/group jobs together). Left these blank for this lab.

Click Submit, and watch the job run.

If the job shows up as "Failed", you can click the job and view the output log to see what went wrong, then submit a new job after you fix whatever went wrong.

17/09/21 06:23:21 INFO org.spark_project.jetty.util.log: Logging initialized @5353ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: Started @5426ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@74af54ac{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/09/21 06:23:22 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/09/21 06:23:23 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at recommendation-cluster-m/10.138.0.5:8032
17/09/21 06:23:27 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1505974425375_0001
17/09/21 06:23:40 WARN org.apache.spark.SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory 'checkpoint/' appears to be on the local filesystem.
Thu Sep 21 06:23:41 UTC 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Traceback (most recent call last):
  File "/tmp/c3ef4c65-1a0c-4d26-832c-53f8983e8dec/train_and_apply.py", line 45, in <module>
    dfRates = sqlContext.read.format('jdbc').options(driver=jdbcDriver, url=jdbcUrl, dbtable='Rating').load()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 165, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: java.sql.SQLException: Access denied for user 'root'@'35.203.177.114' (using password: YES)

Forgot I had SSL access only turned on. Turned that off, and tried again.

To run a PySpark job when unencrypted connections are blocked, need to initialize each worker (and master) node with a Cloud Proxy startup script so that they can make an encrypted connection.


PySpark to Cloud SQL over SSL

See: https://stackoverflow.com/questions/46337901/pyspark-via-dataproc-ssl-connection-to-cloud-sql

Module 4

Module 5

Resources

Module 1 Resources

Code labs for this course: https://codelabs.developers.google.com/cpb100

About google data centers: https://www.google.com/about/datacenters/

Whitepaper on Google's security practices (i.e., why you can trust Google to handle your cloud stuff): https://cloud.google.com/security/whitepaper

Module 2 Resources

Compute engine: https://cloud.google.com/compute

Storage: https://cloud.google.com/storage

Pricing calculator: https://cloud.google.com/pricing

Cloud launcher: https://cloud.google.com/launcher

YouTube video on Compute Instance vs Container Engine vs App Engine vs Cloud Functions: https://www.youtube.com/watch?v=g0dN8Hkh5H8

Cloud launcher:

  • Shortcut for getting a compute engine VM with preconfigured software ready to go
  • Google click to deploy is a Google-maintained VM image with the software already installed and ready to go



Flags