GCDEC/Unstructured Data/Notes
From charlesreid1
Contents
- 1 Leveraging Unstructured Data with Cloud Dataproc
- 1.1 Module 1: Introduction to Cloud Dataproc
- 1.2 Module 2: Running Dataproc jobs
- 1.3 Module 3: Leveraging GCP
- 1.4 Module 4: Analyzing Unstructured Data
- 1.4.1 Machine Learning with Unstructured Data
- 1.4.2 Machine Learning and Big Data Lab (Dataproc Cluster)
- 1.4.3 Machine Learning and Big Data Lab (Notebook)
- 1.4.4 Vision API (Identifying Words in Photos)
- 1.4.5 Translate API (Translating Alice in Wonderland)
- 1.4.6 Language API (Sentiment Analysis of BigQuery Results)
- 1.4.7 Example Output from Vision API
- 2 References
- 3 Flags
Leveraging Unstructured Data with Cloud Dataproc
Module 1: Introduction to Cloud Dataproc
Overview of Unstructured Data
In industry:
- Unstructured data is often the hardest to deal with
- Four sources of data: (data you have, and analyze) (data you don't have, but wish you had) (data you have, but don't analyze) (data that you could easily acquire)
- Focus on data you have, but don't analyze
- Several reasons why you wouldn't analyze it - volume, quality, velocity, and lack of structure
- Most common difficulty is lack of structure - we don't have the kinds of tools to analyze unstructured data that we have to analyze structured data
- Example: emails from customers. Newsgroups. Photographs taken in the field (inspections).
- If we don't have tools for dealing with images, or with free-form text, we will ignore it ("too hard")
Example:
- Google Street View imagery collected
- Huge collection of unstructured imagery
- Collected, used, and shown to users of Google Maps
- Initially, no technology in place to utilize it any further
- Once deep learning came along, went back and looked at that imagery
- Could analyze the imagery, extract information, use it to enhance the maps
Extrapolating:
- You may have a lot of unstructured data laying around in the company, serving one specific purpose
- Typically, it is given to a human user to analyze, and that's the end - because there is no automated system, no automated tools, for dealing with that unstructured data
- Google Cloud provides machine learning tools to analyze unstructured data (freeform text, images, etc.)
- You don't have to start from scratch - using the ML APIs, can take advantage of pre-built machine learning models to extract information that is useful
Counting Problems
Consider MapReduce tasks: these can usually be boiled down to counting problems
Example: delays in payment processing
- Not just counting, but essentially, lots of counting problems
- Counting how long it took you, counting number of occurrences, counting a mean
- MapReduce is full of easy counting problems - large set of data, apply specific operations, count occurrences
Harder counting problems:
- How often are programmers checking in low-quality code?
- This is still a counting problem, but now it's hard to actually determine what to count
- How do you determine what "low quality code" means?
- How can you extract information like "low quality" or "high quality" from code?
- Many analytical tools are about counting problems - but some counting problems are easier than others
Why Dataproc
What is a petabyte?
- 27 years to download a petabyte over 4G
- 100 libraries of congress
- Every tweet ever tweeted......... TIMES 50
- 2 micrograms of DNA
- 1 day of videos uploaded to YouTube
Scaling up: bigger machine
Scaling out: more machines
MapReduce approach - split the data, compute nodes process data local to it
Big data open source stack:
- Hadoop HDFS
- On top of that, Pig/Hive/HBase/Spark
Challenge with clusters:
- If you're using it 24/7, you need a bigger cluster
- If you're not using it 24/7, you are using your cluster inefficiently
- Need a shared resource that can be provisioned when you need it, not used when you don't
YARN = Yet Another Resource Negotiator (MapReduce 2.0)
Dataproc cluster allocated through menu:
- We don't have to do anything (like connecting to worker nodes) except specify how many nodes we need
- Bucket allocated for the cluster - this is how the worker nodes interact with the master nodes
Creating Dataproc Clusters
Philosophy: one cluster for one job
Select zone closest to your data (e.g., Cloud SQL)
- Data coming into the data center will not cost you
- Egress data (data leaving the data center) does cost you
- Having Dataproc cluster in a different zone from your data will lead to more costs
Note: HDFS by default creates 3 copies of the data
We want our Hadoop jobs to directly use Cloud Storage, in place of HDFS - don't want to use HDFS to store input/output data
Preemptible worker nodes - we can provision worker nodes that can, potentially, get kicked off our cluster if a higher-priority job comes along
Want a core of nodes that are high-priority, but can also allocate a large number of pre-emptible worker nodes that may potentially "go away" (Hadoop is resilient to machines going down, so no problemo)
If we allocate preemptible compute nodes, these will match the types (CPU and memory) of regular worker nodes
They also won't have any primary disk associated with them, since they can't be part of HDFS (we don't want them to hold any data permanently, since they may disappear)
Versions of Dataproc:
- Spark, Hadoop, Pig, Hive versions
- Google Cloud Storage-Hadoop connector
- BigQuery-Hadoop connector
To create a Dataproc cluster: can use web console, or can use gcloud SDK (command line)
To create these resources from a command line:
gcloud dataproc clusters create my-creative-cluster-name \ --zone us-central1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 50
To get some help on this:
$ gcloud dataproc --help $ gcloud dataproc clusters --help $ gcloud dataproc clusters create --help
Custom machine types:
- Control the types of machines in the cluster
- Cloud Network for networking
- Cloud IAM for security
- Master node running on compute engine
- Persistent workers running on compute engine
- HDFS and Google Cloud Storage as a shared filesystem between nodes
Machine types:
- Standard-4 means 4 cpus, 1 GB per cpu
- Highmem instances double the amount of memory per core
- Custom type: first dash number specifies number of CPUs, second dash number specifies amount of memory in MB
Example:
gcloud dataproc clusters create test-cluster \ --worker-machine-type custom-6-30720 \ --master-machine-type custom-6-23040
Custom instance: 6 CPUs, 5 GB per core:
- That's 30 GB total
- 30 * 1024 = 30720 MB
- Hence, the flag custom-6-30720
Note that you can also set up a custom cluster, and get the REST command that has the parameters you want, to set it up on the command line.
Creating Dataproc Clusters with Preemptible Nodes
To add preemptible workers to a cluster, add --num-preemptible-workers=N
and set specs for it using --preemptible-worker-boot-disk-size
:
gcloud dataproc clusters create my-creative-cluster-name \ --zone us-central1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 50 \ --num-preemptible-workers 8 \ --preemptible-worker-boot-disk-size 50 \
The type of preemptible machine is set by the type of worker machine.
Creating Dataproc Clusters Lab
Lab goals:
- Create console
- SSH to cluster
- Get to the browser management for Hadoop (this requires a firewall rule change to allow your local machine to reach the page)
- Create, manage, delete Dataproc clusters from CLI
Create a Dataproc cluster:
- Standard size: 1 cpu, 2 nodes
- Disk space: 10 GB each
- Once cluster is up, go to Compute Engine and find the master cluster. Click SSH.
If you want to SSH in, you can just use a browser window.
To enable SSH to the cluster from an arbitrary client:
- Edit the master node in the Compute Engine section of the console
- Scroll down to the SSH Keys section
- Add the public key from the machine that will be connecting
- Click save
- Take note of the username assigned, just to the left of where you copy and paste your key
- Now you can SSH in
Once you're in:
$ python --version Python 2.7.9 $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) $ scala -version Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL $ pyspark --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_131 Branch dataproc-branch-1.2 Compiled by user on 2017-08-09T20:35:29Z Revision 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc Url https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop Type --help for more information. $ pig --version Apache Pig version 0.16.0 (r: unknown) compiled Aug 09 2017, 20:25:20 $ hive --version Hive 2.1.1 Subversion git://build-dataproc-1-2/mnt/ram/bigtop/bigtop/output/hive/hive-2.1.1 -r 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc Compiled by bigtop on Wed Aug 9 19:46:33 UTC 2017 From source with checksum a7e70a0bdc3bd77d45c8738c41ac59eb $ sudo su # apt-get install inetutils-tools # ifconfig eth0 Link encap:Ethernet HWaddr 42:01:0a:8a:00:03 inet addr:10.138.0.3 Bcast:10.138.0.3 Mask:255.255.255.255 UP BROADCAST RUNNING MULTICAST MTU:1460 Metric:1 RX packets:9088 errors:0 dropped:0 overruns:0 frame:0 TX packets:6308 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2222892 (2.1 MiB) TX bytes:1654499 (1.5 MiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:4322 errors:0 dropped:0 overruns:0 frame:0 TX packets:4322 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:521771 (509.5 KiB) TX bytes:521771 (509.5 KiB)
To restrict access via SSH to your machine and your machine only:
- Side menu > Networking > VPC Networking > Firewall Rules
- Create a new firewall rule
- Targets: all instances in network
- Source IP ranges: enter your IP address, followed by /32
- Protocols and ports: allow specific protocols: tcp:8088;tcp:50070;tcp:8080
- Create the rule
The purpose of this was to allow us to access port 8088/8080/5007 via browser
Copy IP address of master node from the Compute Engine console, and go to http://35.197.115.38:8088/
Port 50070: no such luck. Not open.
$ nmap -Pn -p 8088 35.197.115.38/32 Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38) Host is up (0.075s latency). PORT STATE SERVICE 8088/tcp open radan-http Nmap done: 1 IP address (1 host up) scanned in 0.26 seconds $ nmap -Pn -p 50070 35.197.115.38/32 Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38) Host is up (0.070s latency). PORT STATE SERVICE 50070/tcp closed unknown Nmap done: 1 IP address (1 host up) scanned in 0.16 seconds
Side note: a bit confused about SSH firewall rules. Once I have my public key incorporated into the Compute Engine instance, there was no need for a firewall rule. But the "Connecting to a Linux Instance" documentation states:
"Note: Your Google Cloud Platform VPC network must have one or more firewall rules that allow SSH connections on port 22. The firewall rules must allow SSH connections for the IP ranges or specific IP addresses from which you want to connect."
Link: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh
Also looks like you can have a VPN network that connects to the Google Cloud Platform VPC. That makes the Google Compute nodes into internal IP addresses on the network.
Bastion instances: these are special nodes that have an internal and an external IP address, so that you can use them as a gateway for connecting to a Google Compute Engine network
Link: https://cloud.google.com/solutions/connecting-securely#bastion
Creating Dataproc Clusters from Command Line Lab
Start by installing gcloud command line tool:
$ brew install caskroom/cask/google-cloud-sdk
Add this to bashrc:
source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/path.bash.inc' source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/completion.bash.inc'
Now authorize google cloud. If you try and allocate compute nodes before authenticating, you'll see:
$ gcloud dataproc clusters create my-fancy-cluster \ --zone us-west1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 20 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 20 ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Permission denied on resource project quiet-era-180418 (#0) - '@type': type.googleapis.com/google.rpc.Help links: - description: Google developers console url: https://console.developers.google.com
To authenticate:
$ gcloud init
Link: https://cloud.google.com/sdk/docs/authorizing
Cloud Dataproc Remote Conections
Secure connections to VM: https://cloud.google.com/solutions/connecting-securely#bastion
- How to protect services on machines with external IPs
- How to connect to machines that do not have external IPs
- Firewall rules
Connecting to Linux instances: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh
- How to connect to compute engine instances
Module 2: Running Dataproc jobs
Running Pig and Spark Programs
Two objectives in the lab:
- Submit Pig/Hive/Spark/Hadoop jobs using the cloud console's job submission form
- SSH into the cluster master and execute jobs using PySpark's Read-Evaluate-Process-Loop (REPL) interpreter
Procedure:
- Create firewall rule to ssh into cluster
- Create dataproc cluster
- Copy scripts being run into a bucket
- Run PySpark REPL
- Run Pig job reading from HDFS
Firewall rule:
- Left menu > Networking > VPC Networking > Firewall rules
- Create new firewall rule (or modify existing)
- Set target to all instanes on network
- IP ranges should be based on your IP address
- Protocol should be "tcp:8088;tcp:9870;tcp:8080"
Dataproc cluster creation:
- 2-worker cluster
- Use the command line:
$ gcloud dataproc clusters create my-spark-cluster \ --zone us-west1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 50 \ --network=default
Bucket creation:
- Standard bucket, same region as cluster
- Use the command line
$ gsutil mb -c regional -l us-west1 gs://my-spark-cluster
Put scripts into a bucket:
- Clone the data analyst training github repo in a cloud shell
- Run the script, pass it your bucket name (basically does some fancy stuff to put bucket name into notebook, and performs a gsutil copy of python and png files...)
Run PySpark REPL:
- We'll use a simple PySpark code (mapping an array of input values to a simple mathematical function) to test out running REPL script in PySpark
- Copy the SSH key from the local computer into the SSH key list for the master node
- Run PySpark by running the command "pyspark"
- This may take a minute...
data = range(100) distData = sc.parallelize(data) squares = distData.map(lambda x : x*x*x*x*x ) res = squares.reduce(lambda a, b : a + b) print res
Sums up first 100 powers of 5:
$ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'. >>> data = range(100) >>> distData = sc.parallelize(data) >>> squares = distData.map(lambda x : x*x*x*x*x ) >>> res = squares.reduce(lambda a, b : a + b) >>> print res 161708332500 >>>
Another equation, this one shows how reduce() is not, strictly speaking, necessary for simple operations like sum():
>>> import numpy as np >>> data = range(1000) >>> distData = sc.parallelize(data) >>> terms = distData.map(lambda x : 8.0/((2*x+1)*(2*x+1))) >>> sum = terms.reduce(lambda a,b: a+b) >>> print sum 9.86760440126 >>> print terms.sum() 9.86760440126 >>> >>> print np.sqrt(terms.sum()) 3.1412743276027379 >>> print np.sqrt(terms.reduce(lambda a,b : a+b)) 3.1412743276027379
Comparing the two approaches shows that the two methods are basically comparable:
>>> start_time = timeit.default_timer(); np.sqrt(terms.sum()); stop_time = timeit.default_timer() - start_time; print stop_time 0.174875020981 >>> start_time = timeit.default_timer(); np.sqrt(terms.reduce(lambda a,b : a+b)); stop_time = timeit.default_timer() - start_time; print stop_time 0.174139976501
Prepare to run the Pig job from HDFS:
- SSH onto the master node of the cluster
- First, copy from GCS to local disk (will then copy from local disk to HDFS)
- Bucket is my-spark-cluster
- gsutil -m cp gs://my-spark-cluster/unstructured/pet-details* .
- Note: the -m flag passed to gsutil means multithreaded (copy in parallel)
- This consists of a data file (.txt) and a pig file (.pig)
- Then copy from local disk to HDFS
- hadoop fs -mkdir /pet-details
- hadoop fs -put pet-details.txt /pet-details
- http://35.203.180.167:9870/ opens the web control panel for HDFS
- Pick "Utilities" menu on the right, and "Browse file system" menu option
- Verify the files you copied are in the HDFS system
Run the Pig job:
- Feed the pig file to pig (from the local disk, of course)
- Watch it go!
Once job is finished:
- Go back to HDFS browse file system
- Look for GroupedByType directory
- Look for file named part-r-00000
- This file contains the results of the MapReduce job
Examine results:
- hadoop fs -get /GroupedByTypepart* .
- The script grouped pets by type, and dumped out a set of tuples consisting of name, type, color, wt (lb), wt (kg)
Clean up:
- Copy results over to the bucket
- Shut down the cluster
Motivating Serverless Operations
MapReduce - splits big data so each compute node can access a portion of data, locally
GFS (2002) and MapReduce (2004) created the early ideas of a distributed file system
HDFS is a YARN (yet another resource negotiator)
Link: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Locally-built cluster vs. cloud cluster:
- With a locally-built cluster, your Hadoop nodes all read the input files, do their processing, and compile their output together at the end
- At the end of the job, the output files are on the HDFS, so you have to go and collect these
- Input and output data - persistent data (needed before and after job is run)
- Locally-built cluster: no problem, you have the cluster forever, it is always persistent
- Cloud model: you have to treat the input and output data differently, since we're going to spin up and shut down clusters at will
Colossus:
- Google Cloud Storage buckets provide an elastic disk space usage solution
- Colossus is a Google file system product (internal technology)
- Sits on a petabit network (i.e., same network that pushes around the petabytes of data being uploaded to YouTube every day)
- Colossus doesn't show up - because it's an "implicit product"
Storage options:
- Can incorporate buckets with input-output part of Hadoop cluster/HDFS
- Can use any other serverless option:
- App Engine is a serverless platform
- BigQuery (used for data warehousing) also serverless, can be queried like an SQL database
- PubSub is for messaging oriented middleware
- Dataflow is fore pure elastic pipelines (Apache Beam)
- Machine Learning APIs are serverless
Hadoop setup:
- Create a cluster specific to a job
- Monitor it, submit jobs, etc.
- Destroy cluster when finished
- Long-term goal: move to a pure serverless platform that allows you to JUST FOCUS ON THE JOB
- Example: writing Java or Python program for Apache Beam, letting Google handle the scaling
- Example: feed a batch of data into Google Cloud Storage, stream it in via PubSub, process it with Dataflow, and store the results back in Cloud Storage, or in BigTable, or in BigQuery
- These approaches eliminate the actual maintenance of clusters, move you away from having to deal with Hadoop, HDFS, Pig, Hive, etc.
- That's moving from the initial use case (transforming where you compute) to the other use cases (scaling/reliability, and changing how you compute)
The heart of this problem: compute and storage are closely tied
Hadoop model: storage and compute nodes are too tied together
HBase database: same problem, database spread across entire filesystem
This is the reason for the push to serverless data solutions...
- Faster
- Cheaper
- Flexible
Processing: Dataflow for processing (create pipelines, stack operations together to process streams)
Storage: Cloud Storage for files, BigQuery Storage for tables, Cloud BigTable for NoSQL (replaces Hadoop HBase)
Note that these sources of data provide either input, or output, or both to Dataflow
Note on BigQuery for both processing and storage:
- BigQuery Analytics: analogous to Dataflow or to distributed compute nodes doing work
- BigQuery Storage: stores lots of unstructured data in tables
Cloud Dataproc allows Spark jobs to separate compute and storage
- Slight change from migrating to the cloud; initial step is using HDFS as-is
- Can use Google Cloud Storage in place of HDFS for the input/output files
- Just need to install (or use, since already installed) libraries on worker nodes to interact with cloud storage
- Should be as simple as replacing hdfs:// with gfs://
- Cloud Storage has faster read/write than traditional network hardware
Yet another advantage:
- Pre-emptible nodes are not made a part of HDFS (since they might go away at any time)
- That's why we can make, e.g., 10 core compute nodes, plus 40 pre-emptible nodes, without violating the factor-of-three replication
Serverless Architecture
Architecture for Dataproc cluster:
- Set up networking and accounts to connect to the Dataproc cluster
- Provision the cluster with worker nodes/master node
- Dataproc Agent manages the Dataproc cluster
- Master/worker nodes will read input from a bucket in Cloud Storage
- During the compute phase, master node and worker nodes can share data using HDFS
- Pre-emptible workers are not part of HDFS, but can still do work
- Master/worker nodes will write output to a bucket in Cloud Storage
- This is really fast, so no bottlenecks
Using Dataproc and Cloud Storage:
- Copy data to GCS via manual copy or via connector
- Replace hdfs:// with gfs://
- Write output to GCS, and delete cluster when done
Notes on storage formats:
- If using HBase previously, can use BigTable as drop-in replacement (same API)
- If using SQL table previously, can use BigQuery Storage (and later, query it using BigQuery Analytics)
HDFS becomes temporary storage, cluster becomes stateless
Next major step: rewrite your jobs so that they don't use Hadoop, they use Dataflow and Apache Beam
Using Cloud Storage Lab
Lab 3:
- Create cloud storage bucket and deposit input and application files
- Submit job using web console
- Submit job using CLI
- Monitor job progress/view results
- Collect output in cloud storage bucket
Setup for the lab:
- Create a dataproc cluster with 2 workers
- Already copied input files from the training-data-analyst repository (folder unstructured) into bucket
- This contains input files and input data
Submitting job via google cloud console:
- Use a PySpark job, defined in a .py file, along with an input file, defined in a .txt file, all in a cloud storage bucket
- Click Dataproc > Jobs
- Click +Submit Job
- Job type: PySpark
- Main python file: gs://my-spark-cluster/unstructured/lab2.py
- No other information needed, click Submit
Submitting job via command line:
$ gcloud dataproc jobs submit pyspark --cluster my-spark-cluster gs://my-spark-cluster/unstructured/lab2.py
Once done, delete the clusters created.
To list the clusters:
$ gcloud dataproc clusters list NAME WORKER_COUNT STATUS ZONE my-spark-cluster 2 RUNNING us-west1-a
To delete the clusters:
$ gcloud dataproc clusters delete my-spark-cluster The cluster 'my-spark-cluster' and all attached disks will be deleted. Do you want to continue (Y/n)? y Waiting on operation [projects/not-all-broken/regions/global/operations/c74eb958-17d3-48cd-9566-5b2d44b28656]. Waiting for cluster deletion operation...done. Deleted [https://dataproc.googleapis.com/v1/projects/not-all-broken/regions/global/clusters/my-spark-cluster].
BigQuery Analysis of StackOverflow Dataset
https://codelabs.developers.google.com/codelabs/gcp-aws-bigquery/index.html?index=..%2F..%2Findex#2
#standardSQL SELECT badge_name AS First_Gold_Badge, COUNT(1) AS Num_Users, ROUND(AVG(tenure_in_days)) AS Avg_Num_Days FROM ( SELECT badges.user_id AS user_id, badges.name AS badge_name, TIMESTAMP_DIFF(badges.date, users.creation_date, DAY) AS tenure_in_days, ROW_NUMBER() OVER (PARTITION BY badges.user_id ORDER BY badges.date) AS row_number FROM `bigquery-public-data.stackoverflow.badges` badges JOIN `bigquery-public-data.stackoverflow.users` users ON badges.user_id = users.id WHERE badges.class = 1 ) WHERE row_number = 1 GROUP BY First_Gold_Badge ORDER BY Num_Users DESC LIMIT 10
The query generates the top 10 gold badges, ranked by how many users got them as their first gold badges. The query also determines how many days it took for these gold badges to be obtained on average.
Row First_Gold_Badge Num_Users Avg_Num_Days 1 Famous Question 176982 1169.0 2 Fanatic 15085 618.0 3 Unsung Hero 12875 595.0 4 Great Answer 10641 1363.0 5 Electorate 5769 829.0 6 Populist 5491 1227.0 7 Steward 1005 993.0 8 Great Question 582 717.0 9 Copy Editor 253 595.0
You can also query sets of tables using wildcard table names. For more information, see Querying sets of tables using wildcard tables.
Module 3: Leveraging GCP
Customizing Clusters with Initialization
Advantages to moving to cloud storage:
- Can make clusters ephemeral, no longer competing for single resource
- High throughput (sustained) file system (GFS)
Examining how to combine Spark and Hadoop with other capabilities of GCP
Overview of what we have covered:
- Stateless clusters in < 90 seconds
- Using Hadoop, Pig, Spark
- High-level APIs for submitting jobs
What we'll cover next:
- Using scripts to set up a Dataproc cluster with software packages
- Using software to connect Dataproc to BigTable/BigQuery/Cloud Storage
Dataproc Inputs:
- Cloud Dataproc
- BigQuery
- Cloud Storage
- BigTable
- Compute Engine
Dataproc software:
- Hadoop
- HCatalog
- Hive
- Hue
- Kafka (if not ready to replace with PubSub)
- Mahout
- Oozie
- Pig
- Spark
- Sqoop
- Tez
- Zookeeper
Dataproc outputs:
- Cloud Dataproc
- BigQuery
- Cloud Storage
- Cloud BigTable
- Compute Engine
Dataproc uses Bigtop to handle consistent packaging (not interacting with it directly)
Example software:
- Datalab docker image
- Install and run on master node - utilize cluster for compute, and interact using iPython notebook
- This requires installing some additional libraries on the master node
To install software:
- Write your script (bash, python, etc.)
- Upload script to Cloud Storage
- Specify location in cloud storage when initializing the Dataproc cluster
Example script: install py4sci
- apt-get update || true
- apt-get install -y
- numpy, scipy, matplotlib, pandas
To limit installation script to master only:
- Only one script for both master and worker
- Can do this using the ${ROLE} env variable (metadata)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install -y vim
fi
Storing the script:
- Once script is finished, put it in a cloud storage bucket
- To initialize standard stuff - git repository with Dataproc initialization actions
- Link: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
- Initialization scripts for whole bunch of software: cloud-sql-proxy, conda, datalab, drill, flink, ganglia, hive, hue, ipython, jupyter, kafka, oozie, presto, stackdriver, tez, zeppelin, zookeeper
- Note: stackdriver is for monitoring cluster
Note: we can also obtain this through google cloud storage
gsutil ls gs://dataproc-initialization-actions/
This sets up and runs a bash script to obtain the Docker container and spin it up
From command line:
gcloud dataproc clusters create mycluster \
--initialization-actions gs://mybucket/init-actions/my_init.sh \ --initialization-action-timeout 3m
From web console:
Specify location of script in "Initialization actions" text box
If you have to change site configuration for cluster:
- Changing core-site.xml
- Use gcloud command line or gcloud SDK like so:
- "file_prefix:property=value"
Dataproc + Datalab Initialization Lab
Lab tasks:
- Create Dataproc cluster with initialization action to install Datalab
- Run Jupyter Notebook on Dataproc cluster via Datalab
- Create Python and PySpark jobs utilizing Cloud Storage, BigQuery, and Spark
Initialization script:
- In training-data-analyst repo, unstructured folder, we have an example init script
- This prints out a hello world message from the master node
- Use this, or use the ready-to-go dataproc-initializaiton-actions repo, to initialize Datalab
Here was the final command:
$ gcloud dataproc clusters create my-spark-cluster \ --zone us-west1-a \ --master-machine-type n1-standard-1 --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 --worker-boot-disk-size 50 \ --network=default \ --initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh,gs://my-spark-cluster/unstructured/init-script.sh \ --initialization-action-timeout 8m
(Apparently the operation timed out, and experienced an error.)
$ gcloud dataproc clusters create my-spark-cluster \ --zone us-west1-a --master-machine-type n1-standard-1 --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 --worker-boot-disk-size 50 \ --network=default --initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh,gs://my-spark-cluster/unstructured/init-script.sh \ --initialization-action-timeout 3m Waiting on operation [projects/not-all-broken/regions/global/operations/0a5fca0a-8978-4250-9c0d-9566c5ba8ba6]. Waiting for cluster creation operation...done. ERROR: (gcloud.dataproc.clusters.create) Operation [projects/not-all-broken/regions/global/operations/0a5fca0a-8978-4250-9c0d-9566c5ba8ba6] failed: Initialization action timed out. Failed action 'gs://dataproc-initialization-actions/datalab/datalab.sh', see output in: gs://dataproc-46a56292-2a64-4744-b3c4-d030544b5bb9-us/google-cloud-dataproc-metainfo/bbdf0e10-aeba-4768-ba2c-5495a1de0cbd/my-spark-cluster-m/dataproc-initialization-script-0_output.
We wait for the init script to spin up.
- Note: there is a nice tutorial on doing something basically identical, but initializing the Dataproc cluster to run a Jupyter notebook.
- Link: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Now we need to set up firewall rules so that we can access the Datalab instance
Select VPC Network > Firewall and add a firewall rule
Add your IP address, add target "All instances in network," and for protocols and ports add tcp:8088;tcp:50070;tcp:8080
(Why not 8081?)
Note: if you create a firewall rule, and you add your specific IP address, then the remote instance will only open the Datalab instance to YOUR machine. If you don't add an IP address, or if you open it more broadly, you could allow more machines (attackers) to access Datalab.
Firewall rule:
- Opening ports, as usual:
- "tcp:8088; tcp:50070; tcp:8080; tcp:9870; tcp:8081"
- (It's actually running on port 8080, weirdly.)
Enabling APIs:
- The APIs are not working.
- To enable an API, you have to shut down an entire machine.
- But... this... I don't...
- It goddamn lists BigQuery API as ENABLED
If your browser is on a different machine then exit and re-run this application with the command-line parameter
--noauth_local_webserver
Ooookay, trying this again, but creating the cluster from the web console this time.
Checked the box to allow API access for entire project.
(And of course, the prior notebook disappeared to... who the fuck knows where...)
Mistake from prior not-working attempts: incorrectly specified project ID.
>>> data = gbq.read_gbq(sql, project_id=projectId) Requesting query... ok. Query running... Query done. Processed: 2.1 Gb Retrieving results... Got 40 rows. Total time taken 3.21 s. Finished at 2017-09-26 01:56:18.
Spark and BigQuery: Why and Workflow
If BigQuery is serverless, and scales to thousands of nodes, why would you run a BigQuery command from Spark?
Answer: Machine learning.
You may want to feed data from a BigQuery statement into Spark so that you can perform machine learning on the results.
Who does what work depends on what parts of the Jupyter notebook we're looking at:
- BigQuery statements are serverless and automatically scaled out - so when we run the BigQuery statement from Spark, it isn't the Spark nodes running the query, it's BigQuery and the BigQuery infrastructure
- When the results are returned, the data can be distributed among the Spark nodes to do ML training
- BigQuery runs its search, and returns (potentially) LOTS of data
- BigQuery can shard the data and export it to Google Cloud Storage (previously, this would be HDFS)
- Spark (or Hadoop) then pulls the sharded data from GCS (previously, HDFS)
- From there, we can perform computations on the data using Spark
What does this look like:
sc = pyspark.SparkContext() bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket') project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id') input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket) conf = { # Specify parameters for BigQuery input 'mapred.bq.project.id' : project, 'mapred.bq.gcs.bucket' : bucket, 'mapred.bq.temp.gcs.path' : input_directory, 'mapred.bq.input.project.id' : 'publicdata', 'mapred.bq.input.dataset.id' : 'samples', 'mapred.bq.input.table.id' : 'shakespeare' }
This identifies the bucket/project where the BigQuery results will go
It then creates a connector between BigQuery and Google Cloud storage
(We then run the BigQuery)
Recall RDD = resilient distributed dataset
# Load data from BigQuery # This exports the BigQuery table as JSON into Google Cloud Storage, # then reads it table_data = sc.newAPIHadoopRDD( 'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat', 'org.apache.hadoop.io.LongWritable', 'com.google.gson.JsonObject', conf = conf)
Now, we run the actual Spark code: this is what operates on the distributed data set:
# Perform a word count word_counts = ( table_data .map(lambda (_, record): json.loads(record)) .map(lambda x : (x['word'].lower(), int( x['word_count'] ))) .reduceByKey(lambda x, y: x + y) ) # Now display 10 results pprint(word_counts.take(10))
Where do the results of the Spark computation go? The results are sharded (distributed among lots of different files) on each of the different worker nodes. Each worker node writes its results to the same bucket.
Here's how we output the sharded results files to GCS:
# Stage data formatted as # newline-delimited JSON in GCS output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket) # Now use the partitions to index the output files partitions = range(word_counts.getNumPartitions()) output_files = [output_directory + '/part-{:05}'.format(j) for j in partitions] (word_counts .map( lambda (w, c) : json.dumps({'word' : w, 'word_count' : c}) ) .saveAsTextFile(output_directory) )
Reasoning: GCS is immutable blob storage, so if all nodes writing to same file, you're going to have each process overwriting the data from the prior node. Instead, each process writes to its own file on GCS.
Now close the other end of the loop: IMPORT the sharded results from PySpark into BigQuery:
# Output parameters output_dataset = 'wordcount_dataset' output_table = 'wordcount_table' # Use subprocess to make the BigQuery call # to put the JSON files into BigQuery subprocess.check_call( 'bq load --source_format NEWLINE_DELIMITED_JSON ' '--schema word:STRING,word_count:INTEGER ' '{dataset}.{table} {files}'.format( dataset=output_dataset, table=output_table, files=','.join(output_files) ).split() )
Last step is to clean up the PySpark result files in GCS:
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory) input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True) output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory) output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(output_path,True)
If you can't do everything you want in BigQuery (if UDF are too limited, for example), you can use the workflow:
(BigQuery data) --> (PySpark) --> (BigQuery table)
This raises another problem: sometimes your BigQuery statement will return TOO MUCH data to fit into a Pandas dataframe (too much data to fit in memory)
Module 4: Analyzing Unstructured Data
Machine Learning with Unstructured Data
Previous video:
- What humans are good at
- What computers are good at (counting problems)
- What neither are good at (hard counting problems, combination of what computers are good at and what humans are good at)
Machine learning APIs:
- harness power of existing neural networks/machine learning models for those hard counting problems
- Lab will utilize pre-trained ML models
Translate API:
- In Console: go to API > Library
- Under Google Cloud Machine Learning, pick Translation API
- Should be enabled
- Can click "Try this API" to try it out live in the browser
- APIs are RESTful APIs (client sends HTTPS requests, receive blocks of data back)
5 functions available:
- Language detection (single request, list of requests)
- Supported languages for translation
- Translate input text (single request, list of requests)
When in production, also need API key so Google knows who to bill.
Google API Explorer: https://developers.google.com/apis-explorer/#p/
Language support: https://cloud.google.com/translate/docs/languages
"Mon aéroglisseur est plein d’anguilles"
Vision API:
- cloud.google.com/vision - can try the vision API live in the browser
- JSON - structured information
Example uses of APIs:
- Vision API - Meeting Nanny
- Speech API - giving commands via speech (asynchronous or synchronous)
- Speech API - positive or negative feedback
- NLP API, ML models to classify sentiment, topic, categorize for correct support personnel
Machine Learning and Big Data Lab (Dataproc Cluster)
Link: https://codelabs.developers.google.com/codelabs/cpb102-machine-learning-to-big-data-processing/#7
Double-check the command given to create the dataproc cluster: missed the unstructured init-script...
gcloud dataproc clusters create my-cluster --zone us-central1-a \ --master-machine-type n1-standard-1 --master-boot-disk-size 50 \ --num-workers 2 --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 50 --network=default \ --scopes 'https://www.googleapis.com/auth/cloud-platform' \ --project $DEVSHELL_PROJECT_ID \ --initialization-actions 'gs://dataproc-initialization-actions/datalab/datalab.sh','gs://<YOUR-BUCKET-NAME>/unstructured/init-script.sh'
(Submit request to create this cluster)
Firewall rule:
- Default allow dataproc access
- Targets: all instances in the network
- Enable "tcp:8088;tcp:50070;tcp:8080"
(Wait for cluster to be initialized)
...
Something interesting happened:
$ ssh charlesreid1@35.197.84.61 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:ES9/m8CuKg6nnMfJUHZ207DYKCzOKpPh+/jRjRF9vzQ. Please contact your system administrator. Add correct host key in /Users/charles/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /Users/charles/.ssh/known_hosts:27 ECDSA host key for 35.197.84.61 has changed and you have requested strict checking. Host key verification failed.
How do I check ECDSA?
Get a remote shell on the virtual machine.
/etc/ssh/ssh_host_ecdsa_key.pubroot@lab4-cluster-m:~# cat /etc/ssh/ssh_host_ecdsa_key.pubecdsa-sha2-nistp256 AAAAE2VjZ... root@lab4-cluster-m
$ ssh charlesreid1@35.197.84.61 The authenticity of host '35.197.84.61 (35.197.84.61)' can't be established. ECDSA key fingerprint is SHA256:ES9/m8CuKg6nnMfJUHZ207DYKCzOKpPh+/jRjRF9vzQ. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '35.197.84.61' (ECDSA) to the list of known hosts. Permission denied (publickey)
Forgot to add my SSH key. Added it. Worked.
$ ssh charlesreid1@35.197.84.61 The authenticity of host '35.197.84.61 (35.197.84.61)' can't be established. ECDSA key fingerprint is SHA256:ES9/m8CuKg6nnMfJUHZ207DYKCzOKpPh+/jRjRF9vzQ. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '35.197.84.61' (ECDSA) to the list of known hosts.
I don't understand... "SHA256:ES9/m8CuKg6nnMfJUHZ207DYKCzOKpPh+/jRjRF9vzQ." doesn't match anything in /etc/ssh/*
But known_hosts entry is: AAAAE2VjZ...
and public key on remote machine is: AAAAE2VjZ...
which matches *exactly*
Okay, setting that aside.
...
Visit http://35.197.84.61:8080
Datalab comes up.
Ungit opens. (Asks if I want to create a notebooks repository. Need to revisit this, now that we have a working process, and look into the question we asked on SO earlier.)
Clone from remote repository. Looks good.
(WAIT - WHAT??)
Now I click "notebooks/" and "training-data-analyst" is located in notebooks....????
(Some weird stuff going on here...)
Open notebooks/training-data-analyst/courses/unstructured
Set the project/bucket names
Go to cloud console, click APIs > Credentials
Create a new API key and paste it into the notebook
Clear the cells
Walk through the notebook...
...
Still very confused about the notebook/git configuration.
No default git repository was created, at all, at any point.
When I checked out the repository from ungit, it was checked out into notebooks/ (even though I was not in notebooks/ when I checked it out).
By default, ungit was opening in /content/datalab/notebooks, even though datalab was opening by default in /content/datalab/.
When I checked out the git repository, I was (obviously) not able to save any of my changes. But I could still run "Save and Checkpoint". I guess it just... did nothing?
This whole system needs a lot better explanation.
Oh - here's another problem - when you pick "Stop VM" from Datalab, it actually shuts down the master node of your cluster, and only the master node of your cluster.
Machine Learning and Big Data Lab (Notebook)
Link: https://codelabs.developers.google.com/codelabs/cpb102-machine-learning-to-big-data-processing/#7
Notebook sets BUCKET and PROJECT environment variables
Notebook shows how to list images that are in GCS - using gsutil, from the command line, via subprocess
import subprocess images = subprocess.check_output(["gsutil", "ls", "gs://{}/unstructured/photos".format(BUCKET)]) images = list(filter(None,images.split('\n'))) print images
Vision API (Identifying Words in Photos)
Next, we run the vision API to search for text in an image. This is using a set of images with some text and numbers. The code creates a Vision API caller (vservice), then passes each image to the API caller.
results = [] for IMAGE in images: print IMAGE vservice = build('vision', 'v1', developerKey=APIKEY) request = vservice.images().annotate(body={ 'requests': [{ 'image': { 'source': { 'gcs_image_uri': IMAGE } }, 'features': [{ 'type': 'TEXT_DETECTION', 'maxResults': 100, }] }], }) outputs = request.execute(num_retries=3) results.append(outputs)
Note that the text returned is case sensitive, so if we search for "delay" vs "DELAY", we will get different results (if our search is also case sensitive)
Above bit will print out the name of each image as it is submitted to the API
if 'responses' in outputs and len(outputs['responses']) > 0 and 'textAnnotations' in outputs['responses'][0]: # do stuff here...
Wow, this is a huge amount of information:
{u'responses': [{u'textAnnotations': [{u'locale': u'en', u'description': u'DEP DELAYarrival_delay numflights\n0 -27.0\n1 -26.0\n2-25.0\n3-24.0\n4-23.0\n26.793548 465\n24.438375 714\n25.185224 961\n24.090560 1303\n-24.0166301804\n', u'boundingPoly': {u'vertices': [{u'y': 13, u'x': 10}, {u'y': 13, u'x': 267}, {u'y': 172, u'x': 267}, {u'y': 172, u'x': 10}]}}, {u'description': u'DEP', u'boundingPoly': {u'vertices': [{u'y': 13, u'x': 29}, {u'y': 13, u'x': 52}, {u'y': 27, u'x': 52}, {u'y': 27, u'x': 29}]}}, {u'description': u'DELAYarrival_delay', u'boundingPoly': {u'vertices': [{u'y': 13, u'x': 59}, {u'y': 13, u'x': 190}, {u'y': 27, u'x': 190}, {u'y': 27, u'x': 59}]}}, {u'description': u'numflights', u'boundingPoly': {u'vertices': [{u'y': 13, u'x': 200}, {u'y': 13, u'x': 267}, {u'y': 27, u'x': 267}, {u'y': 27, u'x': 200}]}}, {u'description': u'0', u'boundingPoly': {u'vertices': [{u'y': 41, u'x': 10}, {u'y': 41, u'x': 15}, {u'y': 54, u'x': 15}, {u'y': 54, u'x': 10}]}}, {u'description': u'-27.0', u'boundingPoly': {u'vertices': [{u'y': 41, u'x': 24}, {u'y': 42, u'x': 57}, {u'y': 55, u'x': 57}, {u'y': 54, u'x': 24}]}}, {u'description': u'1', ........
Various ways to print this out; here's how they decided to filter this out:
# print outputs if 'responses' in outputs and len(outputs['responses']) > 0 and 'textAnnotations' in outputs['responses'][0]: for output in outputs['responses'][0]['textAnnotations']: if SEARCH_TERM in output['description']: print "------" print u"image={} contains the following text: {}".format(IMAGE, output['description'])
Here's how to dump the entire thing:
import json print(json.dumps(results[0], sort_keys=True, indent=4))
Translate API (Translating Alice in Wonderland)
Translating Alice in Wonderland to Spanish. This time, the number of items being passed to the API is much higher - translating an entire book.
Define a standalone function to connect to Translate API and return the translated text:
- Input is text to be translated
- Create a Translate API caller
- Pass the inputs as a query, specify source/target language as arguments
- Extracts the translated text from the returned JSON
- Return the translated text
def executeTranslate(inputs): from googleapiclient.discovery import build service = build('translate', 'v2', developerKey=APIKEY) translator = service.translations() outputs = translator.list(source='en', target='es', q=inputs).execute() return outputs['translations'][0]['translatedText']
Here's what the workflow looks like:
- Split the Alice in Wonderland book text file to strip out new lines. One long string of text, ready to be split into pieces at each period.
- Load the single Alice string (single string)
- Split the single Alice string at "." (list of strings)
- Now map the API-calling function (defined above) to each item in the list
alice = sc.textFile("gs://cpb103-public-files/alice-short-transformed.txt") alice = alice.map(lambda x: x.split(".")) aliceTranslated = alice.map(executeTranslate) for eachSentence in aliceTranslated.take(10): print u"{0}".format(eachSentance)
Language API (Sentiment Analysis of BigQuery Results)
Similar workflow to before:
- Define a function that takes a quote as an input
- Creates API caller
- Returns the sentiment of the input
Workflow:
- Import Pandas
- Assemble a BigQuery call to get HackerNews articles
""" SELECT title, text FROM [bigquery-public-data:hacker_news.stories] where text > " " and title contains("JavaScript") LIMIT 10 """
The results of the BigQuery call are returned as a dataframe The dataframe is then returned as an RDD
df = gbq.read_gbq(""" SELECT title, text FROM [bigquery-public-data:hacker_news.stories] where text > " " and title contains("JavaScript") LIMIT 10 """, project_id=PROJECT_ID) #Convert Pandas DataFrame to RDD rdd = sqlContext.createDataFrame(df).rdd
10 articles, 2 fields: title, and text. The RDD thing is turning the data from an in-memory Pandas object into a PySpark type of RDD object:
<class 'pyspark.rdd.PipelinedRDD'>
Next, extract the "text" field from whatever data container is being used to store the results:
# "Title" field is x[0], "Text" field is x[1] comments = rdd.map(lambda x: x[1]) sentiments = comments.map(executeSentimentAnalysis)
To loop over each item:
for sentiment in sentiments.collect(): print u"Score:{0} and Magnitude:{1}".format(sentiment['documentSentiment']['score'], sentiment['documentSentiment']['magnitude'])
One item tags each sentence with a sentiment.
Example Output from Vision API
{ "responses": [ { "fullTextAnnotation": { "pages": [ { "blocks": [ { "blockType": "TEXT", "boundingBox": { "vertices": [ { "x": 10, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 172 }, { "x": 10, "y": 172 } ] }, "paragraphs": [ { "boundingBox": { "vertices": [ { "x": 10, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 172 }, { "x": 10, "y": 172 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "words": [ { "boundingBox": { "vertices": [ { "x": 29, "y": 13 }, { "x": 52, "y": 13 }, { "x": 52, "y": 27 }, { "x": 29, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 29, "y": 13 }, { "x": 34, "y": 13 }, { "x": 34, "y": 27 }, { "x": 29, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "D" }, { "boundingBox": { "vertices": [ { "x": 36, "y": 13 }, { "x": 42, "y": 13 }, { "x": 42, "y": 27 }, { "x": 36, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "E" }, { "boundingBox": { "vertices": [ { "x": 44, "y": 13 }, { "x": 52, "y": 13 }, { "x": 52, "y": 27 }, { "x": 44, "y": 27 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "P" } ] }, { "boundingBox": { "vertices": [ { "x": 59, "y": 13 }, { "x": 190, "y": 13 }, { "x": 190, "y": 27 }, { "x": 59, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 59, "y": 13 }, { "x": 65, "y": 13 }, { "x": 65, "y": 27 }, { "x": 59, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "D" }, { "boundingBox": { "vertices": [ { "x": 67, "y": 13 }, { "x": 75, "y": 13 }, { "x": 75, "y": 27 }, { "x": 67, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "E" }, { "boundingBox": { "vertices": [ { "x": 77, "y": 13 }, { "x": 83, "y": 13 }, { "x": 83, "y": 27 }, { "x": 77, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "L" }, { "boundingBox": { "vertices": [ { "x": 85, "y": 13 }, { "x": 90, "y": 13 }, { "x": 90, "y": 27 }, { "x": 85, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "A" }, { "boundingBox": { "vertices": [ { "x": 92, "y": 13 }, { "x": 101, "y": 13 }, { "x": 101, "y": 27 }, { "x": 92, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "Y" }, { "boundingBox": { "vertices": [ { "x": 103, "y": 13 }, { "x": 119, "y": 13 }, { "x": 119, "y": 27 }, { "x": 103, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "a" }, { "boundingBox": { "vertices": [ { "x": 121, "y": 13 }, { "x": 124, "y": 13 }, { "x": 124, "y": 27 }, { "x": 121, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "r" }, { "boundingBox": { "vertices": [ { "x": 126, "y": 13 }, { "x": 129, "y": 13 }, { "x": 129, "y": 27 }, { "x": 126, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "r" }, { "boundingBox": { "vertices": [ { "x": 131, "y": 13 }, { "x": 134, "y": 13 }, { "x": 134, "y": 27 }, { "x": 131, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "i" }, { "boundingBox": { "vertices": [ { "x": 136, "y": 13 }, { "x": 139, "y": 13 }, { "x": 139, "y": 27 }, { "x": 136, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "v" }, { "boundingBox": { "vertices": [ { "x": 141, "y": 13 }, { "x": 147, "y": 13 }, { "x": 147, "y": 27 }, { "x": 141, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "a" }, { "boundingBox": { "vertices": [ { "x": 149, "y": 13 }, { "x": 152, "y": 13 }, { "x": 152, "y": 27 }, { "x": 149, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "l" }, { "boundingBox": { "vertices": [ { "x": 154, "y": 13 }, { "x": 157, "y": 13 }, { "x": 157, "y": 27 }, { "x": 154, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "_" }, { "boundingBox": { "vertices": [ { "x": 159, "y": 13 }, { "x": 165, "y": 13 }, { "x": 165, "y": 27 }, { "x": 159, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "d" }, { "boundingBox": { "vertices": [ { "x": 167, "y": 13 }, { "x": 172, "y": 13 }, { "x": 172, "y": 27 }, { "x": 167, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "e" }, { "boundingBox": { "vertices": [ { "x": 174, "y": 13 }, { "x": 178, "y": 13 }, { "x": 178, "y": 27 }, { "x": 174, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "l" }, { "boundingBox": { "vertices": [ { "x": 180, "y": 13 }, { "x": 183, "y": 13 }, { "x": 183, "y": 27 }, { "x": 180, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "a" }, { "boundingBox": { "vertices": [ { "x": 185, "y": 13 }, { "x": 190, "y": 13 }, { "x": 190, "y": 27 }, { "x": 185, "y": 27 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "y" } ] }, { "boundingBox": { "vertices": [ { "x": 200, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 27 }, { "x": 200, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "nl" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 200, "y": 13 }, { "x": 208, "y": 13 }, { "x": 208, "y": 27 }, { "x": 200, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "n" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 13 }, { "x": 216, "y": 13 }, { "x": 216, "y": 27 }, { "x": 210, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "u" }, { "boundingBox": { "vertices": [ { "x": 218, "y": 13 }, { "x": 226, "y": 13 }, { "x": 226, "y": 27 }, { "x": 218, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "m" }, { "boundingBox": { "vertices": [ { "x": 228, "y": 13 }, { "x": 234, "y": 13 }, { "x": 234, "y": 27 }, { "x": 228, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "f" }, { "boundingBox": { "vertices": [ { "x": 236, "y": 13 }, { "x": 237, "y": 13 }, { "x": 237, "y": 27 }, { "x": 236, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "l" }, { "boundingBox": { "vertices": [ { "x": 239, "y": 13 }, { "x": 242, "y": 13 }, { "x": 242, "y": 27 }, { "x": 239, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "i" }, { "boundingBox": { "vertices": [ { "x": 244, "y": 13 }, { "x": 247, "y": 13 }, { "x": 247, "y": 27 }, { "x": 244, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "g" }, { "boundingBox": { "vertices": [ { "x": 249, "y": 13 }, { "x": 255, "y": 13 }, { "x": 255, "y": 27 }, { "x": 249, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "h" }, { "boundingBox": { "vertices": [ { "x": 257, "y": 13 }, { "x": 262, "y": 13 }, { "x": 262, "y": 27 }, { "x": 257, "y": 27 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "t" }, { "boundingBox": { "vertices": [ { "x": 264, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 27 }, { "x": 264, "y": 27 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "s" } ] }, { "boundingBox": { "vertices": [ { "x": 10, "y": 41 }, { "x": 15, "y": 41 }, { "x": 15, "y": 54 }, { "x": 10, "y": 54 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 10, "y": 41 }, { "x": 15, "y": 41 }, { "x": 15, "y": 54 }, { "x": 10, "y": 54 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 24, "y": 41 }, { "x": 57, "y": 42 }, { "x": 57, "y": 55 }, { "x": 24, "y": 54 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 24, "y": 41 }, { "x": 30, "y": 41 }, { "x": 30, "y": 54 }, { "x": 24, "y": 54 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 32, "y": 41 }, { "x": 37, "y": 41 }, { "x": 37, "y": 54 }, { "x": 32, "y": 54 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 39, "y": 41 }, { "x": 44, "y": 41 }, { "x": 44, "y": 54 }, { "x": 39, "y": 54 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "7" }, { "boundingBox": { "vertices": [ { "x": 46, "y": 42 }, { "x": 49, "y": 42 }, { "x": 49, "y": 55 }, { "x": 46, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 51, "y": 42 }, { "x": 57, "y": 42 }, { "x": 57, "y": 55 }, { "x": 51, "y": 55 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 12, "y": 71 }, { "x": 12, "y": 71 }, { "x": 12, "y": 88 }, { "x": 12, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 12, "y": 71 }, { "x": 12, "y": 71 }, { "x": 12, "y": 88 }, { "x": 12, "y": 88 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" } ] }, { "boundingBox": { "vertices": [ { "x": 24, "y": 71 }, { "x": 56, "y": 71 }, { "x": 56, "y": 88 }, { "x": 24, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 24, "y": 71 }, { "x": 28, "y": 71 }, { "x": 28, "y": 88 }, { "x": 24, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 30, "y": 71 }, { "x": 38, "y": 71 }, { "x": 38, "y": 88 }, { "x": 30, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 40, "y": 71 }, { "x": 44, "y": 71 }, { "x": 44, "y": 88 }, { "x": 40, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 46, "y": 71 }, { "x": 50, "y": 71 }, { "x": 50, "y": 88 }, { "x": 46, "y": 88 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 52, "y": 72 }, { "x": 56, "y": 72 }, { "x": 56, "y": 89 }, { "x": 52, "y": 89 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 11, "y": 100 }, { "x": 56, "y": 101 }, { "x": 56, "y": 115 }, { "x": 11, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 11, "y": 100 }, { "x": 13, "y": 100 }, { "x": 13, "y": 114 }, { "x": 11, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 15, "y": 100 }, { "x": 29, "y": 100 }, { "x": 29, "y": 114 }, { "x": 15, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 31, "y": 100 }, { "x": 36, "y": 100 }, { "x": 36, "y": 114 }, { "x": 31, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 38, "y": 100 }, { "x": 44, "y": 100 }, { "x": 44, "y": 114 }, { "x": 38, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" }, { "boundingBox": { "vertices": [ { "x": 46, "y": 100 }, { "x": 49, "y": 100 }, { "x": 49, "y": 114 }, { "x": 46, "y": 114 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 51, "y": 101 }, { "x": 56, "y": 101 }, { "x": 56, "y": 115 }, { "x": 51, "y": 115 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 11, "y": 129 }, { "x": 56, "y": 130 }, { "x": 56, "y": 143 }, { "x": 11, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 11, "y": 129 }, { "x": 13, "y": 129 }, { "x": 13, "y": 142 }, { "x": 11, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 15, "y": 129 }, { "x": 30, "y": 129 }, { "x": 30, "y": 142 }, { "x": 15, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 32, "y": 129 }, { "x": 37, "y": 129 }, { "x": 37, "y": 142 }, { "x": 32, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 39, "y": 129 }, { "x": 44, "y": 129 }, { "x": 44, "y": 142 }, { "x": 39, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 46, "y": 130 }, { "x": 49, "y": 130 }, { "x": 49, "y": 143 }, { "x": 46, "y": 143 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 51, "y": 130 }, { "x": 56, "y": 130 }, { "x": 56, "y": 143 }, { "x": 51, "y": 143 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 11, "y": 157 }, { "x": 56, "y": 159 }, { "x": 55, "y": 172 }, { "x": 10, "y": 170 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 11, "y": 157 }, { "x": 13, "y": 157 }, { "x": 12, "y": 170 }, { "x": 10, "y": 170 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 15, "y": 157 }, { "x": 30, "y": 158 }, { "x": 29, "y": 171 }, { "x": 14, "y": 170 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 32, "y": 158 }, { "x": 37, "y": 158 }, { "x": 36, "y": 171 }, { "x": 31, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 39, "y": 158 }, { "x": 44, "y": 158 }, { "x": 43, "y": 171 }, { "x": 38, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 46, "y": 159 }, { "x": 49, "y": 159 }, { "x": 48, "y": 172 }, { "x": 45, "y": 172 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 51, "y": 159 }, { "x": 56, "y": 159 }, { "x": 55, "y": 172 }, { "x": 50, "y": 172 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] } ] } ], "property": { "detectedLanguages": [ { "languageCode": "en" } ] } }, { "blockType": "TEXT", "boundingBox": { "vertices": [ { "x": 115, "y": 39 }, { "x": 233, "y": 39 }, { "x": 233, "y": 171 }, { "x": 115, "y": 171 } ] }, "paragraphs": [ { "boundingBox": { "vertices": [ { "x": 115, "y": 39 }, { "x": 233, "y": 39 }, { "x": 233, "y": 171 }, { "x": 115, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "words": [ { "boundingBox": { "vertices": [ { "x": 120, "y": 39 }, { "x": 182, "y": 39 }, { "x": 182, "y": 55 }, { "x": 120, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 120, "y": 39 }, { "x": 123, "y": 39 }, { "x": 123, "y": 55 }, { "x": 120, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 125, "y": 39 }, { "x": 132, "y": 39 }, { "x": 132, "y": 55 }, { "x": 125, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 134, "y": 39 }, { "x": 135, "y": 39 }, { "x": 135, "y": 55 }, { "x": 134, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 137, "y": 39 }, { "x": 144, "y": 39 }, { "x": 144, "y": 55 }, { "x": 137, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "7" }, { "boundingBox": { "vertices": [ { "x": 146, "y": 39 }, { "x": 149, "y": 39 }, { "x": 149, "y": 55 }, { "x": 146, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "9" }, { "boundingBox": { "vertices": [ { "x": 151, "y": 39 }, { "x": 158, "y": 39 }, { "x": 158, "y": 55 }, { "x": 151, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 160, "y": 39 }, { "x": 167, "y": 39 }, { "x": 167, "y": 55 }, { "x": 160, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" }, { "boundingBox": { "vertices": [ { "x": 169, "y": 39 }, { "x": 173, "y": 39 }, { "x": 173, "y": 55 }, { "x": 169, "y": 55 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 175, "y": 39 }, { "x": 182, "y": 39 }, { "x": 182, "y": 55 }, { "x": 175, "y": 55 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "8" } ] }, { "boundingBox": { "vertices": [ { "x": 196, "y": 40 }, { "x": 223, "y": 40 }, { "x": 223, "y": 56 }, { "x": 196, "y": 56 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 196, "y": 40 }, { "x": 208, "y": 40 }, { "x": 208, "y": 56 }, { "x": 196, "y": 56 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 40 }, { "x": 217, "y": 40 }, { "x": 217, "y": 56 }, { "x": 210, "y": 56 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 219, "y": 40 }, { "x": 223, "y": 40 }, { "x": 223, "y": 56 }, { "x": 219, "y": 56 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" } ] }, { "boundingBox": { "vertices": [ { "x": 118, "y": 72 }, { "x": 181, "y": 72 }, { "x": 181, "y": 84 }, { "x": 118, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 118, "y": 72 }, { "x": 124, "y": 72 }, { "x": 124, "y": 84 }, { "x": 118, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 126, "y": 72 }, { "x": 130, "y": 72 }, { "x": 130, "y": 84 }, { "x": 126, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 132, "y": 72 }, { "x": 135, "y": 72 }, { "x": 135, "y": 84 }, { "x": 132, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 137, "y": 72 }, { "x": 141, "y": 72 }, { "x": 141, "y": 84 }, { "x": 137, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 143, "y": 72 }, { "x": 150, "y": 72 }, { "x": 150, "y": 84 }, { "x": 143, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 152, "y": 72 }, { "x": 159, "y": 72 }, { "x": 159, "y": 84 }, { "x": 152, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "8" }, { "boundingBox": { "vertices": [ { "x": 161, "y": 72 }, { "x": 166, "y": 72 }, { "x": 166, "y": 84 }, { "x": 161, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 168, "y": 72 }, { "x": 175, "y": 72 }, { "x": 175, "y": 84 }, { "x": 168, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "7" }, { "boundingBox": { "vertices": [ { "x": 177, "y": 72 }, { "x": 181, "y": 72 }, { "x": 181, "y": 84 }, { "x": 177, "y": 84 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" } ] }, { "boundingBox": { "vertices": [ { "x": 194, "y": 72 }, { "x": 223, "y": 72 }, { "x": 223, "y": 84 }, { "x": 194, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 194, "y": 72 }, { "x": 208, "y": 72 }, { "x": 208, "y": 84 }, { "x": 194, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "7" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 72 }, { "x": 214, "y": 72 }, { "x": 214, "y": 84 }, { "x": 210, "y": 84 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" }, { "boundingBox": { "vertices": [ { "x": 216, "y": 72 }, { "x": 223, "y": 72 }, { "x": 223, "y": 84 }, { "x": 216, "y": 84 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" } ] }, { "boundingBox": { "vertices": [ { "x": 116, "y": 100 }, { "x": 182, "y": 100 }, { "x": 182, "y": 113 }, { "x": 116, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 116, "y": 100 }, { "x": 122, "y": 100 }, { "x": 122, "y": 113 }, { "x": 116, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 124, "y": 100 }, { "x": 131, "y": 100 }, { "x": 131, "y": 113 }, { "x": 124, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" }, { "boundingBox": { "vertices": [ { "x": 133, "y": 100 }, { "x": 136, "y": 100 }, { "x": 136, "y": 113 }, { "x": 133, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 138, "y": 100 }, { "x": 141, "y": 100 }, { "x": 141, "y": 113 }, { "x": 138, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" }, { "boundingBox": { "vertices": [ { "x": 143, "y": 100 }, { "x": 151, "y": 100 }, { "x": 151, "y": 113 }, { "x": 143, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "8" }, { "boundingBox": { "vertices": [ { "x": 153, "y": 100 }, { "x": 158, "y": 100 }, { "x": 158, "y": 113 }, { "x": 153, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" }, { "boundingBox": { "vertices": [ { "x": 160, "y": 100 }, { "x": 167, "y": 100 }, { "x": 167, "y": 113 }, { "x": 160, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 169, "y": 100 }, { "x": 174, "y": 100 }, { "x": 174, "y": 113 }, { "x": 169, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 176, "y": 100 }, { "x": 182, "y": 100 }, { "x": 182, "y": 113 }, { "x": 176, "y": 113 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" } ] }, { "boundingBox": { "vertices": [ { "x": 196, "y": 100 }, { "x": 225, "y": 100 }, { "x": 225, "y": 113 }, { "x": 196, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 196, "y": 100 }, { "x": 208, "y": 100 }, { "x": 208, "y": 113 }, { "x": 196, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "9" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 100 }, { "x": 215, "y": 100 }, { "x": 215, "y": 113 }, { "x": 210, "y": 113 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 217, "y": 100 }, { "x": 225, "y": 100 }, { "x": 225, "y": 113 }, { "x": 217, "y": 113 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" } ] }, { "boundingBox": { "vertices": [ { "x": 120, "y": 129 }, { "x": 182, "y": 129 }, { "x": 182, "y": 142 }, { "x": 120, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 120, "y": 129 }, { "x": 122, "y": 129 }, { "x": 122, "y": 142 }, { "x": 120, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 124, "y": 129 }, { "x": 132, "y": 129 }, { "x": 132, "y": 142 }, { "x": 124, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 134, "y": 129 }, { "x": 134, "y": 129 }, { "x": 134, "y": 142 }, { "x": 134, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 136, "y": 129 }, { "x": 141, "y": 129 }, { "x": 141, "y": 142 }, { "x": 136, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 143, "y": 129 }, { "x": 151, "y": 129 }, { "x": 151, "y": 142 }, { "x": 143, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "9" }, { "boundingBox": { "vertices": [ { "x": 153, "y": 129 }, { "x": 158, "y": 129 }, { "x": 158, "y": 142 }, { "x": 153, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 160, "y": 129 }, { "x": 165, "y": 129 }, { "x": 165, "y": 142 }, { "x": 160, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "5" }, { "boundingBox": { "vertices": [ { "x": 167, "y": 129 }, { "x": 175, "y": 129 }, { "x": 175, "y": 142 }, { "x": 167, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 177, "y": 129 }, { "x": 182, "y": 129 }, { "x": 182, "y": 142 }, { "x": 177, "y": 142 } ] }, "property": { "detectedBreak": { "type": "SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" } ] }, { "boundingBox": { "vertices": [ { "x": 196, "y": 129 }, { "x": 232, "y": 129 }, { "x": 232, "y": 142 }, { "x": 196, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ { "x": 196, "y": 129 }, { "x": 208, "y": 129 }, { "x": 208, "y": 142 }, { "x": 196, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 129 }, { "x": 216, "y": 129 }, { "x": 216, "y": 142 }, { "x": 210, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 218, "y": 129 }, { "x": 225, "y": 129 }, { "x": 225, "y": 142 }, { "x": 218, "y": 142 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 227, "y": 129 }, { "x": 232, "y": 129 }, { "x": 232, "y": 142 }, { "x": 227, "y": 142 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" } ] }, { "boundingBox": { "vertices": [ { "x": 115, "y": 159 }, { "x": 233, "y": 159 }, { "x": 233, "y": 171 }, { "x": 115, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "symbols": [ { "boundingBox": { "vertices": [ {}, { "x": -1 }, { "x": -1, "y": -1 }, { "y": -1 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "-" }, { "boundingBox": { "vertices": [ { "x": 116, "y": 159 }, { "x": 124, "y": 159 }, { "x": 124, "y": 171 }, { "x": 116, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "2" }, { "boundingBox": { "vertices": [ { "x": 126, "y": 159 }, { "x": 130, "y": 159 }, { "x": 130, "y": 171 }, { "x": 126, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" }, { "boundingBox": { "vertices": [ { "x": 132, "y": 159 }, { "x": 135, "y": 159 }, { "x": 135, "y": 171 }, { "x": 132, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "." }, { "boundingBox": { "vertices": [ { "x": 137, "y": 159 }, { "x": 141, "y": 159 }, { "x": 141, "y": 171 }, { "x": 137, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 143, "y": 159 }, { "x": 150, "y": 159 }, { "x": 150, "y": 171 }, { "x": 143, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" }, { "boundingBox": { "vertices": [ { "x": 152, "y": 159 }, { "x": 159, "y": 159 }, { "x": 159, "y": 171 }, { "x": 152, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 161, "y": 159 }, { "x": 166, "y": 159 }, { "x": 166, "y": 171 }, { "x": 161, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "6" }, { "boundingBox": { "vertices": [ { "x": 168, "y": 159 }, { "x": 175, "y": 159 }, { "x": 175, "y": 171 }, { "x": 168, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "3" }, { "boundingBox": { "vertices": [ { "x": 177, "y": 159 }, { "x": 181, "y": 159 }, { "x": 181, "y": 171 }, { "x": 177, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 183, "y": 159 }, { "x": 208, "y": 159 }, { "x": 208, "y": 171 }, { "x": 183, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "1" }, { "boundingBox": { "vertices": [ { "x": 210, "y": 159 }, { "x": 217, "y": 159 }, { "x": 217, "y": 171 }, { "x": 210, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "8" }, { "boundingBox": { "vertices": [ { "x": 219, "y": 159 }, { "x": 223, "y": 159 }, { "x": 223, "y": 171 }, { "x": 219, "y": 171 } ] }, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "0" }, { "boundingBox": { "vertices": [ { "x": 225, "y": 159 }, { "x": 233, "y": 159 }, { "x": 233, "y": 171 }, { "x": 225, "y": 171 } ] }, "property": { "detectedBreak": { "type": "EOL_SURE_SPACE" }, "detectedLanguages": [ { "languageCode": "en" } ] }, "text": "4" } ] } ] } ], "property": { "detectedLanguages": [ { "languageCode": "en" } ] } } ], "height": 183, "property": { "detectedLanguages": [ { "languageCode": "en" } ] }, "width": 281 } ], "text": "DEP DELAYarrival_delay numflights\n0 -27.0\n1 -26.0\n2-25.0\n3-24.0\n4-23.0\n26.793548 465\n24.438375 714\n25.185224 961\n24.090560 1303\n-24.0166301804\n" }, "textAnnotations": [ { "boundingPoly": { "vertices": [ { "x": 10, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 172 }, { "x": 10, "y": 172 } ] }, "description": "DEP DELAYarrival_delay numflights\n0 -27.0\n1 -26.0\n2-25.0\n3-24.0\n4-23.0\n26.793548 465\n24.438375 714\n25.185224 961\n24.090560 1303\n-24.0166301804\n", "locale": "en" }, { "boundingPoly": { "vertices": [ { "x": 29, "y": 13 }, { "x": 52, "y": 13 }, { "x": 52, "y": 27 }, { "x": 29, "y": 27 } ] }, "description": "DEP" }, { "boundingPoly": { "vertices": [ { "x": 59, "y": 13 }, { "x": 190, "y": 13 }, { "x": 190, "y": 27 }, { "x": 59, "y": 27 } ] }, "description": "DELAYarrival_delay" }, { "boundingPoly": { "vertices": [ { "x": 200, "y": 13 }, { "x": 267, "y": 13 }, { "x": 267, "y": 27 }, { "x": 200, "y": 27 } ] }, "description": "numflights" }, { "boundingPoly": { "vertices": [ { "x": 10, "y": 41 }, { "x": 15, "y": 41 }, { "x": 15, "y": 54 }, { "x": 10, "y": 54 } ] }, "description": "0" }, { "boundingPoly": { "vertices": [ { "x": 24, "y": 41 }, { "x": 57, "y": 42 }, { "x": 57, "y": 55 }, { "x": 24, "y": 54 } ] }, "description": "-27.0" }, { "boundingPoly": { "vertices": [ { "x": 12, "y": 71 }, { "x": 12, "y": 71 }, { "x": 12, "y": 88 }, { "x": 12, "y": 88 } ] }, "description": "1" }, { "boundingPoly": { "vertices": [ { "x": 24, "y": 71 }, { "x": 56, "y": 71 }, { "x": 56, "y": 88 }, { "x": 24, "y": 88 } ] }, "description": "-26.0" }, { "boundingPoly": { "vertices": [ { "x": 11, "y": 100 }, { "x": 56, "y": 101 }, { "x": 56, "y": 115 }, { "x": 11, "y": 114 } ] }, "description": "2-25.0" }, { "boundingPoly": { "vertices": [ { "x": 11, "y": 129 }, { "x": 56, "y": 130 }, { "x": 56, "y": 143 }, { "x": 11, "y": 142 } ] }, "description": "3-24.0" }, { "boundingPoly": { "vertices": [ { "x": 11, "y": 157 }, { "x": 56, "y": 159 }, { "x": 55, "y": 172 }, { "x": 10, "y": 170 } ] }, "description": "4-23.0" }, { "boundingPoly": { "vertices": [ { "x": 120, "y": 39 }, { "x": 182, "y": 39 }, { "x": 182, "y": 55 }, { "x": 120, "y": 55 } ] }, "description": "26.793548" }, { "boundingPoly": { "vertices": [ { "x": 196, "y": 40 }, { "x": 223, "y": 40 }, { "x": 223, "y": 56 }, { "x": 196, "y": 56 } ] }, "description": "465" }, { "boundingPoly": { "vertices": [ { "x": 118, "y": 72 }, { "x": 181, "y": 72 }, { "x": 181, "y": 84 }, { "x": 118, "y": 84 } ] }, "description": "24.438375" }, { "boundingPoly": { "vertices": [ { "x": 194, "y": 72 }, { "x": 223, "y": 72 }, { "x": 223, "y": 84 }, { "x": 194, "y": 84 } ] }, "description": "714" }, { "boundingPoly": { "vertices": [ { "x": 116, "y": 100 }, { "x": 182, "y": 100 }, { "x": 182, "y": 113 }, { "x": 116, "y": 113 } ] }, "description": "25.185224" }, { "boundingPoly": { "vertices": [ { "x": 196, "y": 100 }, { "x": 225, "y": 100 }, { "x": 225, "y": 113 }, { "x": 196, "y": 113 } ] }, "description": "961" }, { "boundingPoly": { "vertices": [ { "x": 120, "y": 129 }, { "x": 182, "y": 129 }, { "x": 182, "y": 142 }, { "x": 120, "y": 142 } ] }, "description": "24.090560" }, { "boundingPoly": { "vertices": [ { "x": 196, "y": 129 }, { "x": 232, "y": 129 }, { "x": 232, "y": 142 }, { "x": 196, "y": 142 } ] }, "description": "1303" }, { "boundingPoly": { "vertices": [ { "x": 115, "y": 159 }, { "x": 233, "y": 159 }, { "x": 233, "y": 171 }, { "x": 115, "y": 171 } ] }, "description": "-24.0166301804" } ] } ] }
|
References
YouTube review of Google Cloud Data Engineer Certification: