

















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Google Cloud Professional Data Engineer Ultimate Exam is a complete certification preparation resource designed for professionals working with data pipelines, analytics, machine learning, and cloud-based data infrastructure. This exam covers BigQuery, Cloud Storage, Dataflow, Dataproc, ETL workflows, data governance, security, scalability, and performance optimization. Learners develop practical expertise in designing reliable data solutions, processing large datasets, and enabling data-driven decision-making while building confidence for professional certification success.
Typology: Exams
1 / 57
This page cannot be seen from the preview
Don't miss anything!


















































Question 1. Which architecture combines a batch layer for historical data and a speed layer for real-time data? A) Kappa B) Lambda C) Microservices D) Event-driven Answer: B Explanation: The Lambda architecture uses a batch layer to compute immutable views from all data and a speed layer to provide low-latency updates, merging both results for queries. Question 2. For a lightweight, event-driven orchestration of Cloud Functions and Cloud Run services, which GCP product is most appropriate? A) Cloud Composer B) Cloud Workflows C) Cloud Data Fusion D) Cloud Build Answer: B Explanation: Cloud Workflows is designed for lightweight, serverless coordination of services, whereas Cloud Composer (Airflow) is suited for complex, scheduled pipelines. Question 3. Which tool provides a visual, code-free ETL experience and can write directly to BigQuery tables? A) Dataform B) Cloud Data Fusion C) Dataproc Serverless
D) Dataflow Answer: B Explanation: Cloud Data Fusion offers a drag-and-drop UI for building ETL pipelines and supports native BigQuery sinks. Question 4. When migrating an on-prem Hadoop workload to GCP with minimal operational overhead, which service should you choose? A) Cloud Dataproc (cluster mode) B) Cloud Dataproc Serverless C) Cloud Dataflow D) Cloud Composer Answer: B Explanation: Dataproc Serverless runs Spark and Hadoop jobs without managing clusters, ideal for migrating legacy workloads. Question 5. Which Cloud Storage class is optimized for data accessed less than once a year but requires rapid retrieval? A) Standard B) Nearline C) Coldline D) Archive Answer: C Explanation: Coldline is designed for infrequently accessed data with retrieval times in milliseconds, while Archive is for data accessed less than once a year with longer retrieval. Question 6. For a globally distributed, strongly consistent relational database, which GCP service is the best fit?
Question 9. Which IAM principle helps minimize the risk of privilege escalation? A) Role inheritance B) Least privilege C) Service account impersonation D) Project-level admin Answer: B Explanation: The principle of least privilege grants only the permissions required to perform a task, reducing attack surface. Question 10. When a regulation requires that data never leave the EU, which GCP resource configuration should you enforce? A) Multi-regional bucket in “us-central1” B) Regional bucket in “europe-west1” and Spanner instance in “europe-west1” C) Global Cloud SQL instance D) BigQuery dataset with “US” location Answer: B Explanation: Using regional resources within an EU region complies with data residency requirements. Question 11. Which key management option allows you to control the encryption keys stored in Cloud KMS yourself? A) Google-managed encryption keys (default) B) Customer-managed encryption keys (CMEK) C) Customer-supplied encryption keys (CSEK) D) Transparent data encryption (TDE) Answer: B
Explanation: CMEK lets customers create and manage keys in Cloud KMS, providing control over encryption lifecycle. Question 12. Dataplex primarily provides which capability? A) Real-time stream processing B) Unified data governance, discovery, and quality across data lakes and warehouses C) Serverless Spark execution D) Automated model training Answer: B Explanation: Dataplex is a data fabric that centralizes governance, metadata, and quality checks across heterogeneous data assets. Question 13. Which Pub/Sub feature helps handle messages that repeatedly fail processing? A) Exactly-once delivery B) Dead-letter topic C) Message ordering D) Pull subscription only Answer: B Explanation: A dead-letter topic receives messages that exceed the maximum delivery attempts, allowing separate handling. Question 14. To capture changes from an on-prem MySQL database into BigQuery with minimal latency, which GCP service should you use? A) Transfer Appliance B) Storage Transfer Service C) Datastream
C) Fixed (tumbling) window D) Global window Answer: C Explanation: Fixed (tumbling) windows partition time into non-overlapping intervals of equal size. Question 18. To reduce the cost of a streaming Dataflow job that processes high-volume clickstream data, which feature should you enable? A) Streaming Engine B) Shuffle mode off C) Autoscaling disabled D) Batch mode only Answer: A Explanation: Streaming Engine moves stateful processing to dedicated workers, improving throughput and reducing cost. Question 19. In Dataflow, a “hot key” problem is most likely caused by: A) Using too many worker nodes B) An uneven distribution of keys where a single key receives a disproportionate number of records C) Incorrect windowing strategy D) Insufficient memory on workers Answer: B Explanation: Hot keys create skew because one key’s processing becomes a bottleneck, leading to performance degradation. Question 20. Which BigQuery feature allows you to partition a table by a DATE column to improve query performance?
A) Clustering B) Partitioned tables (time-partitioning) C) Materialized view D) Dataflow template Answer: B Explanation: Time-partitioned tables store data in separate partitions per date, enabling pruning of irrelevant partitions during queries. Question 21. When would you choose clustering over partitioning in BigQuery? A) To reduce storage costs for small tables B) To improve query performance on columns frequently filtered together, without creating separate partitions C) To enforce row-level security D) To enable cross-project data sharing Answer: B Explanation: Clustering groups rows with similar values on specified columns, allowing more efficient pruning when those columns are filtered. Question 22. Which BigQuery feature enables you to share a dataset with external organizations without copying the data? A) Export to Cloud Storage B) Analytics Hub (formerly Data Exchange) C) Data Transfer Service D) Cloud Composer Answer: B Explanation: Analytics Hub lets data providers publish datasets for secure, controlled sharing with other GCP projects or external partners.
Explanation: BigQuery Omni extends BigQuery’s SQL engine to query data in AWS S3 and Azure Blob without moving it. Question 26. In Cloud Bigtable, which row key design pattern helps avoid hotspotting? A) Sequential timestamps at the start of the key B) Randomly generated UUIDs as the entire key C) Prefixing with a hashed bucket followed by a logical identifier D) Using only the user ID as the key Answer: C Explanation: Adding a hash or bucket prefix distributes writes across tablets, preventing hotspots caused by sequential keys. Question 27. When scaling a Bigtable instance, which metric should primarily guide you to add more nodes? A) Storage size only B) CPU utilization above 70% for sustained periods C. Number of tables D) Number of column families Answer: B Explanation: CPU utilization reflects the workload; sustained high CPU indicates the need for additional nodes. Question 28. Which BigQuery SQL function would you use to extract the value of “price” from a JSON column? A) JSON_EXTRACT B) PARSE_JSON C) TO_JSON_STRING
Answer: A Explanation: JSON_EXTRACT returns a JSON-encoded string for a given JSONPath expression; JSON_VALUE returns a scalar value (also acceptable in newer versions), but JSON_EXTRACT is the classic function. Question 29. To create a reusable custom calculation in BigQuery that can be called from multiple queries, you should use: A) Stored procedures B) User-Defined Functions (UDFs) C) Views only D) Dataflow templates Answer: B Explanation: UDFs allow you to define JavaScript or SQL-based functions that can be invoked across queries. Question 30. Looker Studio (formerly Data Studio) connects to BigQuery using which method? A) Direct JDBC connection B) BigQuery API via OAuth C) Cloud Storage export D) Pub/Sub subscription Answer: B Explanation: Looker Studio uses the BigQuery REST API with OAuth2 for authentication to query data directly. Question 31. Which GCP service provides a centralized catalog for metadata across BigQuery, Cloud Storage, and other data assets?
Question 34. Which pre-built Cloud API would you call to extract text from scanned documents in a pipeline? A) Cloud Vision OCR B) Cloud Natural Language Sentiment C) Cloud Translation D) Cloud Speech-to-Text Answer: A Explanation: Cloud Vision’s OCR capability detects and extracts printed or handwritten text from images. Question 35. What is the primary purpose of exponential backoff when retrying failed Pub/Sub message deliveries? A) To guarantee exactly-once delivery B) To reduce the load on the service and avoid thundering herd problems C) To increase message ordering guarantees D) To disable dead-letter topics Answer: B Explanation: Exponential backoff spaces out retries, preventing overwhelming the system during transient failures. Question 36. Which Cloud Monitoring metric would you set an alert on to detect a Dataflow job that is falling behind its processing deadline? A) dataflow.googleapis.com/job/total_rows_processed B) dataflow.googleapis.com/job/element_count C) dataflow.googleapis.com/job/processing_time_per_window D) dataflow.googleapis.com/job/latency Answer: D
Explanation: The latency metric measures the delay between event time and processing time, indicating backlog. Question 37. To enforce a quota limit on the number of BigQuery slots a team can consume, you should configure: A) BigQuery reservations with a slot commitment B) Cloud IAM custom role C) VPC Service Controls D) Organization policy “bigquery.allowedResources” Answer: A Explanation: Reservations allocate a fixed number of slots to a project or group, capping consumption. Question 38. Which CI/CD tool can automatically build and deploy Dataflow templates stored in Cloud Storage? A) Cloud Composer B) Cloud Build C) Cloud Functions D) Cloud Scheduler Answer: B Explanation: Cloud Build can compile code, create Dataflow templates, and push them to a Cloud Storage bucket as part of a pipeline. Question 39. When separating environments for a data platform, which GCP construct provides the strongest isolation? A) Different folders within the same organization B) Different projects with separate VPCs C) Different service accounts in one project
B) DataflowRunner (default) C) FlinkRunner D) SparkRunner Answer: B Explanation: DataflowRunner runs pipelines on the fully managed Dataflow service, handling scaling and resource provisioning. Question 43. When using Pub/Sub with exactly-once delivery, which component must be enabled? A) Ordering keys B) Message deduplication (message ID) C) Dead-letter topic D) Pull subscription only Answer: B Explanation: Exactly-once delivery relies on Pub/Sub’s message deduplication feature, which uses the message ID to filter duplicates. Question 44. Which Cloud Storage class provides the lowest storage cost but a retrieval time of several hours? A) Standard B) Nearline C) Coldline D) Archive Answer: D Explanation: Archive storage is designed for long-term retention with the cheapest price and retrieval times on the order of hours.
Question 45. In BigQuery, which clause is used to limit the amount of data scanned by a query for cost control? A) LIMIT B) WHERE with partition filter C) SELECT * D) WITH clause Answer: B Explanation: Filtering on partitioned columns (e.g., date) reduces scanned partitions, directly lowering query cost. Question 46. Which of the following best describes a “materialized view” in BigQuery? A) A view that is recomputed on each query execution B) A view that stores pre-computed results and refreshes automatically based on source changes C) A static snapshot that never updates D) A temporary table that expires after 24 hours Answer: B Explanation: Materialized views maintain cached results and are incrementally refreshed, offering faster query performance. Question 47. To enforce row-level security on a BigQuery table, which feature should you configure? A) IAM policy on the dataset B) Column-level security C) Access policies with row-level security predicates D) Data Catalog tags Answer: C
A) VPC Service Controls B) Organization policy “bigquery.allowedExternalDataSources” C) Data Loss Prevention API D) IAM deny policies Answer: A Explanation: VPC Service Controls create a security perimeter that restricts data movement, including egress from external data sources accessed via Omni. Question 51. In Cloud Composer, which component is responsible for executing DAG tasks? A) Scheduler only B) Worker pods (CeleryExecutor) or KubernetesExecutor pods C) Cloud Functions D) Dataflow workers Answer: B Explanation: Composer uses Airflow executors; the CeleryExecutor runs tasks on worker pods, while the KubernetesExecutor runs each task in its own pod. Question 52. Which of the following is a valid reason to choose Cloud SQL over Cloud Spanner? A) Need for horizontal scaling across continents B) Requirement for strong global consistency with millions of rows C) Simple OLTP workload with modest scale and need for familiar MySQL/PostgreSQL engine D) Need for petabyte-scale analytical queries Answer: C
Explanation: Cloud SQL provides managed MySQL, PostgreSQL, and SQL Server instances suitable for traditional OLTP workloads. Question 53. When configuring a Dataproc cluster for Spark jobs that require high shuffle performance, which storage option should you enable? A) Local SSDs only B) Cloud Storage as the default filesystem C) High-performance persistent disks (PD-SSD) for /tmp and shuffle directories D) Nearline storage for checkpointing Answer: C Explanation: Using PD-SSD for temporary shuffle files improves I/O performance during Spark shuffle stages. Question 54. Which of the following best describes the purpose of a “dead-letter queue” in a streaming pipeline? A) To store successfully processed messages B) To hold messages that could not be processed after retries for later analysis C) To enforce ordering of messages D) To compress messages before delivery Answer: B Explanation: A dead-letter queue captures messages that repeatedly fail, allowing operators to inspect and remediate them. Question 55. In BigQuery, what does the “slots” metric represent? A) Number of concurrent queries allowed B) Compute capacity measured in virtual CPUs allocated to a project