Cloud Data Engineer Java Professional Study and Certification Guide, Exams of Technology

A comprehensive guide focused on cloud data engineering using Java technologies. It explores cloud architecture, distributed systems, data pipelines, and performance optimization while providing exam-style assessments and detailed solution explanations.

Typology: Exams

2025/2026

Available from 02/22/2026

shilpi-jain-3
shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 91

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cloud Data Engineer Java Professional Study
and Certification Guide
**Question 1.** Which Google Cloud storage option is best suited for storing large,
immutable binary assets that are accessed infrequently?
A) Cloud SQL
B) Cloud Bigtable
C) Cloud Storage Nearline
D) Cloud Spanner
Answer: C
Explanation: Nearline is a low-cost tier of Cloud Storage designed for data that is
accessed less than once a month, making it ideal for large, immutable assets.
**Question 2.** In Apache Beam (Java SDK), which transform is used to apply a
user-defined function to each element of a PCollection?
A) ParDo
B) GroupByKey
C) Combine.perKey
D) Flatten
Answer: A
Explanation: ParDo applies a DoFn to each element, enabling element-wise
processing.
**Question 3.** When designing a BigQuery table for time-series data, which feature
helps improve query performance by limiting the amount of data scanned?
A) Clustering
B) Partitioning by ingestion time
C) Row-level security
D) Materialized views
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b

Partial preview of the text

Download Cloud Data Engineer Java Professional Study and Certification Guide and more Exams Technology in PDF only on Docsity!

and Certification Guide

Question 1. Which Google Cloud storage option is best suited for storing large, immutable binary assets that are accessed infrequently? A) Cloud SQL B) Cloud Bigtable C) Cloud Storage Nearline D) Cloud Spanner Answer: C Explanation: Nearline is a low-cost tier of Cloud Storage designed for data that is accessed less than once a month, making it ideal for large, immutable assets. Question 2. In Apache Beam (Java SDK), which transform is used to apply a user-defined function to each element of a PCollection? A) ParDo B) GroupByKey C) Combine.perKey D) Flatten Answer: A Explanation: ParDo applies a DoFn to each element, enabling element-wise processing. Question 3. When designing a BigQuery table for time-series data, which feature helps improve query performance by limiting the amount of data scanned? A) Clustering B) Partitioning by ingestion time C) Row-level security D) Materialized views Answer: B

and Certification Guide

Explanation: Partitioning by ingestion time (or a timestamp column) allows queries to scan only relevant partitions, reducing data scanned. Question 4. Which of the following row-key design strategies helps avoid hotspotting in Cloud Bigtable? A) Use monotonically increasing timestamps as the leading part of the key B) Prefix the key with a salted hash of the primary identifier C) Store the entire record in a single column family D) Keep the row key length under 10 bytes Answer: B Explanation: Adding a salted hash distributes writes across multiple tablets, preventing hotspotting. Question 5. In a disaster-recovery plan, what does RPO (Recovery Point Objective) define? A) Maximum acceptable downtime B) Maximum data loss measured in time C) Time required to restore services D) Frequency of backup verification Answer: B Explanation: RPO specifies the maximum age of data that can be recovered after a failure, i.e., acceptable data loss. Question 6. Which Java library provides the most efficient way to read Avro files stored in Cloud Storage within a Dataflow pipeline? A) Google Cloud Storage client library B) Apache Avro Java API

and Certification Guide

Question 9. Which IAM role grants a service account permission to read objects from a Cloud Storage bucket and write to a BigQuery dataset? A) roles/storage.objectAdmin B) roles/bigquery.dataEditor C) roles/editor D) roles/bigquery.jobUser Answer: C Explanation: The primitive role “Editor” includes both storage.objectViewer and bigquery.dataEditor permissions, covering both actions. Question 10. In Cloud Composer, which Airflow operator is used to trigger a Dataflow job written in Java? A) BashOperator B) DataflowPythonOperator C) DataflowJavaOperator D) SparkSubmitOperator Answer: C Explanation: DataflowJavaOperator launches a Dataflow job that runs a Java SDK pipeline. Question 11. Which of the following BigQuery features allows you to enforce row-level security based on the user’s email address? A) Column-level encryption B) Authorized views C) Row-level security policies D) Data masking policies Answer: C

and Certification Guide

Explanation: Row-level security policies can reference the CURRENT_USER() function to filter rows per email. Question 12. When serializing Java objects to Parquet for storage in Cloud Storage, which library provides the necessary schema inference? A) Apache Parquet-MR B) Google Cloud Storage client C) Jackson CSV D) Avro4s Answer: A Explanation: Parquet-MR includes tools for converting Java POJOs to Parquet schema and writing files. Question 13. Which Dataflow feature helps you handle malformed records without failing the entire pipeline? A) Side outputs to a dead-letter PCollection B) Set the pipeline option “--failOnError” to false C) Use a Try-Catch block inside a DoFn and rethrow the exception D) Enable “strict mode” in the pipeline options Answer: A Explanation: Emitting malformed records to a side output (dead-letter) isolates errors while allowing the main pipeline to continue. Question 14. In Cloud Spanner, which schema design pattern is recommended for representing many-to-many relationships while minimizing read latency? A) Storing a JSON array in a column B) Using interleaved tables

and Certification Guide

Question 17. Which Watermark strategy is appropriate when source data may arrive out-of-order up to 5 minutes late? A) MonotonicallyIncreasingWatermarkEstimator B) BoundedOutOfOrdernessTimestampExtractor with 5-minute bound C) NoWatermarkPolicy D) PeriodicWatermarkGenerator with 1-minute interval Answer: B Explanation: BoundedOutOfOrdernessTimestampExtractor allows a fixed lateness bound, handling up-to- 5 - minute delays. Question 18. In a Java-based Spark job on Dataproc, which API call persists an RDD as a Parquet file to Cloud Storage? A) rdd.saveAsTextFile("gs://bucket/path") B) rdd.saveAsObjectFile("gs://bucket/path") C) rdd.write().parquet("gs://bucket/path") D) rdd.toDF().write().format("parquet").save("gs://bucket/path") Answer: D Explanation: Converting the RDD to a DataFrame (toDF) and using DataFrameWriter with format “parquet” writes the data correctly. Question 19. Which Cloud Monitoring metric would you alert on to detect a sudden increase in Dataflow job latency? A) dataflow.googleapis.com/job/total_runtime B) dataflow.googleapis.com/worker/element_count C) dataflow.googleapis.com/job/latency D) dataflow.googleapis.com/worker/cpu_utilization Answer: C

and Certification Guide

Explanation: The latency metric reflects processing delay; a spike indicates potential bottlenecks. Question 20. When configuring Customer-Managed Encryption Keys (CMEK) for BigQuery tables, which IAM role must be granted to the service account that loads data? A) roles/cloudkms.cryptoKeyEncrypterDecrypter B) roles/bigquery.dataViewer C) roles/storage.objectCreator D) roles/cloudkms.viewer Answer: A Explanation: The service account needs permission to encrypt/decrypt using the CMEK, provided by the CryptoKeyEncrypterDecrypter role. Question 21. Which Java annotation is used to define a Beam schema for a POJO that will be automatically converted to a Row in a PCollection? A) @DefaultCoder B) @SchemaCreate C) @JsonProperty D) @AutoValue Answer: B Explanation: @SchemaCreate marks a constructor used by Beam’s schema inference to map POJOs to Rows. Question 22. In BigQuery, what is the effect of enabling “require partition filter” on a partitioned table? A) Queries must specify a filter on the partitioning column, otherwise they fail B) The table automatically drops partitions older than 30 days

and Certification Guide

A) The subscription’s acknowledgment deadline expires too quickly B) Processing latency spikes because a single worker handles most messages C) Messages are delivered out of order across partitions D) The topic exceeds its quota for retained messages Answer: B Explanation: A hot key concentrates load on one worker, causing latency and scaling issues. Question 26. In a Java Beam pipeline, what is the purpose of a “Side Input”? A) To provide additional data that is broadcast to all workers for each element B) To split the main PCollection into multiple output streams C) To store intermediate results in Cloud Storage automatically D) To enforce ordering of elements within a window Answer: A Explanation: Side inputs supply auxiliary data (e.g., a lookup table) that is available to every DoFn instance. Question 27. Which Cloud Storage storage class provides the lowest latency for frequently accessed data? A) Archive B) Nearline C) Coldline D) Standard Answer: D Explanation: Standard storage is optimized for low latency and high frequency access.

and Certification Guide

Question 28. When using BigQuery ML to train a linear regression model, which SQL clause specifies the target column? A) SELECT B) FROM C) PREDICT D) LABEL Answer: D Explanation: The LABEL clause identifies the column to be predicted. Question 29. Which of the following is a recommended practice for designing a BigQuery schema that will be queried by many microservices with different access patterns? A) Store all data in a single wide table and rely on column pruning B) Create separate tables per microservice to avoid contention C) Use nested and repeated fields to encapsulate related entities D) Denormalize everything into flat tables only Answer: C Explanation: Nested/repeated fields allow hierarchical data modeling while preserving query efficiency. Question 30. In Cloud Dataproc, which configuration parameter controls the number of executors for a Spark job submitted via Java? A) spark.executor.instances B) spark.driver.memory C) dataproc:dataproc_cluster_config D) spark.sql.shuffle.partitions Answer: A

and Certification Guide

D) IllegalArgumentException Answer: B Explanation: StorageException includes HTTP status codes; 5xx indicates transient failures suitable for retries. Question 34. In a streaming Dataflow job, which option determines how often the system emits early results for a window? A) AllowedLateness B) Triggering frequency via AfterProcessingTime.pastFirstElementInPane() C) AccumulationMode.DISCARDING_FIRED_PANES D) Watermark idle timeout Answer: B Explanation: AfterProcessingTime triggers early pane emission based on processing time after the first element. Question 35. Which of the following best describes the “cold start” problem in serverless Dataflow pipelines? A) Workers take longer to spin up when scaling from zero, increasing initial latency B) Data is stored in a cold storage tier and must be retrieved C) The pipeline cannot process events older than a certain age D) The job fails if the first element arrives after a timeout Answer: A Explanation: Serverless workers may need time to provision, causing a temporary delay at the start of processing. Question 36. When using BigQuery’s streaming API from Java, which field must be unique per row to guarantee idempotent inserts? A) row_id

and Certification Guide

B) insertId C) streaming_token D) requestId Answer: B Explanation: insertId is used by BigQuery to deduplicate streaming inserts. Question 37. Which Cloud IAM principle is applied when granting a service account the role “roles/bigquery.jobUser” on a project? A) Least privilege – only job-creation permissions are granted B) Full administrative control over all datasets C) Read-only access to all tables D) Ability to modify IAM policies Answer: A Explanation: bigquery.jobUser allows creating and managing jobs but not direct data read/write, adhering to least privilege. Question 38. In a Java MapReduce job on Dataproc, which class represents the output key type for a reducer that emits a composite key? A) Text B) LongWritable C) ImmutableBytesWritable D) custom WritableComparable implementation Answer: D Explanation: Composite keys require a custom WritableComparable to define serialization and sort order.

and Certification Guide

Explanation: Combine.globally with an IterableCombineFn aggregates all elements into a collection. Question 42. Which of the following is a recommended way to avoid “straggler” workers in a Dataflow batch job? A) Increase the number of workers dramatically B) Use a higher parallelism factor in the pipeline design C) Disable autoscaling D) Set a fixed number of workers equal to the number of input files Answer: B Explanation: Designing transforms with higher parallelism distributes work evenly, reducing stragglers. Question 43. When using Cloud Spanner with Java, which API method starts a read-only transaction that can be used for consistent reads across multiple tables? A) DatabaseClient.singleUse() B) DatabaseClient.readWriteTransaction() C) SpannerOptions.getService() D) DatabaseAdminClient.createDatabase() Answer: A Explanation: singleUse() returns a read-only transaction object. Question 44. Which of the following is NOT a valid Cloud Storage storage class? A) Multi-Regional B) Regional C) Nearline D) Hotline

and Certification Guide

Answer: D Explanation: “Hotline” is not a defined storage class. Question 45. In a Java Dataflow pipeline, which method is used to set a custom coder for a PCollection of a user-defined type? A) .setCoder() on the PCollection B) .withCoder() on the PipelineOptions C) .apply(CoderRegistry.register()) D) .setDefaultCoder() on the DoFn Answer: A Explanation: PCollection.setCoder() assigns a specific Coder for serialization. Question 46. Which Cloud service provides a managed, serverless environment for running Apache Beam pipelines without managing workers? A) Cloud Dataproc B) Cloud Dataflow C) Cloud Composer D) Cloud Run Answer: B Explanation: Cloud Dataflow is the fully managed service for Beam pipelines. Question 47. When using the DLP API to inspect data for credit card numbers, which info type should be specified? A) PHONE_NUMBER B) EMAIL_ADDRESS C) CREDIT_CARD_NUMBER

and Certification Guide

A) request.time < “17:00” B) resource.name.startsWith(‘projects/…’) C) iam:resource.name D) None – IAM conditions cannot restrict by time of day Answer: D Explanation: IAM conditions currently do not support time-of-day restrictions. Question 51. Which of the following is a primary benefit of using column-level encryption in BigQuery? A) Faster query execution B) Reduced storage cost C) Ability to hide sensitive columns from unauthorized users D) Automatic data compression Answer: C Explanation: Column-level encryption protects specific columns, allowing fine-grained confidentiality. Question 52. When designing a BigQuery table for event logs, which schema design reduces the need for joins when querying recent activity per user? A) Store each event as a separate row with a user_id column B) Use a repeated RECORD field that contains an array of events per user C) Create a separate table for each day’s events D) Store events as JSON strings in a single column Answer: B Explanation: A repeated RECORD embeds events per user, enabling queries without joins for recent activity.

and Certification Guide

Question 53. In Apache Beam, which transform is used to convert a PCollection of KV into a PCollection of Strings containing “key: value” pairs? A) MapElements.via(new SimpleFunction<...>) B) ParDo.of(new DoFn,String>()) C) Convert.toStrings() D) Both A and B are valid Answer: D Explanation: Both MapElements with a SimpleFunction and ParDo with a DoFn can perform the conversion. Question 54. Which Cloud service provides a unified data catalog and policy engine for governing data across BigQuery, Cloud Storage, and Dataproc? A) Cloud Data Fusion B) Cloud Dataplex C) Cloud Composer D) Cloud Asset Inventory Answer: B Explanation: Dataplex offers data cataloging, discovery, and governance across multiple services. Question 55. When using the Java client library to list objects in a Cloud Storage bucket, which method returns an iterator that lazily fetches pages? A) list() B) listObjects() C) listBlobs() D) getObjects() Answer: C