



















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive guide focused on cloud data engineering using Java technologies. It explores cloud architecture, distributed systems, data pipelines, and performance optimization while providing exam-style assessments and detailed solution explanations.
Typology: Exams
1 / 91
This page cannot be seen from the preview
Don't miss anything!




















































































Question 1. Which Google Cloud storage option is best suited for storing large, immutable binary assets that are accessed infrequently? A) Cloud SQL B) Cloud Bigtable C) Cloud Storage Nearline D) Cloud Spanner Answer: C Explanation: Nearline is a low-cost tier of Cloud Storage designed for data that is accessed less than once a month, making it ideal for large, immutable assets. Question 2. In Apache Beam (Java SDK), which transform is used to apply a user-defined function to each element of a PCollection? A) ParDo B) GroupByKey C) Combine.perKey D) Flatten Answer: A Explanation: ParDo applies a DoFn to each element, enabling element-wise processing. Question 3. When designing a BigQuery table for time-series data, which feature helps improve query performance by limiting the amount of data scanned? A) Clustering B) Partitioning by ingestion time C) Row-level security D) Materialized views Answer: B
Explanation: Partitioning by ingestion time (or a timestamp column) allows queries to scan only relevant partitions, reducing data scanned. Question 4. Which of the following row-key design strategies helps avoid hotspotting in Cloud Bigtable? A) Use monotonically increasing timestamps as the leading part of the key B) Prefix the key with a salted hash of the primary identifier C) Store the entire record in a single column family D) Keep the row key length under 10 bytes Answer: B Explanation: Adding a salted hash distributes writes across multiple tablets, preventing hotspotting. Question 5. In a disaster-recovery plan, what does RPO (Recovery Point Objective) define? A) Maximum acceptable downtime B) Maximum data loss measured in time C) Time required to restore services D) Frequency of backup verification Answer: B Explanation: RPO specifies the maximum age of data that can be recovered after a failure, i.e., acceptable data loss. Question 6. Which Java library provides the most efficient way to read Avro files stored in Cloud Storage within a Dataflow pipeline? A) Google Cloud Storage client library B) Apache Avro Java API
Question 9. Which IAM role grants a service account permission to read objects from a Cloud Storage bucket and write to a BigQuery dataset? A) roles/storage.objectAdmin B) roles/bigquery.dataEditor C) roles/editor D) roles/bigquery.jobUser Answer: C Explanation: The primitive role “Editor” includes both storage.objectViewer and bigquery.dataEditor permissions, covering both actions. Question 10. In Cloud Composer, which Airflow operator is used to trigger a Dataflow job written in Java? A) BashOperator B) DataflowPythonOperator C) DataflowJavaOperator D) SparkSubmitOperator Answer: C Explanation: DataflowJavaOperator launches a Dataflow job that runs a Java SDK pipeline. Question 11. Which of the following BigQuery features allows you to enforce row-level security based on the user’s email address? A) Column-level encryption B) Authorized views C) Row-level security policies D) Data masking policies Answer: C
Explanation: Row-level security policies can reference the CURRENT_USER() function to filter rows per email. Question 12. When serializing Java objects to Parquet for storage in Cloud Storage, which library provides the necessary schema inference? A) Apache Parquet-MR B) Google Cloud Storage client C) Jackson CSV D) Avro4s Answer: A Explanation: Parquet-MR includes tools for converting Java POJOs to Parquet schema and writing files. Question 13. Which Dataflow feature helps you handle malformed records without failing the entire pipeline? A) Side outputs to a dead-letter PCollection B) Set the pipeline option “--failOnError” to false C) Use a Try-Catch block inside a DoFn and rethrow the exception D) Enable “strict mode” in the pipeline options Answer: A Explanation: Emitting malformed records to a side output (dead-letter) isolates errors while allowing the main pipeline to continue. Question 14. In Cloud Spanner, which schema design pattern is recommended for representing many-to-many relationships while minimizing read latency? A) Storing a JSON array in a column B) Using interleaved tables
Question 17. Which Watermark strategy is appropriate when source data may arrive out-of-order up to 5 minutes late? A) MonotonicallyIncreasingWatermarkEstimator B) BoundedOutOfOrdernessTimestampExtractor with 5-minute bound C) NoWatermarkPolicy D) PeriodicWatermarkGenerator with 1-minute interval Answer: B Explanation: BoundedOutOfOrdernessTimestampExtractor allows a fixed lateness bound, handling up-to- 5 - minute delays. Question 18. In a Java-based Spark job on Dataproc, which API call persists an RDD as a Parquet file to Cloud Storage? A) rdd.saveAsTextFile("gs://bucket/path") B) rdd.saveAsObjectFile("gs://bucket/path") C) rdd.write().parquet("gs://bucket/path") D) rdd.toDF().write().format("parquet").save("gs://bucket/path") Answer: D Explanation: Converting the RDD to a DataFrame (toDF) and using DataFrameWriter with format “parquet” writes the data correctly. Question 19. Which Cloud Monitoring metric would you alert on to detect a sudden increase in Dataflow job latency? A) dataflow.googleapis.com/job/total_runtime B) dataflow.googleapis.com/worker/element_count C) dataflow.googleapis.com/job/latency D) dataflow.googleapis.com/worker/cpu_utilization Answer: C
Explanation: The latency metric reflects processing delay; a spike indicates potential bottlenecks. Question 20. When configuring Customer-Managed Encryption Keys (CMEK) for BigQuery tables, which IAM role must be granted to the service account that loads data? A) roles/cloudkms.cryptoKeyEncrypterDecrypter B) roles/bigquery.dataViewer C) roles/storage.objectCreator D) roles/cloudkms.viewer Answer: A Explanation: The service account needs permission to encrypt/decrypt using the CMEK, provided by the CryptoKeyEncrypterDecrypter role. Question 21. Which Java annotation is used to define a Beam schema for a POJO that will be automatically converted to a Row in a PCollection? A) @DefaultCoder B) @SchemaCreate C) @JsonProperty D) @AutoValue Answer: B Explanation: @SchemaCreate marks a constructor used by Beam’s schema inference to map POJOs to Rows. Question 22. In BigQuery, what is the effect of enabling “require partition filter” on a partitioned table? A) Queries must specify a filter on the partitioning column, otherwise they fail B) The table automatically drops partitions older than 30 days
A) The subscription’s acknowledgment deadline expires too quickly B) Processing latency spikes because a single worker handles most messages C) Messages are delivered out of order across partitions D) The topic exceeds its quota for retained messages Answer: B Explanation: A hot key concentrates load on one worker, causing latency and scaling issues. Question 26. In a Java Beam pipeline, what is the purpose of a “Side Input”? A) To provide additional data that is broadcast to all workers for each element B) To split the main PCollection into multiple output streams C) To store intermediate results in Cloud Storage automatically D) To enforce ordering of elements within a window Answer: A Explanation: Side inputs supply auxiliary data (e.g., a lookup table) that is available to every DoFn instance. Question 27. Which Cloud Storage storage class provides the lowest latency for frequently accessed data? A) Archive B) Nearline C) Coldline D) Standard Answer: D Explanation: Standard storage is optimized for low latency and high frequency access.
Question 28. When using BigQuery ML to train a linear regression model, which SQL clause specifies the target column? A) SELECT B) FROM C) PREDICT D) LABEL Answer: D Explanation: The LABEL clause identifies the column to be predicted. Question 29. Which of the following is a recommended practice for designing a BigQuery schema that will be queried by many microservices with different access patterns? A) Store all data in a single wide table and rely on column pruning B) Create separate tables per microservice to avoid contention C) Use nested and repeated fields to encapsulate related entities D) Denormalize everything into flat tables only Answer: C Explanation: Nested/repeated fields allow hierarchical data modeling while preserving query efficiency. Question 30. In Cloud Dataproc, which configuration parameter controls the number of executors for a Spark job submitted via Java? A) spark.executor.instances B) spark.driver.memory C) dataproc:dataproc_cluster_config D) spark.sql.shuffle.partitions Answer: A
D) IllegalArgumentException Answer: B Explanation: StorageException includes HTTP status codes; 5xx indicates transient failures suitable for retries. Question 34. In a streaming Dataflow job, which option determines how often the system emits early results for a window? A) AllowedLateness B) Triggering frequency via AfterProcessingTime.pastFirstElementInPane() C) AccumulationMode.DISCARDING_FIRED_PANES D) Watermark idle timeout Answer: B Explanation: AfterProcessingTime triggers early pane emission based on processing time after the first element. Question 35. Which of the following best describes the “cold start” problem in serverless Dataflow pipelines? A) Workers take longer to spin up when scaling from zero, increasing initial latency B) Data is stored in a cold storage tier and must be retrieved C) The pipeline cannot process events older than a certain age D) The job fails if the first element arrives after a timeout Answer: A Explanation: Serverless workers may need time to provision, causing a temporary delay at the start of processing. Question 36. When using BigQuery’s streaming API from Java, which field must be unique per row to guarantee idempotent inserts? A) row_id
B) insertId C) streaming_token D) requestId Answer: B Explanation: insertId is used by BigQuery to deduplicate streaming inserts. Question 37. Which Cloud IAM principle is applied when granting a service account the role “roles/bigquery.jobUser” on a project? A) Least privilege – only job-creation permissions are granted B) Full administrative control over all datasets C) Read-only access to all tables D) Ability to modify IAM policies Answer: A Explanation: bigquery.jobUser allows creating and managing jobs but not direct data read/write, adhering to least privilege. Question 38. In a Java MapReduce job on Dataproc, which class represents the output key type for a reducer that emits a composite key? A) Text B) LongWritable C) ImmutableBytesWritable D) custom WritableComparable implementation Answer: D Explanation: Composite keys require a custom WritableComparable to define serialization and sort order.
Explanation: Combine.globally with an IterableCombineFn aggregates all elements into a collection. Question 42. Which of the following is a recommended way to avoid “straggler” workers in a Dataflow batch job? A) Increase the number of workers dramatically B) Use a higher parallelism factor in the pipeline design C) Disable autoscaling D) Set a fixed number of workers equal to the number of input files Answer: B Explanation: Designing transforms with higher parallelism distributes work evenly, reducing stragglers. Question 43. When using Cloud Spanner with Java, which API method starts a read-only transaction that can be used for consistent reads across multiple tables? A) DatabaseClient.singleUse() B) DatabaseClient.readWriteTransaction() C) SpannerOptions.getService() D) DatabaseAdminClient.createDatabase() Answer: A Explanation: singleUse() returns a read-only transaction object. Question 44. Which of the following is NOT a valid Cloud Storage storage class? A) Multi-Regional B) Regional C) Nearline D) Hotline
Answer: D Explanation: “Hotline” is not a defined storage class. Question 45. In a Java Dataflow pipeline, which method is used to set a custom coder for a PCollection of a user-defined type? A) .setCoder() on the PCollection B) .withCoder() on the PipelineOptions C) .apply(CoderRegistry.register()) D) .setDefaultCoder() on the DoFn Answer: A Explanation: PCollection.setCoder() assigns a specific Coder for serialization. Question 46. Which Cloud service provides a managed, serverless environment for running Apache Beam pipelines without managing workers? A) Cloud Dataproc B) Cloud Dataflow C) Cloud Composer D) Cloud Run Answer: B Explanation: Cloud Dataflow is the fully managed service for Beam pipelines. Question 47. When using the DLP API to inspect data for credit card numbers, which info type should be specified? A) PHONE_NUMBER B) EMAIL_ADDRESS C) CREDIT_CARD_NUMBER
A) request.time < “17:00” B) resource.name.startsWith(‘projects/…’) C) iam:resource.name D) None – IAM conditions cannot restrict by time of day Answer: D Explanation: IAM conditions currently do not support time-of-day restrictions. Question 51. Which of the following is a primary benefit of using column-level encryption in BigQuery? A) Faster query execution B) Reduced storage cost C) Ability to hide sensitive columns from unauthorized users D) Automatic data compression Answer: C Explanation: Column-level encryption protects specific columns, allowing fine-grained confidentiality. Question 52. When designing a BigQuery table for event logs, which schema design reduces the need for joins when querying recent activity per user? A) Store each event as a separate row with a user_id column B) Use a repeated RECORD field that contains an array of events per user C) Create a separate table for each day’s events D) Store events as JSON strings in a single column Answer: B Explanation: A repeated RECORD embeds events per user, enabling queries without joins for recent activity.
Question 53. In Apache Beam, which transform is used to convert a PCollection of KV into a PCollection of Strings containing “key: value” pairs? A) MapElements.via(new SimpleFunction<...>) B) ParDo.of(new DoFn,String>()) C) Convert.toStrings() D) Both A and B are valid Answer: D Explanation: Both MapElements with a SimpleFunction and ParDo with a DoFn can perform the conversion. Question 54. Which Cloud service provides a unified data catalog and policy engine for governing data across BigQuery, Cloud Storage, and Dataproc? A) Cloud Data Fusion B) Cloud Dataplex C) Cloud Composer D) Cloud Asset Inventory Answer: B Explanation: Dataplex offers data cataloging, discovery, and governance across multiple services. Question 55. When using the Java client library to list objects in a Cloud Storage bucket, which method returns an iterator that lazily fetches pages? A) list() B) listObjects() C) listBlobs() D) getObjects() Answer: C