Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Cloud Data Engineer Python Complete Exam Preparation and Study Guide, Exams of Technology

Technology

A comprehensive Python-focused cloud data engineering resource covering automation, data pipelines, analytics frameworks, and certification-level exam preparation content.

Typology: Exams

2025/2026

Available from 02/22/2026

shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 87

This page cannot be seen from the preview

Don't miss anything!

Cloud Data Engineer Python Complete Exam

Preparation and Study Guide

**Question 1.** Which Python built-in function is most appropriate for reading a

JSON file line-by-line without loading the entire file into memory?

A) json.load()

B) json.loads()

C) json.dump()

D) json.load(fp) inside a for-loop iterator

Answer: D

Explanation: Using `json.load(fp)` inside an iterator (e.g., `for line in fp:`) reads

each line sequentially, allowing processing of large JSON files without full in-memory

loading.

**Question 2.** In a Pandas DataFrame, which method returns a view that shares

the same underlying data with the original frame, thus using minimal additional

memory?

A) copy()

B) loc[]

C) iloc[]

D) view()

Answer: B

Explanation: `.loc[]` (and `.iloc[]`) produce slice objects that are views when

possible, sharing memory with the original DataFrame, unlike `.copy()` which

creates a deep copy.

**Question 3.** Which of the following statements about Python generators is

FALSE?

A) They are created using a function with the `yield` keyword.

B) They can be iterated only once.

C) They store all generated items in a list internally.

Partial preview of the text

Download Cloud Data Engineer Python Complete Exam Preparation and Study Guide and more Exams Technology in PDF only on Docsity!

Preparation and Study Guide

Question 1. Which Python built-in function is most appropriate for reading a JSON file line-by-line without loading the entire file into memory? A) json.load() B) json.loads() C) json.dump() D) json.load(fp) inside a for-loop iterator Answer: D Explanation: Using json.load(fp) inside an iterator (e.g., for line in fp:) reads each line sequentially, allowing processing of large JSON files without full in-memory loading. Question 2. In a Pandas DataFrame, which method returns a view that shares the same underlying data with the original frame, thus using minimal additional memory? A) copy() B) loc[] C) iloc[] D) view() Answer: B Explanation: .loc[] (and .iloc[]) produce slice objects that are views when possible, sharing memory with the original DataFrame, unlike .copy() which creates a deep copy. Question 3. Which of the following statements about Python generators is FALSE? A) They are created using a function with the yield keyword. B) They can be iterated only once. C) They store all generated items in a list internally.

Preparation and Study Guide

D) They are memory-efficient for large sequences. Answer: C Explanation: Generators do not store items; they produce each value on demand, making them memory-efficient. Question 4. In object-oriented Python, which magic method is invoked when the built-in len() function is called on an instance of a custom class? A) size B) len C) length D) count Answer: B Explanation: The __len__ special method defines the behavior of len(obj). Question 5. Which decorator from the functools module can be used to cache the results of a pure function to improve performance? A) @lru_cache B) @cache_result C) @memoize D) @cached_property Answer: A Explanation: @functools.lru_cache caches function calls based on arguments, speeding up repeated executions of pure functions. Question 6. When using the requests library to call a paginated REST endpoint that uses a next URL in the JSON response, which pattern correctly iterates through all pages?

Preparation and Study Guide

Explanation: Apache Parquet is a columnar format optimized for analytics and is supported by Pandas (read_parquet) and Spark. Question 9. In SQLAlchemy, what is the purpose of the session.flush() method? A) Commit the current transaction to the database. B) Write pending changes to the database without committing. C) Clear the session’s identity map. D) Refresh objects from the database. Answer: B Explanation: flush() sends pending ORM changes to the DB but does not commit, allowing further operations before a final commit. Question 10. Which PySpark transformation is narrow and therefore does not trigger a shuffle? A) groupByKey() B) join() C) mapPartitions() D) repartition() Answer: C Explanation: mapPartitions() operates on each partition independently; it does not require data movement across partitions. Question 11. In Airflow, which argument of the PythonOperator determines the function that will be executed when the task runs? A) python_callable B) task_function C) callable

Preparation and Study Guide

D) execute_fn Answer: A Explanation: python_callable receives a reference to the Python function to be called. Question 12. Which Airflow feature allows you to pass small pieces of data between tasks without using external storage? A) Variables B) XComs C) Connections D) Hooks Answer: B Explanation: XComs (cross-communications) enable tasks to push and pull small data objects. Question 13. In Google Cloud Storage, which IAM role provides read-only access to objects in a bucket? A) roles/storage.objectAdmin B) roles/storage.objectCreator C) roles/storage.objectViewer D) roles/storage.legacyBucketReader Answer: C Explanation: roles/storage.objectViewer grants permission to view (download) objects without modification rights. Question 14. When encrypting data at rest in AWS S3 using a KMS key, which header must be set on a PUT request to enable server-side encryption?

Preparation and Study Guide

Question 17. In a PySpark DataFrame, which function would you use to add a column that contains the rolling average of a numeric column over a window of 5 rows? A) df.withColumn('avg', avg('col').over(Window.rowsBetween(-4,0))) B) df.withColumn('avg', F.mean('col').over(Window.orderBy('id').rowsBetween(- 4,0))) C) df.withColumn('avg', F.avg('col').over(Window.partitionBy().rowsBetween(-4,0))) D) df.withColumn('avg', F.avg('col').over(Window.orderBy('id').rangeBetween(-4,0))) Answer: B Explanation: F.mean (or F.avg) with a window ordered by a column and rowsBetween(-4,0) computes a rolling average of the current and previous four rows. Question 18. Which Airflow parameter controls how many times a task will be retried after failure before being marked as failed? A) retry_delay B) max_retry_attempts C) retries D) retry_exponential_backoff Answer: C Explanation: The retries argument specifies the maximum number of retry attempts. Question 19. In a GCP BigQuery ELT workflow, which of the following is the most appropriate place to perform data type casting from string to TIMESTAMP? A) In the Python ingestion script before loading to BigQuery. B) Using a Cloud Dataflow pipeline after loading.

Preparation and Study Guide

C) Inside a BigQuery SELECT statement after the raw table is loaded. D) During the Cloud Storage upload process. Answer: C Explanation: ELT loads raw data first; transformations like casting are best performed in-warehouse using SQL queries for scalability. Question 20. Which Python library provides a built-in way to define data schemas and validate JSON payloads against them? A) jsonschema B) marshmallow C) pydantic D) schema Answer: C Explanation: pydantic uses Python type hints to define models and validates JSON data automatically. Question 21. Which Pandas data type consumes the least memory for a column containing a limited set of repeated string values? A) object B) string C) category D) datetime Answer: C Explanation: category stores a dictionary of unique values and integer codes, drastically reducing memory for low-cardinality strings.

Preparation and Study Guide

Explanation: connection.commit() finalizes the transaction, persisting changes. Question 25. Which of the following is a best practice for handling secrets (API keys, passwords) in Airflow DAGs? A) Hard-code them in the Python file. B) Store them in plain-text environment variables on the worker. C) Use Airflow Connections with encrypted fields and retrieve via BaseHook.get_connection. D) Include them in the DAG’s default_args dictionary. Answer: C Explanation: Airflow Connections store credentials securely and can be accessed programmatically, adhering to least-privilege principles. Question 26. In Spark SQL, which command registers a DataFrame as a temporary view that can be queried with SQL? A) df.createOrReplaceTempView('view_name') B) spark.registerTempTable('view_name') C) df.sql('CREATE VIEW view_name AS ...') D) spark.sql('REGISTER VIEW view_name') Answer: A Explanation: createOrReplaceTempView makes the DataFrame accessible via SQL statements within the session. Question 27. Which of the following Python constructs is most suitable for implementing a retry mechanism with exponential backoff when calling an external API? A) while True: try...except...break B) for i in range(5): time.sleep(2**i)

Preparation and Study Guide

C) retrying library’s @retry decorator with wait_exponential_multiplier D) recursion with a base case Answer: C Explanation: The retrying (or tenacity) library provides built-in exponential backoff handling via decorators. Question 28. In the context of data lake design on AWS, which storage class is optimized for infrequently accessed data but still offers rapid retrieval? A) S3 Standard B) S3 Intelligent-Tiering C) S3 Glacier Instant Retrieval D) S3 One Zone-IA Answer: C Explanation: S3 Glacier Instant Retrieval provides low-cost storage for infrequently accessed data with milliseconds retrieval time. Question 29. Which Pandas method would you use to reshape a DataFrame from long to wide format, creating separate columns for each unique value of a categorical variable? A) pivot() B) melt() C) stack() D) unstack() Answer: A Explanation: pivot() (or pivot_table) transforms long-format data into a wide format based on column values.

Preparation and Study Guide

Explanation: csv.reader yields rows lazily, avoiding loading the entire file into memory. Question 33. Which Airflow feature enables you to define a DAG that runs only when a specific external trigger (e.g., a file arrival) occurs? A) schedule_interval='@once' B) TriggerRule=ALL_SUCCESS C) ExternalTaskSensor D) DAG.run_on_trigger=True Answer: C Explanation: ExternalTaskSensor (or FileSensor) can pause DAG execution until an external condition is met. Question 34. In a PySpark job, what does the repartition() transformation do? A) Increases the number of partitions by shuffling data across the cluster. B) Decreases the number of partitions without a shuffle. C) Persists the DataFrame to disk. D) Filters rows based on a condition. Answer: A Explanation: repartition() triggers a full shuffle to create the specified number of partitions, balancing data distribution. Question 35. Which Python testing framework provides a fixture called tmp_path that supplies a temporary directory for file-system tests? A) unittest B) nose C) pytest

Preparation and Study Guide

D) doctest Answer: C Explanation: pytest includes the tmp_path fixture for creating temporary, automatically cleaned-up directories. Question 36. When using Google Cloud Functions to process streaming data, which authentication method is recommended for the function to access BigQuery securely? A) Embedding a service-account key JSON in the source code. B) Using the default service account attached to the Cloud Function with least-privilege IAM roles. C) Passing the API key as an environment variable. D) Using OAuth2 client-side flow. Answer: B Explanation: The default service account inherits IAM permissions; assigning only needed roles follows the principle of least privilege. Question 37. In Pandas, which method would you use to detect and remove duplicate rows based on all columns? A) df.drop_duplicates() B) df.unique() C) df.remove_duplicates() D) df.drop_duplicates(inplace=False, keep='first') Answer: A Explanation: drop_duplicates() removes duplicate rows; by default it considers all columns.

Preparation and Study Guide

Answer: A Explanation: pd.to_datetime parses strings into pandas datetime objects, handling various formats. Question 41. In PySpark, which function is used to explode an array column into multiple rows? A) split() B) flatten() C) explode() D) unnest() Answer: C Explanation: explode() creates a new row for each element in the array column. Question 42. Which Airflow parameter allows you to specify a maximum runtime for a task, after which the task is marked as failed? A) execution_timeout B) timeout_seconds C) max_execution_time D) run_duration_limit Answer: A Explanation: execution_timeout (a datetime.timedelta) sets the allowed duration for a task. Question 43. When using boto3 to list objects in an S3 bucket with pagination, which helper class simplifies the process? A) Paginator B) PageIterator

Preparation and Study Guide

C) S3Paginator D) boto3.resource('s3').objects.filter() Answer: A Explanation: boto3 provides a Paginator object (e.g., client.get_paginator('list_objects_v2')) to handle pagination automatically. Question 44. Which of the following is NOT a valid way to define a composite primary key in SQLAlchemy's declarative model? A) table_args = (PrimaryKeyConstraint('col1', 'col2'),) B) primary_key=True on both columns individually C) use UniqueConstraint instead of PrimaryKeyConstraint D) Define a CompositePrimaryKey mixin class Answer: C Explanation: UniqueConstraint enforces uniqueness but does not create a primary key; the other options correctly define a composite PK. Question 45. In a GCP Dataflow (Apache Beam) pipeline written in Python, which transform is used to group elements by a key before applying a reduction? A) GroupByKey() B) CombinePerKey() C) CoGroupByKey() D) ParDo() with key extraction Answer: A Explanation: GroupByKey groups values under each key, enabling downstream reductions. Question 46. Which Pandas accessor provides datetime-specific properties (e.g., .dt.year) for a Series of timestamps?

Preparation and Study Guide

Question 49. In AWS IAM, which policy element specifies the actions that are allowed or denied? A) Resource B) Effect C) Action D) Condition Answer: C Explanation: The Action field lists the API operations the statement applies to. Question 50. Which Pandas function can be used to compute a cumulative sum over a column while preserving the original index? A) df.cumsum() B) df['col'].cumsum() C) df.accumulate('col') D) df['col'].cumprod() Answer: B Explanation: Series.cumsum() returns the cumulative sum for that column, keeping the index unchanged. Question 51. In PySpark, which storage level is most appropriate for persisting a DataFrame that will be reused many times and fits entirely in memory? A) MEMORY_ONLY B) DISK_ONLY C) MEMORY_AND_DISK_SER D) OFF_HEAP Answer: A

Preparation and Study Guide

Explanation: MEMORY_ONLY stores the DataFrame in RAM without serialization, offering fastest access when memory is sufficient. Question 52. Which Airflow operator is specifically designed to interact with Google Cloud Storage? A) S3Hook B) GCSToBigQueryOperator C) GCSHook D) GoogleCloudStorageCreateBucketOperator Answer: C Explanation: GCSHook provides low-level interactions with GCS; higher-level operators (e.g., GCSToBigQueryOperator) build on it. Question 53. When using pandas.read_parquet() on a dataset with a column of mixed integer and string types, which parameter can help avoid type-inference errors? A) engine='pyarrow' B) dtype={'col': 'object'} C) convert_dates=False D) use_nullable_dtypes=True Answer: B Explanation: Explicitly setting dtype forces the column to be read as an object, preventing inference failures. Question 54. Which of the following is a recommended practice for reducing the cost of Spark jobs on a cloud platform? A) Increase the number of executors regardless of data size. B) Use spot/preemptible instances for executors when fault tolerance is acceptable.