




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A regionally focused cloud data engineering guide emphasizing Python development, data pipelines, and cloud analytics practices. It includes hands-on examples, exam practice questions, and localized learning insights to support certification success.
Typology: Exams
1 / 105
This page cannot be seen from the preview
Don't miss anything!





























































































Question 1. Which Python library is most appropriate for making asynchronous HTTP requests to a REST API for high-throughput data ingestion? A) requests B) urllib C) aiohttp D) http.client Answer: C Explanation: aiohttp provides an async API that allows many concurrent requests without blocking, making it ideal for high-throughput ingestion. Question 2. In boto3, which method is used to upload a large file to Amazon S using multipart upload to improve reliability? A) put_object() B) upload_fileobj() C) multipart_upload() D) upload_file() Answer: D Explanation: upload_file() automatically manages multipart upload for large files, handling retries and part size. Question 3. When using Google Cloud Storage Python client, which parameter specifies the destination bucket and object name in a single call? A) bucket_name B) destination_blob_name
C) source_file_name D) blob_name Answer: B Explanation: destination_blob_name defines the full path (including object name) within the target bucket. Question 4. Which Azure SDK class is used to interact with Azure Blob Storage in Python? A) BlobServiceClient B) AzureBlobClient C) StorageBlobClient D) BlobContainerClient Answer: A Explanation: BlobServiceClient is the entry point for managing containers and blobs in Azure Storage. Question 5. In a Kafka producer written in Python (confluent-kafka), what does the linger_ms configuration control? A) Maximum size of a batch before sending B) Time to wait for additional records before sending a batch C) Number of retries on failure D) Compression algorithm Answer: B
A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.sql.autoBroadcastJoinThreshold D) spark.executor.instances Answer: A Explanation: spark.sql.shuffle.partitions determines how many partitions are created during shuffle operations. Question 9. In a serverless AWS Lambda function written in Python, which environment variable provides the name of the invoked Lambda function? A) AWS_LAMBDA_FUNCTION_NAME B) LAMBDA_FUNCTION_NAME C) AWS_FUNCTION_NAME D) FUNCTION_NAME Answer: A Explanation: AWS_LAMBDA_FUNCTION_NAME is automatically set by Lambda runtime. Question 10. Which of the following is a correct way to register a Python UDF in Google BigQuery using the google-cloud-bigquery library? A) client.register_udf('my_udf', my_function) B) client.create_udf('my_udf', my_function) C) client.query('CREATE TEMP FUNCTION my_udf(x INT64) AS ( ... )')
D) client.create_routine('my_udf', routine_type='SCALAR', language='PYTHON', arguments=..., body=...) Answer: D Explanation: BigQuery supports Python UDFs via the Routine API; create_routine defines a SCALAR routine with Python code. Question 11. Which file format combines columnar storage with ACID transaction support and is native to Delta Lake? A) Parquet B) ORC C) Delta D) Avro Answer: C Explanation: Delta Lake uses the Delta format, built on Parquet files with transaction logs for ACID properties. Question 12. When performing Change Data Capture (CDC) from MySQL to a cloud data lake using Python, which MySQL feature provides a binary log of changes? A) General Log B) Slow Query Log C) Binary Log (binlog) D) Error Log Answer: C
A) BashOperator B) PythonOperator C) SparkSubmitOperator D) HttpSensor Answer: B Explanation: PythonOperator runs a Python function as a task. Question 16. Which Airflow sensor is designed to wait for a file to appear in an S bucket? A) S3KeySensor B) S3FileSensor C) S3ObjectSensor D) S3PathSensor Answer: A Explanation: S3KeySensor checks for the existence of a key (file) in S3. Question 17. In Terraform, which provider block is required to manage AWS resources via Python scripts executed as local-exec? A) provider "aws" {} B) provider "python" {} C) provider "local" {} D) provider "aws_lambda" {}
Answer: A Explanation: The AWS provider configures credentials; local-exec can invoke Python scripts. Question 18. Which Python package can generate realistic fake personal data for data masking purposes? A) faker B) mockaroo C) data-faker D) random-person Answer: A Explanation: faker creates synthetic personal data (names, addresses, etc.) useful for masking. Question 19. When encrypting data client-side before uploading to Azure Blob Storage, which Azure service provides the key management? A) Azure Key Vault B) Azure Storage Encryption C) Azure Secrets Manager D) Azure AD Answer: A Explanation: Azure Key Vault stores and manages encryption keys used for client-side encryption.
D) Couchbase Answer: C Explanation: pymongo is the official MongoDB driver for Python. Question 23. When using DynamoDB with Python (boto3), which method performs a conditional write that succeeds only if an attribute does not already exist? A) put_item(ConditionExpression=…) B) update_item(ConditionExpression=…) C) insert_item(…) D) write_item(…) Answer: A Explanation: put_item with a ConditionExpression can enforce that an attribute is absent before inserting. Question 24. Which Python library enables vector similarity search against a Pinecone index? A) pinecone-client B) pinecone-sdk C) pinecone-py D) pinecone-vector Answer: A Explanation: pinecone-client is the official Python client for interacting with Pinecone vector databases.
Question 25. In a data lake on S3, which naming convention best supports Hive partition pruning? A) /year=2023/month=07/day=15/… B) /2023/07/15/… C) /data_20230715_… D) /partition/… Answer: A Explanation: Using key=value pairs (year=…, month=…) matches Hive’s partitioning scheme, enabling automatic pruning. Question 26. Which of the following file formats provides built-in schema evolution and is optimized for streaming writes? A) Parquet B) Avro C) ORC D) JSON Answer: B Explanation: Avro supports schema evolution and is designed for efficient writes, especially in streaming pipelines. Question 27. In CloudWatch Logs, which Python SDK method retrieves log events for a given log stream? A) get_log_events()
Answer: B Explanation: MEMORY_AND_DISK stores partitions in memory and writes overflow to disk, reducing recomputation. Question 30. Which Python decorator is used to cache the result of a function call in memory for the duration of the program? A) @lru_cache B) @cache C) @memoize D) @cached_property Answer: A Explanation: @lru_cache from functools caches function results based on input arguments. Question 31. In Azure Data Factory, which Python activity allows you to run custom Python code on an Azure Batch pool? A) AzureFunctionActivity B) DatabricksNotebookActivity C) CustomActivity D) PythonScriptActivity Answer: C Explanation: CustomActivity can execute arbitrary scripts, including Python, on Azure Batch.
Question 32. Which GCP service offers a managed vector database that can be accessed via the google-cloud-aiplatform Python client? A) Vertex AI Matching Engine B) BigQuery Vector Search C) Cloud MemoryStore D) Cloud Spanner Answer: A Explanation: Vertex AI Matching Engine provides vector similarity search with a Python client. Question 33. When using pandas.read_json() with lines=True, what format is expected? A) A single JSON object B) An array of JSON objects C) NDJSON (newline-delimited JSON) D) JSON with comments Answer: C Explanation: lines=True tells pandas to parse each line as a separate JSON record (NDJSON). Question 34. Which Python package provides a high-level API for building data pipelines that can run on multiple execution engines (e.g., Spark, Dask, Pandas)? A) luigi B) prefect
Explanation: hashlib provides SHA-256 and other hash functions for binary data. Question 37. In Airflow, what does setting retries=3 and retry_delay=timedelta(minutes=5) on a task accomplish? A) The task will run three times in parallel with a 5-minute gap B) After a failure, the task will be retried up to three times, waiting 5 minutes between attempts C) The task will be skipped after three failures D) The task will delay its first run by 5 minutes Answer: B Explanation: retries defines the maximum retry attempts; retry_delay sets the wait time between retries. Question 38. Which of the following is a best practice for handling large CSV files in a Lambda function? A) Load the entire file into memory using pandas B) Use streaming/iterators (e.g., csv.reader) to process line by line C) Write the file to /tmp and process it there D) Increase Lambda memory to 10 GB Answer: B Explanation: Streaming processing avoids memory exhaustion, which is critical in Lambda’s limited environment. Question 39. When using boto3 to assume an IAM role in another AWS account, which method is called?
A) sts.assume_role() B) iam.assume_role() C) sts.get_federation_token() D) sts.get_session_token() Answer: A Explanation: sts.assume_role returns temporary credentials for the target role. Question 40. Which Python library provides a simple way to create Docker images from a Python script for containerized pipelines? A) docker-py B) pyspark-docker C) dockerfile-generator D) docker Answer: D Explanation: The docker (docker-py) library allows programmatic creation and management of Docker images and containers. Question 41. In GCP, which service stores metadata about data lineage that can be accessed via the google-cloud-datacatalog Python client? A) Cloud Asset Inventory B) Data Catalog C) Cloud Logging D) Cloud Trace
Question 44. Which Python library can be used to interact with Apache Hive Metastore for managing table schemas? A) pyhive B) hive-client C) hms-api D) impyla Answer: A Explanation: pyhive provides a DB-API compatible interface to Hive, allowing schema queries. Question 45. In Azure Synapse, which Python library is recommended for executing T-SQL statements via the serverless SQL pool? A) pyodbc B) sqlalchemy-azure C) azure-synapse-spark D) synapse-sql-client Answer: A Explanation: pyodbc can connect to Synapse’s SQL endpoint using ODBC drivers. Question 46. Which GCP service provides a managed, auto-scaling Spark environment that can be accessed via the google-cloud-dataproc Python client? A) Dataflow B) Dataproc
C) Composer D) BigQuery Answer: B Explanation: Dataproc offers managed Spark clusters; the Python client manages jobs and clusters. Question 47. What is the primary advantage of using generators in Python for processing very large datasets? A) They automatically parallelize the code B) They load the entire dataset into memory C) They yield items one at a time, reducing memory usage D) They compress data on the fly Answer: C Explanation: Generators produce items lazily, keeping only the current item in memory. Question 48. In Airflow, which parameter of the DAG object controls the maximum number of active runs for that DAG? A) max_active_runs B) concurrency C) schedule_interval D) catchup Answer: A