











































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Focuses on data engineering principles, data pipelines, storage architectures, ETL processes, analytics platforms, and data governance practices.
Typology: Exams
1 / 51
This page cannot be seen from the preview
Don't miss anything!












































Question 1. Which of the following best describes the primary responsibility of a data engineer in a modern analytics ecosystem? A) Designing user interfaces for data dashboards B) Building and maintaining scalable data pipelines C) Performing statistical analysis on raw data D) Managing the corporate network infrastructure Answer: B Explanation: Data engineers focus on constructing, automating, and optimizing data pipelines that ingest, transform, and store data for downstream analytics and reporting. Question 2. In the context of the Hadoop ecosystem, which component is responsible for storing data across a cluster in a fault-tolerant manner? A) YARN B) Hive C) HDFS D) Pig Answer: C Explanation: The Hadoop Distributed File System (HDFS) replicates data blocks across multiple nodes, providing durability and high availability. Question 3. Which of the following SQL statements is used to remove duplicate rows from a result set? A) DISTINCT B) UNIQUE C) GROUP BY D) HAVING Answer: A Explanation: The DISTINCT keyword eliminates duplicate rows in the query’s output.
Question 4. When designing a star schema, what is the typical relationship between a fact table and its dimension tables? A) Many-to-many B) One-to-one C) One-to-many (dimension to fact) D) Many-to-one (fact to dimension) Answer: C Explanation: Each dimension record can be referenced by many fact rows, establishing a one-to-many relationship from dimensions to the fact table. Question 5. In Apache Spark, which abstraction represents a distributed collection of objects that can be processed in parallel? A) DataFrame B) RDD C) Dataset D) Table Answer: B Explanation: Resilient Distributed Datasets (RDDs) are the low-level Spark abstraction for parallel processing; DataFrames and Datasets are higher-level APIs built on RDDs. Question 6. Which of the following is a key advantage of using columnar storage formats (e.g., Parquet, ORC) for analytical workloads? A) Faster row-level updates B) Reduced storage cost for binary data C) Improved scan performance for selected columns D) Simplified schema evolution Answer: C Explanation: Columnar formats allow the engine to read only the columns required by a query, dramatically reducing I/O for analytics.
**Question 10. Which of the following best defines “schema drift” in a data pipeline? ** A) A gradual increase in data volume over time B) Changes in the structure of incoming data that are not reflected in downstream schemas C) A shift from batch to streaming processing D) The migration of data from on-premises to cloud storage Answer: B Explanation: Schema drift occurs when source data evolves (e.g., new columns) while downstream systems expect a static schema, potentially causing failures. Question 11. In relational databases, what is the effect of creating a clustered index on a table’s primary key? A) The index is stored separately from the data rows B) Data rows are physically ordered according to the index key C) It prevents duplicate values in non-key columns D) It automatically compresses the table data Answer: B Explanation: A clustered index determines the physical order of rows on disk, aligning the data with the indexed column(s). Question 12. Which of the following NoSQL databases is best suited for graph-based queries such as shortest-path calculations? A) Cassandra B) MongoDB C) Neo4j D) Redis Answer: C Explanation: Neo4j is a native graph database optimized for traversals, pathfinding, and relationship-centric queries.
Question 13. In the context of data security, what does the principle of “least privilege” dictate? A) All users must have admin rights to ensure flexibility B) Users receive only the permissions necessary to perform their job functions C) Data should be encrypted only at rest, not in transit D) Access controls are applied only to production environments Answer: B Explanation: Least privilege limits each user’s access to the minimum resources required, reducing risk of accidental or malicious misuse. Question 14. Which Apache Airflow operator would you use to execute a SQL query against a PostgreSQL database? A) BashOperator B) PythonOperator C) PostgresOperator D) MySQLOperator Answer: C Explanation: PostgresOperator is designed to run arbitrary SQL statements on a PostgreSQL connection. Question 15. What is the primary benefit of using “partition pruning” in a query engine? A) It reduces the number of columns scanned B) It eliminates the need for indexes C) It restricts query execution to relevant data partitions, lowering I/O D) It automatically compresses data during query time Answer: C Explanation: Partition pruning skips whole partitions that do not satisfy the query predicate, improving performance.
Question 19. Which AWS service provides a managed, serverless data lake built on top of S3 with fine-grained security controls? A) Amazon RDS B) AWS Glue Data Catalog C) Amazon Redshift Spectrum D) AWS Lake Formation Answer: D Explanation: Lake Formation simplifies creation, security, and governance of data lakes on S3. Question 20. What does the term “back-pressure” refer to in a streaming processing system? A) A security protocol for encrypting data streams B) A mechanism that slows upstream data production when downstream operators are overloaded C) The process of archiving old data to cold storage D) An algorithm for load balancing across compute nodes Answer: B Explanation: Back-pressure signals producers to reduce emission rates, preventing buffer overflows and ensuring system stability. **Question 21. Which of the following is a characteristic of a “cold” data warehouse? ** A) Data is accessed in real-time for operational reporting B) Data is stored on high-performance SSDs for low latency C) Data is infrequently accessed and often archived on cheaper storage tiers D) Data is continuously updated every few seconds Answer: C Explanation: Cold warehouses hold historical or rarely accessed data, optimizing cost over speed.
Question 22. In data modeling, what is a “slowly changing dimension” (SCD) used for? A) Tracking changes to dimension attributes over time without overwriting history B) Improving query performance by denormalizing tables C) Storing temporary staging data during ETL D) Enforcing referential integrity between fact and dimension tables Answer: A Explanation: SCD techniques (type 1, 2, 3) preserve historical attribute values while allowing updates. Question 23. Which of the following best describes the purpose of a “materialized view” in a database? A) To store query results physically for faster retrieval B) To enforce foreign key constraints C) To provide a live, real-time snapshot of source tables without storage overhead D) To replace indexes on large tables Answer: A Explanation: Materialized views pre-compute and store query output, reducing runtime for complex aggregations. Question 24. In the context of data pipelines, what is the main advantage of using “idempotent” operations? A) They guarantee faster execution times B) They allow the same operation to be safely retried without adverse effects C) They encrypt data automatically D) They eliminate the need for logging Answer: B Explanation: Idempotent steps produce the same result regardless of how many times they run, enabling reliable retries.
Question 28. In a relational database, what does the ANSI SQL clause “WITH (NOLOCK)” do? A) Forces a full table scan B) Allows reading uncommitted data, potentially causing dirty reads C) Creates a temporary index for the query D) Locks the entire table for exclusive access Answer: B Explanation: NOLOCK (or READ UNCOMMITTED) bypasses shared locks, permitting dirty reads and reducing contention. Question 29. Which of the following best describes a “data contract” between producers and consumers in a data platform? A) A legal agreement outlining data ownership B) A formal schema and quality expectations that producers must meet for downstream consumers C) A network protocol for transmitting data D) An encryption key exchange mechanism Answer: B Explanation: Data contracts define the structure, semantics, and quality metrics of data, ensuring compatibility across pipelines. Question 30. When using Amazon Redshift Spectrum, which storage layer does the service query directly? A) Amazon RDS B) Amazon S C) Amazon DynamoDB D) Amazon EFS Answer: B Explanation: Redshift Spectrum extends Redshift queries to data stored in S without loading it into the cluster.
Question 31. In the context of data lake governance, what is the purpose of “data lineage”? A) To encrypt data at rest B) To track the origin, transformations, and movement of data over time C) To compress data for archival D) To schedule batch jobs automatically Answer: B Explanation: Data lineage provides visibility into how data flows and changes, supporting auditing and impact analysis. Question 32. Which of the following is the most appropriate tool for performing large-scale graph analytics on a distributed cluster? A) Apache Spark GraphX B) Apache Hive C) Apache Flume D) Apache Sqoop Answer: A Explanation: GraphX is Spark’s library for graph processing, enabling parallel graph algorithms on big data. Question 33. What does the term “cold start” refer to in the context of serverless data processing functions? A) The initial data load into a warehouse B) The latency incurred when a function is invoked for the first time after being idle C) The process of archiving old data D) The automatic scaling of compute nodes during peak load Answer: B Explanation: Cold start latency occurs because the runtime environment must be provisioned before executing the function.
Question 37. In data quality frameworks, what does the “completeness” metric assess? A) The degree to which data conforms to a predefined format B) The proportion of expected records or fields that are present C) The timeliness of data arrival relative to its source D) The uniqueness of key attributes across records Answer: B Explanation: Completeness measures missing values or records against an expected dataset. Question 38. Which of the following AWS services is specifically designed for real-time change data capture (CDC) from relational databases? A) Amazon Kinesis Data Streams B) AWS DMS (Database Migration Service) C) Amazon Athena D) AWS Glue Answer: B Explanation: AWS DMS can capture and stream ongoing changes from source databases to targets in real time. Question 39. In the context of data serialization, which format is schema-evolution friendly and widely used with Apache Avro? A) CSV B) JSON C) Parquet D) Avro binary with embedded schema Answer: D Explanation: Avro stores the schema with the data, allowing readers to interpret older or newer versions safely.
Question 40. Which of the following best describes “sharding” in a distributed database? A) Replicating the same data on multiple nodes for fault tolerance B) Partitioning data horizontally across multiple nodes based on a sharding key C) Encrypting each row with a unique key D) Storing data in a single monolithic server for simplicity Answer: B Explanation: Sharding distributes subsets of data (rows) across nodes, improving scalability and performance. Question 41. In a data warehouse, what is the purpose of a “factless fact table”? A) To store aggregate metrics without any dimensions B) To capture many-to-many relationships between dimensions without numeric measures C) To hold temporary staging data before loading D) To enforce referential integrity constraints Answer: B Explanation: Factless fact tables record events (e.g., student attendance) where the existence of a row itself is the fact. Question 42. Which of the following Spark APIs provides compile-time type safety for structured data? A) Spark SQL (DataFrames) B) Spark Core (RDD) C) Spark Structured Streaming (DataSets) D) Spark MLlib Answer: C Explanation: Datasets combine the benefits of RDDs and DataFrames, offering compile-time type checking.
Question 46. Which of the following is the primary function of a “reverse ETL” process? A) Exporting transformed data from a warehouse back into operational systems for actionability B) Reversing corrupted data back to its original state C. Converting unstructured logs into structured tables D. Automating schema migrations across environments Answer: A Explanation: Reverse ETL moves analytical data back into SaaS or CRM tools to enable operational use. Question 47. In Google Cloud Platform, which service offers a fully managed, serverless, SQL-compatible data warehouse? A) Cloud SQL B) BigQuery C) Cloud Spanner D) Dataproc Answer: B Explanation: BigQuery provides a serverless, highly scalable, ANSI-SQL-compatible analytics platform. **Question 48. Which of the following best describes a “pipeline orchestration” tool? ** A) A system that stores raw data files B. A platform that schedules, monitors, and manages the execution of data workflow tasks C. A database engine optimized for time-series data D. A visualization library for dashboards Answer: B Explanation: Orchestration tools like Airflow or Prefect coordinate task dependencies, retries, and monitoring.
Question 49. When using the “COPY” command in PostgreSQL to load data from S3, which option improves performance by parallelizing the load? A) DISABLE TRIGGER B) PARALLEL ON C) FREEZE D) FORCE NOT NULL Answer: B Explanation: The PARALLEL option enables multiple worker processes to ingest data concurrently. Question 50. Which of the following is a key characteristic of a “time-travel” query in a data warehouse? A) It allows querying data as it existed at a prior point in time B. It automatically deletes data older than a retention window C. It encrypts query results with a timestamped key D. It speeds up queries by bypassing index scans Answer: A Explanation: Time-travel enables point-in-time snapshots, useful for audits and debugging. Question 51. In Apache Hive, which file format provides built-in column statistics for query optimization? A) TextFile B) ORC C) SequenceFile D) Avro Answer: B Explanation: ORC stores column statistics and compression, enabling predicate push-down and better planning.
Question 55. In the context of data versioning, what is a “snapshot”? A. A real-time view of streaming data B. A read-only copy of a dataset at a specific point in time C. A compressed backup stored on tape D. An automated schema migration script Answer: B Explanation: Snapshots capture the state of a dataset, allowing reproducibility and rollback. Question 56. Which of the following is a primary advantage of using “vectorized execution” in a query engine like DuckDB? A. It enables parallel execution across multiple nodes B. It processes batches of rows as CPU-friendly vectors, improving cache utilization and speed C. It automatically scales storage based on query load D. It enforces strict ACID compliance Answer: B Explanation: Vectorized processing handles columns in chunks, leveraging SIMD instructions for faster computation. Question 57. In Azure Data Factory, which component defines the logical flow of activities? A. Linked Service B. Dataset C. Pipeline D. Trigger Answer: C Explanation: Pipelines group activities and define control flow, dependencies, and parameters.
Question 58. Which of the following best describes “late-arrival data” in a streaming pipeline? A. Data that arrives after the watermark has advanced beyond its event time, potentially being dropped or handled specially B. Data that is encrypted during transmission C. Data that is processed in the batch layer only D. Data that is stored in a cold archive Answer: A Explanation: Late-arrival events occur out of order relative to the current processing window and may require side-output handling. Question 59. When using Apache Cassandra, what is the purpose of a “compaction” process? A. To merge SSTables, discard tombstones, and reclaim disk space B. To encrypt data at rest C. To re-balance data across nodes after a node failure D. To create secondary indexes automatically Answer: A Explanation: Compaction consolidates SSTables, removes deleted data, and optimizes read performance. Question 60. Which of the following is a common technique to reduce the size of a Parquet file? A. Enabling row-level encryption B. Using dictionary encoding and columnar compression C. Storing data as plain text CSV D. Increasing the number of columns per file Answer: B Explanation: Parquet supports columnar compression, dictionary encoding, and other techniques to shrink file size.