























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This updated analytics-focused practice exam assesses expertise in data preparation, storage, processing, analysis, and visualization. Learners master pipelines using Glue, Kinesis Data Streams/Firehose, Athena, EMR, Redshift Spectrum, QuickSight, and ML-based analytics. Scenario questions evaluate governance, schema evolution, partitioning strategies, cost-efficient design, and data lifecycle management.
Typology: Exams
1 / 95
This page cannot be seen from the preview
Don't miss anything!
























































































Question 1. Which AWS service provides at‑least‑once delivery semantics for streaming data and allows you to retain data for up to 365 days? A) Amazon Kinesis Data Streams B) Amazon Kinesis Data Firehose C) Amazon Managed Streaming for Apache Kafka (MSK) D) AWS DataSync Answer: A Explanation: Kinesis Data Streams stores records for a configurable retention period (default 24 h, up to 365 d) and guarantees at‑least‑once delivery. Question 2. When you need to ingest large files (≥ 10 TB) from an on‑premises data center into Amazon S3 with minimal network impact, which solution is most appropriate? A) AWS Direct Connect B) AWS Snowball Edge C) AWS DataSync D) Amazon S3 Transfer Acceleration Answer: B Explanation: Snowball Edge is a petabyte‑scale offline data transfer device designed for moving very large datasets without consuming WAN bandwidth. Question 3. Which data format offers columnar storage, schema evolution, and is optimized for query performance in Amazon Athena and Redshift Spectrum? A) CSV B) JSON C) Parquet D) XML Answer: C
Explanation: Parquet stores data column‑wise, supports schema evolution, and enables predicate push‑down, making it ideal for Athena/Redshift Spectrum. Question 4. An organization requires exactly‑once processing semantics for a streaming pipeline that aggregates click‑stream events. Which combination satisfies this requirement? A) Kinesis Data Streams + Lambda (at‑least‑once) B) Kinesis Data Firehose + S3 (at‑least‑once) C) MSK + Kafka Streams with exactly‑once semantics D) Kinesis Data Analytics (SQL) with at‑least‑once semantics Answer: C Explanation: Apache Kafka (MSK) with Kafka Streams can be configured for exactly‑once processing, unlike Kinesis which only guarantees at‑least‑once. Question 5. Which S3 storage class provides automatic cost optimization for objects with unknown or changing access patterns? A) S3 Standard B) S3 Intelligent‑Tiering C) S3 Standard‑IA D) S3 Glacier Deep Archive Answer: B Explanation: Intelligent‑Tiering automatically moves objects between frequent and infrequent access tiers based on usage, without performance impact. Question 6. To enforce row‑level security on a data lake built on Amazon S3, which AWS service should you use? A) AWS Lake Formation B) AWS Glue Data Catalog
Question 9. Which AWS service is best suited for interactive ad‑hoc querying of data stored in Amazon S3 without provisioning any infrastructure? A) Amazon Redshift B) Amazon Athena C) Amazon EMR D) AWS Glue DataBrew Answer: B Explanation: Athena is a serverless, pay‑per‑query service that directly queries data in S3 using standard SQL. Question 10. You need to capture change data capture (CDC) from an on‑premises Oracle database and replicate it to Amazon Redshift in near real‑time. Which service should you use? A) AWS Database Migration Service (DMS) B) AWS DataSync C) Amazon Kinesis Data Streams D) AWS Glue Answer: A Explanation: DMS supports CDC from many source databases, including Oracle, and can replicate changes to Redshift continuously. Question 11. Which of the following is a primary benefit of using Amazon S3 Object Lock with a compliance mode? A) Automatic lifecycle transition to Glacier B) Prevention of object deletion or modification for a retention period C) Encryption of objects at rest using KMS D) Versioning of objects automatically Answer: B
Explanation: Compliance mode enforces a write‑once‑read‑many (WORM) protection, disallowing deletion or alteration for the specified retention period. Question 12. Which AWS service provides a managed, serverless environment for running Apache Flink applications for stream processing? A) Amazon Kinesis Data Analytics B) AWS Lambda C) Amazon EMR D) AWS Glue Streaming Answer: A Explanation: Kinesis Data Analytics now supports Apache Flink for stateful stream processing without managing servers. Question 13. In Amazon DynamoDB, which index type allows you to query on non‑primary‑key attributes while preserving the original table’s sort order? A) Global Secondary Index (GSI) B) Local Secondary Index (LSI) C) Composite Index D) Sparse Index Answer: B Explanation: LSIs share the same partition key as the base table but use an alternate sort key, preserving the original table’s partition distribution. Question 14. Which AWS service can be used to orchestrate a workflow that triggers an AWS Glue job, then an Amazon EMR step, and finally sends a notification via SNS? A) AWS Step Functions B) Amazon Managed Workflows for Apache Airflow (MWAA)
Question 17. When designing a data lake, which practice helps reduce the amount of data scanned by Amazon Athena queries? A) Storing data in CSV format B) Using S3 versioning C) Partitioning data by common query predicates (e.g., date) D) Enabling S3 Transfer Acceleration Answer: C Explanation: Partitioning organizes data into separate prefixes, allowing Athena to skip irrelevant partitions, thus scanning less data. Question 18. Which AWS service provides built‑in visual data preparation capabilities, allowing business analysts to clean and transform data without writing code? A) AWS Glue DataBrew B) Amazon QuickSight C) AWS Lake Formation D) Amazon EMR Studio Answer: A Explanation: DataBrew offers a visual interface for data profiling, cleaning, and transformation, targeting non‑technical users. Question 19. A company wants to run a Spark job that processes petabytes of data stored in S3, but they need to minimize operational overhead. Which service should they choose? A) Amazon EMR on EC2 spot instances with auto‑scaling B) AWS Glue Elastic Views C) AWS Glue Spark runtime (serverless) D) Amazon Redshift Serverless Answer: C
Explanation: AWS Glue’s serverless Spark runtime automatically provisions and scales resources, removing the need to manage EMR clusters. Question 20. Which Amazon QuickSight pricing tier provides in‑memory SPICE capacity for faster visualizations? A) Standard edition B) Enterprise edition C) Free tier D) QuickSight Q Answer: B Explanation: The Enterprise edition includes SPICE, an in‑memory calculation engine that accelerates dashboard performance. Question 21. Which AWS service is specifically designed for log analytics, full‑text search, and real‑time operational intelligence? A) Amazon Athena B) Amazon OpenSearch Service C) Amazon Redshift D) AWS Glue Answer: B Explanation: OpenSearch Service (formerly Elasticsearch Service) offers distributed search and analytics capabilities optimized for log data. Question 22. What is the primary purpose of an Amazon VPC Endpoint for S3 in a data analytics architecture? A) To provide a public internet gateway for S3 access B) To enable private, secure connectivity to S3 without traversing the internet
Question 25. Which AWS service can be used to continuously replicate data from an Amazon S3 bucket in one AWS account to a bucket in another account, preserving object metadata? A) AWS DataSync B) S3 Cross‑Region Replication (CRR) with bucket policies C) S3 Batch Operations D) AWS Transfer Family Answer: B Explanation: S3 CRR (or Same‑Region Replication) copies objects, including metadata, across accounts when configured with appropriate IAM policies. Question 26. In Amazon Kinesis Data Analytics for SQL applications, which statement is true regarding stateful processing? A) State is automatically persisted to DynamoDB for fault tolerance B) State is stored in memory only and lost on application restart C) State is checkpointed to an S3 bucket at user‑defined intervals D) State cannot be used in SQL applications, only in Flink applications Answer: C Explanation: Kinesis Data Analytics checkpoints state to an S3 bucket, enabling recovery after failures. Question 27. Which of the following best describes the “pull” model for data ingestion? A) Data producers push records directly to a target service B) The ingestion service periodically polls the source for new data C) Data is transferred via AWS Snowball Edge D) Data is streamed using Kinesis Data Streams Answer: B
Explanation: In a pull model, the ingestion system initiates requests to retrieve data from the source, unlike push where the source sends data. Question 28. Which AWS service provides a managed, serverless environment for running Apache Hive queries on data stored in Amazon S3? A) Amazon EMR on Serverless B) Amazon Athena C) AWS Glue Spark D) Amazon Redshift Spectrum Answer: A Explanation: EMR Serverless lets you run Hive, Spark, Presto, and other frameworks without managing clusters. Question 29. To reduce costs for infrequently accessed analytical data stored in S3, which combination of storage class and lifecycle policy is most appropriate? A) S3 Standard‑IA with transition to Glacier after 30 days B) S3 Intelligent‑Tiering with no lifecycle policy C) S3 Standard with transition to Glacier Deep Archive after 90 days D) S3 One Zone‑IA with transition to S3 Standard after 60 days Answer: A Explanation: Standard‑IA is cheaper for infrequent access, and moving older data to Glacier further reduces storage costs. Question 30. Which AWS service can automatically detect and classify sensitive data (e.g., PII) stored in Amazon S3? A) Amazon Macie B) AWS Config
A) Amazon Redshift Spectrum B) Amazon Athena C) Amazon RDS for PostgreSQL with external tables D) Amazon OpenSearch Service Answer: B Explanation: Athena supports standard SQL (Presto) and can query S3 data directly without data movement. Question 34. Which of the following is a key difference between AWS Glue’s Spark runtime and AWS Glue’s Python Shell runtime? A) Spark runtime supports distributed processing; Python Shell runs on a single node. B) Python Shell can read Parquet files; Spark runtime cannot. C) Spark runtime does not support PySpark; Python Shell does. D) Python Shell provides automatic schema inference; Spark does not. Answer: A Explanation: The Spark runtime distributes tasks across a cluster; the Python Shell runs a single‑node script. Question 35. To ensure that data transferred between an on‑premises Hadoop cluster and Amazon S3 is encrypted in transit, which protocol should be used? A) HTTP B) SFTP C) HTTPS (TLS) with the S3 REST API D) FTP Answer: C Explanation: Using HTTPS (TLS) for the S3 REST API encrypts data in transit between Hadoop and S3.
Question 36. Which AWS service offers a fully managed, serverless data catalog that integrates with Amazon Athena, Amazon Redshift Spectrum, and AWS Glue? A) AWS Lake Formation B) AWS Glue Data Catalog C) Amazon S3 Inventory D) Amazon DynamoDB Answer: B Explanation: The Glue Data Catalog is a central metadata repository used by Athena, Redshift Spectrum, and other services. Question 37. In a Kinesis Data Firehose delivery stream, which option provides automatic data format conversion from JSON to Parquet before storing in S3? A) Enable Record Transformation with Lambda B) Enable Data Format Conversion and specify Parquet as the destination format C) Use Kinesis Data Analytics to convert data before Firehose D) Firehose cannot perform format conversion Answer: B Explanation: Firehose can directly convert incoming JSON (or CSV) to columnar formats like Parquet or ORC before delivery. Question 38. Which IAM policy element is used to restrict access to a specific S3 bucket prefix (e.g., “logs/2023/”) for a particular IAM role? A) Resource: “arn:aws:s3:::my‑bucket/logs/2023/*” B) Condition: “StringEquals”: {“s3:prefix”: “logs/2023/”} C) Action: “s3:ListBucket” only D) Effect: “Deny” for all actions
B) S3 Bucket Policy with s3:content‑length‑range condition C) S3 Lifecycle Rule D) S3 Versioning Answer: B Explanation: The s3:content-length-range condition in a bucket policy can restrict the size of objects that can be uploaded. Question 42. You need to run a daily batch ETL job that reads from Amazon RDS MySQL, transforms data, and writes to Amazon S3 in Parquet. Which service provides the most cost‑effective, serverless solution? A) AWS Glue (Spark) job B) Amazon EMR on spot instances C) AWS Data Pipeline D) AWS Batch Answer: A Explanation: Glue’s serverless Spark jobs charge only for the compute time used, making it cost‑effective for scheduled batch ETL. Question 43. Which of the following is a true statement about Amazon S3 Select? A) It can retrieve only entire objects, not subsets. B) It works only with CSV files. C) It enables retrieving a subset of data from an object using SQL expressions, reducing data transfer. D) It requires a dedicated EC2 instance to run. Answer: C Explanation: S3 Select allows you to run SQL queries directly on objects (CSV, JSON, Parquet) to fetch only the needed data.
Question 44. In Amazon Redshift, what is the purpose of a sort key? A) To define the order of rows on disk for faster range queries and joins. B) To distribute data across nodes. C) To encrypt data at rest. D) To control user access permissions. Answer: A Explanation: Sort keys determine how data is physically sorted on disk, improving performance for queries that filter on those columns. Question 45. Which AWS service can be used to create a private, high‑throughput connection between an on‑premises data center and AWS without traversing the public internet? A) AWS Direct Connect B) Amazon CloudFront C) AWS VPN D) AWS Transit Gateway Answer: A Explanation: Direct Connect provides dedicated network links that bypass the public internet, offering consistent low latency and high bandwidth. Question 46. Which Amazon QuickSight feature enables you to embed dashboards into a third‑party web application with fine‑grained access control? A) QuickSight Q B) SPICE C) QuickSight Enterprise Edition IAM embedding D) QuickSight Standard Edition sharing links
A) Kinesis Data Streams → Lambda → S B) Kinesis Data Firehose → S3 (no processing) C) Kinesis Data Analytics (SQL) → S D) IoT Core → S3 directly Answer: C Explanation: Kinesis Data Analytics can run continuous SQL aggregations on streaming data and write the aggregated results to S3. Question 50. Which AWS service provides a managed, serverless environment for running Apache Airflow workflows? A) AWS Step Functions B) Amazon Managed Workflows for Apache Airflow (MWAA) C) AWS Glue Workflow D) Amazon EMR Studio Answer: B Explanation: MWAA is a fully managed Airflow service, handling scaling, security, and availability. Question 51. In a data lake, which practice helps maintain data quality by preventing malformed records from being ingested? A) Enabling S3 Versioning B) Using AWS Glue DataBrew validation rules before loading data C) Setting S3 Lifecycle policies to delete bad data after 7 days D) Relying on IAM policies to block malformed files Answer: B Explanation: DataBrew can apply validation rules (e.g., schema checks) and reject or flag bad records before they enter the lake.
Question 52. Which AWS service can be used to capture and archive VPC flow logs for security analysis? A) Amazon CloudWatch Logs B) AWS Config C) AWS GuardDuty D) Amazon S3 Access Analyzer Answer: A Explanation: VPC flow logs can be sent directly to CloudWatch Logs (or S3) for storage and analysis. Question 53. Which of the following is a benefit of using Amazon Redshift Serverless over a provisioned Redshift cluster for unpredictable workloads? A) Fixed compute capacity with no scaling B) Automatic scaling of compute resources based on query demand C) Ability to use PostgreSQL extensions directly D) No need for a data warehouse schema Answer: B Explanation: Redshift Serverless automatically provisions and scales compute capacity to match workload, ideal for variable demand. Question 54. When configuring an Amazon S3 bucket for use as a data lake, which setting should be enabled to prevent accidental deletion of objects? A) S3 Versioning B) S3 Transfer Acceleration C) S3 Event Notifications D) S3 Replication