Analytics
Amazon Athena
Serverless SQL Query Service, Query S3 Data, Presto/Trino Engine, Partitioning
Giới thiệu
Amazon Athena là dịch vụ serverless interactive query cho phép phân tích dữ liệu trực tiếp trong Amazon S3 bằng SQL chuẩn.
Đặc điểm chính
| Đặc điểm | Mô tả |
|---|---|
| Serverless | Không cần quản lý infrastructure |
| Pay-per-query | Chỉ trả tiền cho data scanned |
| Standard SQL | Dùng Presto/Trino engine |
| No ETL required | Query trực tiếp trên S3 |
| Fast | Kết quả trong vài giây |
Vấn đề cần giải quyết
Kiến trúc
Components
| Component | Mô tả |
|---|---|
| Athena Engine | Presto/Trino-based SQL engine |
| AWS Glue Data Catalog | Metadata store (databases, tables, schemas) |
| S3 | Data source (nơi chứa data thực tế) |
| Query Results Location | S3 bucket lưu kết quả query |
Cách hoạt động
Step-by-step
EXTERNAL TABLE concept
Data Formats
Athena hỗ trợ nhiều data formats:
| Format | Compression | Use Case | Performance |
|---|---|---|---|
| CSV/TSV | GZIP, BZIP2 | Simple logs | ⭐⭐ |
| JSON | GZIP | Semi-structured data | ⭐⭐ |
| Parquet | Snappy, GZIP | Analytics | ⭐⭐⭐⭐⭐ |
| ORC | Snappy, ZLIB | Analytics | ⭐⭐⭐⭐⭐ |
| Avro | Snappy | Schema evolution | ⭐⭐⭐ |
Columnar vs Row-based
Partitioning
Partitioning chia data thành các phần nhỏ hơn dựa trên column values, giúp giảm data scanned.
Ví dụ Partition by Date
Add Partitions
Tích hợp với các dịch vụ AWS
Common Data Pipeline
Federated Query
Federated Query cho phép Athena query data từ nhiều nguồn khác ngoài S3.
Workgroups
Workgroups giúp tách biệt queries, users, và control costs.
| Feature | Mô tả |
|---|---|
| Query isolation | Tách queries theo team/project |
| Cost control | Set data scan limits per query |
| IAM integration | Control access to workgroups |
| Query history | Track queries per workgroup |
| Settings | Different result locations, encryption |
Pricing
Pricing Model
| Component | Price |
|---|---|
| Data scanned | $5 per TB scanned |
| DDL statements | FREE (CREATE, DROP, ALTER) |
| Failed queries | FREE |
| Cancelled queries | Pay for data scanned before cancel |
Cost Optimization
| Technique | Savings |
|---|---|
| Columnar formats (Parquet/ORC) | 30-90% less data scanned |
| Compression | 30-50% less data |
| Partitioning | Only scan needed partitions |
| LIMIT clause | Limit results (doesn't help much) |
Cost Example
Performance Optimization
1. Use Columnar Formats
2. Partition Data
3. Use Compression
4. Optimize Queries
| Do | Don't |
|---|---|
SELECT col1, col2 | SELECT * |
| Filter on partition columns | Scan entire dataset |
Use LIMIT for testing | Run full query for testing |
Use EXPLAIN to understand | Guess query performance |
Use Cases
1. Log Analysis
2. Ad-hoc Data Exploration
3. BI Dashboards
4. Data Lake Analytics
Best Practices
Data Organization
| Practice | Reason |
|---|---|
| Use Parquet/ORC | Columnar = less data scanned |
| Partition by date/region | Partition pruning |
| Compress data | Smaller files = cheaper |
| Right-size files | 128 MB - 512 MB optimal |
Query Optimization
| Practice | Reason |
|---|---|
| Select specific columns | Don't use SELECT * |
| Filter on partitions | Avoid full table scans |
| Use LIMIT for testing | Save costs during development |
| EXPLAIN queries | Understand query plans |
Cost Control
| Practice | Reason |
|---|---|
| Set workgroup limits | Prevent runaway queries |
| Monitor with CloudWatch | Track spending |
| Use Saved Queries | Reuse optimized queries |
Exam Tips
Key Points
- Serverless - No infrastructure to manage
- Pay-per-query - $5/TB scanned
- Standard SQL - Presto/Trino engine
- S3 as data source - Query data in place
- Glue Data Catalog - Metadata storage
Cost Optimization (important!)
- Columnar formats (Parquet/ORC) reduce costs 30-90%
- Partitioning reduces data scanned
- Compression reduces data size
- Workgroups control costs per team
Common Exam Scenarios
| Scenario | Answer |
|---|---|
| Query S3 data with SQL, serverless | Amazon Athena |
| Analyze CloudTrail/VPC Flow Logs | Athena |
| Ad-hoc queries on data lake | Athena |
| Reduce Athena costs | Parquet + Partitioning |
| BI on S3 data | QuickSight + Athena |
| Query multiple data sources | Federated Query |
So sánh với các dịch vụ khác
| Feature | Athena | Redshift | EMR |
|---|---|---|---|
| Type | Serverless query | Data warehouse | Cluster |
| Pricing | Per query | Per hour | Per hour |
| Best for | Ad-hoc, occasional | BI, dashboards | ETL, ML |
| Setup | Instant | Minutes | Minutes |
| Data location | S3 | Redshift storage | S3/HDFS |