2022-02-20 Chapter 15 - Big Data

service-redshift-001; {{c1::AWS Redshift}} is a fully managed cloud data warehouse which can store up to 16 PB of data.

service-redshift-002; AWS Redshift is a relational database suitable for {{c1::OLAP (BI)}} use cases. It's unsuitable for {{c1::OLTP}} use cases.

service-redshift-003; AWS Redshift only supports single-{{c1::AZ}} deployments and is not highly-available by default.

service-emr-001; {{c1::AWS Elastic Map Reduce (EMR)}} is a managed big data platform that allows you to process data using open-source tools like Spark or Hive.

service-emr-002; AWS Elastic Map Reduce (EMR) can be considered as an {{c1::ETL}} tool running across a fleet of EC2 instances in a standard VPC.

service-emr-003; AWS EMR can be used with {{c1::RIs}} and {{c1::Spot}} Instances instances to reduce your costs.

service-kinesis-001; {{c1::AWS Kinesis}} allows you to ingest, process and analyse real-time streaming data.

service-kinesis-002; There are two forms of Kinesis: 1) {{c1::Kinesis Data Streams}} which is real-time and users are responsible for creating consumers and scaling and 2) {{c1::Kinesis Data Firehose}} which is near real-time (within 60 seconds) and plugs and is play with different AWS architecture and handles scaling (i.e. is much more simple to manage than Kinesis Data Streams).

service-kinesis-003; AWS Kinesis Data Streams comprise 1) {{c1::shards}} which have fixed capacity and must be planned and 2) {{c1::consumers}} which are custom applications using the Kinesis SDK running on EC2.

service-kinesis-004; {{c1::AWS Kinesis Data Analytics}} allows you to run serverless SQL queries on data flowing through a Kinesis streaming pipeline.

service-kinesis-005; SQS is simpler than Kinesis and doesn't offer real-time. Kinesis is more complex to configure but offers {{c1::real-time}} processing and is typically used for big data applications.

service-kinesis-006; AWS Kinesis can store data for up to {{c1::a year}}.

service-athena-001; {{c1::AWS Athena}} is a serverless interactive query service that makes it easy to analyse data in S3 using SQL.

service-glue-001; {{c1::AWS Glue}} is a serverless data integration service that makes it easy to discover, prepare and combine data. It allows you to perform ETL workloads without managing underlying services.

service-glue-002; AWS Athena and AWS Glue can be used together. AWS Glue provides crawlers which infer {{c1::schema}} from unstructured data. AWS Glue also offers the Glue Data Catalog which is basically a {{c1::schema}} catalogue for Glue. Athena can then use the Glue Data Catalog to direct queries at the data in an {{c1::S3 bucket}}.

service-quicksight-001; {{c1::AWS QuickSight}} is a fully managed business intelligence data visualisation service for creating dashboards.

service-elasticsearch-001; {{c1::AWS Elasticsearch}} Service is a fully managed version of the open-source application Elasticsearch. It allows you to quickly search over stored data and analyse the data you get back. It's commonly used to analyse logs as part of an {{c1::Elasticsearch, Logstash and Kibana (ELK)}} stack.


You'll only receive email when they publish something new.

More from 15989
All posts