2022-02-20 Chapter 15 - Big Data
February 20, 2022•441 words
service-redshift-001; {{c1::AWS Redshift}} is a fully managed cloud data warehouse which can store up to 16 PB of data.
service-redshift-002; AWS Redshift is a relational database suitable for {{c1::OLAP (BI)}} use cases. It's unsuitable for {{c1::OLTP}} use cases.
service-redshift-003; AWS Redshift only supports single-{{c1::AZ}} deployments and is not highly-available by default.
service-emr-001; {{c1::AWS Elastic Map Reduce (EMR)}} is a managed big data platform that allows you to process data using open-source tools like Spark or Hive.
service-emr-002; AWS Elastic Map Reduce (EMR) can be considered as an {{c1::ETL}} tool running across a fleet of EC2 instances in a standard VPC.
service-emr-003; AWS EMR can be used with {{c1::RIs}} and {{c1::Spot}} Instances instances to reduce your costs.
service-kinesis-001; {{c1::AWS Kinesis}} allows you to ingest, process and analyse real-time streaming data.
service-kinesis-002; There are two forms of Kinesis: 1) {{c1::Kinesis Data Streams}} which is real-time and users are responsible for creating consumers and scaling and 2) {{c1::Kinesis Data Firehose}} which is near real-time (within 60 seconds) and plugs and is play with different AWS architecture and handles scaling (i.e. is much more simple to manage than Kinesis Data Streams).
service-kinesis-003; AWS Kinesis Data Streams comprise 1) {{c1::shards}} which have fixed capacity and must be planned and 2) {{c1::consumers}} which are custom applications using the Kinesis SDK running on EC2.
service-kinesis-004; {{c1::AWS Kinesis Data Analytics}} allows you to run serverless SQL queries on data flowing through a Kinesis streaming pipeline.
service-kinesis-005; SQS is simpler than Kinesis and doesn't offer real-time. Kinesis is more complex to configure but offers {{c1::real-time}} processing and is typically used for big data applications.
service-kinesis-006; AWS Kinesis can store data for up to {{c1::a year}}.
service-athena-001; {{c1::AWS Athena}} is a serverless interactive query service that makes it easy to analyse data in S3 using SQL.
service-glue-001; {{c1::AWS Glue}} is a serverless data integration service that makes it easy to discover, prepare and combine data. It allows you to perform ETL workloads without managing underlying services.
service-glue-002; AWS Athena and AWS Glue can be used together. AWS Glue provides crawlers which infer {{c1::schema}} from unstructured data. AWS Glue also offers the Glue Data Catalog which is basically a {{c1::schema}} catalogue for Glue. Athena can then use the Glue Data Catalog to direct queries at the data in an {{c1::S3 bucket}}.
service-quicksight-001; {{c1::AWS QuickSight}} is a fully managed business intelligence data visualisation service for creating dashboards.
service-elasticsearch-001; {{c1::AWS Elasticsearch}} Service is a fully managed version of the open-source application Elasticsearch. It allows you to quickly search over stored data and analyse the data you get back. It's commonly used to analyse logs as part of an {{c1::Elasticsearch, Logstash and Kibana (ELK)}} stack.