
Every day, we generate mind-boggling amounts of data. Given that this trend is here to stay, our job as data engineers is to constantly innovate new ways to process, store, and analyse data.
Data streaming is undoubtedly part of the solution, and fortunately, sophisticated tools for working with streaming data, such as those provided by Amazon Web Services, are now available (AWS). Continue reading for an overview of data streaming as well as some of the most popular AWS streaming tools.
The process of continuously collecting and processing incoming data is referred to as "data streaming." This incoming data can be any business-relevant event that occurs at a specific time, such as a log generated each time someone makes a purchase in an online store. In fact, a data stream is a continuous and unbroken flow of such data.
If this sounds vaguely familiar, it's because "data streaming" has been around for a long time, though it wasn't always called that. Many related systems, such as transaction processing systems used in banks or tools like the Windows Task Manager, which is used for real-time monitoring of computer performance, are based on the concept of taking actions in response to incoming data. Modern data engineering tools now necessitate extensive study to fully master.
There are two ways to process data: in batches or continuously on arrival. The former method, also known as batch processing, delivers data chunks based on predefined rules, such as overnight or when a certain number of data points are reached. Continuous processing, on the other hand, emphasises instantaneous delivery: each piece of data is transmitted as soon as it is generated.
Batch processing was the dominant approach prior to big data, and it is still used today, particularly for businesses that do not rely on time-sensitive data. However, we are now witnessing a paradigm shift as the business community realises that real-time insights are far more valuable than delayed insights.
When compared to traditional, batch-operated data warehouses, the benefits of real-time data streaming are most apparent. The main advantage of data streaming has already been mentioned: the low latency between updates. Being able to extract learnings and insights from production data on the same day can be far more valuable for a technology-driven business than getting those same learnings even a few days or weeks later. Analysts and product managers can collaborate with developers to create alerts based on metrics and behaviours in their applications.
Another advantage is that data architectures are becoming more simple. When we zoom out and consider data warehouses, we notice that they have characteristics similar to a delayed data stream. Traditionally, businesses saved data from various sources across multiple databases, which were linked together via APIs or other types of point-to-point connections. Data from these databases would then be extracted, transformed, and loaded (ETL) into a data warehouse on a regular basis.
Such disjointed architectures clearly have significant scalability issues — so significant, in fact, that they have earned the moniker "spaghetti architectures," evoking images of labyrinth-like architecture with tangled components. Indeed, synchronising a large number of different parts is extremely difficult. Real-time data streaming architecture proposes a single, logically unified data source as a solution. This has a significant impact on how data flows through organisations.
Businesses are interested in data streaming primarily because of the actions that can be taken based on the data that they receive and process, both in real-time and after the fact. Data streaming, for example, enables businesses to:
The following section will provide an overview of AWS streaming solutions. Before diving into these concrete tools, read this article for a more in-depth look at data streaming.
Amazon Web Services' AWS Kinesis is a fully managed and scalable real-time data streaming service (AWS). That's a lengthy definition, so here's a breakdown:
Kinesis can handle any kind of real-time data, such as audio, video, logs, and clickstreams. It's divided into several services; we'll go over Data Streams, Data Firehose, and Data Analytics below.
The main AWS streaming service for capturing, processing, and storing data streams is Kinesis Data Streams (KDS). Because it is manually managed, it is intended for developers who want to create applications that require custom processing and analysis of streaming data. KDS works with data from single or multiple sources, such as EC2 streams and traditional servers, with virtually no latency. We're talking about real-time reading of gigabytes of incoming data!
The AWS CLI (AWS command-line interface), the official AWS SDK (AWS software development kit), or the Kinesis client with Python and Java support can be used to read data streams from KDS. There is also Kinesis support for Apache Storm and Apache Spark.
Kinesis Data Firehose is a fully-managed AWS streaming architecture that is used to store data streams into other AWS products such as S3 buckets or Amazon Redshift, where the data can be further assessed and analysed. Firehose allows data to be transformed with AWS Lambda as it travels to its destination.
It's critical to understand the difference between Data Streams and Data Firehose.
Both Data Streams and Data Firehose are used for streaming data, but while the former can keep data for up to a week, the latter does not. Another significant difference is latency: KDS buffers data in real-time, whereas Data Firehose has slightly higher latency and may be better suited for applications that do not require real-time data ingestion.
In general, Data Streams provide greater flexibility. However, if your application does not require custom processing and coding, Firehose easily outperforms Data Streams in terms of usability. In addition, unlike Data Streams, Data Firehose is serverless and scales automatically. Finally, the two tools can be used in tandem. Data from Data Streams, for example, can be fed into Data Firehose for permanent storage after processing.
Kinesis Data Analytics is an AWS streaming analytics tool for real-time data analysis. It is unaffected by the data streaming source and can be linked to both Data Streams and Data Firehose. Data Analytics enables you to write traditional SQL code and query incoming streaming data in the same way that you would a relational database.
It is also possible to write a series of SQL statements in which the previous statement's output stream is fed as an input to the next statement. The output streams can then be routed to a destination for additional analysis, visualisations, or dashboard updates.