Apache Flink Series 2 — Stream Processing Fundamentals

  • |
  • 15 February 2020
Post image

In this post, I am going to explain some terms about stream processing and also terms used in Apache Flink

You may see the all my notes about Apache Flink with this link

State

  • At a high level, we can consider state as memory in operators in Flink that remembers information about past input and can be used to influence the processing of future input.
  • State is data information generated during computations that plays a very important role in fault tolerance, failure recovery and checkpoints in Apache Flink.

Operators

  • Operators transform one or more data streams to another data streams (this could be the either same type of data stream of different type of data stream).
  • A few operators form the dataflow graph which represent the program.

Dataflow Graphs

  • A dataflow program describe how data flows between operations
  • Dataflow program are commonly represented as directed graphs where nodes are called operators (represent computations) and edges represent data dependencies
DataFlowGraph

Data Sources

  • Operation without input ports are called data sources.
  • A dataflow graph must have at least one data source
  • In the DataFlow Graph, Tweet source is a data source for that dataflow.

Data Sinks

  • Operation without output ports are called data sinks.
  • A dataflow graph must have at least one data source.
  • In the DataFlow Graph, Trending topics sink is a data sink for that dataflow.

Logical Dataflow(JobGraph) and Physical Dataflow(ExecutionGraph)

  • Logical dataflow represent a high level view of the program. In logical dataflow the nodes are operators the edges indicate input/output-relationships or data streams or data sets.
  • In order to execute a dataflow program, its logical graph is converted into a physical dataflow which specifies in detail how the program is executed.
  • In physical dataflow graph, the nodes represent tasks.
PhysicalDataFlowGraph

Data Parallelism and Task Parallelism

  • We can exploit parallelism in dataflow in different ways.
  • Data parallelism: We divide our input data and we have tasks of the same operation execute on the data subsets in parallel.
    • Using data parallelism we can process large volumes of data and spreading the computation load across several computing nodes
  • Task parallelism: We can have tasks from different operators performing computations same or different data in parallel.
    • Using task parallelism we can better utilize the computing resources of cluster.

Data Exchange Strategies

  • Data exchange strategies define how the data items are assigned to tasks in a physical dataflow graph
  • This strategy can be chosen automatically by the execution engine or explicitly imposed by the programmer.
  • Here is the common data exchange strategies:
    • Forward Strategy sends data from a task to a receiving task. If both tasks are located on the same physical machine(done by task schedulers), this exchange strategy avoids network communication.
    • Broadcast Strategy sends every data item to all parallel tasks of an operator. Because this strategy replicates data and involves network communication, it is fairly expensive.
    • Key-based Strategy divide data by a key attribute and guarantees that data items having the same key will be processed by the same tasks
    • Random Strategy uniformly distributes data items to operator tasks in order to evenly distribute the load across computing tasks
data_exchange_strategies.png

Data ingestion and Data Egress

  • They allow the stream processor to communicate with external systems.
  • Data ingestion is the operation of fetching raw data from external sources and converting it into a format suitable for processing.
    • Operators that implement data ingestion logic are called data sources.
    • A data source can ingest data from a TCP socket, a file, a Kafka topic or a sensor data interface.
  • Data egress is the operation of producing output in a form of suitable for consumption by external systems.
    • Operators that perform data egress are called data sinks.
    • Examples: files, databases, message queues and monitoring interfaces.

There is also “operations on data stream” such as window operations, transformation operations etc.. I am going to explain these operations in the later post(s)…

Note: If you want to see all terms about Apache Flink, you may go to the https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/glossary.html#Record

You May Also Like