Batch Processing and Stream Processing (Async & Sync Messaging)

Hanwen Zhang
4 min readJun 25, 2022

An overview of Batch, Stream, Messaging, and WebSocket Connection

Batching Processing (Offline System)

  • Takes a large amount of input data, and uses a background job to process them in a batch, and then sends it out maybe the next day
  • Fault tolerance—failed tasks will be recalled and re-executed
  • Idempotent operations—multiple executions of a job only performed once.
  • Common examples: backup systems, compressing logs

MapReduce —a batch processing algorithm

  • Split petabytes of data into smaller chunks for a specific task, and distribute those chunks in parallel processing among server machines
  • Map, Shuffle, and Reduce

Fault Tolerance in Batch Processing

  • If a job fails simply batch processor restart that job.
  • If a job is getting too big, we can create small parts called micro-batches

Stream

  • Involve with buffer object (raw data, bits, and bytes)
  • The inbound stream of data is an input stream (keyboard).
  • Write out to a file, and use an output stream.
  • A pipe is a destination to write to (flow data from one object to another one). readStream.pipe(writableStream).

Stream Processing (Near Real-Time System)

  • Asynchronous communication
  • Synchronous communication

Batch Processing v.s. Stream Processing

  • Batch Processing: convert the target data file into a set of records, a data can be created by multiple jobs.
  • Stream Processing: convert the target data file into a set of events, events are created by one single producer (publisher).

The publisher produces the data and the Subscriber consumes the data.

Resolve Frequent Polling — by enabling the publisher to notify the subscriber whenever the information is available.

Fault Tolerance in Stream Processing

  • Use checkpoint ID or use a time to label stream jobs and their order
  • If a job fails, reverse back to checkpoints

Messaging

Asynchronous Messaging

queue up a certain amount of responses and then send them in a batch, the publisher will produce data that is put in a message bus, subscribers are not directly getting the message from the publisher, instead, get the message from a message bus

Message Bus — a sort of buffer that we put messages there, and pull the message when they are available to use

RabbitMQ — Software when the app needs to transfer messages between different software components

What happens when a receiver can not keep up with the sender?

  • Drop Message (best way to deal with video data is to drop some packets)
  • Buffer Messages in a Queue
  • Flow Control (when the slow consumer can not cope with the faster producer, simply block, can not receive any more data)

Multiple Receiver Patterns

  • Load balancing — add more receivers to process messages in parallel
  • Fan-out —broadcast to several receivers without affecting each other

Synchronous Messaging

  • Communication between two software components that one component is waiting for another component to respond.
  • Example: email and web messages

Support Protocols

  • Packet-based messaging (UDP — User Datagram Protocol for data transfer)
  • Stream-based messaging (TCP — Transfer Control Protocol for data transfer)

UDP v.s. TCP

  • UDP is much faster than TCP
  • UDP has no guarantee of all message delivery => good for distributing videos

Socket Connection — Server v.s. Client

Server:

  • The first step is to create a server socket Socket socket = serversocket.accept();
  • After creating the socket, the server should start to wait for the connection Socket socket = new Socket(servername, port);

Client:

  • The client should request a server for a connection with the server name and port ServerSocket serversocket = new ServerSocket(9000);

--

--

Hanwen Zhang

Full-Stack Software Engineer at a Healthcare Tech Company | Document My Coding Journey | Improve My Knowledge | Share Coding Concepts in a Simple Way