Batch Processing and Stream Processing (Async & Sync Messaging)
--
An overview of Batch, Stream, Messaging, and WebSocket Connection
Batching Processing (Offline System)
- Takes a large amount of input data, and uses a background job to process them in a batch, and then sends it out maybe the next day
- Fault tolerance—failed tasks will be recalled and re-executed
- Idempotent operations—multiple executions of a job only performed once.
- Common examples: backup systems, compressing logs
MapReduce —a batch processing algorithm
- Split petabytes of data into smaller chunks for a specific task, and distribute those chunks in parallel processing among server machines
- Map, Shuffle, and Reduce
Fault Tolerance in Batch Processing
- If a job fails simply batch processor restart that job.
- If a job is getting too big, we can create small parts called micro-batches
Stream
- Involve with buffer object (raw data, bits, and bytes)
- The inbound stream of data is an input stream (keyboard).
- Write out to a file, and use an output stream.
- A pipe is a destination to write to (flow data from one object to another one). readStream.pipe(writableStream).
Stream Processing (Near Real-Time System)
- Asynchronous communication
- Synchronous communication
Batch Processing v.s. Stream Processing
- Batch Processing: convert the target data file into a set of records, a data can be created by multiple jobs.
- Stream Processing: convert the target data file into a set of events, events are created by one single producer (publisher).
The publisher produces the data and the Subscriber consumes the data.
Resolve Frequent Polling — by enabling the publisher to notify the subscriber whenever the information is available.
Fault Tolerance in Stream Processing
- Use checkpoint ID or use a time to label stream jobs and their order
- If a job fails, reverse back to checkpoints
Messaging
Asynchronous Messaging
queue up a certain amount of responses and then send them in a batch, the publisher will produce data that is put in a message bus, subscribers are not directly getting the message from the publisher, instead, get the message from a message bus
Message Bus — a sort of buffer that we put messages there, and pull the message when they are available to use
RabbitMQ — Software when the app needs to transfer messages between different software components
What happens when a receiver can not keep up with the sender?
- Drop Message (best way to deal with video data is to drop some packets)
- Buffer Messages in a Queue
- Flow Control (when the slow consumer can not cope with the faster producer, simply block, can not receive any more data)
Multiple Receiver Patterns
- Load balancing — add more receivers to process messages in parallel
- Fan-out —broadcast to several receivers without affecting each other
Synchronous Messaging
- Communication between two software components that one component is waiting for another component to respond.
- Example: email and web messages
Support Protocols
- Packet-based messaging (UDP — User Datagram Protocol for data transfer)
- Stream-based messaging (TCP — Transfer Control Protocol for data transfer)
UDP v.s. TCP
- UDP is much faster than TCP
- UDP has no guarantee of all message delivery => good for distributing videos
Socket Connection — Server v.s. Client
Server:
- The first step is to create a server socket
Socket socket = serversocket.accept();
- After creating the socket, the server should start to wait for the connection
Socket socket = new Socket(servername, port);
Client:
- The client should request a server for a connection with the server name and port
ServerSocket serversocket = new ServerSocket(9000);