Uk News

5 -second data delay that breaks us (and how we fix it in milliseconds)

Less than a month, our team began to challenge a deceptive way that looked simple. The task at hand was to create an indicator panel that follows transactions for a particular financial platform. In principle, it was simple – event data will be placed in Dynamodb and Opensearch will be used for real -time analytics.

There was a simple expectation: Opensearch should be ready to analyze it instantly for each event recorded in Dynamodb. No delay. No unnecessary waiting.

And this is where we calculate completely wrong.

When there is no “real time” Really Real time

During the first demo, I remember CTO’s reaction live. “Why is there such a delay?” Surprisingly, he asked, while pointing to the indicator table that exhibited metrics almost 5 seconds ago.

We chose a lower delay than 1 second. What we provided was a system of sometimes up to 3-5 seconds during traffic fluctuations. Sometimes it was worse. In finance monitoring, such a delay is as good as hours. Proactive or reagent? After customer complaints, a system error was found immediately against learning? This is the real difference.

Some harsh changes were needed and quickly required.

Our first attempt: Traditional Party Approach

Just like the other teams that came before us, we leaned against the familiar region:

  1. Plan an AWS Lambda job for execution every 5 seconds.
  2. Make him get new records stored in Dynamodb.
  3. Collect these updates and group them in groups.
  4. Push the tops for indexing.

It worked. A little. It is problematic that “work” and “satisfaction” are two fully divorced terms.

Things collapsed and burned at such an impressive rate:

  • Internal induced delay: For 5 seconds, new data was put in Dynamodb, waiting to be transferred and could not be moved.

  • Delay in arrangement: Opensearch has caused great delays in collective queries, which slowly slowed down

  • Reliability Problems: Mid-Soft’s collapse meant that all updates in the party were irreversible.

Especially during a frustrating situation, our system has been induced by specific basic error incidents because they were caught in a party error. When the problem was diagnosed, thousands of transactions had already passed through the system.

“This is not sustainable for the love of God.” After another report, he continued. “There is a clear need to change the system basically and to allow these updates to be broadcast lively.”

It was extremely right.

Solution: Live Flow Updates are quite

I was working for a long time on AWS documents and the solution hit me: Dynamodb streams.

What if we can capture and process every change made in the Dynamodb table instead of taking updates in groups in a program?

This completely changed our way of working for better:

  • Adjust Dynamodb Streams to capture each attachment, replacement and removal of records.

  • Add AWS LAMBDA FUNCTIONS TO EXPLAIN THESE CHANGES

  • Push the updates to Opensearch and make a light operation on the data.

In my first tests, the results were incredible. The delay fell below 500MS from 3-5 seconds. I will never forget the message I sent to his team at 3 am. ‘Answers’ Answers’.

Ensure that

This was not just a software engineering homework or design project, where we had to get a evidence of the concept pipeline work. In one of the sleepless nights on the coffee mountains, we reserved our problem in three action. The first was to be informed about changes in Dynamodb.

I am notified about changes in Dynamodb

How can we know that something has changed in Dynamodb? After a little googling, I discovered that we need to enable Dynamodb flows. As it is understood, at first there was a CLI command, albeit painful for me.

aws dynamodb update-table \
    --table-name RealTimeMetrics \
    --stream-specification StreamEnabled=true,StreamViewType=NEW_IMAGE

I still remember how excited I was when I hired this at 23:00 to my colleague: “Working! The table publishes all the changes he has passed!”

Change of Efficiency Listener

And now, when the streams were active, we needed something to capture these events. Thus, we created a function in a lamp I decided to call it “guard .. He expects Dynamodb to give information about a change and he takes action as soon as possible.

import json
import boto3
import requests

OPENSEARCH_URL = "https://your-opensearch-domain.com"
INDEX_NAME = "metrics"
HEADERS = {"Content-Type": "application/json"}

def lambda_handler(event, context):
    records = event.get("Records", [])

    for record in records:
        if record["eventName"] in ["INSERT", "MODIFY"]:
            new_data = record["dynamodb"]["NewImage"]
            doc_id = new_data["id"]["S"]  # Primary key
              
            # Convert DynamoDB format to JSON
            doc_body = {
                "id": doc_id,
                "timestamp": new_data["timestamp"]["N"],
                "metric": new_data["metric"]["S"],
                "value": float(new_data["value"]["N"]),
            }

            # Send update to OpenSearch
            response = requests.put(f"{OPENSEARCH_URL}/{INDEX_NAME}/_doc/{doc_id}", 
                                   headers=HEADERS, 
                                   json=doc_body)
            print(f"Indexed {doc_id}: {response.status_code}")

    return {"statusCode": 200, "body": json.dumps("Processed Successfully")}

Now looking simple, it wasn’t so easy to write your code – he made three experiments to make it operational. Our first attempt collapsed continuously due to a time -out, because we were ignoring the response format from Dynamodb.

Teaching OpenSEARCH to keep up

This last number was the most difficult thing to solve and capture us. Even with the Opensearch’s update immediately, updates were not in real time. It turns out that Opensearch uses its own collective processing technique for simplicity.

“This doesn’t make any sense,” he moaned. “We send real -time data and it doesn’t show real time!”

curl -X PUT "https://your-opensearch-domain.com/metrics/_settings" -H 'Content-Type: application/json' -d ' { "index": { "refresh_interval": "500ms", "number_of_replicas": 1 } }'

After some research and trial and error, we found the parameters we need to change. This change made a big difference. *Instead of waiting for the renovation cycle to Opensearch, he instructed him to make new data available in half a second to search. After being created in Dynamodb, I was about to jump out of my seat when I witnessed the first event in our control panel. “

Results: from seconds to milliseconds

The first week of the new system in construction taught us a lot. The indicator table was no longer a delayed appearance that was simultaneous to reality, but a fully working service.

We succeeded:

  • Average Delay <500MS (3-5 seconds)

  • No party delays no longer – the spread of changes was instant

  • Zero indexing bottlenecks – smaller, more frequent updates were more efficient

  • Developed General System Flexibility – now ‘or nothing’ authorized malfunctions

When we shared the updated indicator table with our leadership, they immediately noticed the change. The difference was clear; Our CTO said, “This is what we need from the beginning,” he said.

More scaling: Managing traffic summits

The new approach experienced difficulties during excessive use increases as well as for normal traffic. At the end of the day, during the reconciliation periods, the rate of events will increase between hundreds of and thousands.

To reduce this problem, we added Amazon Kinesis Firehose as a buffer. Instead of sending each update directly to Opensearch in the lambda, we changed the data to Firehose:

firehose_client = boto3.client("firehose")

firehose_client.put_record(
    DeliveryStreamName="MetricsStream",
    Record={"Data": json.dumps(doc_body)}
)

Firehose was automatically scaled in response to the yield requirements without compromising the real -time features of the pipeline, and was interested in delivery to Opensearch.

Learned Courses: Speed ​​Search Continues

With real -time data systems, we learned that continuous work to reduce the delay time is an endless war. We are working even harder now and we have downloaded our metrics to 500MS:

  • Running transformation steps in the Lambda Opensearch pipelines instead of lambda

  • Using AWS Elasticache to get frequent queries faster

  • Edge Computing for users spread to the world

Every microsecond is important in financial monitoring. As one of my senior engineers said, “It is not a pleasant place to be watching; you are in front of a problem or you are behind.

What did you try?

Have you tried solving the problem of real -time data problems? What did you do? I am eager to learn your adventures on the edges with AWS.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button