Apache Kafka stands as a widely known open supply occasion retailer and stream processing platform. It has advanced into the de facto normal for information streaming, as over 80% of Fortune 500 firms use it. All main cloud suppliers present managed information streaming providers to satisfy this rising demand.
One key benefit of choosing managed Kafka providers is the delegation of accountability for dealer and operational metrics, permitting customers to focus solely on metrics particular to functions. On this article, Product Supervisor Uche Nwankwo supplies steerage on a set of producer and shopper metrics that clients ought to monitor for optimum efficiency.
With Kafka, monitoring sometimes includes numerous metrics which might be associated to subjects, partitions, brokers and shopper teams. Customary Kafka metrics embody info on throughput, latency, replication and disk utilization. Seek advice from the Kafka documentation and related monitoring instruments to know the precise metrics out there to your model of Kafka and the way to interpret them successfully.
Why is it vital to watch Kafka purchasers?
Monitoring your IBM® Occasion Streams for IBM Cloud® occasion is essential to make sure optimum performance and total well being of your information pipeline. Monitoring your Kafka purchasers helps to determine early indicators of software failure, equivalent to excessive useful resource utilization and lagging shoppers and bottlenecks. Figuring out these warning indicators early allows proactive response to potential points that reduce downtime and stop any disruption to enterprise operations.
Kafka purchasers (producers and shoppers) have their very own set of metrics to watch their efficiency and well being. As well as, the Occasion Streams service helps a wealthy set of metrics produced by the server. For extra info, see Monitoring Event Streams metrics by using IBM Cloud Monitoring.
Consumer metrics to watch
Producer metrics
Metric | Description |
File-error-rate | This metric measures the common per-second variety of data despatched that resulted in errors. A excessive (or a rise in) record-error-rate may point out a loss in information or information not being processed as anticipated. All these results may compromise the integrity of the information you might be processing and storing in Kafka. Monitoring this metric helps to make sure that information being despatched by producers is precisely and reliably recorded in your Kafka subjects. |
Request-latency-avg | That is the common latency for every produce request in ms. A rise in latency impacts efficiency and may sign a problem. Measuring the request-latency-avg metric may help to determine bottlenecks inside your occasion. For a lot of functions, low latency is essential to make sure a high-quality person expertise and a spike in request-latency-avg may point out that you’re reaching the bounds of your provisioned occasion. You may repair the difficulty by altering your producer settings, for instance, by batching or scaling your plan to optimize efficiency. |
Byte-rate | The common variety of bytes despatched per second for a subject is a measure of your throughput. In the event you stream information often, a drop in throughput can point out an anomaly in your Kafka occasion. The Occasion Streams Enterprise plan begins from 150MB-per-second cut up one-to-one between ingress and egress, and you will need to understand how a lot of that you’re consuming for efficient capability planning. Don’t go above two-thirds of the utmost throughput, to account for the potential influence of operational actions, equivalent to inside updates or failure modes (for instance, the lack of an availability zone). |
Scroll to view full desk
Shopper metrics
Metric | Description |
Fetch-rate fetch-size-avg | The variety of fetch requests per second (fetch-rate) and the common variety of bytes fetched per request (fetch-size-avg) are key indicators for the way properly your Kafka shoppers are performing. A excessive fetch-rate may sign inefficiency, particularly over a small variety of messages, because it means inadequate (probably no) information is being obtained every time. The fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes and fetch.max.wait.ms. Tune these settings to attain the specified total latency, whereas minimizing the variety of fetch requests and doubtlessly the load on the dealer CPU. Monitoring and optimizing each metrics ensures that you’re processing information effectively for present and future workloads. |
Commit-latency-avg | This metric measures the common time between a dedicated file being despatched and the commit response being obtained. Much like the request-latency-avg as a producer metric, a secure commit-latency-avg implies that your offset commits occur in a well timed method. A high-commit latency may point out issues inside the shopper that forestall it from committing offsets rapidly, which straight impacts the reliability of information processing. It’d result in duplicate processing of messages if a shopper should restart and reprocess messages from a beforehand uncommitted offset. A high-commit latency additionally means spending extra time in administrative operations than precise message processing. This subject may result in backlogs of messages ready to be processed, particularly in high-volume environments. |
Bytes-consumed-rate | It is a consumer-fetch metric that measures the common variety of bytes consumed per second. Much like the byte-rate as a producer metric, this must be a secure and anticipated metric. A sudden change within the anticipated pattern of the bytes-consumed-rate may characterize a problem together with your functions. A low charge is likely to be a sign of effectivity in information fetches or over-provisioned assets. The next charge may overwhelm the shoppers’ processing functionality and thus require scaling, creating extra shoppers to steadiness out the load or altering shopper configurations, equivalent to fetch sizes. |
Rebalance-rate-per-hour | The variety of group rebalances participated per hour. Rebalancing happens each time there’s a new shopper or when a shopper leaves the group and causes a delay in processing. This occurs as a result of partitions are reassigned making Kafka shoppers much less environment friendly if there are numerous rebalances per hour. The next rebalance charge per hour will be attributable to misconfigurations resulting in unstable shopper conduct. This rebalancing act could cause a rise in latency and may end in functions crashing. Make sure that your shopper teams are secure by monitoring a low and secure rebalance-rate-per-hour. |
Scroll to view full desk
The metrics ought to cowl all kinds of functions and use instances. Occasion Streams on IBM Cloud present a wealthy set of metrics which might be documented right here and can present additional helpful insights relying on the area of your software. Take the subsequent step. Study extra about Event Streams for IBM Cloud.
What’s subsequent?
You’ve now obtained the information on important Kafka purchasers to watch. You’re invited to place these factors into apply and check out the absolutely managed Kafka providing on IBM Cloud. For any challenges in arrange, see the Getting Started Guide and FAQs.
Learn more about Kafka and its use cases
Provision an instance of Event Streams on IBM Cloud
Was this text useful?
SureNo