Skip to content
This repository has been archived by the owner on Jul 9, 2024. It is now read-only.

Add additional info to alerts | Batch DB writes #489

Merged
merged 15 commits into from
Oct 31, 2023

Conversation

aquarat
Copy link
Contributor

@aquarat aquarat commented Oct 18, 2023

This PR is currently being tested live.

This PR does two main things:

  • Batches DB writes to improve performance/reduce DB load
  • Includes extra info in OpsGenie alerts related to dead gateways (how many out of how many gateways are dead and whether those gateways are AWS/GCP). This will make it easier to figure out if a dead gateway is a big deal or not.
  • Updates the hard-coded chains :(

Closes https://github.com/api3dao/tasks/issues/219
Closes https://github.com/api3dao/tasks/issues/199

@aquarat aquarat changed the base branch from main to airseeker-monitoring-tuning October 18, 2023 15:07
@aquarat aquarat changed the base branch from airseeker-monitoring-tuning to airseeker-telemetry October 18, 2023 15:07
@aquarat aquarat changed the title Add additional info to alerts Add additional info to alerts | Batch DB writes Oct 18, 2023
@aquarat aquarat requested a review from vponline October 19, 2023 12:11
export let prismaActive = prisma;

export const setPrisma = (newPrisma: any) => {
prismaActive = newPrisma;
};

type DbRecord = {
model: 'deviationValue' | 'rPCFailures' | 'compoundValues' | 'dataFeedApiValue' | 'gatewayFailures';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there's a better way of typing this or handling this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of anything that would be more elegant than this either (reminds me of the old telemetry 😄)

@aquarat aquarat marked this pull request as ready for review October 19, 2023 12:13
@aquarat aquarat marked this pull request as draft October 19, 2023 13:13
@aquarat aquarat marked this pull request as ready for review October 20, 2023 10:10

const addRecord = (record: DbRecord) => {
// eslint-disable-next-line functional/immutable-data
recordsToInsert.push(record);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We batch pending DB inserts now, which speeds things up and reduces DB load.

await Promise.allSettled(
Object.entries(bufferedRecords).map(([model, data]) =>
// @ts-ignore
prismaActive[model].createMany({ data: data.map((item) => item.record) })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this can be done better...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution 😄, anything else would be a much longer utility function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the createMany's are quite fast so this could even be made sequential if we find DB load to bee still too much later on 🤷

).filter((result) => result.status === 'rejected');

if (results.length > 0) {
logger.error(`DB Writer error: ${results}`);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop a sample it's not a big deal and if we drop everything, well, bigger issue that will be obvious.

if (!nodaryResponse.success) {
await limitedSendToOpsGenieLowLevel(
limitedSendToOpsGenieLowLevel(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we are limited by OG's rate limiting, in those case we don't want it to slow down the entire app. This was causing samples to be dropped.

@@ -189,6 +231,19 @@ export const recordRpcProviderResponseSuccess = async (contract: Api3ServerV1, s
}
};

export const getBaseUrl = (fullUrl: string) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the base URL for (1) finding gateways and (2) hashing the gateway URL so it can be looked up later by Mertcan as part of his fault tracing.

{
message: `Dead gateway for Airnode Address ${airnodeAddress}`,
priority: 'P3',
alias: `dead-gateway-${airnodeAddress}${generateOpsGenieAlias(gatewayUrl)}`,
alias: `dead-gateway-${airnodeAddress}${generateOpsGenieAlias(baseUrl)}`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only interested in the base URL, not the URL+template ID (which is what we used before).

@@ -46,5 +46,5 @@ export const DIRECT_GATEWAY_MIN_TIME_DEFAULT_MS = 20;
export const DIRECT_GATEWAY_MAX_CONCURRENCY_DEFAULT = 10;

// TODO: load these 2 from env var instead
export const DATAFEED_READ_BATCH_SIZE = 100;
export const DATAFEED_UPDATE_BATCH_SIZE = 10;
export const DATAFEED_READ_BATCH_SIZE = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading more than 1 has been unreliable on Kava - and I'm tired of troubleshooting this app - if you can see why it's failing, LMK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! This would be a problem for the main airseeker as well, right?

@@ -92,7 +78,7 @@ export const fetchBeaconData = async (beaconId: string) => {
}

const goRes = await go(fetchFn, {
retries: 2,
retries: 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need/want it to retry because if it fails we'll try the whole thing again very soon.

@@ -16,6 +16,10 @@ export const handleStopSignal = (signal: string) => {
expireLimiterJobs();
updateState((state) => ({ ...state, stopSignalReceived: true }));
heartbeatReporter(getState().config);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very annoying when the app doesn't quit quickly in response to ctrl+c. This fixes that.

Copy link
Contributor

@vponline vponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I didn't see any issues 🚀

await Promise.allSettled(
Object.entries(bufferedRecords).map(([model, data]) =>
// @ts-ignore
prismaActive[model].createMany({ data: data.map((item) => item.record) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution 😄, anything else would be a much longer utility function

await Promise.allSettled(
Object.entries(bufferedRecords).map(([model, data]) =>
// @ts-ignore
prismaActive[model].createMany({ data: data.map((item) => item.record) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the createMany's are quite fast so this could even be made sequential if we find DB load to bee still too much later on 🤷

export let prismaActive = prisma;

export const setPrisma = (newPrisma: any) => {
prismaActive = newPrisma;
};

type DbRecord = {
model: 'deviationValue' | 'rPCFailures' | 'compoundValues' | 'dataFeedApiValue' | 'gatewayFailures';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of anything that would be more elegant than this either (reminds me of the old telemetry 😄)

@@ -46,5 +46,5 @@ export const DIRECT_GATEWAY_MIN_TIME_DEFAULT_MS = 20;
export const DIRECT_GATEWAY_MAX_CONCURRENCY_DEFAULT = 10;

// TODO: load these 2 from env var instead
export const DATAFEED_READ_BATCH_SIZE = 100;
export const DATAFEED_UPDATE_BATCH_SIZE = 10;
export const DATAFEED_READ_BATCH_SIZE = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! This would be a problem for the main airseeker as well, right?

@aquarat aquarat merged commit 20fc849 into airseeker-telemetry Oct 31, 2023
@aquarat aquarat deleted the add-additional-info-to-alerts branch October 31, 2023 13:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants