Mark Gateway Traffic in real-time analysis #27

mrd0ll4r · 2022-07-03T16:51:12Z

We need to set up some infra to do something like this:

Regularly run the gateway finder (this is already implemented on the DE1 monitor, I think, but we need to add it to our new infra). The tool works with the new plugin, we just need to set it up.
Collect those results, and aggregate them forever. Add to this the list of gateway peer IDs we already have.
In the monitoring client:
1. Specify where/how we can get that list
2. Reload the new list every day or so, or maybe using a signal to notify us of a new version
3. Mark traffic originating from gateway peer IDs separately. Maybe even tag it by the specific gateway provider it originated from? But that generates a ton of time series, so maybe not. On the other hand, we already have countries...

lgehr · 2022-07-12T15:49:47Z

I am currently working on aggregating to gatwayfinder output.
It is not clear to me which fields need to be aggregated and how:

Let's say we want to merge a.json and b.json into a aggregated file c.json. Both input-files are generated by ipfs-gateway-finder.
The json files have a number of json-object for e.g.:

{
  "gateway": "infura-ipfs.io",
  "gateway_url": "https://infura-ipfs.io/ipfs/:hash",
  "data": [
    243,
    114,
    130,
    149,
    8,
    73,
    230,
    228,
    53,
    5
  ],
  "cid_v1": "bafybeielezns27al7u5luzvlnsqevcyb6oa5wl4kf67hq7iepfegadhtf4",
  "cid_v0": "QmXhqJTtCHakXzt8duir8pYRpGx4k5BPZ4hktQvNsSdujk",
  "http_request_timestamp": "2022-07-12T14:51:11.261492637Z",
  "http_requests_sent": 1,
  "http_request_remote": "54.80.64.45:443",
  "http_success_timestamp": "2022-07-12T14:51:13.339144217Z",
  "http_error_message": null,
  "wantlist_message": {
    "timestamp": "2022-07-12T14:51:12.351461253Z",
    "peer": "12D3KooWKB89k74dHN5vfuFmiGhVXk2GtKBNeabL2YX83xuEgkfd",
    "address": null,
    "received_entries": [
      {
        "priority": 2147483394,
        "cancel": false,
        "send_dont_have": false,
        "cid": {
          "/": "bafybeielezns27al7u5luzvlnsqevcyb6oa5wl4kf67hq7iepfegadhtf4"
        },
        "want_type": 1
      }
    ],
    "full_want_list": false,
    "peer_connected": null,
    "peer_disconnected": null,
    "connect_event_peer_found": null
  }
}

The data-array is shorted for readability

The easy case now is that a.json has a object with a gateway not present in b.json.
The corresponding gateway-object in a.json will be written to c.json.

What should happen when the gateway is present in both a.json and b.json?
If exactly one of the entry has a http_error_message which is not null we should write this entry to c.json, shouldn't we?

What should happen if they both have no http_error_message? Should we just take to newer timestamp?

lgehr · 2022-07-12T15:52:49Z

Do we actually need all the field data, cid_v1, cid_v0, http_*? For me they seem more like logmessage rather then actually interesting data we need?

lgehr · 2022-07-12T15:57:51Z

Also:
Are gateways which did not respond in time (or have other errors in http_error_message`) really gateways which we found? Or are they just noice?

mrd0ll4r · 2022-07-21T07:58:43Z

Do we actually need all the field data, cid_v1, cid_v0, http_*? For me they seem more like logmessage rather then actually interesting data we need?

Absolutely correct! This is just helpful stuff to verify results later, but we don't really need it for the aggregation.

In essence:

We generate some unique content, put it on our nodes
We request that through the HTTP end of a gateway
We wait for a Bitswap message to arrive, asking for that content
The sender of the Bitswap message is the gateway

Some gateway operators run multiple nodes and load-balance among them, which is why we see different peer IDs for the same gateway URL on multiple days.
In general, we want to aggregate the peer IDs we saw over some time for the same gateway URL.

So, for the above case, the gateway URL is infura-ipfs.io and the peer ID is 12D3KooWKB89k74dHN5vfuFmiGhVXk2GtKBNeabL2YX83xuEgkfd.
If we probe the same gateway again tomorrow, we might get a different ID (probably not, but it's possible). We'd then add that to our list of IDs for infura-ipfs.io.

There are two caveats:

Some public gateways are actually proxies, which forward their requests to other gateways. I'm open to ideas how to identify these properly
Some gateway providers have multiple HTTP-ends, for example https://cloudflare-ipfs.com/ipfs/:hash and https://cf-ipfs.com/ipfs/:hash. These are the same gateway(s), operated by cloudflare. I'm open to ideas about what to do with these :)

lgehr · 2022-08-23T09:20:45Z

TODO: Überallen in prom.rs binäres flag hinzufügen: origin_is_gateway

mrd0ll4r · 2022-09-18T11:06:49Z

The client now loads peer IDs of gateways from a file, which can be reloaded via a signal.

What's left is running the gateway finder regularly and aggregating results. This belongs in the infra repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark Gateway Traffic in real-time analysis #27

Mark Gateway Traffic in real-time analysis #27

mrd0ll4r commented Jul 3, 2022

lgehr commented Jul 12, 2022

lgehr commented Jul 12, 2022

lgehr commented Jul 12, 2022

mrd0ll4r commented Jul 21, 2022

lgehr commented Aug 23, 2022 •

edited

Loading

mrd0ll4r commented Sep 18, 2022

Mark Gateway Traffic in real-time analysis #27

Mark Gateway Traffic in real-time analysis #27

Comments

mrd0ll4r commented Jul 3, 2022

lgehr commented Jul 12, 2022

lgehr commented Jul 12, 2022

lgehr commented Jul 12, 2022

mrd0ll4r commented Jul 21, 2022

lgehr commented Aug 23, 2022 • edited Loading

mrd0ll4r commented Sep 18, 2022

lgehr commented Aug 23, 2022 •

edited

Loading