Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark Gateway Traffic in real-time analysis #27

Open
mrd0ll4r opened this issue Jul 3, 2022 · 6 comments
Open

Mark Gateway Traffic in real-time analysis #27

mrd0ll4r opened this issue Jul 3, 2022 · 6 comments

Comments

@mrd0ll4r
Copy link
Member

mrd0ll4r commented Jul 3, 2022

We need to set up some infra to do something like this:

  1. Regularly run the gateway finder (this is already implemented on the DE1 monitor, I think, but we need to add it to our new infra). The tool works with the new plugin, we just need to set it up.
  2. Collect those results, and aggregate them forever. Add to this the list of gateway peer IDs we already have.
  3. In the monitoring client:
    1. Specify where/how we can get that list
    2. Reload the new list every day or so, or maybe using a signal to notify us of a new version
    3. Mark traffic originating from gateway peer IDs separately. Maybe even tag it by the specific gateway provider it originated from? But that generates a ton of time series, so maybe not. On the other hand, we already have countries...
@lgehr
Copy link
Collaborator

lgehr commented Jul 12, 2022

I am currently working on aggregating to gatwayfinder output.
It is not clear to me which fields need to be aggregated and how:

Let's say we want to merge a.json and b.json into a aggregated file c.json. Both input-files are generated by ipfs-gateway-finder.
The json files have a number of json-object for e.g.:

{
  "gateway": "infura-ipfs.io",
  "gateway_url": "https://infura-ipfs.io/ipfs/:hash",
  "data": [
    243,
    114,
    130,
    149,
    8,
    73,
    230,
    228,
    53,
    5
  ],
  "cid_v1": "bafybeielezns27al7u5luzvlnsqevcyb6oa5wl4kf67hq7iepfegadhtf4",
  "cid_v0": "QmXhqJTtCHakXzt8duir8pYRpGx4k5BPZ4hktQvNsSdujk",
  "http_request_timestamp": "2022-07-12T14:51:11.261492637Z",
  "http_requests_sent": 1,
  "http_request_remote": "54.80.64.45:443",
  "http_success_timestamp": "2022-07-12T14:51:13.339144217Z",
  "http_error_message": null,
  "wantlist_message": {
    "timestamp": "2022-07-12T14:51:12.351461253Z",
    "peer": "12D3KooWKB89k74dHN5vfuFmiGhVXk2GtKBNeabL2YX83xuEgkfd",
    "address": null,
    "received_entries": [
      {
        "priority": 2147483394,
        "cancel": false,
        "send_dont_have": false,
        "cid": {
          "/": "bafybeielezns27al7u5luzvlnsqevcyb6oa5wl4kf67hq7iepfegadhtf4"
        },
        "want_type": 1
      }
    ],
    "full_want_list": false,
    "peer_connected": null,
    "peer_disconnected": null,
    "connect_event_peer_found": null
  }
}

The data-array is shorted for readability

The easy case now is that a.json has a object with a gateway not present in b.json.
The corresponding gateway-object in a.json will be written to c.json.

What should happen when the gateway is present in both a.json and b.json?
If exactly one of the entry has a http_error_message which is not null we should write this entry to c.json, shouldn't we?

What should happen if they both have no http_error_message? Should we just take to newer timestamp?

@lgehr
Copy link
Collaborator

lgehr commented Jul 12, 2022

Do we actually need all the field data, cid_v1, cid_v0, http_*? For me they seem more like logmessage rather then actually interesting data we need?

@lgehr
Copy link
Collaborator

lgehr commented Jul 12, 2022

Also:
Are gateways which did not respond in time (or have other errors in http_error_message`) really gateways which we found? Or are they just noice?

@mrd0ll4r
Copy link
Member Author

Do we actually need all the field data, cid_v1, cid_v0, http_*? For me they seem more like logmessage rather then actually interesting data we need?

Absolutely correct! This is just helpful stuff to verify results later, but we don't really need it for the aggregation.

In essence:

  1. We generate some unique content, put it on our nodes
  2. We request that through the HTTP end of a gateway
  3. We wait for a Bitswap message to arrive, asking for that content
  4. The sender of the Bitswap message is the gateway

Some gateway operators run multiple nodes and load-balance among them, which is why we see different peer IDs for the same gateway URL on multiple days.
In general, we want to aggregate the peer IDs we saw over some time for the same gateway URL.

So, for the above case, the gateway URL is infura-ipfs.io and the peer ID is 12D3KooWKB89k74dHN5vfuFmiGhVXk2GtKBNeabL2YX83xuEgkfd.
If we probe the same gateway again tomorrow, we might get a different ID (probably not, but it's possible). We'd then add that to our list of IDs for infura-ipfs.io.

There are two caveats:

  • Some public gateways are actually proxies, which forward their requests to other gateways. I'm open to ideas how to identify these properly
  • Some gateway providers have multiple HTTP-ends, for example https://cloudflare-ipfs.com/ipfs/:hash and https://cf-ipfs.com/ipfs/:hash. These are the same gateway(s), operated by cloudflare. I'm open to ideas about what to do with these :)

@lgehr
Copy link
Collaborator

lgehr commented Aug 23, 2022

TODO: Überallen in prom.rs binäres flag hinzufügen: origin_is_gateway

@mrd0ll4r
Copy link
Member Author

The client now loads peer IDs of gateways from a file, which can be reloaded via a signal.

What's left is running the gateway finder regularly and aggregating results. This belongs in the infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants