You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most consumers of the content payload require the payload to be
decoded using the provided HTTP Content-Encoding
available as byte[] (eg. Tika) or even String (eg. Jsoup)
I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:
return the decoded payload body as channel using the HTTP Content-Encoding
with configurable behavior (fail or return payload without decoding) when Content-Encoding isn't understood or is not reliable (gzip without gzip magic/header)
ev. make it possible to pass decoders for encodings not supported by jwarc, eg. brotli (I assume that jwarc is designed to have zero dependencies)
or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
read the (decoded) payload into byte[] (or ByteBuffer)
optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues
The text was updated successfully, but these errors were encountered:
Having something like a decode() or bodyDecoded() convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.
I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.
read the (decoded) payload into byte[] (or ByteBuffer)
Note that from Java 9 you can do body().stream().readAllBytes() and body().stream().readNBytes(buf, off, len). Not opposed to having our own as there's still quite a few people targeting 8 though.
Most consumers of the content payload require the payload to be
I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:
Content-Encoding
Content-Encoding
isn't understood or is not reliable (gzip without gzip magic/header)brotli
(I assume that jwarc is designed to have zero dependencies)The text was updated successfully, but these errors were encountered: