id | title | sidebar_label |
---|---|---|
spark-layer-rdd |
Tile Layer RDD |
Tile Layer RDD |
A layer is an RDD[(SpatialKey, Tile)] with Metadata[TileLayerMetadata[SpatialKey]]
.
Conceptually a layer is a tiled raster. Functionally you can treat it as a distributed Map object.
SpatialKey
is fromLayoutDefinition
defined by layer metadata- Every
Tile
in a layer has same:- Dimensions (col x row)
- Band Count
- CellType
- Map Projection
- Spatial Resolution (pixels per map X/Y units)
The design philosophy behind GeoTrellis spark API is that it should work as much as possible with native spark types and should minimize number of new classes of wrappers. Therefore TileLayerRDD
is first and foremost an RDD
and has all of the methods available to RDD of its type. The spatial metadata is "extra".
Note: There is no requirement that each key value occurs only once within a layer. Typically that is the case, but ultimately the property can only be inferred based on the operations that that produced the layer.
Because each record in a layer RDD is a tuple key and value Spark provides additional methods through PairRDDFunctions
implicit method extensions:
reduceByKey
combineByKey
groupByKey
partitionBy
join
leftOuterJoin
rightOuterJoin
fullOuterJoin
mapValues
flatMapValues
Once a layer exists much of the work transforming a layer can be done these built-in Spark functions.
import org.apache.spark._
implicit val sparkContext =
SparkContext.getOrCreate(
new SparkConf(loadDefaults = true)
.setMaster("local[*]")
.setAppName("Demo")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "geotrellis.spark.store.kryo.KryoRegistrator"))
Lets build a layer from a small demo file, we will have more later.
First we construct our RasterSource
instances.
import geotrellis.layer._
import geotrellis.spark._
import geotrellis.raster.geotiff.GeoTiffRasterSource
val uri = "s3://geotrellis-demo/cogs/harrisburg-pa/elevation.tif"
val source = GeoTiffRasterSource(uri)
At this point we can read the file but we have to make a choice about what kind of tiling layout we want to use.
If we know ahead of time, we can skip this step. But if the input is more variable we can use RasterSummary
object.
RasterSummary
will expand the extent to cover all of the seen rasters, at highest resolution.
RasterSummary
will through an exception if it encounters sources in different projects because their extents are not comparable. If that's your problem, reproject your sources to common CRS before collecting the summary.
val summary = RasterSummary.fromSeq(List(source))
An instance of RasterSummary
will produce for you a LayoutDefinition
that covers the input data and matches your layout scheme. In this case lets snap our raster to one of the slippy zoom levels. This will up-sample our raster to snap it to closest zoom level
import geotrellis.spark._
import geotrellis.proj4.WebMercator
val layout = summary.layoutDefinition(ZoomedLayoutScheme(WebMercator, 256))
val layer = RasterSourceRDD.spatial(source, layout)
// problem ... I can't run the code
val layer: TileLayerRDD[SpatialKey] = ???
layer.metadata
TODO: example of joining layers
A layer does not restrict how its records are partitioned. Generally a HashPartitioner
is used and records are distributed across all the partitions based on the modulo division of the key hash.
Specifying layer partitioning provides an interesting avenue for optimization. When joining two layers that have different partitioning a shuffle has to happen in order to co-locate all the records from layer A with all the records from layer B. However, if two layers share the same partitioner Spark can infer that all keys that could join to a given partition in layer A must come from a single partition in layer B. Such a join does not result in a shuffle and is thus an order of a magnitude faster.
sparkContext.stop()