Support Spark Deployment #3043

czy006 · 2023-09-11T15:14:33Z

czy006
Sep 11, 2023

Flink job deployment StreamPark has already supported it relatively well. Spark has not yet done so in this area. We are now proposing relevant designs. We welcome your discussion and will continue to collect relevant functional opinions and design suggestions.

Draft：Deploy Spark Job In StreamPark

czy006 · 2023-09-11T15:21:53Z

czy006
Sep 11, 2023
Author

Here I put forward an idea: I don't store multiple fields in the database like Flink deployment mode. I save it in the form of YAML. We can simply write it to the database after Base64 and save it. This can avoid database field expansion as the functions increase. What do you think about this?

3 replies

czy006 Sep 12, 2023
Author

This function can be later extended to Flink deployment, but I think the first version can be used as an experimental function in Spark.

czy006 Sep 17, 2023
Author

After online and offline discussions, Spark tables will be designed separately and will not be mixed with tables existing in Flink's current database. I will update the link later after revising the plan

cc @caicancai @Al-assad

wolfboys Sep 17, 2023
Collaborator

Are you suggesting that we save all the description information of Spark jobs in a YAML format? For example, the spark job status (running, cancel), the job's application ID, and the resource parameters required for running the job (e.g: memory, CPU)? Why don't we describe the job information in a table? Similar to the way StreamPark handles Flink jobs (e.g, using a table: t_flink_app), we can create a new table to record Spark job information (t_spark_app).

ChengJie1053 · 2023-09-12T01:16:33Z

ChengJie1053
Sep 12, 2023
Collaborator

Flink job deployment StreamPark has already supported it relatively well. Spark has not yet done so in this area. We are now proposing relevant designs. We welcome your discussion and will continue to collect relevant functional opinions and design suggestions.

Draft：Deploy Spark Job In StreamPark

Hello, your technical plan is very good, I have used spark before, can I complete this with you

3 replies

czy006 Sep 12, 2023
Author

Welcome to join, the specific tasks will be discussed after the motion passed we can negotiate the completion of each part

ChengJie1053 Sep 12, 2023
Collaborator

I want to support spark jar and sql task submission to yarn in streampark

The jar task is submitted directly using SparkLauncher
The SQL reference org.apache.streampark.flink.cli.SqlClient build streampark-spark-sqlclient, used to accept the spark SQL

czy006 Sep 12, 2023
Author

Good idea I had YARN support in mind at the beginning of the design. This release is just starting. I will concentrate on completing the deployment of the K8S environment first. If you can take care of the YARN part, it will be a good choice. I took this part of the implementation interface into consideration in the design

caicancai · 2023-09-12T03:08:14Z

caicancai
Sep 12, 2023
Collaborator

Flink job deployment StreamPark has already supported it relatively well. Spark has not yet done so in this area. We are now proposing relevant designs. We welcome your discussion and will continue to collect relevant functional opinions and design suggestions.
Draft：Deploy Spark Job In StreamPark

Hello, your technical plan is very good, I have used spark before, can I complete this with you

+1

1 reply

czy006 Sep 12, 2023
Author

Welcome to join, you can contribute ideas and proposals for the design.The development will be carried out through the following lecture

wolfboys · 2023-09-17T06:01:27Z

wolfboys
Sep 17, 2023
Collaborator

hi:

As mentioned in the background introduction of the proposal, currently StreamPark already provides support for flink jobs and has a large user base. However, support for Spark capabilities has not been initiated yet. StreamPark has never abandoned support for Spark jobs and I sincerely appreciate your drive for this PR. The initial plan for StreamPark support for Spark is as follows:

On one hand, we will provide development framework for Spark jobs (this feature is not part of the current discussion). On the other hand, we will provide the deployment and management capabilities of Spark jobs. For this aspect, I suggest we divide it into several steps:

First, we can start by implementing the minimum viable product (MVP), taking inspiration from StreamPark's support for Flink jobs. We can create a table for Spark jobs with essential fields containing necessary information (job name, status, parameters, deployment mode, etc.). We will also need a frontend page for adding Spark jobs.

Regarding the deployment method of spark jobs, there are differences between the implementation on yarn and k8s. We can begin with a simplest deployment mode. For the job status handling, we can directly leverage Spark's API. This part ensures the minimal functionality (jobs can be added, started, and their status automatically tracked).

Based on user feedback, we will gradually improve the functions, e.g support more deployment modes(Each deployment mode implementation involves various detailed issues. We can discuss them in depth at the appropriate time)
Unified job(flink|spark) status, we can redefine the job(flink|spark) states within StreamPark, completely shielding the original states from Flink and Spark for the users.

1 reply

czy006 Sep 22, 2023
Author

This is the result of online and offline discussions. Thank you for your summary. The plan will be modified and will be reviewed again after completion cc @wolfboys

JeremyXin · 2024-03-04T09:18:10Z

JeremyXin
Mar 4, 2024

This plan is great, allowing streampark to support both spark and flink tasks. Can I join this program and contribute a little bit? Thanks!

0 replies

caicancai · 2024-03-05T02:01:16Z

caicancai
Mar 5, 2024
Collaborator

@tam-lab Of course, the community has already begun to adapt spark on yarn. You can leave a message under the corresponding issue.
#3569

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Spark Deployment #3043

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Support Spark Deployment #3043

czy006 Sep 11, 2023

Replies: 6 comments · 8 replies

czy006 Sep 11, 2023 Author

czy006 Sep 12, 2023 Author

czy006 Sep 17, 2023 Author

wolfboys Sep 17, 2023 Collaborator

ChengJie1053 Sep 12, 2023 Collaborator

czy006 Sep 12, 2023 Author

ChengJie1053 Sep 12, 2023 Collaborator

czy006 Sep 12, 2023 Author

caicancai Sep 12, 2023 Collaborator

czy006 Sep 12, 2023 Author

wolfboys Sep 17, 2023 Collaborator

czy006 Sep 22, 2023 Author

JeremyXin Mar 4, 2024

caicancai Mar 5, 2024 Collaborator

czy006
Sep 11, 2023

Replies: 6 comments 8 replies

czy006
Sep 11, 2023
Author

czy006 Sep 12, 2023
Author

czy006 Sep 17, 2023
Author

wolfboys Sep 17, 2023
Collaborator

ChengJie1053
Sep 12, 2023
Collaborator

czy006 Sep 12, 2023
Author

ChengJie1053 Sep 12, 2023
Collaborator

czy006 Sep 12, 2023
Author

caicancai
Sep 12, 2023
Collaborator

czy006 Sep 12, 2023
Author

wolfboys
Sep 17, 2023
Collaborator

czy006 Sep 22, 2023
Author

JeremyXin
Mar 4, 2024

caicancai
Mar 5, 2024
Collaborator