Apache Spark is a high-performance engine for large-scale computing tasks, such as data processing, machine learning and real-time data streaming. It includes APIs for Java, Python, Scala and R.
[Overview of Apache Spark](https://spark.apache.org/)
Trademarks: This software listing is packaged by Bitnami. The respective trademarks mentioned in the offering are owned by the respective companies, and use of them does not imply any affiliation or endorsement.
This chart bootstraps an [Apache Spark](https://github.com/bitnami/containers/tree/main/bitnami/spark) deployment on a [Kubernetes](https://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
Apache Spark includes APIs for Java, Python, Scala and R.
Bitnami charts can be used with [Kubeapps](https://kubeapps.dev/) for deployment and management of Helm Charts in clusters.
## Prerequisites
- Kubernetes 1.19+
- Helm 3.2.0+
## Installing the Chart
To install the chart with the release name `my-release`:
These commands deploy Apache Spark on the Kubernetes cluster in the default configuration. The [Parameters](#parameters) section lists the parameters that can be configured during installation.
> **Tip**: List all releases using `helm list`
## Uninstalling the Chart
To uninstall/delete the `my-release` statefulset:
```console
$ helm delete my-release
```
The command removes all the Kubernetes components associated with the chart and deletes the release. Use the option `--purge` to delete all persistent volumes too.
| `commonLabels` | Labels to add to all deployed objects | `{}` |
| `commonAnnotations` | Annotations to add to all deployed objects | `{}` |
| `clusterDomain` | Kubernetes cluster domain name | `cluster.local` |
| `extraDeploy` | Array of extra objects to deploy with the release | `[]` |
| `initScripts` | Dictionary of init scripts. Evaluated as a template. | `{}` |
| `initScriptsCM` | ConfigMap with the init scripts. Evaluated as a template. | `""` |
| `initScriptsSecret` | Secret containing `/docker-entrypoint-initdb.d` scripts to be executed at initialization time that contain sensitive data. Evaluated as a template. | `""` |
| `diagnosticMode.enabled` | Enable diagnostic mode (all probes will be disabled and the command will be overridden) | `false` |
| `diagnosticMode.command` | Command to override all containers in the deployment | `["sleep"]` |
| `diagnosticMode.args` | Args to override all containers in the deployment | `["infinity"]` |
| `master.topologySpreadConstraints` | Topology Spread Constraints for pod assignment spread across your cluster among failure-domains. Evaluated as a template | `[]` |
| `master.schedulerName` | Name of the k8s scheduler (other than default) for master pods | `""` |
| `master.terminationGracePeriodSeconds` | Seconds Redmine pod needs to terminate gracefully | `""` |
| `master.lifecycleHooks` | for the master container(s) to automate configuration before or after startup | `{}` |
| `master.extraVolumes` | Optionally specify extra list of additional volumes for the master pod(s) | `[]` |
| `master.extraVolumeMounts` | Optionally specify extra list of additional volumeMounts for the master container(s) | `[]` |
| `master.resources.limits` | The resources limits for the container | `{}` |
| `master.resources.requests` | The requested resources for the container | `{}` |
| `worker.topologySpreadConstraints` | Topology Spread Constraints for pod assignment spread across your cluster among failure-domains. Evaluated as a template | `[]` |
| `worker.schedulerName` | Name of the k8s scheduler (other than default) for worker pods | `""` |
| `worker.terminationGracePeriodSeconds` | Seconds Redmine pod needs to terminate gracefully | `""` |
| `worker.lifecycleHooks` | for the worker container(s) to automate configuration before or after startup | `{}` |
| `worker.extraVolumes` | Optionally specify extra list of additional volumes for the worker pod(s) | `[]` |
| `worker.extraVolumeMounts` | Optionally specify extra list of additional volumeMounts for the master container(s) | `[]` |
| `worker.resources.limits` | The resources limits for the container | `{}` |
| `worker.resources.requests` | The requested resources for the container | `{}` |
| `ingress.pathType` | Ingress path type | `ImplementationSpecific` |
| `ingress.apiVersion` | Force Ingress API version (automatically detected if not set) | `""` |
| `ingress.hostname` | Default host for the ingress resource | `spark.local` |
| `ingress.ingressClassName` | IngressClass that will be be used to implement the Ingress (Kubernetes 1.18+) | `""` |
| `ingress.path` | The Path to Spark. You may need to set this to '/*' in order to use this with ALB ingress controllers. | `/` |
| `ingress.annotations` | Additional annotations for the Ingress resource. To enable certificate autogeneration, place here your cert-manager annotations. | `{}` |
| `ingress.tls` | Enable TLS configuration for the hostname defined at ingress.hostname parameter | `false` |
| `ingress.selfSigned` | Create a TLS secret for this ingress record using self-signed certificates generated by Helm | `false` |
| `ingress.extraHosts` | The list of additional hostnames to be covered with this ingress record. | `[]` |
| `ingress.extraPaths` | Any additional arbitrary paths that may need to be added to the ingress under the main host. | `[]` |
| `ingress.extraTls` | The tls configuration for additional hostnames to be covered with this ingress record. | `[]` |
| `ingress.secrets` | If you're providing your own certificates, please use this to add the certificates as secrets | `[]` |
| `ingress.extraRules` | Additional rules to be covered with this ingress record | `[]` |
| `metrics.masterAnnotations` | Annotations for the Prometheus metrics on master nodes | `{}` |
| `metrics.workerAnnotations` | Annotations for the Prometheus metrics on worker nodes | `{}` |
| `metrics.podMonitor.enabled` | If the operator is installed in your cluster, set to true to create a PodMonitor Resource for scraping metrics using PrometheusOperator | `false` |
| `metrics.podMonitor.extraMetricsEndpoints` | Add metrics endpoints for monitoring the jobs running in the worker nodes | `[]` |
| `metrics.podMonitor.namespace` | Specify the namespace in which the podMonitor resource will be created | `""` |
| `metrics.podMonitor.interval` | Specify the interval at which metrics should be scraped | `30s` |
| `metrics.podMonitor.scrapeTimeout` | Specify the timeout after which the scrape is ended | `""` |
| `metrics.podMonitor.additionalLabels` | Additional labels that can be used so PodMonitors will be discovered by Prometheus | `{}` |
| `metrics.prometheusRule.enabled` | Set this to true to create prometheusRules for Prometheus | `false` |
| `metrics.prometheusRule.namespace` | Namespace where the prometheusRules resource should be created | `""` |
| `metrics.prometheusRule.additionalLabels` | Additional labels that can be used so prometheusRules will be discovered by Prometheus | `{}` |
> **Tip**: You can use the default [values.yaml](values.yaml)
## Configuration and installation details
### [Rolling vs Immutable tags](https://docs.bitnami.com/containers/how-to/understand-rolling-tags-containers/)
It is strongly recommended to use immutable tags in a production environment. This ensures your deployment does not change automatically if the same tag is updated with a different image.
Bitnami will release a new chart updating its containers if a new version of the main container, significant changes, or critical vulnerabilities exist.
### Define custom configuration
To use a custom configuration, a ConfigMap should be created with the `spark-env.sh` file inside the ConfigMap. The ConfigMap name must be provided at deployment time.
To set the configuration on the master use `master.configurationConfigMap=configMapName`. To set the configuration on the worker, use `worker.configurationConfigMap=configMapName`.
These values can be set at the same time in a single ConfigMap or using two ConfigMaps. An additional `spark-defaults.conf` file can be provided in the ConfigMap. You can use both files or one without the other.
### Submit an application
To submit an application to the Apache Spark cluster, use the `spark-submit` script, which is available at [https://github.com/apache/spark/tree/master/bin](https://github.com/apache/spark/tree/master/bin).
The command below illustrates the process of deploying one of the sample applications included with Apache Spark. Replace the `k8s-apiserver-host`, `k8s-apiserver-port`, `spark-master-svc`, and `spark-master-port` placeholders with the correct master host/IP address and port for your deployment.
This command example assumes that you have downloaded a Spark binary distribution, which can be found at [Download Apache Spark](https://spark.apache.org/downloads.html).
For a complete walkthrough of the process using a custom application, refer to the [detailed Apache Spark tutorial](https://docs.bitnami.com/tutorials/process-data-spark-kubernetes/) or Spark's guide to [Running Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html).
> Be aware that it is currently not possible to submit an application to a standalone cluster if RPC authentication is configured. [Learn more about the issue](https://issues.apache.org/jira/browse/SPARK-25078).
### Configuring Spark Master as reverse proxy
Spark offers configuration to enable running Spark Master as reverse proxy for worker and application UIs. This can be useful as the Spark Master UI may otherwise use private IPv4 addresses for links to Spark workers and Spark apps.
Coupled with `ingress` configuration, you can set `master.configOptions` and `worker.configOptions` to tell Spark to reverse proxy the worker and application UIs to enable access without requiring direct access to their hosts:
See the [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) docs for detail on the parameters.
### Configure security for Apache Spark
### Configure SSL communication
In order to enable secure transport between workers and master, deploy the Helm chart with the `ssl.enabled=true` chart parameter.
### Create certificate and password secrets
It is necessary to create two secrets for the passwords and certificates. The names of the two secrets should be configured using the `security.passwordsSecretName` and `security.ssl.existingSecret` chart parameters.
The keys for the certificate secret must be named `spark-keystore.jks` and `spark-truststore.jks`, and the content must be text in JKS format. Use [this script to generate certificates](https://raw.githubusercontent.com/confluentinc/confluent-platform-security-tools/master/kafka-generate-ssl.sh) for test purposes if required.
The secret for passwords should have three keys: `rpc-authentication-secret`, `ssl-keystore-password` and `ssl-truststore-password`.
Refer to the [chart documentation for more details on configuring security and an example](https://docs.bitnami.com/kubernetes/infrastructure/spark/administration/configure-security/).
> It is currently not possible to submit an application to a standalone cluster if RPC authentication is configured. [Learn more about this issue](https://issues.apache.org/jira/browse/SPARK-25078).
### Set Pod affinity
This chart allows you to set your custom affinity using the `XXX.affinity` parameter(s). Find more information about Pod affinity in the [Kubernetes documentation](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity).
As an alternative, you can use the preset configurations for pod affinity, pod anti-affinity, and node affinity available at the [bitnami/common](https://github.com/bitnami/charts/tree/main/bitnami/common#affinities) chart. To do so, set the `XXX.podAffinityPreset`, `XXX.podAntiAffinityPreset`, or `XXX.nodeAffinityPreset` parameters.
## Troubleshooting
Find more information about how to deal with common errors related to Bitnami's Helm charts in [this troubleshooting guide](https://docs.bitnami.com/general/how-to/troubleshoot-helm-chart-issues).
## Upgrading
### To 6.0.0
This chart major version standarizes the chart templates and values, modifying some existing parameters names and adding several more. These parameter modifications can be sumarised in the following:
-`worker.autoscaling.CpuTargetPercentage/.replicasMax` parameters are now found by `worker.autoscaling.targetCPU/.maxReplicas`.
-`webport/webPortHttps/cluster` parameters are now found by `containerPorts.http/.https/.cluster`.
-`service.webport/webPortHttps/cluster` parameters are now found by `service.ports.http/.https/.cluster`.
-`service.nodePorts.web/webHttps/cluster` parameters are now found by `service.nodePorts.http/.https/.cluster`.
-`xxxxx.securityContext` parameters are now under `xxxxx.podSecurityContext`
-`xxxxx.configurationConfigMap` parameter has been renamed to `xxxxx.existingConfigmap`
-`extraPodLabels` parameter has been renamed to `podLabels`
Besides the changes detailed above, no issues are expected to appear when upgrading.
### To 5.0.0
This version standardizes the way of defining Ingress rules. When configuring a single hostname for the Ingress rule, set the `ingress.hostname` value. When defining more than one, set the `ingress.extraHosts` array. Apart from this case, no issues are expected to appear when upgrading.
### To 4.0.0
[On November 13, 2020, Helm v2 support formally ended](https://github.com/helm/charts#status-of-the-project). This major version is the result of the required changes applied to the Helm Chart to be able to incorporate the different features added in Helm v3 and to be consistent with the Helm project itself regarding the Helm v2 EOL.
[Learn more about this change and related upgrade considerations](https://docs.bitnami.com/kubernetes/infrastructure/spark/administration/upgrade-helm3/).
### To 3.0.0
- This version introduces `bitnami/common`, a [library chart](https://helm.sh/docs/topics/library_charts/#helm) as a dependency. More documentation about this new utility could be found [here](https://github.com/bitnami/charts/tree/main/bitnami/common#bitnami-common-library-chart). Please, make sure that you have updated the chart dependencies before executing any upgrade.
- Spark container images are updated to use Hadoop `3.2.x`: [Notable Changes: 3.0.0-debian-10-r44](https://github.com/bitnami/containers/tree/main/bitnami/spark#300-debian-10-r44)
> Note: Backwards compatibility is not guaranteed due to the above mentioned changes. Please make sure your workloads are compatible with the new version of Hadoop before upgrading. Backups are always recommended before any upgrade operation.