Skip to content

Integrate with Spark NLP¶

Comet integrates with Spark NLP.

Spark NLP is a state-of-the-art Natural Language Processing (NLP) library built on top of Apache Spark. It provides simple, performant and accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.logging.comet import CometLogger

spark = sparknlp.start()
logger = CometLogger()

## Your training code

Log automatically¶

SparkNLP ships with a dedicated CometLogger and can automatically track the following experiment data:

  • Model Metrics

Comet can automatically monitor model training metrics from your PySpark pipelines.

Configuring the CometLogger for SparkNLP¶

Find more information about the CometLogger in the SparkNLP documentation

End-to-end Example¶

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.logging.comet import CometLogger
spark = sparknlp.start()

OUTPUT_LOG_PATH = "./run"
logger = CometLogger()

document = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")
embds = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")
multiClassifier = MultiClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setShufflePerEpoch(False) \
    .setEnableOutputLogs(True) \
    .setOutputLogsPath(OUTPUT_LOG_PATH) \
    .setMaxEpochs(1)

logger.monitor(logdir=OUTPUT_LOG_PATH, model=multiClassifier)

trainDataset = spark.createDataFrame(
    [("Nice.", ["positive"]), ("That's bad.", ["negative"])],
    schema=["text", "labels"],
)

pipeline = Pipeline(stages=[document, embds, multiClassifier])
pipeline_model = pipeline.fit(trainDataset)
logger.log_pipeline_parameters(pipeline_model)

logger.end()

Try it out!¶

Try our example for using Comet with Spark NLP.

Open In Colab

Nov. 18, 2024