Natural Language Processing with Databricks

Summary

Perform natural language processing at scale on Databricks.

Description

This five-hour course will teach you how to do natural language processing at scale on Databricks. You will apply libraries such as NLTK and Gensim in a distributed setting as well as SparkML/MLlib to solve classification, sentiment analysis, and text wrangling tasks. You will learn how to remove stop words, when to lemmatize vs stem your tokens, and how to generate term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset. You will also use dimensionality reduction techniques to visualize word embeddings with Tensorboard and apply and visualize basic vector arithmetic to embeddings.

Learning objectives

  • Explain the motivation behind using Natural Language Processing to analyze data.

  • Identify distributed Natural Language Processing libraries commonly used when analyzing data.

  • Perform a series of Natural Language Processing workflows in the Databricks Data Science Workspace

Prerequisites

  • Experience working with PySpark DataFrames

  • Mastery of concepts presented in the Databricks Academy "Apache Spark Programming" course

  • Mastery of concepts presented in the Databricks Academy "Scalable Machine Learning with Apache Spark" course

Learning path

  • This course is part of the Data Scientist learning path.

Proof of completion

  • Upon 80% completion of this course, you will receive a proof of completion. 

 

Part of Learning Pathway(s)