Natural Language Processing refers to a set of technologies used in our everyday lives to make it easier for computers to understand human language. Thanks to the increasingly-relevant use-cases popping up every day, it has quickly grown into one of the most important fields in data science. Extracting accurate information from readable text is essential for applications such as search, customer service and financial engineering.
At the forefront of this battle to understand human language are libraries specifically written to accomplish tasks such as language modeling, disambiguation, paraphrasing and question-answering. These are in abundance. However, one tool, in particular, stands out more than any other – Spark NLP.
According to a study carried out by O’Reilly, Spark NLP is the most popular NLP tool among developers and the seventh-most used tool in Artificial Intelligence applications. Here are a few reasons why it is such a popular tool.
It’s very accurate
On the software age spectrum, Spark NLP is young. It was first released in 2017, so libraries like spaCy, CoreNLP and OpenNLP have had a bit more inertia going on for them. That may be, but Spark NLP trounces them because of its approach to the problem with more recent and more advanced methods. For instance, it comes with a production-ready implementation of BERT embeddings and uses transfer learning for data extraction.
In the world of Natural Language Processing, representing words as vectors opens up a world of possibilities. Such data structures are called word embeddings. Used together with transformers, which are a kind of neural network architecture created by Google, we get BERT (Bi-directional Encoder Representation from Transformer.)
It’s a language model that outperforms conventional sequential models, such as GRU and RNN, in terms of accuracy even before convergence. One common application of BERT is entity recognition.
Reduced training model sizes
Transfer learning is a highly-effective method of extracting data that can leverage even small amounts of data. As a result, there’s no need to collect large amounts of data to train state-of-the-art models.
Apache Spark is a powerful analytics engine for large-scale data processing on distributed networks. Compared to competing libraries such as Hadoop, it can process data up to a hundred times as fast. Spark NLP leverages this performance boost and other optimizations to run orders of magnitude faster than the design limitations of legacy applications allow.
Another reason for this speed is the 2015 introduction of a new processing engine (Tungsten) to Apache Spark. This would see the library overlook Java’s in-built garbage collection in favor of more performant memory management by Spark itself.
Hardware innovations by GPU manufacturers such as NVIDIA have also provided Spark NLP with an upper hand. Since Spark NLP uses Tensorflow under the hood for various operations, it can leverage the performance benefits that the more powerful hardware provides. In comparison, legacy libraries will probably require a rewrite to achieve the same.
It is fully supported by Spark
Spark is currently one of the most popular libraries in the machine learning world because of its speed and flexibility. The need for a library that fully supports it should be immediately apparent.
There already exist libraries that are friendly with Spark, such as SparkML, but these are usually not as feature-rich as SparkML. Developers are bound to find themselves importing additional libraries to process data before feeding the intermediary back to Spark. This approach is inefficient because too much time is spent serializing and deserializing strings.
It is scalable
Another benefit that Spark NLP gets from relying on Spark under the hood is scalability. Spark, primarily used for distributed applications, was designed to be scalable. Spark NLP benefits from this since it can scale on any Spark cluster as well as on-premise and with any cloud provider. This improved scalability is thanks to Spark’s ability to pull cluster-wide data into an in-memory cache.
Caching is advantageous when dealing with sets of data that need to be accessed repeatedly. Iterative algorithms and like random forests and the need to access small sets of data at a time are examples of applications that benefit greatly from cluster-wide caching.
Spark’s distributed nature also lends a hand here. Since most large-scale applications would require the processing load to be distributed among different servers, Spark NLP is built ready to deal with the imminent task.
Extensive functionality and support
Spark NLP was originally written in Scala, making it compatible with a variety of JVM interfaces such as Java, Kotlin and Groovy. Over the years, it has since been fully ported to Python. It offers support for architectures and software that other libraries tend to ignore, including:
- Training on GPUs
- Native support for Spark
- Support for Hadoop (YARN and HDFS)
- Support for user-defined deep-learning networks.
Other library-specific features present in Spark NLP include:
- Sentence detection
- Pre-trained models
A large community
No matter how large or extensive a library is, it can only be as successful as the community that rallies behind it. A large community is beneficial because developers tend to band together and create resources that every one of them can benefit from. Additionally, anyone finding themselves stuck can quickly get help from people that have had similar problems through Stack Overflow or similar platforms.
Fortunately, Spark NLP is supported by some of the most popular languages in the world. Java and Python are good examples, but this audience is greatly expanded once other JVM languages like Kotlin and Scala are included.
There are many other open-source libraries with large communities and offer a score of features, including spaCy, CoreNLP and NLTK. However, much of the appeal of SparkNLP comes from its compatibility with Spark, considering its recent massive boom in popularity.
Spark isn’t the easiest library to wrap your head around, but SparkNLP does a good job of providing a simple API that can be easily interacted with. For developers, this will often translate itself as a way to do more with fewer lines of code.