Lesson 3:  Classify unstructured text data with Sparkling Water

Happier data scientists use Apache Spark’s elegant APIs, RDDs, multi-tenant context combined with an open source, distributed, and parallel predictive engine for machine learning.

H2O.ai’s revolutionary product, Sparkling Water, blends data science workflows into developers’ applications using H2O’s machine learning technology and Spark’s fast data processing abilities. Sparkling Water provides the API calls to transform an H2O Frame to a Spark Data Frame, allowing access to Spark’s SQL engine and Sparkling Water conversely transforms Spark Data Frames to H2O Frames for access to H2O’s algorithms.

Sparkling water enables practitioners to use H2O algorithms in conjunction with MLib algorithms on Apache Spark, run Scala code in Flow, export pipelines as executable java code for easy deployment. It equips you with a toolbox to build smarter applications using both in harmony with the Spark ecosystem.

In the example below, watch H2O’s community hacker Alex Tellez hack 14,000+ job titles off Craigslist Bay Area listings to build a model that classifies the unstructured text data of a job title to a given job category. In the video that follows, Michal Malohlava describes how to turn these models into a streaming application which scores job title on the fly.



Part 1: H2O’s community hacker Alex Tellez builds a model that classifies the unstructured text.



Part 2: Michal Malohlava describes how to turn these models into a streaming application.

GET THE DEMO


Questions?

  • We'd love to hear from you! Reach out to us on our Google Stream group with questions, requests, or comments any time.