Spark: An Aptly Named, Exploding Technology

Boasting 1500 expected participants (900 last year), Spark Summit East has transformed into one of the most exciting data science and engineering conferences on the planet - and this is anything but a surprise. On September 9, 2015 Cloudera announced Apache Spark will replace Apache MapReduce as the default distributed processing framework for its popular distribution of Hadoop. Earlier that year, IBM announced a 300 million dollar investment in Spark as their engineering centerpiece for their shift into advanced analytics and recruited star Spark talent including Holden Karau, author of Learning Spark and former Databricks employee. Spark Summit East is the first major 'Big Data' conference of the year, preceding Strata + Hadoop, Predictive Analytics World, and most platform-specific developer events, to name a few, and consistently features keynotes characterizing the direction and power of the platform.

The fiercely competitive, rigorously vetted, prominent breakout sessions are a major focal point of the Spark Summit, where evangelists describe Spark's groundbreaking accomplishments and most exciting use cases. The concurrent sessions are organized by track. We recognize creating your personal conference schedule is always challenging, so CapTech has analyzed all of the session abstracts and speaker profiles and recognizes a few significant talks by discipline:

Enterprise Track

Petabyte Scale Anomaly Detection Using R & Spark
  • Speakers Sridhar Alla and Kiran Muglurmath lead the Enterprise Business Intelligence (EBI) group at Comcast; experts in the telecommunications industry and experienced developers of data science applications
  • Petabytes of high-velocity customer data spanning multiple sources
  • Anomaly detection problem of speech recognition using Hidden Markov Models
  • Access to the R language community libraries through SparkR
  • Unstructured data is an increasingly important part of the enterprise
  • Reusing business logic during the transition from serial to parallel computing (for example, R to SparkR) significantly decreases development costs
  • Attend if you want to increase performance of your R applications or you handle unstructured text / speech

Developer Track

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
  • Speaker Ewen Cheslack-Postava is an engineer at Confluent; Stanford PhD
  • Practical session for data integration and delivery
  • Apache Kafka should be on your radar if Hadoop is part of your data architecture
  • Kafka Connect promises to shorten development times for Kafka projects
  • Combining Kafka Connect and Spark Streaming is a natural design pattern for realtime data engineering problems
  • Attend if you have data delivery problems, want to reduce your Kafka footprint, or are unfamiliar with Kafka
Inside Apache SystemML
  • Speaker Frederick Reiss is the Chief Architect at IBM Spark Technology Center; UC Berkeley PhD
  • Apache SystemML accelerates data science algorithm development for large-scale machine learning problems
  • SystemML optimizes operations of a high-level algorithm into Spark API calls
  • SystemML entered the Apache incubator phase in November 2015; committers include Databricks (Patrick Wendell, Reynold Xin, etc.), IBM (Holden Karau, Luciano Resende, etc.), UC Berkeley faculty, and others
  • Attend if you're interested in the future of machine learning and advanced algorithm development in distributed programming frameworks

Data Science Track

Distributed Time Travel for Feature Generation
  • Speakers DB Tsai, Prasanna Padmanabhan, and Mohammad H. Taghavi are in Netflix's Research Division; the group works on Netflix's personalization engine
  • Netflix uses ambitious predictive analytics models and goals; its personalization engine is complex yet highly effective
  • Feature generation is enhanced by snapshots of models at specified time intervals
  • Interactively prototype model features and experiment using Apache Zeppelin
  • Attend if your model building process is needlessly iterative, you employ predictive analytics on volumes of data, you have never seen Apache Zeppelin in-action, or your product is personalized based on the collective intelligence of your customers
Distributed Tensor Flow on Spark: Scaling Google's Deep Learning Library
  • Speaker Christopher Nguyen is the CEO and co-founder of Arimo; Stanford PhD
  • TensorFlow, the algorithm portion of Google's deep-learning engine, was released to open source in November 2015
  • Deep learning is a class of machine learning models that use complex transformations (similar to neural networks)
  • TensorFlow uses a series of graph models to represent computations; operations are represented as nodes and edges are called tensors
  • Google uses an advanced hardware infrastructure engine to run TensorFlow
  • Arimo's distributed implementation in Spark which allows TensorFlow to scale horizontally
  • Attend if you are familiar with machine learning models and interested in horizontal scaling for TensorFlow
Time Series Analysis With Spark
  • Speaker Sandy Ryza is a data scientist at Cloudera; a frequent speaker at Spark Summit and author of Advanced Analytics with Spark
  • Time series data sets are ubiquitous for data-rich industries but are frequently mismanaged due to their complexity and volume
  • Spark-TS is an open source Spark library developed by the data science team at Cloudera to assist with model problems exclusive to time series data sets
  • Time series models require inputs that are difficult to capture using traditional data modeling techniques; for example, an autoregressive (AR) model will require a lag operator computed using p adjacent data points
  • Attend if you are developing applications for munging, manipulating, and modeling time series data

Applications Track

Spark Streaming and IoT
  • Speaker Mike Freedman is a professor at Princeton and Co-founder/CEO of Iobeam
  • Internet of Things (IoT) is a leading area for Apache Spark applications
  • Experience and lessons-learned of iobeam, a data analysis platform for sensor-collected data
  • Use cases and architecture of realtime-enabled data analysis pipelines
  • Effectively utilize Spark for streaming and batch applications
  • Problems of data collection for sensors and devices and the implications for Spark applications
  • Attend if you are a data architect or a manager handling device and sensor data

Research Track

Succinct Spark: Fast Interactive Queries on Compressed RDDs
  • Speaker Rachit Agarwal is a postdoctoral fellow at University of California Berkeley's AMPLab, the birthplace of Spark; University of Illinois PhD
  • Succinct is a distributed data store that supports a wide range of point queries directly on a compressed representation of the input data
  • Succinct stores compressed input data by distributing it among nodes in a cluster; this data supports distributed search (Elastic, Solr, etc.)
  • Succinct Spark provides an alternative to full scans or indexes for searchings RDDs
  • Succinct Spark is orders of magnitude faster than native Spark for document storage and retrieval
  • Attend if you RDDs scans are limiting the performance of your Spark applications or you are interested in the concept of search using Spark