Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.

Introduction to Programming Big Data with R (bpdR)

Message Passing Interface (MPI)

Distributed Matrices

Statistics Applications

The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.

In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.

Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.

In this instructor-led, live training, participants will learn how to build a Data Vault.

In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.

Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results

Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.

This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.

This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.

This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.

In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.

Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.

This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.

Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).

In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.

Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.

In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

Apache Flink is an open-source framework for scalable stream and batch data processing.

This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).

In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark

spark.mllib: data types, algorithms, and utilities

spark.ml: high-level APIs for ML pipelines