Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.
[category_overview] =>
[outline] =>
Introduction to Programming Big Data with R (bpdR)
Setting up your environment to use pbdR
Scope and tools available in pbdR
Packages commonly used with Big Data alongside pbdR
Message Passing Interface (MPI)
Using pbdR MPI 5
Parallel processing
Point-to-point communication
Send Matrices
Summing Matrices
Collective communication
Summing Matrices with Reduce
Scatter / Gather
Other MPI communications
Distributed Matrices
Creating a distributed diagonal matrix
SVD of a distributed matrix
Building a distributed matrix in parallel
Statistics Applications
Monte Carlo Integration
Reading Datasets
Reading on all processes
Broadcasting from one process
Reading partitioned data
Distributed Regression
Distributed Bootstrap
[language] => en
[duration] => 21
[status] => published
[changed] => 1700037139
[source_title] => Programming with Big Data in R
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
[1] => stdClass Object
(
[tid] => 877
[alias] => r-language-training
[name] => R Language
[english_name] => R Language
[consulting_option] => available
)
)
[2] => bigdatar
[3] => Array
(
[outlines] => Array
(
[tidyverse] => stdClass Object
(
[course_code] => tidyverse
[hr_nid] => 212656
[title] => Introduction to Data Visualization with Tidyverse and R
[requirements] =>
No programming experience is necessary
[overview] =>
The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.
In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.
By the end of this training, participants will be able to:
Perform data analysis and create appealing visualizations
Draw useful conclusions from various datasets of sample data
Filter, sort and summarize data to answer exploratory questions
Turn processed data into informative line plots, bar plots, histograms
Import and filter data from diverse data sources, including Excel, CSV, and SPSS files
Audience
Beginners to the R language
Beginners to data analysis and data visualization
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Introduction
Tydyverse vs traditional R plotting
Setting up your working environment
Preparing the dataset
Importing and filtering data
Wrangling the data
Visualizing the data (graphs, scatter plots)
Grouping and summarizing the data
Visualizing the data (line plots, bar plots, histograms, boxplots)
Working with non-standard data
Closing remarks
[language] => en
[duration] => 7
[status] => published
[changed] => 1700037359
[source_title] => Introduction to Data Visualization with Tidyverse and R
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => tidyverse
)
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
bisecting k-means
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
association rules
PrefixSpan
Evaluation metrics
PMML model export
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
spark.ml: high-level APIs for ML pipelines
Overview: estimators, transformers and pipelines
Extracting, transforming and selecting features
Classification and regression
Clustering
Advanced topics
[language] => en
[duration] => 35
[status] => published
[changed] => 1700037209
[source_title] => Apache Spark MLlib
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => spmllib
)
)
[codes] => Array
(
[0] => tidyverse
[1] => datavault
[2] => sparkstreaming
[3] => ksql
[4] => apacheignite
[5] => beam
[6] => apex
[7] => storm
[8] => nifi
[9] => nifidev
[10] => flink
[11] => sparkpython
[12] => graphcomputing
[13] => aitech
[14] => spmllib
)
)
[4] =>
[5] => Array
(
[0] => 4
[1] => 5
)
[6] => Array
(
[282974] => Array
(
[title] => Programming with Big Data in R
[rating] => 4
[delegate_and_company] => Tim - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
[body] => The subject matter and the pace were perfect.
[mc] => The subject matter and the pace were perfect.
[is_mt] => 0
[nid] => 282974
)
[282922] => Array
(
[title] => Programming with Big Data in R
[rating] => 5
[delegate_and_company] => Xiaoyuan Geng - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
[body] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training to meet clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!
[mc] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training meeting clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!.
[is_mt] => 0
[nid] => 282922
)
)
[7] => 4.5
[8] =>
[9] => 1
[10] =>
)
)
[7] => Array
(
[file] => /apps/nobleprog-website/core/routes.php
[line] => 19
[function] => course_menu_callback
[args] => Array
(
[0] => /en/cc/bigdatar?id=bigdatar-3516157-20210504
)
)
[8] => Array
(
[file] => /apps/nobleprog-website/__index.php
[line] => 100
[args] => Array
(
[0] => /apps/nobleprog-website/core/routes.php
)
[function] => require_once
)
[9] => Array
(
[file] => /apps/nobleprog-website/_index.php
[line] => 26
[args] => Array
(
[0] => /apps/nobleprog-website/__index.php
)
[function] => include_once
)
[10] => Array
(
[file] => /apps/hitra7/index.php
[line] => 54
[args] => Array
(
[0] => /apps/nobleprog-website/_index.php
)
[function] => include_once
)
)
NP URI: www.nobleprog.pt/en/cc/bigdatar?id=bigdatar-3516157-20210504 Cannot modify header information - headers already sent by (output started at /apps/nobleprog-website/_index.php:16) /apps/nobleprog-website/modules/course/course.php:119 Array
(
[0] => Array
(
[function] => myErrorHandler
[args] => Array
(
[0] => 2
[1] => Cannot modify header information - headers already sent by (output started at /apps/nobleprog-website/_index.php:16)
[2] => /apps/nobleprog-website/modules/course/course.php
[3] => 119
)
)
[1] => Array
(
[file] => /apps/nobleprog-website/modules/course/course.php
[line] => 119
[function] => header
[args] => Array
(
[0] => X-CSRF-Token:Tm9ibGVQcm9nMTcxNjE2OTA3NA==
)
)
[2] => Array
(
[file] => /apps/nobleprog-website/modules/course/course.php
[line] => 82
[function] => course_generate_csrf_token
[args] => Array
(
)
)
[3] => Array
(
[file] => /apps/nobleprog-website/modules/course/course.php
[line] => 31
[function] => course_render
[args] => Array
(
[0] => Array
(
[course_code] => bigdatar
[hr_nid] => 68924
[title] => Programming with Big Data in R
[requirements] =>
[overview] =>
Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.
[category_overview] =>
[outline] =>
Introduction to Programming Big Data with R (bpdR)
Setting up your environment to use pbdR
Scope and tools available in pbdR
Packages commonly used with Big Data alongside pbdR
Message Passing Interface (MPI)
Using pbdR MPI 5
Parallel processing
Point-to-point communication
Send Matrices
Summing Matrices
Collective communication
Summing Matrices with Reduce
Scatter / Gather
Other MPI communications
Distributed Matrices
Creating a distributed diagonal matrix
SVD of a distributed matrix
Building a distributed matrix in parallel
Statistics Applications
Monte Carlo Integration
Reading Datasets
Reading on all processes
Broadcasting from one process
Reading partitioned data
Distributed Regression
Distributed Bootstrap
[language] => en
[duration] => 21
[status] => published
[changed] => 1700037139
[source_title] => Programming with Big Data in R
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
[1] => stdClass Object
(
[tid] => 877
[alias] => r-language-training
[name] => R Language
[english_name] => R Language
[consulting_option] => available
)
)
[2] => bigdatar
[3] => Array
(
[outlines] => Array
(
[tidyverse] => stdClass Object
(
[course_code] => tidyverse
[hr_nid] => 212656
[title] => Introduction to Data Visualization with Tidyverse and R
[requirements] =>
No programming experience is necessary
[overview] =>
The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.
In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.
By the end of this training, participants will be able to:
Perform data analysis and create appealing visualizations
Draw useful conclusions from various datasets of sample data
Filter, sort and summarize data to answer exploratory questions
Turn processed data into informative line plots, bar plots, histograms
Import and filter data from diverse data sources, including Excel, CSV, and SPSS files
Audience
Beginners to the R language
Beginners to data analysis and data visualization
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Introduction
Tydyverse vs traditional R plotting
Setting up your working environment
Preparing the dataset
Importing and filtering data
Wrangling the data
Visualizing the data (graphs, scatter plots)
Grouping and summarizing the data
Visualizing the data (line plots, bar plots, histograms, boxplots)
Working with non-standard data
Closing remarks
[language] => en
[duration] => 7
[status] => published
[changed] => 1700037359
[source_title] => Introduction to Data Visualization with Tidyverse and R
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => tidyverse
)
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
bisecting k-means
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
association rules
PrefixSpan
Evaluation metrics
PMML model export
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
spark.ml: high-level APIs for ML pipelines
Overview: estimators, transformers and pipelines
Extracting, transforming and selecting features
Classification and regression
Clustering
Advanced topics
[language] => en
[duration] => 35
[status] => published
[changed] => 1700037209
[source_title] => Apache Spark MLlib
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => spmllib
)
)
[codes] => Array
(
[0] => tidyverse
[1] => datavault
[2] => sparkstreaming
[3] => ksql
[4] => apacheignite
[5] => beam
[6] => apex
[7] => storm
[8] => nifi
[9] => nifidev
[10] => flink
[11] => sparkpython
[12] => graphcomputing
[13] => aitech
[14] => spmllib
)
)
[4] =>
[5] => Array
(
[0] => 4
[1] => 5
)
[6] => Array
(
[282974] => Array
(
[title] => Programming with Big Data in R
[rating] => 4
[delegate_and_company] => Tim - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
[body] => The subject matter and the pace were perfect.
[mc] => The subject matter and the pace were perfect.
[is_mt] => 0
[nid] => 282974
)
[282922] => Array
(
[title] => Programming with Big Data in R
[rating] => 5
[delegate_and_company] => Xiaoyuan Geng - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
[body] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training to meet clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!
[mc] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training meeting clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!.
[is_mt] => 0
[nid] => 282922
)
)
[7] => 4.5
[8] =>
[9] => 1
[10] =>
)
)
[4] => Array
(
[file] => /apps/nobleprog-website/core/routes.php
[line] => 19
[function] => course_menu_callback
[args] => Array
(
[0] => /en/cc/bigdatar?id=bigdatar-3516157-20210504
)
)
[5] => Array
(
[file] => /apps/nobleprog-website/__index.php
[line] => 100
[args] => Array
(
[0] => /apps/nobleprog-website/core/routes.php
)
[function] => require_once
)
[6] => Array
(
[file] => /apps/nobleprog-website/_index.php
[line] => 26
[args] => Array
(
[0] => /apps/nobleprog-website/__index.php
)
[function] => include_once
)
[7] => Array
(
[file] => /apps/hitra7/index.php
[line] => 54
[args] => Array
(
[0] => /apps/nobleprog-website/_index.php
)
[function] => include_once
)
)
Programming with Big Data in R Training Course
Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.
Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction to Programming Big Data with R (bpdR)
Setting up your environment to use pbdR
Scope and tools available in pbdR
Packages commonly used with Big Data alongside pbdR
Message Passing Interface (MPI)
Using pbdR MPI 5
Parallel processing
Point-to-point communication
Send Matrices
Summing Matrices
Collective communication
Summing Matrices with Reduce
Scatter / Gather
Other MPI communications
Distributed Matrices
Creating a distributed diagonal matrix
SVD of a distributed matrix
Building a distributed matrix in parallel
Statistics Applications
Monte Carlo Integration
Reading Datasets
Reading on all processes
Broadcasting from one process
Reading partitioned data
Distributed Regression
Distributed Bootstrap
21 Hours
Programming with Big Data in R Training Course - Booking
Programming with Big Data in R Training Course - Enquiry
Programming with Big Data in R - Consultancy Enquiry
Testimonials (2)
The subject matter and the pace were perfect.
Tim - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
Course - Programming with Big Data in R
Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training meeting clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!.
Xiaoyuan Geng - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada
The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.
In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.
By the end of this training, participants will be able to:
Perform data analysis and create appealing visualizations
Draw useful conclusions from various datasets of sample data
Filter, sort and summarize data to answer exploratory questions
Turn processed data into informative line plots, bar plots, histograms
Import and filter data from diverse data sources, including Excel, CSV, and SPSS files
Audience
Beginners to the R language
Beginners to data analysis and data visualization
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
This instructor-led, live training in Portugal (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
This instructor-led, live training in Portugal (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
This instructor-led, live training in Portugal (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
In this instructor-led, live training in Portugal (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
In this instructor-led, live training in Portugal, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
This instructor-led, live training in Portugal (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
In this instructor-led, live training in Portugal, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
In this instructor-led, live training in Portugal, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark