NP URI: www.nobleprog.pt/en/cc/bigdatar?id=bigdatar-3516157-20210504 filemtime(): stat failed for /dev/shm/np_cache/www.nobleprog.pt/_np_locale_load_all_by_language__en_timestamp /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_cache/np_cache.module:42 Array ( [0] => Array ( [function] => myErrorHandler [args] => Array ( [0] => 2 [1] => filemtime(): stat failed for /dev/shm/np_cache/www.nobleprog.pt/_np_locale_load_all_by_language__en_timestamp [2] => /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_cache/np_cache.module [3] => 42 ) ) [1] => Array ( [file] => /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_cache/np_cache.module [line] => 42 [function] => filemtime [args] => Array ( [0] => /dev/shm/np_cache/www.nobleprog.pt/_np_locale_load_all_by_language__en_timestamp ) ) [2] => Array ( [file] => /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_locales/np_locale.module [line] => 29 [function] => np_cache_function [args] => Array ( [0] => _np_locale_load_all_by_language [1] => Array ( [0] => en ) ) ) [3] => Array ( [file] => /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_locales/np_locale.module [line] => 35 [function] => np_locale_load_all_by_language [args] => Array ( [0] => en ) ) [4] => Array ( [file] => /apps/hitra7/drupal7/sites/all/modules/_custom/common/np_locales/np_locale.module [line] => 85 [function] => _np_locale_t [args] => Array ( [0] => Training [1] => [2] => en ) ) [5] => Array ( [file] => /apps/nobleprog-website/modules/course/course.php [line] => 77 [function] => np_locale_t [args] => Array ( [0] => Training ) ) [6] => Array ( [file] => /apps/nobleprog-website/modules/course/course.php [line] => 31 [function] => course_render [args] => Array ( [0] => Array ( [course_code] => bigdatar [hr_nid] => 68924 [title] => Programming with Big Data in R [requirements] => [overview] =>

Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.

[category_overview] => [outline] =>

Introduction to Programming Big Data with R (bpdR)

Message Passing Interface (MPI)

Distributed Matrices

Statistics Applications

[language] => en [duration] => 21 [status] => published [changed] => 1700037139 [source_title] => Programming with Big Data in R [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) ) [1] => Array ( [0] => stdClass Object ( [tid] => 766 [alias] => big-data-training [name] => Big Data [english_name] => Big Data [consulting_option] => available_promoted ) [1] => stdClass Object ( [tid] => 877 [alias] => r-language-training [name] => R Language [english_name] => R Language [consulting_option] => available ) ) [2] => bigdatar [3] => Array ( [outlines] => Array ( [tidyverse] => stdClass Object ( [course_code] => tidyverse [hr_nid] => 212656 [title] => Introduction to Data Visualization with Tidyverse and R [requirements] => [overview] =>

The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.

In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.

By the end of this training, participants will be able to:

Audience

Format of the course

[category_overview] => [outline] =>

Introduction

Setting up your working environment

Preparing the dataset

Importing and filtering data

Wrangling the data

Visualizing the data (graphs, scatter plots)

Grouping and summarizing the data

Visualizing the data (line plots, bar plots, histograms, boxplots)

Working with non-standard data

Closing remarks

[language] => en [duration] => 7 [status] => published [changed] => 1700037359 [source_title] => Introduction to Data Visualization with Tidyverse and R [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => tidyverse ) [datavault] => stdClass Object ( [course_code] => datavault [hr_nid] => 210132 [title] => Data Vault: Building a Scalable Data Warehouse [requirements] =>

Audience

[overview] =>

Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.

In this instructor-led, live training, participants will learn how to build a Data Vault.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Data Vault architecture and design principles

Data Vault applications

Data Vault components

Building a Data Vault

Modeling Hubs, Links and Satellites

Data Vault reference rules

How components interact with each other

Modeling and populating a Data Vault

Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)

Understanding load dates, end-dates, and join operations

Business keys, relationships, link tables and join techniques

Query techniques

Load processing and query processing

Overview of Matrix Methodology

Getting data into data entities

Loading Hub Entities

Loading Link Entities

Loading Satellites

Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results

Developing a consistent and repeatable ETL (Extract, Transform, Load) process

Building and deploying highly scalable and repeatable warehouses

Closing remarks

[language] => en [duration] => 28 [status] => published [changed] => 1715349914 [source_title] => Data Vault: Building a Scalable Data Warehouse [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => datavault ) [sparkstreaming] => stdClass Object ( [course_code] => sparkstreaming [hr_nid] => 356863 [title] => Spark Streaming with Python and Kafka [requirements] =>

Audience

[overview] =>

Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.

This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

[outline] =>

Introduction

Overview of Spark Streaming Features and Architecture

Preparing the Environment

Processing Messages

Performing a Windowed Stream Processing

Prototyping the Processing Code

Streaming the Code

Acquiring Stream Output

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037710 [source_title] => Spark Streaming with Python and Kafka [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => sparkstreaming ) [ksql] => stdClass Object ( [course_code] => ksql [hr_nid] => 318463 [title] => Confluent KSQL [requirements] =>

Audience

[overview] =>

Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.

This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Setting up Confluent KSQL

Overview of KSQL Features and Architecture

How KSQL Interacts with Apache Kafka

Use Cases for KSQL

KSQL Command Line and Operations

Ingesting Data (CSV, JSON, etc.)

Creating a Stream

Creating a Table

Advanced KSQL Operations (Joins, Windowing, Aggregations, Geospatial, etc.)

Deploying KSQL to Production

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037528 [source_title] => Confluent KSQL [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => ksql ) [apacheignite] => stdClass Object ( [course_code] => apacheignite [hr_nid] => 209621 [title] => Apache Ignite for Developers [requirements] =>

Audience

[overview] =>

Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.

This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Big Data Tools and Technologies

Installing and Configuring Apache Ignite

Overview of Ignite Architecture

Querying Data in Ignite

Spreading Large Data Sets across a Cluster

Understanding the In-Memory Data Grid

Writing a Service in Ignite

Running Distributed Computing with Ignite

Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors

Testing and Troubleshooting

Summary and Next Steps

[language] => en [duration] => 14 [status] => published [changed] => 1700037322 [source_title] => Apache Ignite for Developers [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => apacheignite ) [beam] => stdClass Object ( [course_code] => beam [hr_nid] => 283646 [title] => Unified Batch and Stream Processing with Apache Beam [requirements] =>

Audience

[overview] =>

Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.

In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.

By the end of this training, participants will be able to:

Format of the Course

Note

[category_overview] => [outline] =>

Introduction

Installing and Configuring Apache Beam

Overview of Apache Beam Features and Architecture

Understanding the Apache Beam Programming Model

Running a sample pipeline

Designing a Pipeline

Creating the Pipeline

Executing the Pipeline

Testing and Debugging Apache Beam

Processing Bounded and Unbounded Datasets

Making Your Pipelines Reusable and Maintainable

Create New Data Sources and Sinks

Integrating Apache Beam with other Big Data Systems

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 14 [status] => published [changed] => 1700037430 [source_title] => Unified Batch and Stream Processing with Apache Beam [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => beam ) [apex] => stdClass Object ( [course_code] => apex [hr_nid] => 209525 [title] => Apache Apex: Processing Big Data-in-Motion [requirements] =>

Audience

[overview] =>

Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.

This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] => [outline] =>

To request a customized course outline for this training, please contact us.

 

[language] => en [duration] => 21 [status] => published [changed] => 1700037320 [source_title] => Apache Apex: Processing Big Data-in-Motion [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => apex ) [storm] => stdClass Object ( [course_code] => storm [hr_nid] => 208253 [title] => Apache Storm [requirements] => [overview] =>

Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).

"Storm is for real-time processing what Hadoop is for batch processing!"

In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.

Some of the topics included in this training include:

Request this course now!

Audience

Format of the course

[category_overview] => [outline] =>

Request a customized course outline for this training!

[language] => en [duration] => 28 [status] => published [changed] => 1700037303 [source_title] => Apache Storm [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => storm ) [nifi] => stdClass Object ( [course_code] => nifi [hr_nid] => 212800 [title] => Apache NiFi for Administrators [requirements] =>

Audience

[overview] =>

Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.

In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

By the end of this training, participants will be able to:

[outline] =>

Introduction to Apache NiFi   

Overview of Big Data and Apache Hadoop

Setting up and Running a NiFi Cluster

NiFi Operations

Monitoring and Recovery

Optimizing NiFI

Best practices

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 21 [status] => published [changed] => 1700037360 [source_title] => Apache NiFi for Administrators [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => nifi ) [nifidev] => stdClass Object ( [course_code] => nifidev [hr_nid] => 212804 [title] => Apache NiFi for Developers [requirements] =>

Audience

[overview] =>

Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.

In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Big Data Tools and Technologies

Installing and Configuring NiFi

Overview of NiFi Architecture

Development Approaches

Design Considerations

Components, Events, and Processor Patterns

Exercise: Streaming Data Feeds into HDFS

Error Handling

Controller Services

Exercise: Ingesting Data from IoT Devices using Web-Based APIs

Exercise: Developing a Custom Apache Nifi Processor using JSON

Testing and Troubleshooting

Contributing to Apache NiFi

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037361 [source_title] => Apache NiFi for Developers [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => nifidev ) [flink] => stdClass Object ( [course_code] => flink [hr_nid] => 209489 [title] => Apache Flink Fundamentals [requirements] =>

Audience

[overview] =>

Apache Flink is an open-source framework for scalable stream and batch data processing.

This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Installing and Configuring Apache Flink

Overview of Flink Architecture

Developing Data Streaming Applications in Flink

Managing Diverse Workloads

Performing Advanced Analytics

Setting up a Multi-Node Flink Cluster

Mastering Flink DataStream API

Understanding Flink Libraries

Integrating Flink with Other Big Data Tools

Testing and Troubleshooting

Summary and Next Steps

[language] => en [duration] => 28 [status] => published [changed] => 1700037319 [source_title] => Apache Flink Fundamentals [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => flink ) [sparkpython] => stdClass Object ( [course_code] => sparkpython [hr_nid] => 279430 [title] => Python and Spark for Big Data (PySpark) [requirements] =>

Audience

[overview] =>

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Understanding Big Data

Overview of Spark

Overview of Python

Overview of PySpark

Setting Up Python with Spark

Setting Up PySpark

Using Amazon Web Services (AWS) EC2 Instances for Spark

Setting Up Databricks

Setting Up the AWS EMR Cluster

Learning the Basics of Python Programming

Learning the Basics of Spark DataFrame

Working on a Spark DataFrame Project Exercise

Understanding Machine Learning with MLlib

Working with MLlib, Spark, and Python for Machine Learning

Understanding Regressions

Understanding Random Forests and Decision Trees

Working with K-means Clustering

Working with Recommender Systems

Implementing Natural Language Processing

Streaming with Spark on Python

Closing Remarks

[language] => en [duration] => 21 [status] => published [changed] => 1715349940 [source_title] => Python and Spark for Big Data (PySpark) [source_language] => en [cert_code] => [weight] => -998 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => sparkpython ) [graphcomputing] => stdClass Object ( [course_code] => graphcomputing [hr_nid] => 278402 [title] => Introduction to Graph Computing [requirements] =>

Audience

[overview] =>

Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).

In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Understanding Graph Data

Using Graph Databases to Model, Persist and Process Graph Data

Exercise: Modeling Graph Data with neo4j

Beyond Graph Databases: Graph Computing

Solving Real-World Problems with Traversals

Case Study: Ranking Discussion Contributors

Graph Computing: Local, In-Memory Graph toolkits

Exercise: Modeling Graph Data with NetworkX

Graph Computing: Batch Processing Graph Frameworks

Graph Computing: Graph-Parallel Computation

Setup and Installation

GraphX Operators

Iterating with Pregel API

Building a Graph

Designing Scalable Algorithms

Accessing Additional Algorithms

Exercis: Page Rank and Top Users

Deploying to Production

Closing Remarks

[language] => en [duration] => 28 [status] => published [changed] => 1715349940 [source_title] => Introduction to Graph Computing [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => graphcomputing ) [aitech] => stdClass Object ( [course_code] => aitech [hr_nid] => 199320 [title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP [requirements] => [overview] =>

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.

[category_overview] => [outline] =>
  1. Distribution big data
    1. Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
    2. Apache Spark MLlib
  2. Recommendations and Advertising:
    1. Natural language
    2. Text clustering, text categorization (labeling), synonyms
    3. User profile restore, labeling system
    4. Recommended algorithms
    5. Insuring the accuracy of "lift" between and within categories
    6. How to create closed loops for recommendation algorithms
  3. Logical regression, RankingSVM,
  4. Feature recognition (deep learning and automatic feature recognition for graphics)
  5. Natural language
    1. Chinese word segmentation
    2. Theme model (text clustering)
    3. Text classification
    4. Extract keywords
    5. Semantic analysis, semantic parser, word2vec (vector to word)
    6. RNN long-term memory (TSTM) architecture
[language] => en [duration] => 21 [status] => published [changed] => 1715084120 [source_title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP [source_language] => zh-hans [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => aitech ) [spmllib] => stdClass Object ( [course_code] => spmllib [hr_nid] => 141237 [title] => Apache Spark MLlib [requirements] =>

Knowledge of one of the following:

[overview] =>

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

It divides into two packages:

 

Audience

This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark

[category_overview] => [outline] =>

spark.mllib: data types, algorithms, and utilities

spark.ml: high-level APIs for ML pipelines

[language] => en [duration] => 35 [status] => published [changed] => 1700037209 [source_title] => Apache Spark MLlib [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => spmllib ) ) [codes] => Array ( [0] => tidyverse [1] => datavault [2] => sparkstreaming [3] => ksql [4] => apacheignite [5] => beam [6] => apex [7] => storm [8] => nifi [9] => nifidev [10] => flink [11] => sparkpython [12] => graphcomputing [13] => aitech [14] => spmllib ) ) [4] => [5] => Array ( [0] => 4 [1] => 5 ) [6] => Array ( [282974] => Array ( [title] => Programming with Big Data in R [rating] => 4 [delegate_and_company] => Tim - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada [body] => The subject matter and the pace were perfect. [mc] => The subject matter and the pace were perfect. [is_mt] => 0 [nid] => 282974 ) [282922] => Array ( [title] => Programming with Big Data in R [rating] => 5 [delegate_and_company] => Xiaoyuan Geng - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada [body] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training to meet clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training! [mc] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training meeting clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!. [is_mt] => 0 [nid] => 282922 ) ) [7] => 4.5 [8] => [9] => 1 [10] => ) ) [7] => Array ( [file] => /apps/nobleprog-website/core/routes.php [line] => 19 [function] => course_menu_callback [args] => Array ( [0] => /en/cc/bigdatar?id=bigdatar-3516157-20210504 ) ) [8] => Array ( [file] => /apps/nobleprog-website/__index.php [line] => 100 [args] => Array ( [0] => /apps/nobleprog-website/core/routes.php ) [function] => require_once ) [9] => Array ( [file] => /apps/nobleprog-website/_index.php [line] => 26 [args] => Array ( [0] => /apps/nobleprog-website/__index.php ) [function] => include_once ) [10] => Array ( [file] => /apps/hitra7/index.php [line] => 54 [args] => Array ( [0] => /apps/nobleprog-website/_index.php ) [function] => include_once ) ) NP URI: www.nobleprog.pt/en/cc/bigdatar?id=bigdatar-3516157-20210504 Cannot modify header information - headers already sent by (output started at /apps/nobleprog-website/_index.php:16) /apps/nobleprog-website/modules/course/course.php:119 Array ( [0] => Array ( [function] => myErrorHandler [args] => Array ( [0] => 2 [1] => Cannot modify header information - headers already sent by (output started at /apps/nobleprog-website/_index.php:16) [2] => /apps/nobleprog-website/modules/course/course.php [3] => 119 ) ) [1] => Array ( [file] => /apps/nobleprog-website/modules/course/course.php [line] => 119 [function] => header [args] => Array ( [0] => X-CSRF-Token:Tm9ibGVQcm9nMTcxNjE2OTA3NA== ) ) [2] => Array ( [file] => /apps/nobleprog-website/modules/course/course.php [line] => 82 [function] => course_generate_csrf_token [args] => Array ( ) ) [3] => Array ( [file] => /apps/nobleprog-website/modules/course/course.php [line] => 31 [function] => course_render [args] => Array ( [0] => Array ( [course_code] => bigdatar [hr_nid] => 68924 [title] => Programming with Big Data in R [requirements] => [overview] =>

Big Data is a term that refers to solutions destined for storing and processing large data sets. Developed by Google initially, these Big Data solutions have evolved and inspired other similar projects, many of which are available as open-source. R is a popular programming language in the financial industry.

[category_overview] => [outline] =>

Introduction to Programming Big Data with R (bpdR)

Message Passing Interface (MPI)

Distributed Matrices

Statistics Applications

[language] => en [duration] => 21 [status] => published [changed] => 1700037139 [source_title] => Programming with Big Data in R [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) ) [1] => Array ( [0] => stdClass Object ( [tid] => 766 [alias] => big-data-training [name] => Big Data [english_name] => Big Data [consulting_option] => available_promoted ) [1] => stdClass Object ( [tid] => 877 [alias] => r-language-training [name] => R Language [english_name] => R Language [consulting_option] => available ) ) [2] => bigdatar [3] => Array ( [outlines] => Array ( [tidyverse] => stdClass Object ( [course_code] => tidyverse [hr_nid] => 212656 [title] => Introduction to Data Visualization with Tidyverse and R [requirements] => [overview] =>

The Tidyverse is a collection of versatile R packages for cleaning, processing, modeling, and visualizing data. Some of the packages included are: ggplot2, dplyr, tidyr, readr, purrr, and tibble.

In this instructor-led, live training, participants will learn how to manipulate and visualize data using the tools included in the Tidyverse.

By the end of this training, participants will be able to:

Audience

Format of the course

[category_overview] => [outline] =>

Introduction

Setting up your working environment

Preparing the dataset

Importing and filtering data

Wrangling the data

Visualizing the data (graphs, scatter plots)

Grouping and summarizing the data

Visualizing the data (line plots, bar plots, histograms, boxplots)

Working with non-standard data

Closing remarks

[language] => en [duration] => 7 [status] => published [changed] => 1700037359 [source_title] => Introduction to Data Visualization with Tidyverse and R [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => tidyverse ) [datavault] => stdClass Object ( [course_code] => datavault [hr_nid] => 210132 [title] => Data Vault: Building a Scalable Data Warehouse [requirements] =>

Audience

[overview] =>

Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.

In this instructor-led, live training, participants will learn how to build a Data Vault.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Data Vault architecture and design principles

Data Vault applications

Data Vault components

Building a Data Vault

Modeling Hubs, Links and Satellites

Data Vault reference rules

How components interact with each other

Modeling and populating a Data Vault

Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)

Understanding load dates, end-dates, and join operations

Business keys, relationships, link tables and join techniques

Query techniques

Load processing and query processing

Overview of Matrix Methodology

Getting data into data entities

Loading Hub Entities

Loading Link Entities

Loading Satellites

Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results

Developing a consistent and repeatable ETL (Extract, Transform, Load) process

Building and deploying highly scalable and repeatable warehouses

Closing remarks

[language] => en [duration] => 28 [status] => published [changed] => 1715349914 [source_title] => Data Vault: Building a Scalable Data Warehouse [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => datavault ) [sparkstreaming] => stdClass Object ( [course_code] => sparkstreaming [hr_nid] => 356863 [title] => Spark Streaming with Python and Kafka [requirements] =>

Audience

[overview] =>

Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.

This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

[outline] =>

Introduction

Overview of Spark Streaming Features and Architecture

Preparing the Environment

Processing Messages

Performing a Windowed Stream Processing

Prototyping the Processing Code

Streaming the Code

Acquiring Stream Output

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037710 [source_title] => Spark Streaming with Python and Kafka [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => sparkstreaming ) [ksql] => stdClass Object ( [course_code] => ksql [hr_nid] => 318463 [title] => Confluent KSQL [requirements] =>

Audience

[overview] =>

Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.

This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Setting up Confluent KSQL

Overview of KSQL Features and Architecture

How KSQL Interacts with Apache Kafka

Use Cases for KSQL

KSQL Command Line and Operations

Ingesting Data (CSV, JSON, etc.)

Creating a Stream

Creating a Table

Advanced KSQL Operations (Joins, Windowing, Aggregations, Geospatial, etc.)

Deploying KSQL to Production

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037528 [source_title] => Confluent KSQL [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => ksql ) [apacheignite] => stdClass Object ( [course_code] => apacheignite [hr_nid] => 209621 [title] => Apache Ignite for Developers [requirements] =>

Audience

[overview] =>

Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.

This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Big Data Tools and Technologies

Installing and Configuring Apache Ignite

Overview of Ignite Architecture

Querying Data in Ignite

Spreading Large Data Sets across a Cluster

Understanding the In-Memory Data Grid

Writing a Service in Ignite

Running Distributed Computing with Ignite

Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors

Testing and Troubleshooting

Summary and Next Steps

[language] => en [duration] => 14 [status] => published [changed] => 1700037322 [source_title] => Apache Ignite for Developers [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => apacheignite ) [beam] => stdClass Object ( [course_code] => beam [hr_nid] => 283646 [title] => Unified Batch and Stream Processing with Apache Beam [requirements] =>

Audience

[overview] =>

Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.

In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.

By the end of this training, participants will be able to:

Format of the Course

Note

[category_overview] => [outline] =>

Introduction

Installing and Configuring Apache Beam

Overview of Apache Beam Features and Architecture

Understanding the Apache Beam Programming Model

Running a sample pipeline

Designing a Pipeline

Creating the Pipeline

Executing the Pipeline

Testing and Debugging Apache Beam

Processing Bounded and Unbounded Datasets

Making Your Pipelines Reusable and Maintainable

Create New Data Sources and Sinks

Integrating Apache Beam with other Big Data Systems

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 14 [status] => published [changed] => 1700037430 [source_title] => Unified Batch and Stream Processing with Apache Beam [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => beam ) [apex] => stdClass Object ( [course_code] => apex [hr_nid] => 209525 [title] => Apache Apex: Processing Big Data-in-Motion [requirements] =>

Audience

[overview] =>

Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.

This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] => [outline] =>

To request a customized course outline for this training, please contact us.

 

[language] => en [duration] => 21 [status] => published [changed] => 1700037320 [source_title] => Apache Apex: Processing Big Data-in-Motion [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => apex ) [storm] => stdClass Object ( [course_code] => storm [hr_nid] => 208253 [title] => Apache Storm [requirements] => [overview] =>

Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).

"Storm is for real-time processing what Hadoop is for batch processing!"

In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.

Some of the topics included in this training include:

Request this course now!

Audience

Format of the course

[category_overview] => [outline] =>

Request a customized course outline for this training!

[language] => en [duration] => 28 [status] => published [changed] => 1700037303 [source_title] => Apache Storm [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => storm ) [nifi] => stdClass Object ( [course_code] => nifi [hr_nid] => 212800 [title] => Apache NiFi for Administrators [requirements] =>

Audience

[overview] =>

Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.

In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.

By the end of this training, participants will be able to:

[outline] =>

Introduction to Apache NiFi   

Overview of Big Data and Apache Hadoop

Setting up and Running a NiFi Cluster

NiFi Operations

Monitoring and Recovery

Optimizing NiFI

Best practices

Troubleshooting

Summary and Conclusion

[language] => en [duration] => 21 [status] => published [changed] => 1700037360 [source_title] => Apache NiFi for Administrators [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => nifi ) [nifidev] => stdClass Object ( [course_code] => nifidev [hr_nid] => 212804 [title] => Apache NiFi for Developers [requirements] =>

Audience

[overview] =>

Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.

In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Overview of Big Data Tools and Technologies

Installing and Configuring NiFi

Overview of NiFi Architecture

Development Approaches

Design Considerations

Components, Events, and Processor Patterns

Exercise: Streaming Data Feeds into HDFS

Error Handling

Controller Services

Exercise: Ingesting Data from IoT Devices using Web-Based APIs

Exercise: Developing a Custom Apache Nifi Processor using JSON

Testing and Troubleshooting

Contributing to Apache NiFi

Summary and Conclusion

[language] => en [duration] => 7 [status] => published [changed] => 1700037361 [source_title] => Apache NiFi for Developers [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => nifidev ) [flink] => stdClass Object ( [course_code] => flink [hr_nid] => 209489 [title] => Apache Flink Fundamentals [requirements] =>

Audience

[overview] =>

Apache Flink is an open-source framework for scalable stream and batch data processing.

This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

By the end of this training, participants will be able to:

Format of the Course

Course Customization Options

[category_overview] =>

This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Installing and Configuring Apache Flink

Overview of Flink Architecture

Developing Data Streaming Applications in Flink

Managing Diverse Workloads

Performing Advanced Analytics

Setting up a Multi-Node Flink Cluster

Mastering Flink DataStream API

Understanding Flink Libraries

Integrating Flink with Other Big Data Tools

Testing and Troubleshooting

Summary and Next Steps

[language] => en [duration] => 28 [status] => published [changed] => 1700037319 [source_title] => Apache Flink Fundamentals [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => flink ) [sparkpython] => stdClass Object ( [course_code] => sparkpython [hr_nid] => 279430 [title] => Python and Spark for Big Data (PySpark) [requirements] =>

Audience

[overview] =>

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.

In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Understanding Big Data

Overview of Spark

Overview of Python

Overview of PySpark

Setting Up Python with Spark

Setting Up PySpark

Using Amazon Web Services (AWS) EC2 Instances for Spark

Setting Up Databricks

Setting Up the AWS EMR Cluster

Learning the Basics of Python Programming

Learning the Basics of Spark DataFrame

Working on a Spark DataFrame Project Exercise

Understanding Machine Learning with MLlib

Working with MLlib, Spark, and Python for Machine Learning

Understanding Regressions

Understanding Random Forests and Decision Trees

Working with K-means Clustering

Working with Recommender Systems

Implementing Natural Language Processing

Streaming with Spark on Python

Closing Remarks

[language] => en [duration] => 21 [status] => published [changed] => 1715349940 [source_title] => Python and Spark for Big Data (PySpark) [source_language] => en [cert_code] => [weight] => -998 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => sparkpython ) [graphcomputing] => stdClass Object ( [course_code] => graphcomputing [hr_nid] => 278402 [title] => Introduction to Graph Computing [requirements] =>

Audience

[overview] =>

Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).

In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

By the end of this training, participants will be able to:

Format of the course

[category_overview] =>

In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.

By the end of this training, participants will be able to:

[outline] =>

Introduction

Understanding Graph Data

Using Graph Databases to Model, Persist and Process Graph Data

Exercise: Modeling Graph Data with neo4j

Beyond Graph Databases: Graph Computing

Solving Real-World Problems with Traversals

Case Study: Ranking Discussion Contributors

Graph Computing: Local, In-Memory Graph toolkits

Exercise: Modeling Graph Data with NetworkX

Graph Computing: Batch Processing Graph Frameworks

Graph Computing: Graph-Parallel Computation

Setup and Installation

GraphX Operators

Iterating with Pregel API

Building a Graph

Designing Scalable Algorithms

Accessing Additional Algorithms

Exercis: Page Rank and Top Users

Deploying to Production

Closing Remarks

[language] => en [duration] => 28 [status] => published [changed] => 1715349940 [source_title] => Introduction to Graph Computing [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => graphcomputing ) [aitech] => stdClass Object ( [course_code] => aitech [hr_nid] => 199320 [title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP [requirements] => [overview] =>

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.

[category_overview] => [outline] =>
  1. Distribution big data
    1. Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
    2. Apache Spark MLlib
  2. Recommendations and Advertising:
    1. Natural language
    2. Text clustering, text categorization (labeling), synonyms
    3. User profile restore, labeling system
    4. Recommended algorithms
    5. Insuring the accuracy of "lift" between and within categories
    6. How to create closed loops for recommendation algorithms
  3. Logical regression, RankingSVM,
  4. Feature recognition (deep learning and automatic feature recognition for graphics)
  5. Natural language
    1. Chinese word segmentation
    2. Theme model (text clustering)
    3. Text classification
    4. Extract keywords
    5. Semantic analysis, semantic parser, word2vec (vector to word)
    6. RNN long-term memory (TSTM) architecture
[language] => en [duration] => 21 [status] => published [changed] => 1715084120 [source_title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP [source_language] => zh-hans [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => aitech ) [spmllib] => stdClass Object ( [course_code] => spmllib [hr_nid] => 141237 [title] => Apache Spark MLlib [requirements] =>

Knowledge of one of the following:

[overview] =>

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

It divides into two packages:

 

Audience

This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark

[category_overview] => [outline] =>

spark.mllib: data types, algorithms, and utilities

spark.ml: high-level APIs for ML pipelines

[language] => en [duration] => 35 [status] => published [changed] => 1700037209 [source_title] => Apache Spark MLlib [source_language] => en [cert_code] => [weight] => 0 [excluded_sites] => [use_mt] => stdClass Object ( [field_overview] => [field_course_outline] => [field_prerequisits] => [field_overview_in_category] => ) [cc] => spmllib ) ) [codes] => Array ( [0] => tidyverse [1] => datavault [2] => sparkstreaming [3] => ksql [4] => apacheignite [5] => beam [6] => apex [7] => storm [8] => nifi [9] => nifidev [10] => flink [11] => sparkpython [12] => graphcomputing [13] => aitech [14] => spmllib ) ) [4] => [5] => Array ( [0] => 4 [1] => 5 ) [6] => Array ( [282974] => Array ( [title] => Programming with Big Data in R [rating] => 4 [delegate_and_company] => Tim - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada [body] => The subject matter and the pace were perfect. [mc] => The subject matter and the pace were perfect. [is_mt] => 0 [nid] => 282974 ) [282922] => Array ( [title] => Programming with Big Data in R [rating] => 5 [delegate_and_company] => Xiaoyuan Geng - Ottawa Research and Development Center, Science Technology Branch, Agriculture and Agri-Food Canada [body] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training to meet clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training! [mc] => Michael the trainer is very knowledgeable and skillful about the subject of Big Data and R. He is very flexible and quickly customize the training meeting clients' need. He is also very capable to solve technical and subject matter problems on the go. Fantastic and professional training!. [is_mt] => 0 [nid] => 282922 ) ) [7] => 4.5 [8] => [9] => 1 [10] => ) ) [4] => Array ( [file] => /apps/nobleprog-website/core/routes.php [line] => 19 [function] => course_menu_callback [args] => Array ( [0] => /en/cc/bigdatar?id=bigdatar-3516157-20210504 ) ) [5] => Array ( [file] => /apps/nobleprog-website/__index.php [line] => 100 [args] => Array ( [0] => /apps/nobleprog-website/core/routes.php ) [function] => require_once ) [6] => Array ( [file] => /apps/nobleprog-website/_index.php [line] => 26 [args] => Array ( [0] => /apps/nobleprog-website/__index.php ) [function] => include_once ) [7] => Array ( [file] => /apps/hitra7/index.php [line] => 54 [args] => Array ( [0] => /apps/nobleprog-website/_index.php ) [function] => include_once ) ) Programming with Big Data in R Training Course

Course Outline

Introduction to Programming Big Data with R (bpdR)

  • Setting up your environment to use pbdR
  • Scope and tools available in pbdR
  • Packages commonly used with Big Data alongside pbdR

Message Passing Interface (MPI)

  • Using pbdR MPI 5
  • Parallel processing
  • Point-to-point communication
  • Send Matrices
  • Summing Matrices
  • Collective communication
  • Summing Matrices with Reduce
  • Scatter / Gather
  • Other MPI communications

Distributed Matrices

  • Creating a distributed diagonal matrix
  • SVD of a distributed matrix
  • Building a distributed matrix in parallel

Statistics Applications

  • Monte Carlo Integration
  • Reading Datasets
  • Reading on all processes
  • Broadcasting from one process
  • Reading partitioned data
  • Distributed Regression
  • Distributed Bootstrap
 21 Hours

Testimonials (2)

Related Courses

Introduction to Data Visualization with Tidyverse and R

7 Hours

Data Vault: Building a Scalable Data Warehouse

28 Hours

Spark Streaming with Python and Kafka

7 Hours

Confluent KSQL

7 Hours

Apache Ignite for Developers

14 Hours

Unified Batch and Stream Processing with Apache Beam

14 Hours

Apache Apex: Processing Big Data-in-Motion

21 Hours

Apache Storm

28 Hours

Apache NiFi for Administrators

21 Hours

Apache NiFi for Developers

7 Hours

Apache Flink Fundamentals

28 Hours

Python and Spark for Big Data (PySpark)

21 Hours

Introduction to Graph Computing

28 Hours

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

21 Hours

Apache Spark MLlib

35 Hours

Related Categories