Course Outline
Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis
- Case Studies from Law Enforcement - Predictive Policing
- Big Data adoption rate in Law Enforcement Agencies and how they are aligning their future operations around Big Data Predictive Analytics
- Emerging technology solutions such as gunshot sensors, surveillance video, and social media
- Using Big Data technology to mitigate information overload
- Integrating Big Data with Legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data Integration & Dashboard visualization
- Fraud management
- Business Rules and Fraud detection
- Threat detection and profiling
- Cost-benefit analysis for Big Data implementation
Introduction to Big Data
- Main characteristics of Big Data -- Volume, Variety, Velocity, and Veracity.
- MPP (Massively Parallel Processing) architecture
- Data Warehouses – static schema, slowly evolving dataset
- MPP Databases: Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
- Hadoop-Based Solutions – no constraints on dataset structure.
- Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
- Apache Spark for stream processing
- Batch processing – suited for analytical/non-interactive tasks
- Volume: CEP streaming data
- Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
- Less production-ready – Storm/S4
- NoSQL Databases – (columnar and key-value): Best suited as an analytical adjunct to data warehouses/databases
NoSQL solutions
- KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
- KV Store - Dynamo, Voldemort, Dynomite, SubRecord, MongoDB, DovetailDB
- KV Store (Hierarchical) - GT.m, Cache
- KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
- KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
- Tuple Store - Gigaspaces, Coord, Apache River
- Object Database - ZopeDB, DB40, Shoal
- Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
- Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to Data Cleaning issues in Big Data
- RDBMS – static structure/schema, does not promote an agile, exploratory environment.
- NoSQL – semi-structured, enough structure to store data without an exact schema before storing data
- Data cleaning issues
Hadoop
- When to select Hadoop?
- STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration)
- SEMI-STRUCTURED data – difficult to process using traditional solutions (DW/DB)
- Warehousing data = HUGE effort and static even after implementation
- For variety & volume of data, processed on commodity hardware – HADOOP
- Commodity H/W needed to create a Hadoop Cluster
Introduction to Map Reduce /HDFS
- MapReduce – distribute computing over multiple servers
- HDFS – make data available locally for the computing process (with redundancy)
- Data – can be unstructured/schema-less (unlike RDBMS)
- Developer responsibility to make sense of data
- Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day 02
Big Data Ecosystem -- Building Big Data ETL (Extract, Transform, Load) -- Which Big Data Tools to use and when?
- Hadoop vs. Other NoSQL solutions
- For interactive, random access to data
- Hbase (column-oriented database) on top of Hadoop
- Random access to data but restrictions imposed (max 1 PB)
- Not ideal for ad-hoc analytics, good for logging, counting, time-series
- Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
- Flume – Stream data (e.g., log data) into HDFS
Big Data Management System
- Moving parts, compute nodes start/fail: ZooKeeper - For configuration/coordination/naming services
- Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
- Deploy, configure, cluster management, upgrade etc (sys admin): Ambari
- In Cloud: Whirr
Predictive Analytics -- Fundamental Techniques and Machine Learning based Business Intelligence
- Introduction to Machine Learning
- Learning classification techniques
- Bayesian Prediction -- preparing a training file
- Support Vector Machine
- KNN p-Tree Algebra & vertical mining
- Neural Networks
- Big Data large variable problem -- Random Forest (RF)
- Big Data Automation problem – Multi-model ensemble RF
- Automation through Soft10-M
- Text analytic tool-Treeminer
- Agile learning
- Agent-based learning
- Distributed learning
- Introduction to Open Source Tools for predictive analytics: R, Python, RapidMiner, Mahout
Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis
- Technology and the investigative process
- Insight analytics
- Visualization analytics
- Structured predictive analytics
- Unstructured predictive analytics
- Threat/fraudster/vendor profiling
- Recommendation Engine
- Pattern detection
- Rule/Scenario discovery – failure, fraud, optimization
- Root cause discovery
- Sentiment analysis
- CRM analytics
- Network analytics
- Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
- Technology-assisted review
- Fraud analytics
- Real-Time Analytics
Day 03
Real-Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS
- Apache Hama - for Bulk Synchronous distributed computing
- Apache SPARK - for cluster computing and real-time analytics
- CMU Graphics Lab2 - Graph-based asynchronous approach to distributed computing
- KNN p -- Algebra-based approach from Treeminer for reduced hardware cost of operation
Tools for eDiscovery and Forensics
- eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
- Predictive coding and Technology-Assisted Review (TAR)
- Live demo of vMiner for understanding how TAR enables faster discovery
- Faster indexing through HDFS – Velocity of data
- NLP (Natural Language Processing) – open source products and techniques
- eDiscovery in foreign languages -- technology for foreign language processing
Big Data BI for Cyber Security – Getting a 360-degree view, speedy data collection, and threat identification
- Understanding the basics of security analytics -- attack surface, security misconfiguration, host defenses
- Network infrastructure / Large datapipe / Response ETL for real-time analytics
- Prescriptive vs predictive – Fixed rule-based vs auto-discovery of threat rules from Meta data
Gathering disparate data for Criminal Intelligence Analysis
- Using IoT (Internet of Things) as sensors for capturing data
- Using Satellite Imagery for Domestic Surveillance
- Using surveillance and image data for criminal identification
- Other data gathering technologies -- drones, body cameras, GPS tagging systems, and thermal imaging technology
- Combining automated data retrieval with data obtained from informants, interrogation, and research
- Forecasting criminal activity
Day 04
Fraud Prevention BI from Big Data in Fraud Analytics
- Basic classification of Fraud Analytics -- rules-based vs predictive analytics
- Supervised vs unsupervised Machine learning for Fraud pattern detection
- Business-to-business fraud, medical claims fraud, insurance fraud, tax evasion, and money laundering
Social Media Analytics -- Intelligence gathering and analysis
- How Social Media is used by criminals to organize, recruit, and plan
- Big Data ETL API for extracting social media data
- Text, image, metadata, and video
- Sentiment analysis from social media feed
- Contextual and non-contextual filtering of social media feed
- Social Media Dashboard to integrate diverse social media
- Automated profiling of social media profiles
- Live demo of each analytic will be given through Treeminer Tool
Big Data Analytics in image processing and video feeds
- Image Storage techniques in Big Data -- Storage solution for data exceeding petabytes
- LTFS (Linear Tape File System) and LTO (Linear Tape Open)
- GPFS-LTFS (General Parallel File System - Linear Tape File System) -- layered storage solution for Big image data
- Fundamentals of image analytics
- Object recognition
- Image segmentation
- Motion tracking
- 3-D image reconstruction
Biometrics, DNA, and Next Generation Identification Programs
- Beyond fingerprinting and facial recognition
- Speech recognition, keystroke (analyzing a user's typing pattern), and CODIS (Combined DNA Index System)
- Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples
Big Data Dashboard for quick accessibility of diverse data and display :
- Integration of existing application platform with Big Data Dashboard
- Big Data management
- Case Study of Big Data Dashboard: Tableau and Pentaho
- Use Big Data app to push location-based services in Govt.
- Tracking system and management
Day 05
How to justify Big Data BI implementation within an organization:
- Defining the ROI (Return on Investment) for implementing Big Data
- Case studies for saving Analyst Time in collection and preparation of Data – increasing productivity
- Revenue gain from lower database licensing cost
- Revenue gain from location-based services
- Cost savings from fraud prevention
- An integrated spreadsheet approach for calculating approximate expenses vs. Revenue gain/savings from Big Data implementation.
Step-by-step procedure for replacing a legacy data system with a Big Data System
- Big Data Migration Roadmap
- What critical information is needed before architecting a Big Data system?
- What are the different ways for calculating Volume, Velocity, Variety, and Veracity of data
- How to estimate data growth
- Case studies
Review of Big Data Vendors and review of their products.
- Accenture
- APTEAN (Formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (Formerly 10Gen)
- MU Sigma
- Netapp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (Part of EMC)
Q/A session
Requirements
- Knowledge of law enforcement processes and data systems
- Basic understanding of SQL/Oracle or relational databases
- Basic understanding of statistics (at the spreadsheet level)
Audience
- Law enforcement specialists with a technical background
Custom Corporate Training
Training solutions designed exclusively for businesses.
- Customized Content: We adapt the syllabus and practical exercises to the real goals and needs of your project.
- Flexible Schedule: Dates and times adapted to your team's agenda.
- Format: Online (live), In-company (at your offices), or Hybrid.
Price per private group, online live training, starting from 6500 € + VAT*
Contact us for an exact quote and to hear our latest promotions
Testimonials (2)
basics and loved the prepared documents and exercises
Rekha Nallam - GE Medical Systems Polska Sp. z o.o.
Course - Introduction to Predictive AI
Deepthi was super attuned to my needs, she could tell when to add layers of complexity and when to hold back and take a more structured approach. Deepthi truly worked at my pace and ensured I was able to use the new functions /tools myself by first showing then letting me recreate the items myself which really helped embed the training. I could not be happier with the results of this training and with the level of expertise of Deepthi!