Data Pipeline Processing

Big Data - Tech Stack:
  • Analytical Tool 
    • R Tool 
  • Machine Learning 
    • Apache Mahout 
  • Query System 
    • Hive 
    • Pig
  • Cache System  
    • Memcache 
    • Redis 
  • Data Serialize
    • Apache Avro 
  • Processing System (Batch)
    • Apache Hadoop
      • Map Reduce / HDFS
  • Processing (RT)
    • Apache Spark 
    • Apache Storm
  • Message System 
    • Apache Kafka 
  • NOSQL 
    • Apache Cassandra
    • Apache HBase 
Storm:
  • Nugget - Distributed Reliable Real Time Data Processing System
  • Meant for Real time Steaming data vs Batch processing of Hadoop
  • Diff with hadoop - Task are continuous vs task with completion
  • Reads the data from Messaging queues
  • Fail over - On failed execution, it restart the task on another node
  • Reliability - Based on Spout - Ability to repeat tuple to bolt 
  • Tuple - Boundless data with Schema
  • Spout - Consumer of data stream from external Source
  • Bolt - Description of topology
  • Spout > Topology - Bolt > Worker > Executor > Task
  • Worker can execute tasks of Bolt and Spout.
  • Parallelism is defined no of executor running for each Bolt and spout. 
Kafka:
  • Nugget - Distributed Reliable Scalable Messaging System
  • Producer -
  • Consumer -
  • Topics - Topics to which Publisher and Consumer exchange messages in Pub/Sub fashion.
  • Partition - Queue management under a topic. Each Topics will have at least one queue.
  • Broker - Set of Server group service messaging through topics. Has Master and Follow to provide resiliency.
Rhadoop:
  • R works with all data in RAM. Restricts its scalability.
  • RHadoop 
    • Offering from Revolution analytic to allow scalability to R program processing.
  • Has 3 Components
    • rmr2- Map/Reduce > Streaming > R Functions
    • rHBase - HBase Thrift gateway > HBase
    • rHDFS - HDFS
      •     Allows R to HDFC and R Data Framework to HDFS
  • Hadoop Streaming 
    • Project to facilitate to process MR job of any programing langauges. 

Comments

Popular posts from this blog

ML Algirithms - Cheat Sheets

McKinsey Innovation - Horizon Model

Go To Market Strategy