Friday, December 11, 2015

ML - Course Notes

2 - Linear Regression:
  • Problem 
    • Given a data set plotted on graph (x,y). 
    • Find out hypothesis which represent the data behavior
  • Hypothesis 
    • f(x) : Represent its behavior
    • f(x) = a + b Xi 
      • a and b controls the prediction
      • Slop of line which fits the data
  • Cost Function 
    • Minimize the Cost Function
      • J(T) = 1/2m * { Sum (Yi - f(Xi)) ^2 } : m is total nu of data set
      • J(T) = 1/2m * { Sum (Yi - (a + b Xi)) ^2 }
  • Gradient Descent 
    • Mechanism to find value of a and b where Cost function value is lowest 
    • Incrementally minimize the multivariate cost function 
    • Leads to Global and Local Minimum 
    • Repeat until convergence
      • Pj := Pj - Alpha * d/dPj (J(T0,T1))
      • P0 := P0 - Alpha * d/dP0 (J(P0))
      • Pj := Pj - Alpha * d/dPj (J(Pj) 
      • Pj := Pj - Alpha * J(Pj) * Pj 
      • Pj : Values of a, b  
      • Alpha : size of step ; Size of Prediction ; Automatically reduces alpha value
      • Update both T0 and T1 
  • Formula 
    • Hypothesis
      • f(x) = T1 X1 + T2 X2 
    • Cost Function 
      • J(T) = 1/2m * {Sum (1-m) (f(x) - Y) ^2}
        • Minimize the Cost Function
    • Gradient Descend: 
      • T1 := T1 - Alpha * d/da (J(T)) ; 
        • Repeat until convergence
        • Calculate all values of T before assigning value 
        • Increment till this assignment is Zero 
  • Terminology 
    • X - Feature Set (No of rooms) 
    • Y - Output Class (Price of House)
    • T - Parameters over Feature Set  
    • f(x) - Hypothesis 
    • J(T) - Cost function 
    • Alpha - Gradient Descent Size
6 - Logistic Regression 
  • Definition 
    • Classification Problem, also known, Stochastic Regression 
    • Give probability of Hypothesis for a value to be classified between a range 
    • Draws a Decision boundary to classify a set of data of any share, line, circle, oval or any asymmetric shape 
  • Problem 
    • A Data set with classification of each data point 
  • Formula 
    • Hypothesis
      • f(x) = 1 / (1 + e^ - (T ' x) ) 
      • f(x) = - log (f(x)) for y = 1; 
      • f(x) = - log (1-f(x)) for y = 0;
    • Cost Function 
      • J(T) = - (y log (f(x)) + y log (1-f(x)))
      • J(T) = - 1/2m *  Sum (1-m) {Y log (f(x)) + (1 - Y) log (1-f(x)}  
        • Minimize the Cost
    • Gradient Descend: 
      • T := T - Alpha * d/da (J(T)) ; 
        • Repeat until convergence
        • Calculate all value before assigning value 
        • Increment till this assignment is Zero
7 - Regularization:
  • Type of Irregularity 
    • Under-fitting / High Biased: Learned Hypothesis is too general does not fit well with given training set
    • Over-fitting / High Variance: Learned Hypothesis fits training set very well but fails to generalize on new data point.  
  • Mechanics:
    • Reduced no of Feature 
      • Manually Reduce number of feature
      • Model selection algorithm 
    • Regularize 
      • Keeping feature but minimize their magnitude and values 
      • Works well many features each contributing to y 
    • Regularization Factor 
      • Increase to decrease the influence the higher number of feature / polynomials
      • Decrease to increase the influence the higher number of feature / polynomials
  • Formula 
    • Cost Function: 
      • J(T) = 1/2m * {Sum (1-m) (f(x) - Y) ^2 + Lambda * 1/2m * Sum (1-m) T^2}
      • J(T) = - 1/2m *  Sum (1-m) {Y log (f(x)) + (1 - Y) log (1-f(x)}  
        • Minimize cost function
    • Gradient Descent: 
      • T := T - { Alpha * J(T) * Pj - Lambda* Sum (1-m) T }
        • Repeat until convergence 
  • Terminology 
    • Lambda - Regularization Factor 
8 - Neural Networks: 
Representation 

  • Non-Linear Classification: 
    • Features Set are of higher numbers - How to calculate when each Pixel is feature set of a type
    • Neuron Model: Processing unit with logic 
    • Using different activation/weight , combination of feature and degree of
  • Mechanics:
    • Each Algorithm is designed for a certain purpose
    • Objective - To get right combination Values and their Influence (Theta) 
    • Mechanism : Define Feature functions
      • Evaluate (Function values)
      • Analyze (Lowest - Cost Function)
      • Decide (Optimize - Gradient Descend)
  • Neural Network for CV :
    • NN Algorithm Objective
      • To identify a certain type of objects in an Image – Coke Bottle based on one feature Color Intensity
    • Approach 
      • Input Images with Outputs of their Class (X,Y) 
      • Select a Feature > Get its Feature Vector for each Pixel (X)
        • Intermediate Levels - Break the function further finer aspects 
      • Develop Function calculate for Feature Vector and Parameter for each (Y)
    • Input Data Set (X)
      • X – Feature Vector – Feature Value in Vector for Each Pixel
        • Image with Coke bottles and Non-Coke bottles
        • Feature Vector : Colors, Dimension, Position, Sizes - (Relative & absolute) ,
        • Here it is done for one Feature Vector; Same to be done for other Features Set
    • Output Data Set (Y)
      • Y – No of Output Classes
        • Binary Class (Coke bottles - Yes / No)
        • Multi Class (Bottles - Coke, Pepsi, ..)
    • Algorithm
      • Functions
        • f(X) = Theta Transpose X
        • Z2 = g (f(x)) for previous level
        • Each level has Activation Factor Theta to control participation
      • At each level, function calculates feature value contribution of previous level in order to reduce error (improve accuracy) of the function
        • Hidden Units are to break the function further to become more finer on Feature
      • Level Propagation
        • L1 – Works with Bigger size of image with identifying lines, Shades, etc.
        • L2 – Breaking the previously fed image, work with smaller part of it to identify finer features
    • Strategy
      • Collapse & Construct:
        • Collapse the whole image at each level to find finer features
        • Reconstruct the image using identified features
      • Deep Learning
        • For deeper features (pixel level feature deduction), deeper algorithm are designed 
  • Terminology
    • Activation Function – Function at Hidden unit with Factor of participate at each node of layer
    • Error Calculation – Output Value of Function – Y
Learning: 
  • Formula 
    • Hypothesis
      • f(x) = T1 X1 + T2 X2 + T3 X3
    • Cost Function 
      • J(a) = Sum (1-N) Sum (1-K) (y log x - (1-y) (1 - log x)) + alpha * / n
        • Minimize the Cost
    • Gradient Descend: 
      • a := a - Alpha * d/da (f(x)) ; 
        • Calculate all value before assigning value 
        • Increment till this assignment is Zero 
  • Algorithms 
    • Propagation - Cost function and Gradient Function Optimization Evaluation for each path taken
    • Forward Propagation – Calculating Function Value Forward from Left to Right .
    • Backward Propagation – Calculating Error Function backward from Right to Left
10 - Tuning Algorithm: 
  • Measures 
    • More Training Set 
      • Over-fitting 
    • Decreasing Feature Set 
      • Over-fitting  
    • Increasing Feature Set 
      • Under-fitting 
    • Decreasing Polynomial 
      • Over-fitting 
    • Increasing Polynomial  
      • Under-fitting 
    • Decreasing Lambda (Regularization) 
      • Under-fitting 
    • Increasing Lambda  (Regularization) 
      • Over-fitting 
  • Learning Algorithm Type 
    • Over-fitting (High Variance)
    • Under-fitting (High Bias)
  • Approach for Training Data Set Division
    • Training and Test Data : 70/30
    • Training, Cross Validation and Test Data: 60/20/20  
  • Algorithm
    • Calculate Cost Functions 
    • Drawing Error Function - Cost function (Error) vs 
      • Polynomial (x + x^2 + x^3) - Increasing / Decreasing 
      • Training data Set - Increasing / Decreasing 
      • Lamda - Increasing / Decreasing 
    • Learning Curve 
      • Error vs Training data set 
  • Neural network 
    • Smaller Network - Under-fitting  - Not Compute Expensive
    • Deeper Network - Over-fitting - Compute Expensive 
11 - System Design: 
  • Approach 
    • Define Feature Set 
    • Create Feature Vector 
    • Train with Data Set  
  • Error Metrics
    • Precision = True Positive / Predicted Positive  
    • Recall = True Positive / Actual Positive  
    • Threshold of Qualifying for Positive Results 
    • To predict with high confidence > Lower the Threshold > High Precision and Low Recall 
    • To predict with high coverage > Higher Threshold > Low Precision and High Recall 
  • Amount Of Data 
    • Use many Parameter - Jtrain small > Low Bias
    • Use very large Data - Jtrain = Jtest > Low Variance
12 - Support Vector Machine 

13 - Unsupervised Learning: 
  • Definition 
    • Ability of a function to find clear pattern based on distance and density   
    • Clustering Algorithm Approach 
  • K-Mean Algorithm 
    • Optimization Objective 
      • Minimize mean distance of assigned data points from Centriod
    • Algorithm
      • To Select K Random Centriod in Space 
      • Assign X points based on closeness to Centriod 
      • Calculate Mean of X for Centriod's points 
      • This becomes new Centriod for Selected
      • Repeat
    • Terms 
      • K - Randomly selected cluster Centriod points in space 
      • Mue - Mean distance of all assigned data points 
      • Ci - Index of Centroid closest to Xi
      • Mue k - Average/mean of point assigned to cluster K 
    • Cost Function - 
      • = Sum (1-m) Mode (X - Mue) ^2  
    • Selecting K Value 
      • Elbow Method - K vs Cost Function
14 - Dimentionality Reduction:
  • Definition 
    • To reduce Features Set (not vector)
  • Approach  
    • Express the reducing function on to remaining dimension with objective to reduce variance of expressed dimension 
    • Continue till the objective Meets
  • Algorithm - PCA (Principal Component Analysis) 
  • Note 
    • PCA (Perpendicular to dimension function) vs Linear Regression (Perpendicular to Axis)
15 - Anomaly Detection:  
  • Definition 
    • To find out-liner in data-set
    • Use Case: 
      • Malfunctioning of Critical Systems (Aircraft) 
      • Fraud Detection of Public Systems (Bank, ) 
      • Error Condition Detection (Computer Hardware, Data center) 
  • Algorithm 
    • Choose a Feature Set which could be indicative of Anomaly 
    • Calculate Gaussian Function - Talks about Central Tendency 
      • Mean - 1/m Sum 1-m (X) 
      • Standard  Deviation - 1/m Sum 1-M (Mean - X) ^ 2
      • Variance - SD ^2 
      • Gaussian Function GF (X, Mean, Variance)=
      • = Prod 1-N P (X, Mean, Variance) 
    • GF < e (normal error) - Anomaly Detection 

Appendix:

NLP vs NN CV:
  • ML Algorithm Objective
    • Same thing with NLP, but you can define these variations
    • It used Entropy based Decision Tree Algorithm
  • Input
    • Labeled Data
  • Feature Function for NLP
    • Semantic position of different words within a sentences
    • Sentiment algo define negative work position with ref to other construct of sentences 
Naive Bayes Algorithm:
  • Classifier Algorithm - 
    • Fast Convergence 
    • Independent Feature Simulation
    • Document Classification
  • Based on Posterior Probability Formulation
    • Based on prior probability 
    • Ability to classify a new data being in territory 
  • Process
    • Calculate Frequency of Each Term in Document 
    • Calculate Probability of Each Term of being in a Class 
    • Calculate Naive Bayes Equation for Each Class 
  • NLP Apply
    • P (Document Belong to a Class / Words in Document) = P (Word Belong to Class)  * P (Prior Probability of Class) / P (Prior Probability of Word)
  • Ref 
    • http://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
    • http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
Decision Tree:
  • Entropy 
    • In Decision tree, to evaluate the Significance of a feature, Probability calculation is being done at that feature node.
    • To build decision logic at each node of a tree 
  • Information Gain : 
    • Decision Tree > Ability to decide the feature to take decision to lead it to class 
  • Random Forest 
    • Randomly divide dataset into multiple trees and then Ensembe them. It uses mode of the classes (classification) or mean prediction (regression) of the individual trees.
    • http://www.slideshare.net/DerekKane/data-science-v-decision-tree-random-forests  
Ensemble: 
  • Boosting vs Bagging - Ensemble
    • Both are ensemble technique to create weaker Learners in order to create combined Strong Learner. To reduce the variance.
    • Bagging - Using Bootstrapping data set (by different combination with replacement) to train multiple models and Use Voting Method to Select the Output. 
    • Boosting - Using whole/original Dataset for all model learning, Giving weight boosting to weaker models and select average of them. 
    • Stacking - (Like boosting) Applying another model to learn the Weights of each model  
Terminology: 
  • Terminology 
    • Decision Factor / Attribute 
    • Features Set - Set of all the Features (X Down number)
    • Feature Vector - All value of Feature in Vector (X Up Number)
    • Feature Scaling - Mean Normalization - Normalize Value around Mean of Data Set
  • Gaussian Model 
    • Weight Initialization in Neural Network -

No comments:

Post a Comment