### ML - Course Notes

**2 - Linear Regression:**- Problem
- Given a data set plotted on graph (x,y).
- Find out hypothesis which represent the data behavior
- Hypothesis
- f(x) : Represent its behavior
- f(x) = a + b Xi
- a and b controls the prediction
- Slop of line which fits the data
- Cost Function
- Minimize the Cost Function
- J(T) = 1/2m * { Sum (Yi - f(Xi)) ^2 } : m is total nu of data set
- J(T) = 1/2m * { Sum (Yi - (a + b Xi)) ^2 }
- Gradient Descent
- Mechanism to find value of a and b where Cost function value is lowest
- Incrementally minimize the multivariate cost function
- Leads to Global and Local Minimum
- Repeat until convergence
- Pj := Pj - Alpha * d/dPj (J(T0,T1))
- P0 := P0 - Alpha * d/dP0 (J(P0))
- Pj := Pj - Alpha * d/dPj (J(Pj)
- Pj := Pj - Alpha * J(Pj) * Pj
- Pj : Values of a, b
- Alpha : size of step ; Size of Prediction ; Automatically reduces alpha value
- Update both T0 and T1
- Formula
- Hypothesis
- f(x) = T1 X1 + T2 X2
- Cost Function
- J(T) = 1/2m * {Sum (1-m) (f(x) - Y) ^2}
- Minimize the Cost Function
- Gradient Descend:
- T1 := T1 - Alpha * d/da (J(T)) ;
- Repeat until convergence
- Calculate all values of T before assigning value
- Increment till this assignment is Zero
- Terminology
- X - Feature Set (No of rooms)
- Y - Output Class (Price of House)
- T - Parameters over Feature Set
- f(x) - Hypothesis
- J(T) - Cost function
- Alpha - Gradient Descent Size

__6 - Logistic Regression__- Definition
- Classification Problem, also known, Stochastic Regression
- Give probability of Hypothesis for a value to be classified between a range
- Draws a Decision boundary to classify a set of data of any share, line, circle, oval or any asymmetric shape
- Problem
- A Data set with classification of each data point
- Formula
- Hypothesis
- f(x) = 1 / (1 + e^ - (T ' x) )
- f(x) = - log (f(x)) for y = 1;
- f(x) = - log (1-f(x)) for y = 0;
- Cost Function
- J(T) = - (y log (f(x)) + y log (1-f(x)))
- J(T) = - 1/2m * Sum (1-m) {Y log (f(x)) + (1 - Y) log (1-f(x)}
- Minimize the Cost
- Gradient Descend:
- T := T - Alpha * d/da (J(T)) ;
- Repeat until convergence
- Calculate all value before assigning value
- Increment till this assignment is Zero

**7 - Regularization:**- Type of Irregularity
- Under-fitting / High Biased: Learned Hypothesis is too general does not fit well with given training set
- Over-fitting / High Variance: Learned Hypothesis fits training set very well but fails to generalize on new data point.
- Mechanics:
- Reduced no of Feature
- Manually Reduce number of feature
- Model selection algorithm
- Regularize
- Keeping feature but minimize their magnitude and values
- Works well many features each contributing to y
- Regularization Factor
- Increase to decrease the influence the higher number of feature / polynomials
- Decrease to increase the influence the higher number of feature / polynomials
- Formula
- Cost Function:
- J(T) = 1/2m * {Sum (1-m) (f(x) - Y) ^2 + Lambda * 1/2m * Sum (1-m) T^2}
- J(T) = - 1/2m * Sum (1-m) {Y log (f(x)) + (1 - Y) log (1-f(x)}
- Minimize cost function
- Gradient Descent:
- T := T - { Alpha * J(T) * Pj - Lambda* Sum (1-m) T }
- Repeat until convergence
- Terminology
- Lambda - Regularization Factor

**8 - Neural Networks:**__Representation__

- Non-Linear Classification:
- Features Set are of higher numbers - How to calculate when each Pixel is feature set of a type
- Neuron Model: Processing unit with logic
- Using different activation/weight , combination of feature and degree of
- Mechanics:
- Each Algorithm is designed for a certain purpose
- Objective - To get right combination Values and their Influence (Theta)
- Mechanism : Define Feature functions
- Evaluate (Function values)
- Analyze (Lowest - Cost Function)
- Decide (Optimize - Gradient Descend)
- Neural Network for CV :
- NN Algorithm Objective
- To identify a certain type of objects in an Image – Coke Bottle based on one feature Color Intensity
- Approach
- Input Images with Outputs of their Class (X,Y)
- Select a Feature > Get its Feature Vector for each Pixel (X)
- Intermediate Levels - Break the function further finer aspects
- Develop Function calculate for Feature Vector and Parameter for each (Y)
- Input Data Set (X)
- X – Feature Vector – Feature Value in Vector for Each Pixel
- Image with Coke bottles and Non-Coke bottles
- Feature Vector : Colors, Dimension, Position, Sizes - (Relative & absolute) ,
- Here it is done for one Feature Vector; Same to be done for other Features Set
- Output Data Set (Y)
- Y – No of Output Classes
- Binary Class (Coke bottles - Yes / No)
- Multi Class (Bottles - Coke, Pepsi, ..)
- Algorithm
- Functions
- f(X) = Theta Transpose X
- Z2 = g (f(x)) for previous level
- Each level has Activation Factor Theta to control participation
- At each level, function calculates feature value contribution of previous level in order to reduce error (improve accuracy) of the function
- Hidden Units are to break the function further to become more finer on Feature
- Level Propagation
- L1 – Works with Bigger size of image with identifying lines, Shades, etc.
- L2 – Breaking the previously fed image, work with smaller part of it to identify finer features
- Strategy
- Collapse & Construct:
- Collapse the whole image at each level to find finer features
- Reconstruct the image using identified features
- Deep Learning
- For deeper features (pixel level feature deduction), deeper algorithm are designed
- Terminology

- Activation Function – Function at Hidden unit with Factor of participate at each node of layer
- Error Calculation – Output Value of Function – Y

__Learning:__

- Formula
- Hypothesis
- f(x) = T1 X1 + T2 X2 + T3 X3
- Cost Function
- J(a) = Sum (1-N) Sum (1-K) (y log x - (1-y) (1 - log x)) + alpha * / n
- Minimize the Cost
- Gradient Descend:
- a := a - Alpha * d/da (f(x)) ;
- Calculate all value before assigning value
- Increment till this assignment is Zero
- Algorithms
- Propagation - Cost function and Gradient Function Optimization Evaluation for each path taken
- Forward Propagation – Calculating Function Value Forward from Left to Right .
- Backward Propagation – Calculating Error Function backward from Right to Left

**10 - Tuning Algorithm:**- Measures
- More Training Set
- Over-fitting
- Decreasing Feature Set
- Over-fitting
- Increasing Feature Set
- Under-fitting
- Decreasing Polynomial
- Over-fitting
- Increasing Polynomial
- Under-fitting
- Decreasing Lambda (Regularization)
- Under-fitting
- Increasing Lambda (Regularization)
- Over-fitting
- Learning Algorithm Type
- Over-fitting (High Variance)
- Under-fitting (High Bias)
- Approach for Training Data Set Division
- Training and Test Data : 70/30
- Training, Cross Validation and Test Data: 60/20/20
- Algorithm
- Calculate Cost Functions
- Drawing Error Function - Cost function (Error) vs
- Polynomial (x + x^2 + x^3) - Increasing / Decreasing
- Training data Set - Increasing / Decreasing
- Lamda - Increasing / Decreasing
- Learning Curve
- Error vs Training data set
- Neural network
- Smaller Network - Under-fitting - Not Compute Expensive
- Deeper Network - Over-fitting - Compute Expensive

**11 - System Design:**- Approach
- Define Feature Set
- Create Feature Vector
- Train with Data Set
- Error Metrics
- Precision = True Positive / Predicted Positive
- Recall = True Positive / Actual Positive
- Threshold of Qualifying for Positive Results
- To predict with high confidence > Lower the Threshold > High Precision and Low Recall
- To predict with high coverage > Higher Threshold > Low Precision and High Recall
- Amount Of Data
- Use many Parameter - Jtrain small > Low Bias
- Use very large Data - Jtrain = Jtest > Low Variance

**12 - Support Vector Machine**- Find out Hyper plane maximum margin from closest data points

**13 - Unsupervised Learning:**- Definition
- Ability of a function to find clear pattern based on distance and density
- Clustering Algorithm Approach
- K-Mean Algorithm
- Optimization Objective
- Minimize mean distance of assigned data points from Centriod
- Algorithm
- To Select K Random Centriod in Space
- Assign X points based on closeness to Centriod
- Calculate Mean of X for Centriod's points
- This becomes new Centriod for Selected
- Repeat
- Terms
- K - Randomly selected cluster Centriod points in space
- Mue - Mean distance of all assigned data points
- Ci - Index of Centroid closest to Xi
- Mue k - Average/mean of point assigned to cluster K
- Cost Function -
- = Sum (1-m) Mode (X - Mue) ^2
- Selecting K Value
- Elbow Method - K vs Cost Function

**14 - Dimentionality Reduction:**- Definition
- To reduce Features Set (not vector)
- Approach
- Express the reducing function on to remaining dimension with objective to reduce variance of expressed dimension
- Continue till the objective Meets
- Algorithm - PCA (Principal Component Analysis)
- Note
- PCA (Perpendicular to dimension function) vs Linear Regression (Perpendicular to Axis)

**15 - Anomaly Detection:**- Definition
- To find out-liner in data-set
- Use Case:
- Malfunctioning of Critical Systems (Aircraft)
- Fraud Detection of Public Systems (Bank, )
- Error Condition Detection (Computer Hardware, Data center)
- Algorithm
- Choose a Feature Set which could be indicative of Anomaly
- Calculate Gaussian Function - Talks about Central Tendency
- Mean - 1/m Sum 1-m (X)
- Standard Deviation - 1/m Sum 1-M (Mean - X) ^ 2
- Variance - SD ^2
- Gaussian Function GF (X, Mean, Variance)=
- = Prod 1-N P (X, Mean, Variance)
- GF < e (normal error) - Anomaly Detection

###
__Appendix: __

__Appendix:__

__NLP vs NN CV:__

- ML Algorithm Objective

- Same thing with NLP, but you can define these variations
- It used Entropy based Decision Tree Algorithm
- Input

- Labeled Data
- Feature Function for NLP
- Semantic position of different words within a sentences
- Sentiment algo define negative work position with ref to other construct of sentences

__Naive Bayes Algorithm:__

- Classifier Algorithm -
- Fast Convergence
- Independent Feature Simulation
- Document Classification
- Based on Posterior Probability Formulation

- Based on prior probability
- Ability to classify a new data being in territory
- Process

- Calculate Frequency of Each Term in Document
- Calculate Probability of Each Term of being in a Class
- Calculate Naive Bayes Equation for Each Class
- NLP Apply

- P (Document Belong to a Class / Words in Document) = P (Word Belong to Class) * P (Prior Probability of Class) / P (Prior Probability of Word)
- Ref
- http://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
- http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification

__Decision Tree:__

- Entropy
- In Decision tree, to evaluate the Significance of a feature, Probability calculation is being done at that feature node.
- To build decision logic at each node of a tree
- Information Gain :
- Decision Tree > Ability to decide the feature to take decision to lead it to class
- Random Forest
- Randomly divide dataset into multiple trees and then Ensembe them. It uses mode of the classes (classification) or mean prediction (regression) of the individual trees.
- http://www.slideshare.net/DerekKane/data-science-v-decision-tree-random-forests

__Ensemble:__

- Boosting vs Bagging - Ensemble
- Both are ensemble technique to create weaker Learners in order to create combined Strong Learner. To reduce the variance.
- Bagging - Using Bootstrapping data set (by different combination with replacement) to train multiple models and Use Voting Method to Select the Output.
- Boosting - Using whole/original Dataset for all model learning, Giving weight boosting to weaker models and select average of them.
- Stacking - (Like boosting) Applying another model to learn the Weights of each model

__Terminology:__

- Terminology
- Decision Factor / Attribute
- Features Set - Set of all the Features (X Down number)
- Feature Vector - All value of Feature in Vector (X Up Number)
- Feature Scaling - Mean Normalization - Normalize Value around Mean of Data Set
- Gaussian Model
- Weight Initialization in Neural Network -

## Comments

## Post a Comment