18 KiB
-
Definition of Data:
- Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
-
Data Basics:
- Basic Population: Entire group of interest (e.g., all students).
- Sample: Subset of the population (e.g., students in a lecture).
- Statistical Unit: Individual data point (e.g., one student).
- Variable: Characteristics of units (e.g., name, population size).
- Value: Specific value of a variable (e.g., "123456" for "MatrNr").
-
Data Categories:
- Structured vs. Unstructured:
- Structured: Organized data with a predefined format (e.g., tables).
- Unstructured: No traditional format (e.g., text, images).
- Discrete vs. Continuous:
- Discrete: Countable values (e.g., grades).
- Continuous: Any value within a range (e.g., temperature).
- Levels of Measurement:
- Nominal: Labels without order (e.g., colors).
- Ordinal: Ordered labels (e.g., school grades).
- Interval: Ordered with equal intervals (e.g., Celsius).
- Ratio: Interval with a true zero (e.g., weight).
- Qualitative vs. Quantitative:
- Qualitative: Categorical (e.g., gender).
- Quantitative: Numerical (e.g., height).
- Structured vs. Unstructured:
-
Primary vs. Secondary Data:
- Primary Data: Collected directly for a specific purpose (e.g., surveys, experiments).
- Secondary Data: Existing data from other sources (e.g., books, journals).
-
Ways to Obtain Data:
- Capturing Data: Collecting through sensors, observations, or experiments.
- Retrieving Data: Accessing from databases, APIs, or open data sources.
- Collecting Data: Scraping from websites or logs when direct access isn't available.
-
Databases:
- Relational Databases: Use SQL for structured data but have limitations with big data.
- NoSQL Databases: Handle unstructured or semi-structured data, offering flexibility and scalability.
- Document-Oriented Databases: Store data in formats like JSON, ideal for e-commerce and IoT.
-
APIs:
- REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
- Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
-
Data Scraping:
- Extracting data from websites or logs when APIs aren't available.
- Legal and ethical considerations must be addressed.
-
Data Protection and Anonymization:
- GDPR Compliance: Personal data must be protected, and usage requires consent.
- Anonymization: Removing personal identifiers to prevent individual identification.
- Pseudonymization: Using non-unique identifiers, requiring additional info for identification.
- Hashing: Converting data to fixed-size values (e.g., SHA-256) for privacy.
-
Statistical Basics:
- Descriptive Statistics: Summarizes data (e.g., mean, median).
- Exploratory Data Analysis: Identifies patterns and outliers.
- Inferential Statistics: Draws conclusions about populations from samples.
-
Frequencies and Histograms:
- Frequencies: Count of occurrences of each value.
- Absolute vs. Relative Frequencies: Raw counts vs. proportions.
- Histograms: Visual representation of data distribution across classes.
-
Empirical Distribution Function (EDF):
- Plots cumulative frequencies to show data distribution over a range.
-
Data Visualization:
- Pie Charts: Effective for showing proportions of categorical data.
- Bar Charts: Compare frequencies across categories.
- Histograms: Display distribution of continuous data.
-
Central Tendencies:
- Mode: The most frequently occurring value in a dataset.
- Median: The middle value when data is ordered, dividing the dataset into two equal halves.
- Mean: The average value, calculated by summing all observations and dividing by the number of observations.
-
Statistical Dispersion:
- Range: The difference between the maximum and minimum values.
- Interquartile Range (IQR): The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
- Variance and Standard Deviation: Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
-
Data Visualization:
- Histograms: Display the distribution of continuous data across classes.
- Box Plots: Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
-
Outliers:
- Defined as data points falling outside the range of [Q1 - 1.5IQR, Q3 + 1.5IQR].
- Can indicate errors, unusual observations, or novel data points.
-
Empirical Variance Calculation:
- Data: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
- Mean (x̄): 12.24
- Sum of squared deviations: 14.16
- Empirical Variance (s̃²): (
\frac{14.16}{7}= 2.02 ) - Empirical Standard Deviation (s̃): (
\sqrt{2.02} \approx 1.42)
-
Contingency Table Analysis:
- Variables: Growth (shrinking, growing, growing strongly) and Location (North, South)
- Absolute Frequencies:
- North: 29 (growing), 13 (growing strongly), 7 (shrinking)
- South: 13 (growing), 19 (growing strongly), 2 (shrinking)
- Chi-squared (χ²) Test:
- Calculated χ²: 7.53
- Corrected Pearson Contingency Coefficient (K*P):
- ( K*P =
\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41) - Interpretation: Weak to medium correlation between location and growth.
- ( K*P =
-
Correlation Coefficient:
- Pearson Correlation Coefficient (rXY) for population and area:
- Calculated rXY: 0.70
- Interpretation: Strong positive correlation.
- Pearson Correlation Coefficient (rXY) for population and area:
-
Key Concepts:
- Empirical Variance: Measures data spread around the mean.
- Contingency Tables: Used for nominal/ordinal data to assess associations.
- Pearson Correlation: Measures linear correlation between metric variables, ranging from -1 to 1.
-
Probability Theory Basics:
- Sample Space (Ω): Set of all possible outcomes.
- Event: Subset of the sample space.
- Probability Axioms (Kolmogorov):
- ( P(A)
\geq 0) - ( P(Ω) = 1 )
- Additivity for disjoint events.
- ( P(A)
-
Conditional Probability:
- Definition: ( P(A|B) =
\frac{P(A \cap B)}{P(B)}) - Independence: Events A and B are independent if ( P(A \cap B) = P(A)P(B) ).
- Definition: ( P(A|B) =
-
Bayes' Theorem:
- Formula: ( P(B|A) =
\frac{P(A|B)P(B)}{P(A)}) - Example:
- ( P(B|A) =
\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}= 0.31 )
- ( P(B|A) =
- Formula: ( P(B|A) =
-
Combinatorics:
- Permutations: ( n! ) for unique items.
- Combinations: (
\binom{n}{k} = \frac{n!}{k!(n-k)!}) - Multiplication Rule: ( n_1
\times n_2 \times \dots \times n_r)
-
Key Calculations:
- Password Example: ( 7^2
\timesP(10,4) = 49\times5040 = 246960 ) - Dice Probability: ( P(
\text{at least one 6 in 4 throws}) = 1 - (5/6)^4 = 0.518 )
- Password Example: ( 7^2
-
Random Variables:
- Definition: A function that assigns numerical values to outcomes in a sample space.
- Discrete vs. Continuous:
- Discrete: Countable outcomes (e.g., dice roll).
- Continuous: Uncountable outcomes (e.g., measurement of height).
-
Probability Distributions:
- Discrete:
- Bernoulli: ( $f(x) = p^x(1-p)^{1-x}$) for (
x \in \{0,1\}). - Binomial: (
f(x) = \binom{n}{x}p^x(1-p)^{n-x}) for (x \in \{0,1,...,n\}). - Uniform: (
f(x) = \frac{1}{m}) for (x \in \{1,2,...,m\}).
- Bernoulli: ( $f(x) = p^x(1-p)^{1-x}$) for (
- Continuous:
- Uniform: (
f(x) = \frac{1}{b-a}) for (x \in [a,b]). - Normal (Gaussian): (
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}).
- Uniform: (
- Discrete:
-
Expected Value and Variance:
- Discrete:
- Expected Value: (
E(X) = \sum x \cdot f(x)). - Variance: (
Var(X) = E((X-E(X))^2)).
- Expected Value: (
- Continuous:
- Expected Value: (
E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx). - Variance: (
Var(X) = E((X-E(X))^2)).
- Expected Value: (
- Discrete:
-
Key Calculations:
- Bernoulli Distribution:
- (
E(X) = p), (Var(X) = p(1-p)).
- (
- Binomial Distribution:
- (
E(X) = np), (Var(X) = np(1-p)).
- (
- Uniform Distribution (Discrete):
- (
E(X) = \frac{m+1}{2}), (Var(X) = \frac{m^2-1}{12}).
- (
- Uniform Distribution (Continuous):
- (
E(X) = \frac{a+b}{2}), (Var(X) = \frac{(b-a)^2}{12}).
- (
- Normal Distribution:
- (
E(X) = \mu), (Var(X) = \sigma^2).
- (
- Bernoulli Distribution:
-
Normal Distribution:
- Standard Normal (Z-Score): (
Z = \frac{X-\mu}{\sigma} \sim N(0,1)). - 68-95-99.7 Rule: 68% data within (
\mu \pm \sigma), 95% within (\mu \pm 2\sigma), 99.7% within (\mu \pm 3\sigma).
- Standard Normal (Z-Score): (
Here are the key points and calculations from the provided data:
-
Simple Linear Regression:
- Model: (
Y = \beta_0 + \beta_1X + \epsilon), where (\epsilon \sim N(0, \sigma^2) ). - Estimation: Parameters (
\beta_0) and (\beta_1) are estimated using least squares method.
- Model: (
-
Least Squares Estimators:
- Slope (β̂₁): [$$ \hat{\beta}1 = \frac{\sum{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} $$]
- Intercept (β̂₀): [$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} $$]
-
Residual Analysis:
- Residuals: (
e_i = Y_i - \hat{Y}_i). - Residual Plot: Used to check model assumptions (linearity, constant variance, normality).
- Residuals: (
-
Coefficient of Determination (R²):
- Formula: [$$ R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}} $$]
- Interpretation: Measures goodness of fit (0 ≤ R² ≤ 1).
-
Variance Estimation:
- Residual Sum of Squares (RSS): [$$ RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $$]
- Variance Estimator: [$$ \hat{\sigma}^2 = \frac{RSS}{n-2} $$]
-
Key Calculations:
- Example Calculations:
- For given data, compute (
\hat{\beta}_0), (\hat{\beta}_1), and (\hat{\sigma}^2). - Calculate R² to assess model fit.
- For given data, compute (
- Example Calculations:
-
Confidence Intervals:
- Point Estimator: Provides a single estimate of a population parameter.
- Interval Estimator: Provides a range of values within which the parameter is expected to lie.
- Formula for Mean (σ known): [$$ \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right] $$]
- Formula for Mean (σ unknown): [$$ \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right] $$]
-
Statistical Tests:
- Hypothesis Testing: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
- Z-Test: Used when the population variance is known.
- Test Statistic: (
Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}})
- Test Statistic: (
- T-Test: Used when the population variance is unknown.
- Test Statistic: (
T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}})
- Test Statistic: (
- Two-Sample T-Test: Compares the means of two independent groups.
- Test Statistic: (
T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}})
- Test Statistic: (
-
Key Calculations:
- Example 1: 95% confidence interval for bonbon package weights.
- Result: [63.84, 65.08]
- Example 2: Testing machine adjustment with t-test.
- Result: Null hypothesis not rejected, machine does not need adjustment.
- Example 3: Two-sample t-test for bonbon weights.
- Result: Reject H₀, second company’s bonbons are heavier.
- Example 1: 95% confidence interval for bonbon package weights.
-
Point Estimation:
- A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
- Example: For a normal distribution, the arithmetic mean estimates the expected value.
-
Confidence Intervals:
- A range [gl, gu] where θ is likely to lie with probability 1−α.
- Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
-
Statistical Hypothesis Testing:
- Null Hypothesis (H0): Statement to be tested (e.g., μ = 15).
- Alternative Hypothesis (H1): Opposing statement (e.g., μ ≠ 15).
- Test Statistic: A function of the sample data to assess H0 vs. H1.
- Rejection Region: Values of the test statistic leading to rejection of H0.
- Errors:
- Type I Error: Rejecting a true H0 (controlled by significance level α).
- Type II Error: Failing to reject a false H0 (related to test power).
-
Z-Test and T-Test:
- Z-Test: Used when population variance is known or sample size is large.
- T-Test: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
-
Two-Sample T-Test:
- Compares means of two independent groups.
- Assumptions: Normality, independence, homogeneity (if variances are equal).
- Test Statistic: Accounts for sample means, standard deviations, and sizes.
- Degrees of Freedom: Calculated using Welch-Satterthwaite equation for unequal variances.
-
Examples:
- One-Sample Test: Chocolate box weights using z-test (known variance).
- Two-Sample Test: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
-
Key Concepts:
- p-value: Probability of observing test statistic under H0; compared to α.
- Confidence Interval and Hypothesis Test Relationship: Rejecting H0 if the parameter lies outside the confidence interval.
-
Future Topics: Data preparation and decision trees.
-
Machine Learning Overview:
- Involves data preparation, model building, and evaluation.
- Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
-
Data Preparation:
- Handling Missing Data: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
- Handling False Data: Identify and correct errors through analysis or expert consultation.
- Feature Engineering: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
-
Decision Trees:
- Structure: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
- Construction: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
- Example: Medicine recommendation based on blood pressure and age.
- Pros and Cons: Easy to interpret but may overfit or require discretization for numerical attributes.
-
Model Evaluation:
- Train-Test Split: Evaluate models on independent test data to avoid overfitting.
- Metrics for Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE).
- Metrics for Classification: Confusion matrix, accuracy, precision, recall, F1 score.
- Accuracy: Overall correctness.
- Precision: Correct predictions among positive predictions.
- Recall: Correct predictions among actual positive instances.
- F1 Score: Harmonic mean of precision and recall.
-
Advanced Topics:
- Ensemble Methods: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
- Model Comparison: Use validation sets to compare models and avoid overfitting.
-
Summary:
- Skills Acquired: Data preparation, decision tree classification, and model evaluation.
- Future Topics: Dashboards and summaries for data visualization and reporting.
-
Model Evaluation Basics:
- Overfitting: When a model fits the training data too well, leading to poor performance on new data.
- Underfitting: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
-
Data Splitting:
- Train-Test Split: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
- Train-Validation-Test Split: Used to compare different models by having separate training, validation, and test sets.
-
Evaluation Metrics:
- Regression Metrics:
- Mean Squared Error (MSE): Average of squared differences between predicted and actual values.
- Mean Absolute Error (MAE): Average of absolute differences.
- Mean Absolute Percentage Error (MAPE): Average of percentage differences.
- Classification Metrics:
- Confusion Matrix: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
- Accuracy: (TP + TN) / (TP + TN + FP + FN).
- Precision: TP / (TP + FP).
- Recall: TP / (TP + FN).
- F1 Score: Harmonic mean of precision and recall, given by (
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}).
- Regression Metrics:
-
Example Calculations:
- Confusion Matrix Example:
- Accuracy: (
\frac{4 + 1}{10} = 0.5) - Precision: (
\frac{4}{6} \approx 0.66) - Recall: (
\frac{4}{7} \approx 0.57) - F1 Score: (
\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31)
- Accuracy: (
- Confusion Matrix Example:
-
Interpreting Results:
- Baseline Comparison: Always compare model performance against a naive reference (e.g., predicting the majority class).
- Contextual Relevance: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
-
Summary:
- Skills Acquired: Understanding of evaluation metrics and their application in assessing machine learning models.
- Future Topics: Advanced evaluation techniques and model optimization strategies.