![[lecture_02.pdf]] 1. **Definition of Data:** - Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized. 2. **Data Basics:** - **Basic Population:** Entire group of interest (e.g., all students). - **Sample:** Subset of the population (e.g., students in a lecture). - **Statistical Unit:** Individual data point (e.g., one student). - **Variable:** Characteristics of units (e.g., name, population size). - **Value:** Specific value of a variable (e.g., "123456" for "MatrNr"). 3. **Data Categories:** - **Structured vs. Unstructured:** - **Structured:** Organized data with a predefined format (e.g., tables). - **Unstructured:** No traditional format (e.g., text, images). - **Discrete vs. Continuous:** - **Discrete:** Countable values (e.g., grades). - **Continuous:** Any value within a range (e.g., temperature). - **Levels of Measurement:** - **Nominal:** Labels without order (e.g., colors). - **Ordinal:** Ordered labels (e.g., school grades). - **Interval:** Ordered with equal intervals (e.g., Celsius). - **Ratio:** Interval with a true zero (e.g., weight). - **Qualitative vs. Quantitative:** - **Qualitative:** Categorical (e.g., gender). - **Quantitative:** Numerical (e.g., height). ![[lecture_03.pdf]] 1. **Primary vs. Secondary Data:** - **Primary Data:** Collected directly for a specific purpose (e.g., surveys, experiments). - **Secondary Data:** Existing data from other sources (e.g., books, journals). 2. **Ways to Obtain Data:** - **Capturing Data:** Collecting through sensors, observations, or experiments. - **Retrieving Data:** Accessing from databases, APIs, or open data sources. - **Collecting Data:** Scraping from websites or logs when direct access isn't available. 3. **Databases:** - **Relational Databases:** Use SQL for structured data but have limitations with big data. - **NoSQL Databases:** Handle unstructured or semi-structured data, offering flexibility and scalability. - **Document-Oriented Databases:** Store data in formats like JSON, ideal for e-commerce and IoT. 4. **APIs:** - REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE). - Often require authentication (e.g., API keys) and provide data in JSON/XML formats. 5. **Data Scraping:** - Extracting data from websites or logs when APIs aren't available. - Legal and ethical considerations must be addressed. ![[lecture_04.pdf]] 1. **Data Protection and Anonymization:** - **GDPR Compliance:** Personal data must be protected, and usage requires consent. - **Anonymization:** Removing personal identifiers to prevent individual identification. - **Pseudonymization:** Using non-unique identifiers, requiring additional info for identification. - **Hashing:** Converting data to fixed-size values (e.g., SHA-256) for privacy. 2. **Statistical Basics:** - **Descriptive Statistics:** Summarizes data (e.g., mean, median). - **Exploratory Data Analysis:** Identifies patterns and outliers. - **Inferential Statistics:** Draws conclusions about populations from samples. 3. **Frequencies and Histograms:** - **Frequencies:** Count of occurrences of each value. - **Absolute vs. Relative Frequencies:** Raw counts vs. proportions. - **Histograms:** Visual representation of data distribution across classes. 4. **Empirical Distribution Function (EDF):** - Plots cumulative frequencies to show data distribution over a range. 5. **Data Visualization:** - **Pie Charts:** Effective for showing proportions of categorical data. - **Bar Charts:** Compare frequencies across categories. - **Histograms:** Display distribution of continuous data. ![[lecture_05.pdf]] 1. **Central Tendencies:** - **Mode:** The most frequently occurring value in a dataset. - **Median:** The middle value when data is ordered, dividing the dataset into two equal halves. - **Mean:** The average value, calculated by summing all observations and dividing by the number of observations. 2. **Statistical Dispersion:** - **Range:** The difference between the maximum and minimum values. - **Interquartile Range (IQR):** The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data. - **Variance and Standard Deviation:** Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance. 3. **Data Visualization:** - **Histograms:** Display the distribution of continuous data across classes. - **Box Plots:** Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers. 4. **Outliers:** - Defined as data points falling outside the range of [Q1 - 1.5*IQR, Q3 + 1.5*IQR]. - Can indicate errors, unusual observations, or novel data points. ![[lecture_06.pdf]] 1. **Empirical Variance Calculation**: - **Data**: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9 - **Mean (x̄)**: 12.24 - **Sum of squared deviations**: 14.16 - **Empirical Variance (s̃²)**: \( $\frac{14.16}{7}$ = 2.02 \) - **Empirical Standard Deviation (s̃)**: \( $\sqrt{2.02} \approx 1.42$ \) 2. **Contingency Table Analysis**: - **Variables**: Growth (shrinking, growing, growing strongly) and Location (North, South) - **Absolute Frequencies**: - North: 29 (growing), 13 (growing strongly), 7 (shrinking) - South: 13 (growing), 19 (growing strongly), 2 (shrinking) - **Chi-squared (χ²) Test**: - Calculated χ²: 7.53 - **Corrected Pearson Contingency Coefficient (K*P)**: - \( K*P = $\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41$ \) - Interpretation: Weak to medium correlation between location and growth. 3. **Correlation Coefficient**: - **Pearson Correlation Coefficient (rXY)** for population and area: - Calculated rXY: 0.70 - Interpretation: Strong positive correlation. 4. **Key Concepts**: - **Empirical Variance**: Measures data spread around the mean. - **Contingency Tables**: Used for nominal/ordinal data to assess associations. - **Pearson Correlation**: Measures linear correlation between metric variables, ranging from -1 to 1. ![[lecture_07.pdf]] 1. **Probability Theory Basics**: - **Sample Space (Ω)**: Set of all possible outcomes. - **Event**: Subset of the sample space. - **Probability Axioms** (Kolmogorov): 1. \( P(A) $\geq 0$ \) 2. \( P(Ω) = 1 \) 3. Additivity for disjoint events. 2. **Conditional Probability**: - **Definition**: \( P(A|B) = $\frac{P(A \cap B)}{P(B)}$ \) - **Independence**: Events A and B are independent if \( P(A \cap B) = P(A)P(B) \). 3. **Bayes' Theorem**: - **Formula**: \( P(B|A) = $\frac{P(A|B)P(B)}{P(A)}$ \) - **Example**: - \( P(B|A) = $\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}$ = 0.31 \) 4. **Combinatorics**: - **Permutations**: \( n! \) for unique items. - **Combinations**: \( $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ \) - **Multiplication Rule**: \( n_1 $\times n_2 \times \dots \times n_r$ \) 5. **Key Calculations**: - **Password Example**: \( 7^2 $\times$ P(10,4) = 49 $\times$ 5040 = 246960 \) - **Dice Probability**: \( P($\text{at least one 6 in 4 throws}$) = 1 - (5/6)^4 = 0.518 \) ![[lecture_08_neu.pdf]] 1. **Random Variables**: - **Definition**: A function that assigns numerical values to outcomes in a sample space. - **Discrete vs. Continuous**: - Discrete: Countable outcomes (e.g., dice roll). - Continuous: Uncountable outcomes (e.g., measurement of height). 2. **Probability Distributions**: - **Discrete**: - **Bernoulli**: \( $f(x) = p^x(1-p)^{1-x}$\) for \( $x \in \{0,1\}$ \). - **Binomial**: \( $f(x) = \binom{n}{x}p^x(1-p)^{n-x}$ \) for \( $x \in \{0,1,...,n\}$ \). - **Uniform**: \( $f(x) = \frac{1}{m}$ \) for \($x \in \{1,2,...,m\}$ \). - **Continuous**: - **Uniform**: \( $f(x) = \frac{1}{b-a}$ \) for \( $x \in [a,b]$ \). - **Normal (Gaussian)**: \( $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}$ \). 3. **Expected Value and Variance**: - **Discrete**: - **Expected Value**: \( $E(X) = \sum x \cdot f(x)$ \). - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \). - **Continuous**: - **Expected Value**: \( $E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx$ \). - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \). 4. **Key Calculations**: - **Bernoulli Distribution**: - \( $E(X) = p$ \), \( $Var(X) = p(1-p)$ \). - **Binomial Distribution**: - \( $E(X) = np$ \), \( $Var(X) = np(1-p)$ \). - **Uniform Distribution (Discrete)**: - \( $E(X) = \frac{m+1}{2}$ \), \( $Var(X) = \frac{m^2-1}{12}$ \). - **Uniform Distribution (Continuous)**: - \( $E(X) = \frac{a+b}{2}$ \), \( $Var(X) = \frac{(b-a)^2}{12}$ \). - **Normal Distribution**: - \( $E(X) = \mu$ \), \( $Var(X) = \sigma^2$ \). 5. **Normal Distribution**: - **Standard Normal (Z-Score)**: \( $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$ \). - **68-95-99.7 Rule**: 68% data within \( $\mu \pm \sigma$ \), 95% within \( $\mu \pm 2\sigma$ \), 99.7% within \( $\mu \pm 3\sigma$ \). ![[lecture_09.pdf]] Here are the key points and calculations from the provided data: 1. **Simple Linear Regression**: - **Model**: \( $Y = \beta_0 + \beta_1X + \epsilon$ \), where \( $\epsilon \sim N(0, \sigma^2$) \). - **Estimation**: Parameters \( $\beta_0$ \) and \( $\beta_1$ \) are estimated using least squares method. 2. **Least Squares Estimators**: - **Slope (β̂₁)**: \[$$ \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} $$\] - **Intercept (β̂₀)**: \[$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} $$\] 3. **Residual Analysis**: - **Residuals**: \( $e_i = Y_i - \hat{Y}_i$ \). - **Residual Plot**: Used to check model assumptions (linearity, constant variance, normality). 4. **Coefficient of Determination (R²)**: - **Formula**: \[$$ R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}} $$\] - **Interpretation**: Measures goodness of fit (0 ≤ R² ≤ 1). 5. **Variance Estimation**: - **Residual Sum of Squares (RSS)**: \[$$ RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $$\] - **Variance Estimator**: \[$$ \hat{\sigma}^2 = \frac{RSS}{n-2} $$\] 6. **Key Calculations**: - **Example Calculations**: - For given data, compute \( $\hat{\beta}_0$ \), \( $\hat{\beta}_1$ \), and \( $\hat{\sigma}^2$ \). - Calculate R² to assess model fit. ![[lecture_10.pdf]] 1. **Confidence Intervals**: - **Point Estimator**: Provides a single estimate of a population parameter. - **Interval Estimator**: Provides a range of values within which the parameter is expected to lie. - **Formula for Mean (σ known)**: \[$$ \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right] $$\] - **Formula for Mean (σ unknown)**: \[$$ \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right] $$\] 2. **Statistical Tests**: - **Hypothesis Testing**: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data. - **Z-Test**: Used when the population variance is known. - **Test Statistic**: \( $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ \) - **T-Test**: Used when the population variance is unknown. - **Test Statistic**: \( $T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ \) - **Two-Sample T-Test**: Compares the means of two independent groups. - **Test Statistic**: \( $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}}$ \) 3. **Key Calculations**: - **Example 1**: 95% confidence interval for bonbon package weights. - **Result**: [63.84, 65.08] - **Example 2**: Testing machine adjustment with t-test. - **Result**: Null hypothesis not rejected, machine does not need adjustment. - **Example 3**: Two-sample t-test for bonbon weights. - **Result**: Reject H₀, second company’s bonbons are heavier. ![[lecture_11.pdf]] 1. **Point Estimation**: - A point estimator (e.g., sample mean) estimates the parameter θ from a random sample. - Example: For a normal distribution, the arithmetic mean estimates the expected value. 2. **Confidence Intervals**: - A range [gl, gu] where θ is likely to lie with probability 1−α. - Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval). 3. **Statistical Hypothesis Testing**: - **Null Hypothesis (H0)**: Statement to be tested (e.g., μ = 15). - **Alternative Hypothesis (H1)**: Opposing statement (e.g., μ ≠ 15). - **Test Statistic**: A function of the sample data to assess H0 vs. H1. - **Rejection Region**: Values of the test statistic leading to rejection of H0. - **Errors**: - **Type I Error**: Rejecting a true H0 (controlled by significance level α). - **Type II Error**: Failing to reject a false H0 (related to test power). 4. **Z-Test and T-Test**: - **Z-Test**: Used when population variance is known or sample size is large. - **T-Test**: Used when population variance is unknown; relies on sample standard deviation and t-distribution. 5. **Two-Sample T-Test**: - Compares means of two independent groups. - **Assumptions**: Normality, independence, homogeneity (if variances are equal). - **Test Statistic**: Accounts for sample means, standard deviations, and sizes. - **Degrees of Freedom**: Calculated using Welch-Satterthwaite equation for unequal variances. 6. **Examples**: - **One-Sample Test**: Chocolate box weights using z-test (known variance). - **Two-Sample Test**: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment. 7. **Key Concepts**: - **p-value**: Probability of observing test statistic under H0; compared to α. - **Confidence Interval and Hypothesis Test Relationship**: Rejecting H0 if the parameter lies outside the confidence interval. 8. **Future Topics**: Data preparation and decision trees. ![[lecture_12.pdf]] 1. **Machine Learning Overview**: - Involves data preparation, model building, and evaluation. - Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment. 2. **Data Preparation**: - **Handling Missing Data**: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation). - **Handling False Data**: Identify and correct errors through analysis or expert consultation. - **Feature Engineering**: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance. 3. **Decision Trees**: - **Structure**: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions. - **Construction**: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split. - **Example**: Medicine recommendation based on blood pressure and age. - **Pros and Cons**: Easy to interpret but may overfit or require discretization for numerical attributes. 4. **Model Evaluation**: - **Train-Test Split**: Evaluate models on independent test data to avoid overfitting. - **Metrics for Regression**: Mean Squared Error (MSE), Mean Absolute Error (MAE). - **Metrics for Classification**: Confusion matrix, accuracy, precision, recall, F1 score. - **Accuracy**: Overall correctness. - **Precision**: Correct predictions among positive predictions. - **Recall**: Correct predictions among actual positive instances. - **F1 Score**: Harmonic mean of precision and recall. 5. **Advanced Topics**: - **Ensemble Methods**: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias. - **Model Comparison**: Use validation sets to compare models and avoid overfitting. 6. **Summary**: - **Skills Acquired**: Data preparation, decision tree classification, and model evaluation. - **Future Topics**: Dashboards and summaries for data visualization and reporting. ![[lecture_13.pdf]] 1. **Model Evaluation Basics**: - **Overfitting**: When a model fits the training data too well, leading to poor performance on new data. - **Underfitting**: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data. 2. **Data Splitting**: - **Train-Test Split**: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing. - **Train-Validation-Test Split**: Used to compare different models by having separate training, validation, and test sets. 3. **Evaluation Metrics**: - **Regression Metrics**: - **Mean Squared Error (MSE)**: Average of squared differences between predicted and actual values. - **Mean Absolute Error (MAE)**: Average of absolute differences. - **Mean Absolute Percentage Error (MAPE)**: Average of percentage differences. - **Classification Metrics**: - **Confusion Matrix**: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). - **Accuracy**: (TP + TN) / (TP + TN + FP + FN). - **Precision**: TP / (TP + FP). - **Recall**: TP / (TP + FN). - **F1 Score**: Harmonic mean of precision and recall, given by \( $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ \). 4. **Example Calculations**: - **Confusion Matrix Example**: - Accuracy: \( $\frac{4 + 1}{10} = 0.5$ \) - Precision: \( $\frac{4}{6} \approx 0.66$ \) - Recall: \( $\frac{4}{7} \approx 0.57$ \) - F1 Score: \( $\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31$ \) 5. **Interpreting Results**: - **Baseline Comparison**: Always compare model performance against a naive reference (e.g., predicting the majority class). - **Contextual Relevance**: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance. 6. **Summary**: - **Skills Acquired**: Understanding of evaluation metrics and their application in assessing machine learning models. - **Future Topics**: Advanced evaluation techniques and model optimization strategies.