obsidian/WS2425/Data Science/Cheat Sheet Mockup.md

![[lecture_02.pdf]]


1. **Definition of Data:**
   - Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.

2. **Data Basics:**
   - **Basic Population:** Entire group of interest (e.g., all students).
   - **Sample:** Subset of the population (e.g., students in a lecture).
   - **Statistical Unit:** Individual data point (e.g., one student).
   - **Variable:** Characteristics of units (e.g., name, population size).
   - **Value:** Specific value of a variable (e.g., "123456" for "MatrNr").

3. **Data Categories:**
   - **Structured vs. Unstructured:**
     - **Structured:** Organized data with a predefined format (e.g., tables).
     - **Unstructured:** No traditional format (e.g., text, images).
   - **Discrete vs. Continuous:**
     - **Discrete:** Countable values (e.g., grades).
     - **Continuous:** Any value within a range (e.g., temperature).
   - **Levels of Measurement:**
     - **Nominal:** Labels without order (e.g., colors).
     - **Ordinal:** Ordered labels (e.g., school grades).
     - **Interval:** Ordered with equal intervals (e.g., Celsius).
     - **Ratio:** Interval with a true zero (e.g., weight).
   - **Qualitative vs. Quantitative:**
     - **Qualitative:** Categorical (e.g., gender).
     - **Quantitative:** Numerical (e.g., height).

![[lecture_03.pdf]]


1. **Primary vs. Secondary Data:**
   - **Primary Data:** Collected directly for a specific purpose (e.g., surveys, experiments).
   - **Secondary Data:** Existing data from other sources (e.g., books, journals).

2. **Ways to Obtain Data:**
   - **Capturing Data:** Collecting through sensors, observations, or experiments.
   - **Retrieving Data:** Accessing from databases, APIs, or open data sources.
   - **Collecting Data:** Scraping from websites or logs when direct access isn't available.

3. **Databases:**
   - **Relational Databases:** Use SQL for structured data but have limitations with big data.
   - **NoSQL Databases:** Handle unstructured or semi-structured data, offering flexibility and scalability.
   - **Document-Oriented Databases:** Store data in formats like JSON, ideal for e-commerce and IoT.

4. **APIs:**
   - REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
   - Often require authentication (e.g., API keys) and provide data in JSON/XML formats.

5. **Data Scraping:**
   - Extracting data from websites or logs when APIs aren't available.
   - Legal and ethical considerations must be addressed.


![[lecture_04.pdf]]


1. **Data Protection and Anonymization:**
   - **GDPR Compliance:** Personal data must be protected, and usage requires consent.
   - **Anonymization:** Removing personal identifiers to prevent individual identification.
   - **Pseudonymization:** Using non-unique identifiers, requiring additional info for identification.
   - **Hashing:** Converting data to fixed-size values (e.g., SHA-256) for privacy.

2. **Statistical Basics:**
   - **Descriptive Statistics:** Summarizes data (e.g., mean, median).
   - **Exploratory Data Analysis:** Identifies patterns and outliers.
   - **Inferential Statistics:** Draws conclusions about populations from samples.

3. **Frequencies and Histograms:**
   - **Frequencies:** Count of occurrences of each value.
   - **Absolute vs. Relative Frequencies:** Raw counts vs. proportions.
   - **Histograms:** Visual representation of data distribution across classes.

4. **Empirical Distribution Function (EDF):**
   - Plots cumulative frequencies to show data distribution over a range.

5. **Data Visualization:**
   - **Pie Charts:** Effective for showing proportions of categorical data.
   - **Bar Charts:** Compare frequencies across categories.
   - **Histograms:** Display distribution of continuous data.

![[lecture_05.pdf]]


1. **Central Tendencies:**
   - **Mode:** The most frequently occurring value in a dataset.
   - **Median:** The middle value when data is ordered, dividing the dataset into two equal halves.
   - **Mean:** The average value, calculated by summing all observations and dividing by the number of observations.

2. **Statistical Dispersion:**
   - **Range:** The difference between the maximum and minimum values.
   - **Interquartile Range (IQR):** The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
   - **Variance and Standard Deviation:** Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.

3. **Data Visualization:**
   - **Histograms:** Display the distribution of continuous data across classes.
   - **Box Plots:** Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.

4. **Outliers:**
   - Defined as data points falling outside the range of [Q1 - 1.5*IQR, Q3 + 1.5*IQR].
   - Can indicate errors, unusual observations, or novel data points.

![[lecture_06.pdf]]


1. **Empirical Variance Calculation**:
   - **Data**: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
   - **Mean (x̄)**: 12.24
   - **Sum of squared deviations**: 14.16
   - **Empirical Variance (s̃²)**: \( $\frac{14.16}{7}$ = 2.02 \)
   - **Empirical Standard Deviation (s̃)**: \( $\sqrt{2.02} \approx 1.42$ \)

2. **Contingency Table Analysis**:
   - **Variables**: Growth (shrinking, growing, growing strongly) and Location (North, South)
   - **Absolute Frequencies**:
     - North: 29 (growing), 13 (growing strongly), 7 (shrinking)
     - South: 13 (growing), 19 (growing strongly), 2 (shrinking)
   - **Chi-squared (χ²) Test**:
     - Calculated χ²: 7.53
   - **Corrected Pearson Contingency Coefficient (K*P)**:
     - \( K*P = $\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41$ \)
     - Interpretation: Weak to medium correlation between location and growth.

3. **Correlation Coefficient**:
   - **Pearson Correlation Coefficient (rXY)** for population and area:
     - Calculated rXY: 0.70
     - Interpretation: Strong positive correlation.

4. **Key Concepts**:
   - **Empirical Variance**: Measures data spread around the mean.
   - **Contingency Tables**: Used for nominal/ordinal data to assess associations.
   - **Pearson Correlation**: Measures linear correlation between metric variables, ranging from -1 to 1.

![[lecture_07.pdf]]


1. **Probability Theory Basics**:
   - **Sample Space (Ω)**: Set of all possible outcomes.
   - **Event**: Subset of the sample space.
   - **Probability Axioms** (Kolmogorov):
     1. \( P(A) $\geq 0$ \)
     2. \( P(Ω) = 1 \)
     3. Additivity for disjoint events.

2. **Conditional Probability**:
   - **Definition**: \( P(A|B) = $\frac{P(A \cap B)}{P(B)}$ \)
   - **Independence**: Events A and B are independent if \( P(A \cap B) = P(A)P(B) \).

3. **Bayes' Theorem**:
   - **Formula**: \( P(B|A) = $\frac{P(A|B)P(B)}{P(A)}$ \)
   - **Example**:
     - \( P(B|A) = $\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}$ = 0.31 \)

4. **Combinatorics**:
   - **Permutations**: \( n! \) for unique items.
   - **Combinations**: \( $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ \)
   - **Multiplication Rule**: \( n_1 $\times n_2 \times \dots \times n_r$ \)

5. **Key Calculations**:
   - **Password Example**: \( 7^2 $\times$ P(10,4) = 49 $\times$ 5040 = 246960 \)
   - **Dice Probability**: \( P($\text{at least one 6 in 4 throws}$) = 1 - (5/6)^4 = 0.518 \)

![[lecture_08_neu.pdf]]


1. **Random Variables**:
   - **Definition**: A function that assigns numerical values to outcomes in a sample space.
   - **Discrete vs. Continuous**:
     - Discrete: Countable outcomes (e.g., dice roll).
     - Continuous: Uncountable outcomes (e.g., measurement of height).

2. **Probability Distributions**:
   - **Discrete**:
     - **Bernoulli**: \( $f(x) = p^x(1-p)^{1-x}$\) for \( $x \in \{0,1\}$ \).
     - **Binomial**: \( $f(x) = \binom{n}{x}p^x(1-p)^{n-x}$ \) for \( $x \in \{0,1,...,n\}$ \).
     - **Uniform**: \( $f(x) = \frac{1}{m}$ \) for \($x \in \{1,2,...,m\}$ \).
   - **Continuous**:
     - **Uniform**: \( $f(x) = \frac{1}{b-a}$ \) for \( $x \in [a,b]$ \).
     - **Normal (Gaussian)**: \( $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}$ \).

3. **Expected Value and Variance**:
   - **Discrete**:
     - **Expected Value**: \( $E(X) = \sum x \cdot f(x)$ \).
     - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
   - **Continuous**:
     - **Expected Value**: \( $E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx$ \).
     - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).

4. **Key Calculations**:
   - **Bernoulli Distribution**:
     - \( $E(X) = p$ \), \( $Var(X) = p(1-p)$ \).
   - **Binomial Distribution**:
     - \( $E(X) = np$ \), \( $Var(X) = np(1-p)$ \).
   - **Uniform Distribution (Discrete)**:
     - \( $E(X) = \frac{m+1}{2}$ \), \( $Var(X) = \frac{m^2-1}{12}$ \).
   - **Uniform Distribution (Continuous)**:
     - \( $E(X) = \frac{a+b}{2}$ \), \( $Var(X) = \frac{(b-a)^2}{12}$ \).
   - **Normal Distribution**:
     - \( $E(X) = \mu$ \), \( $Var(X) = \sigma^2$ \).

5. **Normal Distribution**:
   - **Standard Normal (Z-Score)**: \( $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$ \).
   - **68-95-99.7 Rule**: 68% data within \( $\mu \pm \sigma$ \), 95% within \( $\mu \pm 2\sigma$ \), 99.7% within \( $\mu \pm 3\sigma$ \).

![[lecture_09.pdf]]


Here are the key points and calculations from the provided data:

1. **Simple Linear Regression**:
   - **Model**: \( $Y = \beta_0 + \beta_1X + \epsilon$ \), where \( $\epsilon \sim N(0, \sigma^2$) \).
   - **Estimation**: Parameters \( $\beta_0$ \) and \( $\beta_1$ \) are estimated using least squares method.

2. **Least Squares Estimators**:
   - **Slope (β̂₁)**:
     \[$$
     \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}
     $$\]
   - **Intercept (β̂₀)**:
     \[$$
     \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}
     $$\]

3. **Residual Analysis**:
   - **Residuals**: \( $e_i = Y_i - \hat{Y}_i$ \).
   - **Residual Plot**: Used to check model assumptions (linearity, constant variance, normality).

4. **Coefficient of Determination (R²)**:
   - **Formula**:
     \[$$
     R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}}
     $$\]
   - **Interpretation**: Measures goodness of fit (0 ≤ R² ≤ 1).

5. **Variance Estimation**:
   - **Residual Sum of Squares (RSS)**:
     \[$$
     RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2
     $$\]
   - **Variance Estimator**:
     \[$$
     \hat{\sigma}^2 = \frac{RSS}{n-2}
     $$\]

6. **Key Calculations**:
   - **Example Calculations**:
     - For given data, compute \( $\hat{\beta}_0$ \), \( $\hat{\beta}_1$ \), and \( $\hat{\sigma}^2$ \).
     - Calculate R² to assess model fit.


![[lecture_10.pdf]]


1. **Confidence Intervals**:
   - **Point Estimator**: Provides a single estimate of a population parameter.
   - **Interval Estimator**: Provides a range of values within which the parameter is expected to lie.
   - **Formula for Mean (σ known)**:
     \[$$
     \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right]
     $$\]
   - **Formula for Mean (σ unknown)**:
     \[$$
     \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right]
     $$\]

2. **Statistical Tests**:
   - **Hypothesis Testing**: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
   - **Z-Test**: Used when the population variance is known.
     - **Test Statistic**: \( $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ \)
   - **T-Test**: Used when the population variance is unknown.
     - **Test Statistic**: \( $T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ \)
   - **Two-Sample T-Test**: Compares the means of two independent groups.
     - **Test Statistic**: \( $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}}$ \)

3. **Key Calculations**:
   - **Example 1**: 95% confidence interval for bonbon package weights.
     - **Result**: [63.84, 65.08]
   - **Example 2**: Testing machine adjustment with t-test.
     - **Result**: Null hypothesis not rejected, machine does not need adjustment.
   - **Example 3**: Two-sample t-test for bonbon weights.
     - **Result**: Reject H₀, second company’s bonbons are heavier.

![[lecture_11.pdf]]


1. **Point Estimation**:
   - A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
   - Example: For a normal distribution, the arithmetic mean estimates the expected value.

2. **Confidence Intervals**:
   - A range [gl, gu] where θ is likely to lie with probability 1−α.
   - Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).

3. **Statistical Hypothesis Testing**:
   - **Null Hypothesis (H0)**: Statement to be tested (e.g., μ = 15).
   - **Alternative Hypothesis (H1)**: Opposing statement (e.g., μ ≠ 15).
   - **Test Statistic**: A function of the sample data to assess H0 vs. H1.
   - **Rejection Region**: Values of the test statistic leading to rejection of H0.
   - **Errors**:
     - **Type I Error**: Rejecting a true H0 (controlled by significance level α).
     - **Type II Error**: Failing to reject a false H0 (related to test power).

4. **Z-Test and T-Test**:
   - **Z-Test**: Used when population variance is known or sample size is large.
   - **T-Test**: Used when population variance is unknown; relies on sample standard deviation and t-distribution.

5. **Two-Sample T-Test**:
   - Compares means of two independent groups.
   - **Assumptions**: Normality, independence, homogeneity (if variances are equal).
   - **Test Statistic**: Accounts for sample means, standard deviations, and sizes.
   - **Degrees of Freedom**: Calculated using Welch-Satterthwaite equation for unequal variances.

6. **Examples**:
   - **One-Sample Test**: Chocolate box weights using z-test (known variance).
   - **Two-Sample Test**: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.

7. **Key Concepts**:
   - **p-value**: Probability of observing test statistic under H0; compared to α.
   - **Confidence Interval and Hypothesis Test Relationship**: Rejecting H0 if the parameter lies outside the confidence interval.

8. **Future Topics**: Data preparation and decision trees.

![[lecture_12.pdf]]


1. **Machine Learning Overview**:
   - Involves data preparation, model building, and evaluation.
   - Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.

2. **Data Preparation**:
   - **Handling Missing Data**: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
   - **Handling False Data**: Identify and correct errors through analysis or expert consultation.
   - **Feature Engineering**: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.

3. **Decision Trees**:
   - **Structure**: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
   - **Construction**: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
   - **Example**: Medicine recommendation based on blood pressure and age.
   - **Pros and Cons**: Easy to interpret but may overfit or require discretization for numerical attributes.

4. **Model Evaluation**:
   - **Train-Test Split**: Evaluate models on independent test data to avoid overfitting.
   - **Metrics for Regression**: Mean Squared Error (MSE), Mean Absolute Error (MAE).
   - **Metrics for Classification**: Confusion matrix, accuracy, precision, recall, F1 score.
     - **Accuracy**: Overall correctness.
     - **Precision**: Correct predictions among positive predictions.
     - **Recall**: Correct predictions among actual positive instances.
     - **F1 Score**: Harmonic mean of precision and recall.

5. **Advanced Topics**:
   - **Ensemble Methods**: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
   - **Model Comparison**: Use validation sets to compare models and avoid overfitting.

6. **Summary**:
   - **Skills Acquired**: Data preparation, decision tree classification, and model evaluation.
   - **Future Topics**: Dashboards and summaries for data visualization and reporting.


![[lecture_13.pdf]]

1. **Model Evaluation Basics**:
   - **Overfitting**: When a model fits the training data too well, leading to poor performance on new data.
   - **Underfitting**: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.

2. **Data Splitting**:
   - **Train-Test Split**: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
   - **Train-Validation-Test Split**: Used to compare different models by having separate training, validation, and test sets.

3. **Evaluation Metrics**:
   - **Regression Metrics**:
     - **Mean Squared Error (MSE)**: Average of squared differences between predicted and actual values.
     - **Mean Absolute Error (MAE)**: Average of absolute differences.
     - **Mean Absolute Percentage Error (MAPE)**: Average of percentage differences.
   - **Classification Metrics**:
     - **Confusion Matrix**: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
     - **Accuracy**: (TP + TN) / (TP + TN + FP + FN).
     - **Precision**: TP / (TP + FP).
     - **Recall**: TP / (TP + FN).
     - **F1 Score**: Harmonic mean of precision and recall, given by \( $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ \).

4. **Example Calculations**:
   - **Confusion Matrix Example**:
     - Accuracy: \( $\frac{4 + 1}{10} = 0.5$ \)
     - Precision: \( $\frac{4}{6} \approx 0.66$ \)
     - Recall: \( $\frac{4}{7} \approx 0.57$ \)
     - F1 Score: \( $\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31$ \)

5. **Interpreting Results**:
   - **Baseline Comparison**: Always compare model performance against a naive reference (e.g., predicting the majority class).
   - **Contextual Relevance**: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.

6. **Summary**:
   - **Skills Acquired**: Understanding of evaluation metrics and their application in assessing machine learning models.
   - **Future Topics**: Advanced evaluation techniques and model optimization strategies.