Files
obsidian/WS2425/Data Science/Cheat Sheet Mockup.md
2025-02-20 15:52:52 +01:00

401 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
![[lecture_02.pdf]]
1. **Definition of Data:**
- Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
2. **Data Basics:**
- **Basic Population:** Entire group of interest (e.g., all students).
- **Sample:** Subset of the population (e.g., students in a lecture).
- **Statistical Unit:** Individual data point (e.g., one student).
- **Variable:** Characteristics of units (e.g., name, population size).
- **Value:** Specific value of a variable (e.g., "123456" for "MatrNr").
3. **Data Categories:**
- **Structured vs. Unstructured:**
- **Structured:** Organized data with a predefined format (e.g., tables).
- **Unstructured:** No traditional format (e.g., text, images).
- **Discrete vs. Continuous:**
- **Discrete:** Countable values (e.g., grades).
- **Continuous:** Any value within a range (e.g., temperature).
- **Levels of Measurement:**
- **Nominal:** Labels without order (e.g., colors).
- **Ordinal:** Ordered labels (e.g., school grades).
- **Interval:** Ordered with equal intervals (e.g., Celsius).
- **Ratio:** Interval with a true zero (e.g., weight).
- **Qualitative vs. Quantitative:**
- **Qualitative:** Categorical (e.g., gender).
- **Quantitative:** Numerical (e.g., height).
![[lecture_03.pdf]]
1. **Primary vs. Secondary Data:**
- **Primary Data:** Collected directly for a specific purpose (e.g., surveys, experiments).
- **Secondary Data:** Existing data from other sources (e.g., books, journals).
2. **Ways to Obtain Data:**
- **Capturing Data:** Collecting through sensors, observations, or experiments.
- **Retrieving Data:** Accessing from databases, APIs, or open data sources.
- **Collecting Data:** Scraping from websites or logs when direct access isn't available.
3. **Databases:**
- **Relational Databases:** Use SQL for structured data but have limitations with big data.
- **NoSQL Databases:** Handle unstructured or semi-structured data, offering flexibility and scalability.
- **Document-Oriented Databases:** Store data in formats like JSON, ideal for e-commerce and IoT.
4. **APIs:**
- REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
- Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
5. **Data Scraping:**
- Extracting data from websites or logs when APIs aren't available.
- Legal and ethical considerations must be addressed.
![[lecture_04.pdf]]
1. **Data Protection and Anonymization:**
- **GDPR Compliance:** Personal data must be protected, and usage requires consent.
- **Anonymization:** Removing personal identifiers to prevent individual identification.
- **Pseudonymization:** Using non-unique identifiers, requiring additional info for identification.
- **Hashing:** Converting data to fixed-size values (e.g., SHA-256) for privacy.
2. **Statistical Basics:**
- **Descriptive Statistics:** Summarizes data (e.g., mean, median).
- **Exploratory Data Analysis:** Identifies patterns and outliers.
- **Inferential Statistics:** Draws conclusions about populations from samples.
3. **Frequencies and Histograms:**
- **Frequencies:** Count of occurrences of each value.
- **Absolute vs. Relative Frequencies:** Raw counts vs. proportions.
- **Histograms:** Visual representation of data distribution across classes.
4. **Empirical Distribution Function (EDF):**
- Plots cumulative frequencies to show data distribution over a range.
5. **Data Visualization:**
- **Pie Charts:** Effective for showing proportions of categorical data.
- **Bar Charts:** Compare frequencies across categories.
- **Histograms:** Display distribution of continuous data.
![[lecture_05.pdf]]
1. **Central Tendencies:**
- **Mode:** The most frequently occurring value in a dataset.
- **Median:** The middle value when data is ordered, dividing the dataset into two equal halves.
- **Mean:** The average value, calculated by summing all observations and dividing by the number of observations.
2. **Statistical Dispersion:**
- **Range:** The difference between the maximum and minimum values.
- **Interquartile Range (IQR):** The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
- **Variance and Standard Deviation:** Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
3. **Data Visualization:**
- **Histograms:** Display the distribution of continuous data across classes.
- **Box Plots:** Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
4. **Outliers:**
- Defined as data points falling outside the range of [Q1 - 1.5*IQR, Q3 + 1.5*IQR].
- Can indicate errors, unusual observations, or novel data points.
![[lecture_06.pdf]]
1. **Empirical Variance Calculation**:
- **Data**: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
- **Mean (x̄)**: 12.24
- **Sum of squared deviations**: 14.16
- **Empirical Variance (s̃²)**: \( $\frac{14.16}{7}$ = 2.02 \)
- **Empirical Standard Deviation (s̃)**: \( $\sqrt{2.02} \approx 1.42$ \)
2. **Contingency Table Analysis**:
- **Variables**: Growth (shrinking, growing, growing strongly) and Location (North, South)
- **Absolute Frequencies**:
- North: 29 (growing), 13 (growing strongly), 7 (shrinking)
- South: 13 (growing), 19 (growing strongly), 2 (shrinking)
- **Chi-squared (χ²) Test**:
- Calculated χ²: 7.53
- **Corrected Pearson Contingency Coefficient (K*P)**:
- \( K*P = $\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41$ \)
- Interpretation: Weak to medium correlation between location and growth.
3. **Correlation Coefficient**:
- **Pearson Correlation Coefficient (rXY)** for population and area:
- Calculated rXY: 0.70
- Interpretation: Strong positive correlation.
4. **Key Concepts**:
- **Empirical Variance**: Measures data spread around the mean.
- **Contingency Tables**: Used for nominal/ordinal data to assess associations.
- **Pearson Correlation**: Measures linear correlation between metric variables, ranging from -1 to 1.
![[lecture_07.pdf]]
1. **Probability Theory Basics**:
- **Sample Space (Ω)**: Set of all possible outcomes.
- **Event**: Subset of the sample space.
- **Probability Axioms** (Kolmogorov):
1. \( P(A) $\geq 0$ \)
2. \( P(Ω) = 1 \)
3. Additivity for disjoint events.
2. **Conditional Probability**:
- **Definition**: \( P(A|B) = $\frac{P(A \cap B)}{P(B)}$ \)
- **Independence**: Events A and B are independent if \( P(A \cap B) = P(A)P(B) \).
3. **Bayes' Theorem**:
- **Formula**: \( P(B|A) = $\frac{P(A|B)P(B)}{P(A)}$ \)
- **Example**:
- \( P(B|A) = $\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}$ = 0.31 \)
4. **Combinatorics**:
- **Permutations**: \( n! \) for unique items.
- **Combinations**: \( $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ \)
- **Multiplication Rule**: \( n_1 $\times n_2 \times \dots \times n_r$ \)
5. **Key Calculations**:
- **Password Example**: \( 7^2 $\times$ P(10,4) = 49 $\times$ 5040 = 246960 \)
- **Dice Probability**: \( P($\text{at least one 6 in 4 throws}$) = 1 - (5/6)^4 = 0.518 \)
![[lecture_08_neu.pdf]]
1. **Random Variables**:
- **Definition**: A function that assigns numerical values to outcomes in a sample space.
- **Discrete vs. Continuous**:
- Discrete: Countable outcomes (e.g., dice roll).
- Continuous: Uncountable outcomes (e.g., measurement of height).
2. **Probability Distributions**:
- **Discrete**:
- **Bernoulli**: \( $f(x) = p^x(1-p)^{1-x}$\) for \( $x \in \{0,1\}$ \).
- **Binomial**: \( $f(x) = \binom{n}{x}p^x(1-p)^{n-x}$ \) for \( $x \in \{0,1,...,n\}$ \).
- **Uniform**: \( $f(x) = \frac{1}{m}$ \) for \($x \in \{1,2,...,m\}$ \).
- **Continuous**:
- **Uniform**: \( $f(x) = \frac{1}{b-a}$ \) for \( $x \in [a,b]$ \).
- **Normal (Gaussian)**: \( $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}$ \).
3. **Expected Value and Variance**:
- **Discrete**:
- **Expected Value**: \( $E(X) = \sum x \cdot f(x)$ \).
- **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
- **Continuous**:
- **Expected Value**: \( $E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx$ \).
- **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
4. **Key Calculations**:
- **Bernoulli Distribution**:
- \( $E(X) = p$ \), \( $Var(X) = p(1-p)$ \).
- **Binomial Distribution**:
- \( $E(X) = np$ \), \( $Var(X) = np(1-p)$ \).
- **Uniform Distribution (Discrete)**:
- \( $E(X) = \frac{m+1}{2}$ \), \( $Var(X) = \frac{m^2-1}{12}$ \).
- **Uniform Distribution (Continuous)**:
- \( $E(X) = \frac{a+b}{2}$ \), \( $Var(X) = \frac{(b-a)^2}{12}$ \).
- **Normal Distribution**:
- \( $E(X) = \mu$ \), \( $Var(X) = \sigma^2$ \).
5. **Normal Distribution**:
- **Standard Normal (Z-Score)**: \( $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$ \).
- **68-95-99.7 Rule**: 68% data within \( $\mu \pm \sigma$ \), 95% within \( $\mu \pm 2\sigma$ \), 99.7% within \( $\mu \pm 3\sigma$ \).
![[lecture_09.pdf]]
Here are the key points and calculations from the provided data:
1. **Simple Linear Regression**:
- **Model**: \( $Y = \beta_0 + \beta_1X + \epsilon$ \), where \( $\epsilon \sim N(0, \sigma^2$) \).
- **Estimation**: Parameters \( $\beta_0$ \) and \( $\beta_1$ \) are estimated using least squares method.
2. **Least Squares Estimators**:
- **Slope (β̂₁)**:
\[$$
\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}
$$\]
- **Intercept (β̂₀)**:
\[$$
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}
$$\]
3. **Residual Analysis**:
- **Residuals**: \( $e_i = Y_i - \hat{Y}_i$ \).
- **Residual Plot**: Used to check model assumptions (linearity, constant variance, normality).
4. **Coefficient of Determination (R²)**:
- **Formula**:
\[$$
R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}}
$$\]
- **Interpretation**: Measures goodness of fit (0 ≤ R² ≤ 1).
5. **Variance Estimation**:
- **Residual Sum of Squares (RSS)**:
\[$$
RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2
$$\]
- **Variance Estimator**:
\[$$
\hat{\sigma}^2 = \frac{RSS}{n-2}
$$\]
6. **Key Calculations**:
- **Example Calculations**:
- For given data, compute \( $\hat{\beta}_0$ \), \( $\hat{\beta}_1$ \), and \( $\hat{\sigma}^2$ \).
- Calculate R² to assess model fit.
![[lecture_10.pdf]]
1. **Confidence Intervals**:
- **Point Estimator**: Provides a single estimate of a population parameter.
- **Interval Estimator**: Provides a range of values within which the parameter is expected to lie.
- **Formula for Mean (σ known)**:
\[$$
\left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right]
$$\]
- **Formula for Mean (σ unknown)**:
\[$$
\left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right]
$$\]
2. **Statistical Tests**:
- **Hypothesis Testing**: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
- **Z-Test**: Used when the population variance is known.
- **Test Statistic**: \( $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ \)
- **T-Test**: Used when the population variance is unknown.
- **Test Statistic**: \( $T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ \)
- **Two-Sample T-Test**: Compares the means of two independent groups.
- **Test Statistic**: \( $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}}$ \)
3. **Key Calculations**:
- **Example 1**: 95% confidence interval for bonbon package weights.
- **Result**: [63.84, 65.08]
- **Example 2**: Testing machine adjustment with t-test.
- **Result**: Null hypothesis not rejected, machine does not need adjustment.
- **Example 3**: Two-sample t-test for bonbon weights.
- **Result**: Reject H₀, second companys bonbons are heavier.
![[lecture_11.pdf]]
1. **Point Estimation**:
- A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
- Example: For a normal distribution, the arithmetic mean estimates the expected value.
2. **Confidence Intervals**:
- A range [gl, gu] where θ is likely to lie with probability 1α.
- Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
3. **Statistical Hypothesis Testing**:
- **Null Hypothesis (H0)**: Statement to be tested (e.g., μ = 15).
- **Alternative Hypothesis (H1)**: Opposing statement (e.g., μ ≠ 15).
- **Test Statistic**: A function of the sample data to assess H0 vs. H1.
- **Rejection Region**: Values of the test statistic leading to rejection of H0.
- **Errors**:
- **Type I Error**: Rejecting a true H0 (controlled by significance level α).
- **Type II Error**: Failing to reject a false H0 (related to test power).
4. **Z-Test and T-Test**:
- **Z-Test**: Used when population variance is known or sample size is large.
- **T-Test**: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
5. **Two-Sample T-Test**:
- Compares means of two independent groups.
- **Assumptions**: Normality, independence, homogeneity (if variances are equal).
- **Test Statistic**: Accounts for sample means, standard deviations, and sizes.
- **Degrees of Freedom**: Calculated using Welch-Satterthwaite equation for unequal variances.
6. **Examples**:
- **One-Sample Test**: Chocolate box weights using z-test (known variance).
- **Two-Sample Test**: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
7. **Key Concepts**:
- **p-value**: Probability of observing test statistic under H0; compared to α.
- **Confidence Interval and Hypothesis Test Relationship**: Rejecting H0 if the parameter lies outside the confidence interval.
8. **Future Topics**: Data preparation and decision trees.
![[lecture_12.pdf]]
1. **Machine Learning Overview**:
- Involves data preparation, model building, and evaluation.
- Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
2. **Data Preparation**:
- **Handling Missing Data**: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
- **Handling False Data**: Identify and correct errors through analysis or expert consultation.
- **Feature Engineering**: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
3. **Decision Trees**:
- **Structure**: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
- **Construction**: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
- **Example**: Medicine recommendation based on blood pressure and age.
- **Pros and Cons**: Easy to interpret but may overfit or require discretization for numerical attributes.
4. **Model Evaluation**:
- **Train-Test Split**: Evaluate models on independent test data to avoid overfitting.
- **Metrics for Regression**: Mean Squared Error (MSE), Mean Absolute Error (MAE).
- **Metrics for Classification**: Confusion matrix, accuracy, precision, recall, F1 score.
- **Accuracy**: Overall correctness.
- **Precision**: Correct predictions among positive predictions.
- **Recall**: Correct predictions among actual positive instances.
- **F1 Score**: Harmonic mean of precision and recall.
5. **Advanced Topics**:
- **Ensemble Methods**: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
- **Model Comparison**: Use validation sets to compare models and avoid overfitting.
6. **Summary**:
- **Skills Acquired**: Data preparation, decision tree classification, and model evaluation.
- **Future Topics**: Dashboards and summaries for data visualization and reporting.
![[lecture_13.pdf]]
1. **Model Evaluation Basics**:
- **Overfitting**: When a model fits the training data too well, leading to poor performance on new data.
- **Underfitting**: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
2. **Data Splitting**:
- **Train-Test Split**: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
- **Train-Validation-Test Split**: Used to compare different models by having separate training, validation, and test sets.
3. **Evaluation Metrics**:
- **Regression Metrics**:
- **Mean Squared Error (MSE)**: Average of squared differences between predicted and actual values.
- **Mean Absolute Error (MAE)**: Average of absolute differences.
- **Mean Absolute Percentage Error (MAPE)**: Average of percentage differences.
- **Classification Metrics**:
- **Confusion Matrix**: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
- **Accuracy**: (TP + TN) / (TP + TN + FP + FN).
- **Precision**: TP / (TP + FP).
- **Recall**: TP / (TP + FN).
- **F1 Score**: Harmonic mean of precision and recall, given by \( $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ \).
4. **Example Calculations**:
- **Confusion Matrix Example**:
- Accuracy: \( $\frac{4 + 1}{10} = 0.5$ \)
- Precision: \( $\frac{4}{6} \approx 0.66$ \)
- Recall: \( $\frac{4}{7} \approx 0.57$ \)
- F1 Score: \( $\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31$ \)
5. **Interpreting Results**:
- **Baseline Comparison**: Always compare model performance against a naive reference (e.g., predicting the majority class).
- **Contextual Relevance**: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
6. **Summary**:
- **Skills Acquired**: Understanding of evaluation metrics and their application in assessing machine learning models.
- **Future Topics**: Advanced evaluation techniques and model optimization strategies.