a
This commit is contained in:
400
WS2425/Data Science/Cheat Sheet Mockup.md
Normal file
400
WS2425/Data Science/Cheat Sheet Mockup.md
Normal file
@@ -0,0 +1,400 @@
|
||||
![[lecture_02.pdf]]
|
||||
|
||||
|
||||
|
||||
1. **Definition of Data:**
|
||||
- Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
|
||||
|
||||
2. **Data Basics:**
|
||||
- **Basic Population:** Entire group of interest (e.g., all students).
|
||||
- **Sample:** Subset of the population (e.g., students in a lecture).
|
||||
- **Statistical Unit:** Individual data point (e.g., one student).
|
||||
- **Variable:** Characteristics of units (e.g., name, population size).
|
||||
- **Value:** Specific value of a variable (e.g., "123456" for "MatrNr").
|
||||
|
||||
3. **Data Categories:**
|
||||
- **Structured vs. Unstructured:**
|
||||
- **Structured:** Organized data with a predefined format (e.g., tables).
|
||||
- **Unstructured:** No traditional format (e.g., text, images).
|
||||
- **Discrete vs. Continuous:**
|
||||
- **Discrete:** Countable values (e.g., grades).
|
||||
- **Continuous:** Any value within a range (e.g., temperature).
|
||||
- **Levels of Measurement:**
|
||||
- **Nominal:** Labels without order (e.g., colors).
|
||||
- **Ordinal:** Ordered labels (e.g., school grades).
|
||||
- **Interval:** Ordered with equal intervals (e.g., Celsius).
|
||||
- **Ratio:** Interval with a true zero (e.g., weight).
|
||||
- **Qualitative vs. Quantitative:**
|
||||
- **Qualitative:** Categorical (e.g., gender).
|
||||
- **Quantitative:** Numerical (e.g., height).
|
||||
|
||||
![[lecture_03.pdf]]
|
||||
|
||||
|
||||
1. **Primary vs. Secondary Data:**
|
||||
- **Primary Data:** Collected directly for a specific purpose (e.g., surveys, experiments).
|
||||
- **Secondary Data:** Existing data from other sources (e.g., books, journals).
|
||||
|
||||
2. **Ways to Obtain Data:**
|
||||
- **Capturing Data:** Collecting through sensors, observations, or experiments.
|
||||
- **Retrieving Data:** Accessing from databases, APIs, or open data sources.
|
||||
- **Collecting Data:** Scraping from websites or logs when direct access isn't available.
|
||||
|
||||
3. **Databases:**
|
||||
- **Relational Databases:** Use SQL for structured data but have limitations with big data.
|
||||
- **NoSQL Databases:** Handle unstructured or semi-structured data, offering flexibility and scalability.
|
||||
- **Document-Oriented Databases:** Store data in formats like JSON, ideal for e-commerce and IoT.
|
||||
|
||||
4. **APIs:**
|
||||
- REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
|
||||
- Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
|
||||
|
||||
5. **Data Scraping:**
|
||||
- Extracting data from websites or logs when APIs aren't available.
|
||||
- Legal and ethical considerations must be addressed.
|
||||
|
||||
|
||||
![[lecture_04.pdf]]
|
||||
|
||||
|
||||
1. **Data Protection and Anonymization:**
|
||||
- **GDPR Compliance:** Personal data must be protected, and usage requires consent.
|
||||
- **Anonymization:** Removing personal identifiers to prevent individual identification.
|
||||
- **Pseudonymization:** Using non-unique identifiers, requiring additional info for identification.
|
||||
- **Hashing:** Converting data to fixed-size values (e.g., SHA-256) for privacy.
|
||||
|
||||
2. **Statistical Basics:**
|
||||
- **Descriptive Statistics:** Summarizes data (e.g., mean, median).
|
||||
- **Exploratory Data Analysis:** Identifies patterns and outliers.
|
||||
- **Inferential Statistics:** Draws conclusions about populations from samples.
|
||||
|
||||
3. **Frequencies and Histograms:**
|
||||
- **Frequencies:** Count of occurrences of each value.
|
||||
- **Absolute vs. Relative Frequencies:** Raw counts vs. proportions.
|
||||
- **Histograms:** Visual representation of data distribution across classes.
|
||||
|
||||
4. **Empirical Distribution Function (EDF):**
|
||||
- Plots cumulative frequencies to show data distribution over a range.
|
||||
|
||||
5. **Data Visualization:**
|
||||
- **Pie Charts:** Effective for showing proportions of categorical data.
|
||||
- **Bar Charts:** Compare frequencies across categories.
|
||||
- **Histograms:** Display distribution of continuous data.
|
||||
|
||||
![[lecture_05.pdf]]
|
||||
|
||||
|
||||
1. **Central Tendencies:**
|
||||
- **Mode:** The most frequently occurring value in a dataset.
|
||||
- **Median:** The middle value when data is ordered, dividing the dataset into two equal halves.
|
||||
- **Mean:** The average value, calculated by summing all observations and dividing by the number of observations.
|
||||
|
||||
2. **Statistical Dispersion:**
|
||||
- **Range:** The difference between the maximum and minimum values.
|
||||
- **Interquartile Range (IQR):** The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
|
||||
- **Variance and Standard Deviation:** Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
|
||||
|
||||
3. **Data Visualization:**
|
||||
- **Histograms:** Display the distribution of continuous data across classes.
|
||||
- **Box Plots:** Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
|
||||
|
||||
4. **Outliers:**
|
||||
- Defined as data points falling outside the range of [Q1 - 1.5*IQR, Q3 + 1.5*IQR].
|
||||
- Can indicate errors, unusual observations, or novel data points.
|
||||
|
||||
![[lecture_06.pdf]]
|
||||
|
||||
|
||||
1. **Empirical Variance Calculation**:
|
||||
- **Data**: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
|
||||
- **Mean (x̄)**: 12.24
|
||||
- **Sum of squared deviations**: 14.16
|
||||
- **Empirical Variance (s̃²)**: \( $\frac{14.16}{7}$ = 2.02 \)
|
||||
- **Empirical Standard Deviation (s̃)**: \( $\sqrt{2.02} \approx 1.42$ \)
|
||||
|
||||
2. **Contingency Table Analysis**:
|
||||
- **Variables**: Growth (shrinking, growing, growing strongly) and Location (North, South)
|
||||
- **Absolute Frequencies**:
|
||||
- North: 29 (growing), 13 (growing strongly), 7 (shrinking)
|
||||
- South: 13 (growing), 19 (growing strongly), 2 (shrinking)
|
||||
- **Chi-squared (χ²) Test**:
|
||||
- Calculated χ²: 7.53
|
||||
- **Corrected Pearson Contingency Coefficient (K*P)**:
|
||||
- \( K*P = $\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41$ \)
|
||||
- Interpretation: Weak to medium correlation between location and growth.
|
||||
|
||||
3. **Correlation Coefficient**:
|
||||
- **Pearson Correlation Coefficient (rXY)** for population and area:
|
||||
- Calculated rXY: 0.70
|
||||
- Interpretation: Strong positive correlation.
|
||||
|
||||
4. **Key Concepts**:
|
||||
- **Empirical Variance**: Measures data spread around the mean.
|
||||
- **Contingency Tables**: Used for nominal/ordinal data to assess associations.
|
||||
- **Pearson Correlation**: Measures linear correlation between metric variables, ranging from -1 to 1.
|
||||
|
||||
![[lecture_07.pdf]]
|
||||
|
||||
|
||||
|
||||
1. **Probability Theory Basics**:
|
||||
- **Sample Space (Ω)**: Set of all possible outcomes.
|
||||
- **Event**: Subset of the sample space.
|
||||
- **Probability Axioms** (Kolmogorov):
|
||||
1. \( P(A) $\geq 0$ \)
|
||||
2. \( P(Ω) = 1 \)
|
||||
3. Additivity for disjoint events.
|
||||
|
||||
2. **Conditional Probability**:
|
||||
- **Definition**: \( P(A|B) = $\frac{P(A \cap B)}{P(B)}$ \)
|
||||
- **Independence**: Events A and B are independent if \( P(A \cap B) = P(A)P(B) \).
|
||||
|
||||
3. **Bayes' Theorem**:
|
||||
- **Formula**: \( P(B|A) = $\frac{P(A|B)P(B)}{P(A)}$ \)
|
||||
- **Example**:
|
||||
- \( P(B|A) = $\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}$ = 0.31 \)
|
||||
|
||||
4. **Combinatorics**:
|
||||
- **Permutations**: \( n! \) for unique items.
|
||||
- **Combinations**: \( $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ \)
|
||||
- **Multiplication Rule**: \( n_1 $\times n_2 \times \dots \times n_r$ \)
|
||||
|
||||
5. **Key Calculations**:
|
||||
- **Password Example**: \( 7^2 $\times$ P(10,4) = 49 $\times$ 5040 = 246960 \)
|
||||
- **Dice Probability**: \( P($\text{at least one 6 in 4 throws}$) = 1 - (5/6)^4 = 0.518 \)
|
||||
|
||||
![[lecture_08_neu.pdf]]
|
||||
|
||||
|
||||
1. **Random Variables**:
|
||||
- **Definition**: A function that assigns numerical values to outcomes in a sample space.
|
||||
- **Discrete vs. Continuous**:
|
||||
- Discrete: Countable outcomes (e.g., dice roll).
|
||||
- Continuous: Uncountable outcomes (e.g., measurement of height).
|
||||
|
||||
2. **Probability Distributions**:
|
||||
- **Discrete**:
|
||||
- **Bernoulli**: \( $f(x) = p^x(1-p)^{1-x}$\) for \( $x \in \{0,1\}$ \).
|
||||
- **Binomial**: \( $f(x) = \binom{n}{x}p^x(1-p)^{n-x}$ \) for \( $x \in \{0,1,...,n\}$ \).
|
||||
- **Uniform**: \( $f(x) = \frac{1}{m}$ \) for \($x \in \{1,2,...,m\}$ \).
|
||||
- **Continuous**:
|
||||
- **Uniform**: \( $f(x) = \frac{1}{b-a}$ \) for \( $x \in [a,b]$ \).
|
||||
- **Normal (Gaussian)**: \( $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}$ \).
|
||||
|
||||
3. **Expected Value and Variance**:
|
||||
- **Discrete**:
|
||||
- **Expected Value**: \( $E(X) = \sum x \cdot f(x)$ \).
|
||||
- **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
|
||||
- **Continuous**:
|
||||
- **Expected Value**: \( $E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx$ \).
|
||||
- **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
|
||||
|
||||
4. **Key Calculations**:
|
||||
- **Bernoulli Distribution**:
|
||||
- \( $E(X) = p$ \), \( $Var(X) = p(1-p)$ \).
|
||||
- **Binomial Distribution**:
|
||||
- \( $E(X) = np$ \), \( $Var(X) = np(1-p)$ \).
|
||||
- **Uniform Distribution (Discrete)**:
|
||||
- \( $E(X) = \frac{m+1}{2}$ \), \( $Var(X) = \frac{m^2-1}{12}$ \).
|
||||
- **Uniform Distribution (Continuous)**:
|
||||
- \( $E(X) = \frac{a+b}{2}$ \), \( $Var(X) = \frac{(b-a)^2}{12}$ \).
|
||||
- **Normal Distribution**:
|
||||
- \( $E(X) = \mu$ \), \( $Var(X) = \sigma^2$ \).
|
||||
|
||||
5. **Normal Distribution**:
|
||||
- **Standard Normal (Z-Score)**: \( $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$ \).
|
||||
- **68-95-99.7 Rule**: 68% data within \( $\mu \pm \sigma$ \), 95% within \( $\mu \pm 2\sigma$ \), 99.7% within \( $\mu \pm 3\sigma$ \).
|
||||
|
||||
![[lecture_09.pdf]]
|
||||
|
||||
|
||||
|
||||
Here are the key points and calculations from the provided data:
|
||||
|
||||
1. **Simple Linear Regression**:
|
||||
- **Model**: \( $Y = \beta_0 + \beta_1X + \epsilon$ \), where \( $\epsilon \sim N(0, \sigma^2$) \).
|
||||
- **Estimation**: Parameters \( $\beta_0$ \) and \( $\beta_1$ \) are estimated using least squares method.
|
||||
|
||||
2. **Least Squares Estimators**:
|
||||
- **Slope (β̂₁)**:
|
||||
\[$$
|
||||
\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}
|
||||
$$\]
|
||||
- **Intercept (β̂₀)**:
|
||||
\[$$
|
||||
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}
|
||||
$$\]
|
||||
|
||||
3. **Residual Analysis**:
|
||||
- **Residuals**: \( $e_i = Y_i - \hat{Y}_i$ \).
|
||||
- **Residual Plot**: Used to check model assumptions (linearity, constant variance, normality).
|
||||
|
||||
4. **Coefficient of Determination (R²)**:
|
||||
- **Formula**:
|
||||
\[$$
|
||||
R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}}
|
||||
$$\]
|
||||
- **Interpretation**: Measures goodness of fit (0 ≤ R² ≤ 1).
|
||||
|
||||
5. **Variance Estimation**:
|
||||
- **Residual Sum of Squares (RSS)**:
|
||||
\[$$
|
||||
RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2
|
||||
$$\]
|
||||
- **Variance Estimator**:
|
||||
\[$$
|
||||
\hat{\sigma}^2 = \frac{RSS}{n-2}
|
||||
$$\]
|
||||
|
||||
6. **Key Calculations**:
|
||||
- **Example Calculations**:
|
||||
- For given data, compute \( $\hat{\beta}_0$ \), \( $\hat{\beta}_1$ \), and \( $\hat{\sigma}^2$ \).
|
||||
- Calculate R² to assess model fit.
|
||||
|
||||
|
||||
|
||||
![[lecture_10.pdf]]
|
||||
|
||||
|
||||
1. **Confidence Intervals**:
|
||||
- **Point Estimator**: Provides a single estimate of a population parameter.
|
||||
- **Interval Estimator**: Provides a range of values within which the parameter is expected to lie.
|
||||
- **Formula for Mean (σ known)**:
|
||||
\[$$
|
||||
\left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right]
|
||||
$$\]
|
||||
- **Formula for Mean (σ unknown)**:
|
||||
\[$$
|
||||
\left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right]
|
||||
$$\]
|
||||
|
||||
2. **Statistical Tests**:
|
||||
- **Hypothesis Testing**: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
|
||||
- **Z-Test**: Used when the population variance is known.
|
||||
- **Test Statistic**: \( $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ \)
|
||||
- **T-Test**: Used when the population variance is unknown.
|
||||
- **Test Statistic**: \( $T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ \)
|
||||
- **Two-Sample T-Test**: Compares the means of two independent groups.
|
||||
- **Test Statistic**: \( $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}}$ \)
|
||||
|
||||
3. **Key Calculations**:
|
||||
- **Example 1**: 95% confidence interval for bonbon package weights.
|
||||
- **Result**: [63.84, 65.08]
|
||||
- **Example 2**: Testing machine adjustment with t-test.
|
||||
- **Result**: Null hypothesis not rejected, machine does not need adjustment.
|
||||
- **Example 3**: Two-sample t-test for bonbon weights.
|
||||
- **Result**: Reject H₀, second company’s bonbons are heavier.
|
||||
|
||||
![[lecture_11.pdf]]
|
||||
|
||||
|
||||
1. **Point Estimation**:
|
||||
- A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
|
||||
- Example: For a normal distribution, the arithmetic mean estimates the expected value.
|
||||
|
||||
2. **Confidence Intervals**:
|
||||
- A range [gl, gu] where θ is likely to lie with probability 1−α.
|
||||
- Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
|
||||
|
||||
3. **Statistical Hypothesis Testing**:
|
||||
- **Null Hypothesis (H0)**: Statement to be tested (e.g., μ = 15).
|
||||
- **Alternative Hypothesis (H1)**: Opposing statement (e.g., μ ≠ 15).
|
||||
- **Test Statistic**: A function of the sample data to assess H0 vs. H1.
|
||||
- **Rejection Region**: Values of the test statistic leading to rejection of H0.
|
||||
- **Errors**:
|
||||
- **Type I Error**: Rejecting a true H0 (controlled by significance level α).
|
||||
- **Type II Error**: Failing to reject a false H0 (related to test power).
|
||||
|
||||
4. **Z-Test and T-Test**:
|
||||
- **Z-Test**: Used when population variance is known or sample size is large.
|
||||
- **T-Test**: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
|
||||
|
||||
5. **Two-Sample T-Test**:
|
||||
- Compares means of two independent groups.
|
||||
- **Assumptions**: Normality, independence, homogeneity (if variances are equal).
|
||||
- **Test Statistic**: Accounts for sample means, standard deviations, and sizes.
|
||||
- **Degrees of Freedom**: Calculated using Welch-Satterthwaite equation for unequal variances.
|
||||
|
||||
6. **Examples**:
|
||||
- **One-Sample Test**: Chocolate box weights using z-test (known variance).
|
||||
- **Two-Sample Test**: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
|
||||
|
||||
7. **Key Concepts**:
|
||||
- **p-value**: Probability of observing test statistic under H0; compared to α.
|
||||
- **Confidence Interval and Hypothesis Test Relationship**: Rejecting H0 if the parameter lies outside the confidence interval.
|
||||
|
||||
8. **Future Topics**: Data preparation and decision trees.
|
||||
|
||||
![[lecture_12.pdf]]
|
||||
|
||||
|
||||
1. **Machine Learning Overview**:
|
||||
- Involves data preparation, model building, and evaluation.
|
||||
- Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
|
||||
|
||||
2. **Data Preparation**:
|
||||
- **Handling Missing Data**: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
|
||||
- **Handling False Data**: Identify and correct errors through analysis or expert consultation.
|
||||
- **Feature Engineering**: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
|
||||
|
||||
3. **Decision Trees**:
|
||||
- **Structure**: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
|
||||
- **Construction**: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
|
||||
- **Example**: Medicine recommendation based on blood pressure and age.
|
||||
- **Pros and Cons**: Easy to interpret but may overfit or require discretization for numerical attributes.
|
||||
|
||||
4. **Model Evaluation**:
|
||||
- **Train-Test Split**: Evaluate models on independent test data to avoid overfitting.
|
||||
- **Metrics for Regression**: Mean Squared Error (MSE), Mean Absolute Error (MAE).
|
||||
- **Metrics for Classification**: Confusion matrix, accuracy, precision, recall, F1 score.
|
||||
- **Accuracy**: Overall correctness.
|
||||
- **Precision**: Correct predictions among positive predictions.
|
||||
- **Recall**: Correct predictions among actual positive instances.
|
||||
- **F1 Score**: Harmonic mean of precision and recall.
|
||||
|
||||
5. **Advanced Topics**:
|
||||
- **Ensemble Methods**: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
|
||||
- **Model Comparison**: Use validation sets to compare models and avoid overfitting.
|
||||
|
||||
6. **Summary**:
|
||||
- **Skills Acquired**: Data preparation, decision tree classification, and model evaluation.
|
||||
- **Future Topics**: Dashboards and summaries for data visualization and reporting.
|
||||
|
||||
|
||||
![[lecture_13.pdf]]
|
||||
|
||||
1. **Model Evaluation Basics**:
|
||||
- **Overfitting**: When a model fits the training data too well, leading to poor performance on new data.
|
||||
- **Underfitting**: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
|
||||
|
||||
2. **Data Splitting**:
|
||||
- **Train-Test Split**: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
|
||||
- **Train-Validation-Test Split**: Used to compare different models by having separate training, validation, and test sets.
|
||||
|
||||
3. **Evaluation Metrics**:
|
||||
- **Regression Metrics**:
|
||||
- **Mean Squared Error (MSE)**: Average of squared differences between predicted and actual values.
|
||||
- **Mean Absolute Error (MAE)**: Average of absolute differences.
|
||||
- **Mean Absolute Percentage Error (MAPE)**: Average of percentage differences.
|
||||
- **Classification Metrics**:
|
||||
- **Confusion Matrix**: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
|
||||
- **Accuracy**: (TP + TN) / (TP + TN + FP + FN).
|
||||
- **Precision**: TP / (TP + FP).
|
||||
- **Recall**: TP / (TP + FN).
|
||||
- **F1 Score**: Harmonic mean of precision and recall, given by \( $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ \).
|
||||
|
||||
4. **Example Calculations**:
|
||||
- **Confusion Matrix Example**:
|
||||
- Accuracy: \( $\frac{4 + 1}{10} = 0.5$ \)
|
||||
- Precision: \( $\frac{4}{6} \approx 0.66$ \)
|
||||
- Recall: \( $\frac{4}{7} \approx 0.57$ \)
|
||||
- F1 Score: \( $\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31$ \)
|
||||
|
||||
5. **Interpreting Results**:
|
||||
- **Baseline Comparison**: Always compare model performance against a naive reference (e.g., predicting the majority class).
|
||||
- **Contextual Relevance**: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
|
||||
|
||||
6. **Summary**:
|
||||
- **Skills Acquired**: Understanding of evaluation metrics and their application in assessing machine learning models.
|
||||
- **Future Topics**: Advanced evaluation techniques and model optimization strategies.
|
||||
|
||||
Reference in New Issue
Block a user