a

2025-02-20 15:52:52 +01:00
parent 273a41a36b
commit 7ef2e7f0bb
34 changed files with 2066 additions and 21 deletions
--- a/Science/Cheat
+++ b/Science/Cheat
@@ -0,0 +1,400 @@
+![[lecture_02.pdf]]
+
+
+
+1. **Definition of Data:**
+   - Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
+
+2. **Data Basics:**
+   - **Basic Population:** Entire group of interest (e.g., all students).
+   - **Sample:** Subset of the population (e.g., students in a lecture).
+   - **Statistical Unit:** Individual data point (e.g., one student).
+   - **Variable:** Characteristics of units (e.g., name, population size).
+   - **Value:** Specific value of a variable (e.g., "123456" for "MatrNr").
+
+3. **Data Categories:**
+   - **Structured vs. Unstructured:**
+     - **Structured:** Organized data with a predefined format (e.g., tables).
+     - **Unstructured:** No traditional format (e.g., text, images).
+   - **Discrete vs. Continuous:**
+     - **Discrete:** Countable values (e.g., grades).
+     - **Continuous:** Any value within a range (e.g., temperature).
+   - **Levels of Measurement:**
+     - **Nominal:** Labels without order (e.g., colors).
+     - **Ordinal:** Ordered labels (e.g., school grades).
+     - **Interval:** Ordered with equal intervals (e.g., Celsius).
+     - **Ratio:** Interval with a true zero (e.g., weight).
+   - **Qualitative vs. Quantitative:**
+     - **Qualitative:** Categorical (e.g., gender).
+     - **Quantitative:** Numerical (e.g., height).
+
+![[lecture_03.pdf]]
+
+
+1. **Primary vs. Secondary Data:**
+   - **Primary Data:** Collected directly for a specific purpose (e.g., surveys, experiments).
+   - **Secondary Data:** Existing data from other sources (e.g., books, journals).
+
+2. **Ways to Obtain Data:**
+   - **Capturing Data:** Collecting through sensors, observations, or experiments.
+   - **Retrieving Data:** Accessing from databases, APIs, or open data sources.
+   - **Collecting Data:** Scraping from websites or logs when direct access isn't available.
+
+3. **Databases:**
+   - **Relational Databases:** Use SQL for structured data but have limitations with big data.
+   - **NoSQL Databases:** Handle unstructured or semi-structured data, offering flexibility and scalability.
+   - **Document-Oriented Databases:** Store data in formats like JSON, ideal for e-commerce and IoT.
+
+4. **APIs:**
+   - REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
+   - Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
+
+5. **Data Scraping:**
+   - Extracting data from websites or logs when APIs aren't available.
+   - Legal and ethical considerations must be addressed.
+
+
+![[lecture_04.pdf]]
+
+
+1. **Data Protection and Anonymization:**
+   - **GDPR Compliance:** Personal data must be protected, and usage requires consent.
+   - **Anonymization:** Removing personal identifiers to prevent individual identification.
+   - **Pseudonymization:** Using non-unique identifiers, requiring additional info for identification.
+   - **Hashing:** Converting data to fixed-size values (e.g., SHA-256) for privacy.
+
+2. **Statistical Basics:**
+   - **Descriptive Statistics:** Summarizes data (e.g., mean, median).
+   - **Exploratory Data Analysis:** Identifies patterns and outliers.
+   - **Inferential Statistics:** Draws conclusions about populations from samples.
+
+3. **Frequencies and Histograms:**
+   - **Frequencies:** Count of occurrences of each value.
+   - **Absolute vs. Relative Frequencies:** Raw counts vs. proportions.
+   - **Histograms:** Visual representation of data distribution across classes.
+
+4. **Empirical Distribution Function (EDF):**
+   - Plots cumulative frequencies to show data distribution over a range.
+
+5. **Data Visualization:**
+   - **Pie Charts:** Effective for showing proportions of categorical data.
+   - **Bar Charts:** Compare frequencies across categories.
+   - **Histograms:** Display distribution of continuous data.
+
+![[lecture_05.pdf]]
+
+
+1. **Central Tendencies:**
+   - **Mode:** The most frequently occurring value in a dataset.
+   - **Median:** The middle value when data is ordered, dividing the dataset into two equal halves.
+   - **Mean:** The average value, calculated by summing all observations and dividing by the number of observations.
+
+2. **Statistical Dispersion:**
+   - **Range:** The difference between the maximum and minimum values.
+   - **Interquartile Range (IQR):** The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
+   - **Variance and Standard Deviation:** Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
+
+3. **Data Visualization:**
+   - **Histograms:** Display the distribution of continuous data across classes.
+   - **Box Plots:** Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
+
+4. **Outliers:**
+   - Defined as data points falling outside the range of [Q1 - 1.5*IQR, Q3 + 1.5*IQR].
+   - Can indicate errors, unusual observations, or novel data points.
+
+![[lecture_06.pdf]]
+
+
+1. **Empirical Variance Calculation**:
+   - **Data**: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
+   - **Mean (x̄)**: 12.24
+   - **Sum of squared deviations**: 14.16
+   - **Empirical Variance (s̃²)**: \( $\frac{14.16}{7}$ = 2.02 \)
+   - **Empirical Standard Deviation (s̃)**: \( $\sqrt{2.02} \approx 1.42$ \)
+
+2. **Contingency Table Analysis**:
+   - **Variables**: Growth (shrinking, growing, growing strongly) and Location (North, South)
+   - **Absolute Frequencies**:
+     - North: 29 (growing), 13 (growing strongly), 7 (shrinking)
+     - South: 13 (growing), 19 (growing strongly), 2 (shrinking)
+   - **Chi-squared (χ²) Test**:
+     - Calculated χ²: 7.53
+   - **Corrected Pearson Contingency Coefficient (K*P)**:
+     - \( K*P = $\sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41$ \)
+     - Interpretation: Weak to medium correlation between location and growth.
+
+3. **Correlation Coefficient**:
+   - **Pearson Correlation Coefficient (rXY)** for population and area:
+     - Calculated rXY: 0.70
+     - Interpretation: Strong positive correlation.
+
+4. **Key Concepts**:
+   - **Empirical Variance**: Measures data spread around the mean.
+   - **Contingency Tables**: Used for nominal/ordinal data to assess associations.
+   - **Pearson Correlation**: Measures linear correlation between metric variables, ranging from -1 to 1.
+
+![[lecture_07.pdf]]
+
+
+
+1. **Probability Theory Basics**:
+   - **Sample Space (Ω)**: Set of all possible outcomes.
+   - **Event**: Subset of the sample space.
+   - **Probability Axioms** (Kolmogorov):
+     1. \( P(A) $\geq 0$ \)
+     2. \( P(Ω) = 1 \)
+     3. Additivity for disjoint events.
+
+2. **Conditional Probability**:
+   - **Definition**: \( P(A|B) = $\frac{P(A \cap B)}{P(B)}$ \)
+   - **Independence**: Events A and B are independent if \( P(A \cap B) = P(A)P(B) \).
+
+3. **Bayes' Theorem**:
+   - **Formula**: \( P(B|A) = $\frac{P(A|B)P(B)}{P(A)}$ \)
+   - **Example**: 
+     - \( P(B|A) = $\frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99}$ = 0.31 \)
+
+4. **Combinatorics**:
+   - **Permutations**: \( n! \) for unique items.
+   - **Combinations**: \( $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ \)
+   - **Multiplication Rule**: \( n_1 $\times n_2 \times \dots \times n_r$ \)
+
+5. **Key Calculations**:
+   - **Password Example**: \( 7^2 $\times$ P(10,4) = 49 $\times$ 5040 = 246960 \)
+   - **Dice Probability**: \( P($\text{at least one 6 in 4 throws}$) = 1 - (5/6)^4 = 0.518 \)
+
+![[lecture_08_neu.pdf]]
+
+
+1. **Random Variables**:
+   - **Definition**: A function that assigns numerical values to outcomes in a sample space.
+   - **Discrete vs. Continuous**: 
+     - Discrete: Countable outcomes (e.g., dice roll).
+     - Continuous: Uncountable outcomes (e.g., measurement of height).
+
+2. **Probability Distributions**:
+   - **Discrete**:
+     - **Bernoulli**: \( $f(x) = p^x(1-p)^{1-x}$\) for \( $x \in \{0,1\}$ \).
+     - **Binomial**: \( $f(x) = \binom{n}{x}p^x(1-p)^{n-x}$ \) for \( $x \in \{0,1,...,n\}$ \).
+     - **Uniform**: \( $f(x) = \frac{1}{m}$ \) for \($x \in \{1,2,...,m\}$ \).
+   - **Continuous**:
+     - **Uniform**: \( $f(x) = \frac{1}{b-a}$ \) for \( $x \in [a,b]$ \).
+     - **Normal (Gaussian)**: \( $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}$ \).
+
+3. **Expected Value and Variance**:
+   - **Discrete**:
+     - **Expected Value**: \( $E(X) = \sum x \cdot f(x)$ \).
+     - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
+   - **Continuous**:
+     - **Expected Value**: \( $E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx$ \).
+     - **Variance**: \( $Var(X) = E((X-E(X))^2)$ \).
+
+4. **Key Calculations**:
+   - **Bernoulli Distribution**:
+     - \( $E(X) = p$ \), \( $Var(X) = p(1-p)$ \).
+   - **Binomial Distribution**:
+     - \( $E(X) = np$ \), \( $Var(X) = np(1-p)$ \).
+   - **Uniform Distribution (Discrete)**:
+     - \( $E(X) = \frac{m+1}{2}$ \), \( $Var(X) = \frac{m^2-1}{12}$ \).
+   - **Uniform Distribution (Continuous)**:
+     - \( $E(X) = \frac{a+b}{2}$ \), \( $Var(X) = \frac{(b-a)^2}{12}$ \).
+   - **Normal Distribution**:
+     - \( $E(X) = \mu$ \), \( $Var(X) = \sigma^2$ \).
+
+5. **Normal Distribution**:
+   - **Standard Normal (Z-Score)**: \( $Z = \frac{X-\mu}{\sigma} \sim N(0,1)$ \).
+   - **68-95-99.7 Rule**: 68% data within \( $\mu \pm \sigma$ \), 95% within \( $\mu \pm 2\sigma$ \), 99.7% within \( $\mu \pm 3\sigma$ \).
+
+![[lecture_09.pdf]]
+
+
+
+Here are the key points and calculations from the provided data:
+
+1. **Simple Linear Regression**:
+   - **Model**: \( $Y = \beta_0 + \beta_1X + \epsilon$ \), where \( $\epsilon \sim N(0, \sigma^2$) \).
+   - **Estimation**: Parameters \( $\beta_0$ \) and \( $\beta_1$ \) are estimated using least squares method.
+
+2. **Least Squares Estimators**:
+   - **Slope (β̂₁)**: 
+     \[$$
+     \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}
+     $$\]
+   - **Intercept (β̂₀)**:
+     \[$$
+     \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}
+     $$\]
+
+3. **Residual Analysis**:
+   - **Residuals**: \( $e_i = Y_i - \hat{Y}_i$ \).
+   - **Residual Plot**: Used to check model assumptions (linearity, constant variance, normality).
+
+4. **Coefficient of Determination (R²)**:
+   - **Formula**:
+     \[$$
+     R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}}
+     $$\]
+   - **Interpretation**: Measures goodness of fit (0 ≤ R² ≤ 1).
+
+5. **Variance Estimation**:
+   - **Residual Sum of Squares (RSS)**:
+     \[$$
+     RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2
+     $$\]
+   - **Variance Estimator**:
+     \[$$
+     \hat{\sigma}^2 = \frac{RSS}{n-2}
+     $$\]
+
+6. **Key Calculations**:
+   - **Example Calculations**:
+     - For given data, compute \( $\hat{\beta}_0$ \), \( $\hat{\beta}_1$ \), and \( $\hat{\sigma}^2$ \).
+     - Calculate R² to assess model fit.
+
+
+
+![[lecture_10.pdf]]
+
+
+1. **Confidence Intervals**:
+   - **Point Estimator**: Provides a single estimate of a population parameter.
+   - **Interval Estimator**: Provides a range of values within which the parameter is expected to lie.
+   - **Formula for Mean (σ known)**:
+     \[$$
+     \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right]
+     $$\]
+   - **Formula for Mean (σ unknown)**:
+     \[$$
+     \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right]
+     $$\]
+
+2. **Statistical Tests**:
+   - **Hypothesis Testing**: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
+   - **Z-Test**: Used when the population variance is known.
+     - **Test Statistic**: \( $Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$ \)
+   - **T-Test**: Used when the population variance is unknown.
+     - **Test Statistic**: \( $T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ \)
+   - **Two-Sample T-Test**: Compares the means of two independent groups.
+     - **Test Statistic**: \( $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}}$ \)
+
+3. **Key Calculations**:
+   - **Example 1**: 95% confidence interval for bonbon package weights.
+     - **Result**: [63.84, 65.08]
+   - **Example 2**: Testing machine adjustment with t-test.
+     - **Result**: Null hypothesis not rejected, machine does not need adjustment.
+   - **Example 3**: Two-sample t-test for bonbon weights.
+     - **Result**: Reject H₀, second company’s bonbons are heavier.
+
+![[lecture_11.pdf]]
+
+
+1. **Point Estimation**:
+   - A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
+   - Example: For a normal distribution, the arithmetic mean estimates the expected value.
+
+2. **Confidence Intervals**:
+   - A range [gl, gu] where θ is likely to lie with probability 1−α.
+   - Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
+
+3. **Statistical Hypothesis Testing**:
+   - **Null Hypothesis (H0)**: Statement to be tested (e.g., μ = 15).
+   - **Alternative Hypothesis (H1)**: Opposing statement (e.g., μ ≠ 15).
+   - **Test Statistic**: A function of the sample data to assess H0 vs. H1.
+   - **Rejection Region**: Values of the test statistic leading to rejection of H0.
+   - **Errors**:
+     - **Type I Error**: Rejecting a true H0 (controlled by significance level α).
+     - **Type II Error**: Failing to reject a false H0 (related to test power).
+
+4. **Z-Test and T-Test**:
+   - **Z-Test**: Used when population variance is known or sample size is large.
+   - **T-Test**: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
+
+5. **Two-Sample T-Test**:
+   - Compares means of two independent groups.
+   - **Assumptions**: Normality, independence, homogeneity (if variances are equal).
+   - **Test Statistic**: Accounts for sample means, standard deviations, and sizes.
+   - **Degrees of Freedom**: Calculated using Welch-Satterthwaite equation for unequal variances.
+
+6. **Examples**:
+   - **One-Sample Test**: Chocolate box weights using z-test (known variance).
+   - **Two-Sample Test**: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
+
+7. **Key Concepts**:
+   - **p-value**: Probability of observing test statistic under H0; compared to α.
+   - **Confidence Interval and Hypothesis Test Relationship**: Rejecting H0 if the parameter lies outside the confidence interval.
+
+8. **Future Topics**: Data preparation and decision trees.
+
+![[lecture_12.pdf]]
+
+
+1. **Machine Learning Overview**:
+   - Involves data preparation, model building, and evaluation.
+   - Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
+
+2. **Data Preparation**:
+   - **Handling Missing Data**: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
+   - **Handling False Data**: Identify and correct errors through analysis or expert consultation.
+   - **Feature Engineering**: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
+
+3. **Decision Trees**:
+   - **Structure**: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
+   - **Construction**: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
+   - **Example**: Medicine recommendation based on blood pressure and age.
+   - **Pros and Cons**: Easy to interpret but may overfit or require discretization for numerical attributes.
+
+4. **Model Evaluation**:
+   - **Train-Test Split**: Evaluate models on independent test data to avoid overfitting.
+   - **Metrics for Regression**: Mean Squared Error (MSE), Mean Absolute Error (MAE).
+   - **Metrics for Classification**: Confusion matrix, accuracy, precision, recall, F1 score.
+     - **Accuracy**: Overall correctness.
+     - **Precision**: Correct predictions among positive predictions.
+     - **Recall**: Correct predictions among actual positive instances.
+     - **F1 Score**: Harmonic mean of precision and recall.
+
+5. **Advanced Topics**:
+   - **Ensemble Methods**: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
+   - **Model Comparison**: Use validation sets to compare models and avoid overfitting.
+
+6. **Summary**:
+   - **Skills Acquired**: Data preparation, decision tree classification, and model evaluation.
+   - **Future Topics**: Dashboards and summaries for data visualization and reporting.
+
+
+![[lecture_13.pdf]]
+
+1. **Model Evaluation Basics**:
+   - **Overfitting**: When a model fits the training data too well, leading to poor performance on new data.
+   - **Underfitting**: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
+
+2. **Data Splitting**:
+   - **Train-Test Split**: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
+   - **Train-Validation-Test Split**: Used to compare different models by having separate training, validation, and test sets.
+
+3. **Evaluation Metrics**:
+   - **Regression Metrics**:
+     - **Mean Squared Error (MSE)**: Average of squared differences between predicted and actual values.
+     - **Mean Absolute Error (MAE)**: Average of absolute differences.
+     - **Mean Absolute Percentage Error (MAPE)**: Average of percentage differences.
+   - **Classification Metrics**:
+     - **Confusion Matrix**: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
+     - **Accuracy**: (TP + TN) / (TP + TN + FP + FN).
+     - **Precision**: TP / (TP + FP).
+     - **Recall**: TP / (TP + FN).
+     - **F1 Score**: Harmonic mean of precision and recall, given by \( $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ \).
+
+4. **Example Calculations**:
+   - **Confusion Matrix Example**:
+     - Accuracy: \( $\frac{4 + 1}{10} = 0.5$ \)
+     - Precision: \( $\frac{4}{6} \approx 0.66$ \)
+     - Recall: \( $\frac{4}{7} \approx 0.57$ \)
+     - F1 Score: \( $\frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31$ \)
+
+5. **Interpreting Results**:
+   - **Baseline Comparison**: Always compare model performance against a naive reference (e.g., predicting the majority class).
+   - **Contextual Relevance**: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
+
+6. **Summary**:
+   - **Skills Acquired**: Understanding of evaluation metrics and their application in assessing machine learning models.
+   - **Future Topics**: Advanced evaluation techniques and model optimization strategies.
+