Files
obsidian/WS2425/Data Science/Cheat Sheet Mockup.md
2025-02-20 15:52:52 +01:00

18 KiB
Raw Permalink Blame History

!lecture_02.pdf

  1. Definition of Data:

    • Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
  2. Data Basics:

    • Basic Population: Entire group of interest (e.g., all students).
    • Sample: Subset of the population (e.g., students in a lecture).
    • Statistical Unit: Individual data point (e.g., one student).
    • Variable: Characteristics of units (e.g., name, population size).
    • Value: Specific value of a variable (e.g., "123456" for "MatrNr").
  3. Data Categories:

    • Structured vs. Unstructured:
      • Structured: Organized data with a predefined format (e.g., tables).
      • Unstructured: No traditional format (e.g., text, images).
    • Discrete vs. Continuous:
      • Discrete: Countable values (e.g., grades).
      • Continuous: Any value within a range (e.g., temperature).
    • Levels of Measurement:
      • Nominal: Labels without order (e.g., colors).
      • Ordinal: Ordered labels (e.g., school grades).
      • Interval: Ordered with equal intervals (e.g., Celsius).
      • Ratio: Interval with a true zero (e.g., weight).
    • Qualitative vs. Quantitative:
      • Qualitative: Categorical (e.g., gender).
      • Quantitative: Numerical (e.g., height).

!lecture_03.pdf

  1. Primary vs. Secondary Data:

    • Primary Data: Collected directly for a specific purpose (e.g., surveys, experiments).
    • Secondary Data: Existing data from other sources (e.g., books, journals).
  2. Ways to Obtain Data:

    • Capturing Data: Collecting through sensors, observations, or experiments.
    • Retrieving Data: Accessing from databases, APIs, or open data sources.
    • Collecting Data: Scraping from websites or logs when direct access isn't available.
  3. Databases:

    • Relational Databases: Use SQL for structured data but have limitations with big data.
    • NoSQL Databases: Handle unstructured or semi-structured data, offering flexibility and scalability.
    • Document-Oriented Databases: Store data in formats like JSON, ideal for e-commerce and IoT.
  4. APIs:

    • REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
    • Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
  5. Data Scraping:

    • Extracting data from websites or logs when APIs aren't available.
    • Legal and ethical considerations must be addressed.

!lecture_04.pdf

  1. Data Protection and Anonymization:

    • GDPR Compliance: Personal data must be protected, and usage requires consent.
    • Anonymization: Removing personal identifiers to prevent individual identification.
    • Pseudonymization: Using non-unique identifiers, requiring additional info for identification.
    • Hashing: Converting data to fixed-size values (e.g., SHA-256) for privacy.
  2. Statistical Basics:

    • Descriptive Statistics: Summarizes data (e.g., mean, median).
    • Exploratory Data Analysis: Identifies patterns and outliers.
    • Inferential Statistics: Draws conclusions about populations from samples.
  3. Frequencies and Histograms:

    • Frequencies: Count of occurrences of each value.
    • Absolute vs. Relative Frequencies: Raw counts vs. proportions.
    • Histograms: Visual representation of data distribution across classes.
  4. Empirical Distribution Function (EDF):

    • Plots cumulative frequencies to show data distribution over a range.
  5. Data Visualization:

    • Pie Charts: Effective for showing proportions of categorical data.
    • Bar Charts: Compare frequencies across categories.
    • Histograms: Display distribution of continuous data.

!lecture_05.pdf

  1. Central Tendencies:

    • Mode: The most frequently occurring value in a dataset.
    • Median: The middle value when data is ordered, dividing the dataset into two equal halves.
    • Mean: The average value, calculated by summing all observations and dividing by the number of observations.
  2. Statistical Dispersion:

    • Range: The difference between the maximum and minimum values.
    • Interquartile Range (IQR): The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
    • Variance and Standard Deviation: Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
  3. Data Visualization:

    • Histograms: Display the distribution of continuous data across classes.
    • Box Plots: Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
  4. Outliers:

    • Defined as data points falling outside the range of [Q1 - 1.5IQR, Q3 + 1.5IQR].
    • Can indicate errors, unusual observations, or novel data points.

!lecture_06.pdf

  1. Empirical Variance Calculation:

    • Data: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
    • Mean (x̄): 12.24
    • Sum of squared deviations: 14.16
    • Empirical Variance (s̃²): ( \frac{14.16}{7} = 2.02 )
    • Empirical Standard Deviation (s̃): ( \sqrt{2.02} \approx 1.42 )
  2. Contingency Table Analysis:

    • Variables: Growth (shrinking, growing, growing strongly) and Location (North, South)
    • Absolute Frequencies:
      • North: 29 (growing), 13 (growing strongly), 7 (shrinking)
      • South: 13 (growing), 19 (growing strongly), 2 (shrinking)
    • Chi-squared (χ²) Test:
      • Calculated χ²: 7.53
    • Corrected Pearson Contingency Coefficient (K*P):
      • ( K*P = \sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41 )
      • Interpretation: Weak to medium correlation between location and growth.
  3. Correlation Coefficient:

    • Pearson Correlation Coefficient (rXY) for population and area:
      • Calculated rXY: 0.70
      • Interpretation: Strong positive correlation.
  4. Key Concepts:

    • Empirical Variance: Measures data spread around the mean.
    • Contingency Tables: Used for nominal/ordinal data to assess associations.
    • Pearson Correlation: Measures linear correlation between metric variables, ranging from -1 to 1.

!lecture_07.pdf

  1. Probability Theory Basics:

    • Sample Space (Ω): Set of all possible outcomes.
    • Event: Subset of the sample space.
    • Probability Axioms (Kolmogorov):
      1. ( P(A) \geq 0 )
      2. ( P(Ω) = 1 )
      3. Additivity for disjoint events.
  2. Conditional Probability:

    • Definition: ( P(A|B) = \frac{P(A \cap B)}{P(B)} )
    • Independence: Events A and B are independent if ( P(A \cap B) = P(A)P(B) ).
  3. Bayes' Theorem:

    • Formula: ( P(B|A) = \frac{P(A|B)P(B)}{P(A)} )
    • Example:
      • ( P(B|A) = \frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99} = 0.31 )
  4. Combinatorics:

    • Permutations: ( n! ) for unique items.
    • Combinations: ( \binom{n}{k} = \frac{n!}{k!(n-k)!} )
    • Multiplication Rule: ( n_1 \times n_2 \times \dots \times n_r )
  5. Key Calculations:

    • Password Example: ( 7^2 \times P(10,4) = 49 \times 5040 = 246960 )
    • Dice Probability: ( P(\text{at least one 6 in 4 throws}) = 1 - (5/6)^4 = 0.518 )

!lecture_08_neu.pdf

  1. Random Variables:

    • Definition: A function that assigns numerical values to outcomes in a sample space.
    • Discrete vs. Continuous:
      • Discrete: Countable outcomes (e.g., dice roll).
      • Continuous: Uncountable outcomes (e.g., measurement of height).
  2. Probability Distributions:

    • Discrete:
      • Bernoulli: ( $f(x) = p^x(1-p)^{1-x}$) for ( x \in \{0,1\} ).
      • Binomial: ( f(x) = \binom{n}{x}p^x(1-p)^{n-x} ) for ( x \in \{0,1,...,n\} ).
      • Uniform: ( f(x) = \frac{1}{m} ) for (x \in \{1,2,...,m\} ).
    • Continuous:
      • Uniform: ( f(x) = \frac{1}{b-a} ) for ( x \in [a,b] ).
      • Normal (Gaussian): ( f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)} ).
  3. Expected Value and Variance:

    • Discrete:
      • Expected Value: ( E(X) = \sum x \cdot f(x) ).
      • Variance: ( Var(X) = E((X-E(X))^2) ).
    • Continuous:
      • Expected Value: ( E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx ).
      • Variance: ( Var(X) = E((X-E(X))^2) ).
  4. Key Calculations:

    • Bernoulli Distribution:
      • ( E(X) = p ), ( Var(X) = p(1-p) ).
    • Binomial Distribution:
      • ( E(X) = np ), ( Var(X) = np(1-p) ).
    • Uniform Distribution (Discrete):
      • ( E(X) = \frac{m+1}{2} ), ( Var(X) = \frac{m^2-1}{12} ).
    • Uniform Distribution (Continuous):
      • ( E(X) = \frac{a+b}{2} ), ( Var(X) = \frac{(b-a)^2}{12} ).
    • Normal Distribution:
      • ( E(X) = \mu ), ( Var(X) = \sigma^2 ).
  5. Normal Distribution:

    • Standard Normal (Z-Score): ( Z = \frac{X-\mu}{\sigma} \sim N(0,1) ).
    • 68-95-99.7 Rule: 68% data within ( \mu \pm \sigma ), 95% within ( \mu \pm 2\sigma ), 99.7% within ( \mu \pm 3\sigma ).

!lecture_09.pdf

Here are the key points and calculations from the provided data:

  1. Simple Linear Regression:

    • Model: ( Y = \beta_0 + \beta_1X + \epsilon ), where ( \epsilon \sim N(0, \sigma^2) ).
    • Estimation: Parameters ( \beta_0 ) and ( \beta_1 ) are estimated using least squares method.
  2. Least Squares Estimators:

    • Slope (β̂₁): [$$ \hat{\beta}1 = \frac{\sum{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} $$]
    • Intercept (β̂₀): [$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} $$]
  3. Residual Analysis:

    • Residuals: ( e_i = Y_i - \hat{Y}_i ).
    • Residual Plot: Used to check model assumptions (linearity, constant variance, normality).
  4. Coefficient of Determination (R²):

    • Formula: [$$ R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}} $$]
    • Interpretation: Measures goodness of fit (0 ≤ R² ≤ 1).
  5. Variance Estimation:

    • Residual Sum of Squares (RSS): [$$ RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $$]
    • Variance Estimator: [$$ \hat{\sigma}^2 = \frac{RSS}{n-2} $$]
  6. Key Calculations:

    • Example Calculations:
      • For given data, compute ( \hat{\beta}_0 ), ( \hat{\beta}_1 ), and ( \hat{\sigma}^2 ).
      • Calculate R² to assess model fit.

!lecture_10.pdf

  1. Confidence Intervals:

    • Point Estimator: Provides a single estimate of a population parameter.
    • Interval Estimator: Provides a range of values within which the parameter is expected to lie.
    • Formula for Mean (σ known): [$$ \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right] $$]
    • Formula for Mean (σ unknown): [$$ \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right] $$]
  2. Statistical Tests:

    • Hypothesis Testing: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
    • Z-Test: Used when the population variance is known.
      • Test Statistic: ( Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} )
    • T-Test: Used when the population variance is unknown.
      • Test Statistic: ( T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} )
    • Two-Sample T-Test: Compares the means of two independent groups.
      • Test Statistic: ( T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}} )
  3. Key Calculations:

    • Example 1: 95% confidence interval for bonbon package weights.
      • Result: [63.84, 65.08]
    • Example 2: Testing machine adjustment with t-test.
      • Result: Null hypothesis not rejected, machine does not need adjustment.
    • Example 3: Two-sample t-test for bonbon weights.
      • Result: Reject H₀, second companys bonbons are heavier.

!lecture_11.pdf

  1. Point Estimation:

    • A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
    • Example: For a normal distribution, the arithmetic mean estimates the expected value.
  2. Confidence Intervals:

    • A range [gl, gu] where θ is likely to lie with probability 1α.
    • Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
  3. Statistical Hypothesis Testing:

    • Null Hypothesis (H0): Statement to be tested (e.g., μ = 15).
    • Alternative Hypothesis (H1): Opposing statement (e.g., μ ≠ 15).
    • Test Statistic: A function of the sample data to assess H0 vs. H1.
    • Rejection Region: Values of the test statistic leading to rejection of H0.
    • Errors:
      • Type I Error: Rejecting a true H0 (controlled by significance level α).
      • Type II Error: Failing to reject a false H0 (related to test power).
  4. Z-Test and T-Test:

    • Z-Test: Used when population variance is known or sample size is large.
    • T-Test: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
  5. Two-Sample T-Test:

    • Compares means of two independent groups.
    • Assumptions: Normality, independence, homogeneity (if variances are equal).
    • Test Statistic: Accounts for sample means, standard deviations, and sizes.
    • Degrees of Freedom: Calculated using Welch-Satterthwaite equation for unequal variances.
  6. Examples:

    • One-Sample Test: Chocolate box weights using z-test (known variance).
    • Two-Sample Test: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
  7. Key Concepts:

    • p-value: Probability of observing test statistic under H0; compared to α.
    • Confidence Interval and Hypothesis Test Relationship: Rejecting H0 if the parameter lies outside the confidence interval.
  8. Future Topics: Data preparation and decision trees.

!lecture_12.pdf

  1. Machine Learning Overview:

    • Involves data preparation, model building, and evaluation.
    • Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
  2. Data Preparation:

    • Handling Missing Data: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
    • Handling False Data: Identify and correct errors through analysis or expert consultation.
    • Feature Engineering: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
  3. Decision Trees:

    • Structure: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
    • Construction: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
    • Example: Medicine recommendation based on blood pressure and age.
    • Pros and Cons: Easy to interpret but may overfit or require discretization for numerical attributes.
  4. Model Evaluation:

    • Train-Test Split: Evaluate models on independent test data to avoid overfitting.
    • Metrics for Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE).
    • Metrics for Classification: Confusion matrix, accuracy, precision, recall, F1 score.
      • Accuracy: Overall correctness.
      • Precision: Correct predictions among positive predictions.
      • Recall: Correct predictions among actual positive instances.
      • F1 Score: Harmonic mean of precision and recall.
  5. Advanced Topics:

    • Ensemble Methods: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
    • Model Comparison: Use validation sets to compare models and avoid overfitting.
  6. Summary:

    • Skills Acquired: Data preparation, decision tree classification, and model evaluation.
    • Future Topics: Dashboards and summaries for data visualization and reporting.

!lecture_13.pdf

  1. Model Evaluation Basics:

    • Overfitting: When a model fits the training data too well, leading to poor performance on new data.
    • Underfitting: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
  2. Data Splitting:

    • Train-Test Split: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
    • Train-Validation-Test Split: Used to compare different models by having separate training, validation, and test sets.
  3. Evaluation Metrics:

    • Regression Metrics:
      • Mean Squared Error (MSE): Average of squared differences between predicted and actual values.
      • Mean Absolute Error (MAE): Average of absolute differences.
      • Mean Absolute Percentage Error (MAPE): Average of percentage differences.
    • Classification Metrics:
      • Confusion Matrix: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
      • Accuracy: (TP + TN) / (TP + TN + FP + FN).
      • Precision: TP / (TP + FP).
      • Recall: TP / (TP + FN).
      • F1 Score: Harmonic mean of precision and recall, given by ( F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} ).
  4. Example Calculations:

    • Confusion Matrix Example:
      • Accuracy: ( \frac{4 + 1}{10} = 0.5 )
      • Precision: ( \frac{4}{6} \approx 0.66 )
      • Recall: ( \frac{4}{7} \approx 0.57 )
      • F1 Score: ( \frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31 )
  5. Interpreting Results:

    • Baseline Comparison: Always compare model performance against a naive reference (e.g., predicting the majority class).
    • Contextual Relevance: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
  6. Summary:

    • Skills Acquired: Understanding of evaluation metrics and their application in assessing machine learning models.
    • Future Topics: Advanced evaluation techniques and model optimization strategies.