obsidian/WS2425/Data Science/Cheat Sheet Mockup.md at c0a740a8a2efce3fdc6788c5cb5007cffb3973eb

Jordi/obsidian

Fork 0

Files

Jordi 7ef2e7f0bb a

2025-02-20 15:52:52 +01:00

18 KiB

Raw Blame History

!lecture_02.pdf

Definition of Data:
- Data is information collected, stored, or processed. It is ubiquitous and can be measured or categorized.
Data Basics:
- Basic Population: Entire group of interest (e.g., all students).
- Sample: Subset of the population (e.g., students in a lecture).
- Statistical Unit: Individual data point (e.g., one student).
- Variable: Characteristics of units (e.g., name, population size).
- Value: Specific value of a variable (e.g., "123456" for "MatrNr").
Data Categories:
- Structured vs. Unstructured:
  - Structured: Organized data with a predefined format (e.g., tables).
  - Unstructured: No traditional format (e.g., text, images).
- Discrete vs. Continuous:
  - Discrete: Countable values (e.g., grades).
  - Continuous: Any value within a range (e.g., temperature).
- Levels of Measurement:
  - Nominal: Labels without order (e.g., colors).
  - Ordinal: Ordered labels (e.g., school grades).
  - Interval: Ordered with equal intervals (e.g., Celsius).
  - Ratio: Interval with a true zero (e.g., weight).
- Qualitative vs. Quantitative:
  - Qualitative: Categorical (e.g., gender).
  - Quantitative: Numerical (e.g., height).

!lecture_03.pdf

Primary vs. Secondary Data:
- Primary Data: Collected directly for a specific purpose (e.g., surveys, experiments).
- Secondary Data: Existing data from other sources (e.g., books, journals).
Ways to Obtain Data:
- Capturing Data: Collecting through sensors, observations, or experiments.
- Retrieving Data: Accessing from databases, APIs, or open data sources.
- Collecting Data: Scraping from websites or logs when direct access isn't available.
Databases:
- Relational Databases: Use SQL for structured data but have limitations with big data.
- NoSQL Databases: Handle unstructured or semi-structured data, offering flexibility and scalability.
- Document-Oriented Databases: Store data in formats like JSON, ideal for e-commerce and IoT.
APIs:
- REST-APIs enable communication between systems, using HTTP methods (GET, POST, PUT, DELETE).
- Often require authentication (e.g., API keys) and provide data in JSON/XML formats.
Data Scraping:
- Extracting data from websites or logs when APIs aren't available.
- Legal and ethical considerations must be addressed.

!lecture_04.pdf

Data Protection and Anonymization:
- GDPR Compliance: Personal data must be protected, and usage requires consent.
- Anonymization: Removing personal identifiers to prevent individual identification.
- Pseudonymization: Using non-unique identifiers, requiring additional info for identification.
- Hashing: Converting data to fixed-size values (e.g., SHA-256) for privacy.
Statistical Basics:
- Descriptive Statistics: Summarizes data (e.g., mean, median).
- Exploratory Data Analysis: Identifies patterns and outliers.
- Inferential Statistics: Draws conclusions about populations from samples.
Frequencies and Histograms:
- Frequencies: Count of occurrences of each value.
- Absolute vs. Relative Frequencies: Raw counts vs. proportions.
- Histograms: Visual representation of data distribution across classes.
Empirical Distribution Function (EDF):
- Plots cumulative frequencies to show data distribution over a range.
Data Visualization:
- Pie Charts: Effective for showing proportions of categorical data.
- Bar Charts: Compare frequencies across categories.
- Histograms: Display distribution of continuous data.

!lecture_05.pdf

Central Tendencies:
- Mode: The most frequently occurring value in a dataset.
- Median: The middle value when data is ordered, dividing the dataset into two equal halves.
- Mean: The average value, calculated by summing all observations and dividing by the number of observations.
Statistical Dispersion:
- Range: The difference between the maximum and minimum values.
- Interquartile Range (IQR): The difference between the third quartile (Q3) and first quartile (Q1), representing the middle 50% of the data.
- Variance and Standard Deviation: Measures of spread, with variance being the average squared deviation from the mean and standard deviation the square root of variance.
Data Visualization:
- Histograms: Display the distribution of continuous data across classes.
- Box Plots: Show the five-number summary (minimum, Q1, median, Q3, maximum) and identify outliers.
Outliers:
- Defined as data points falling outside the range of [Q1 - 1.5IQR, Q3 + 1.5IQR].
- Can indicate errors, unusual observations, or novel data points.

!lecture_06.pdf

Empirical Variance Calculation:
- Data: Daily temperatures (°C): 11.2, 13.3, 14.1, 13.7, 12.2, 11.3, 9.9
- Mean (x̄): 12.24
- Sum of squared deviations: 14.16
- Empirical Variance (s̃²): ( \frac{14.16}{7} = 2.02 )
- Empirical Standard Deviation (s̃): ( \sqrt{2.02} \approx 1.42 )
Contingency Table Analysis:
- Variables: Growth (shrinking, growing, growing strongly) and Location (North, South)
- Absolute Frequencies:
  - North: 29 (growing), 13 (growing strongly), 7 (shrinking)
  - South: 13 (growing), 19 (growing strongly), 2 (shrinking)
- Chi-squared (χ²) Test:
  - Calculated χ²: 7.53
- Corrected Pearson Contingency Coefficient (K*P):
  - ( K*P = \sqrt{\frac{7.53}{7.53 + 83}} \times \sqrt{\frac{2}{2}} = 0.41 )
  - Interpretation: Weak to medium correlation between location and growth.
Correlation Coefficient:
- Pearson Correlation Coefficient (rXY) for population and area:
  - Calculated rXY: 0.70
  - Interpretation: Strong positive correlation.
Key Concepts:
- Empirical Variance: Measures data spread around the mean.
- Contingency Tables: Used for nominal/ordinal data to assess associations.
- Pearson Correlation: Measures linear correlation between metric variables, ranging from -1 to 1.

!lecture_07.pdf

Probability Theory Basics:
- Sample Space (Ω): Set of all possible outcomes.
- Event: Subset of the sample space.
- Probability Axioms (Kolmogorov):
  1. ( P(A) \geq 0 )
  2. ( P(Ω) = 1 )
  3. Additivity for disjoint events.
Conditional Probability:
- Definition: ( P(A|B) = \frac{P(A \cap B)}{P(B)} )
- Independence: Events A and B are independent if ( P(A \cap B) = P(A)P(B) ).
Bayes' Theorem:
- Formula: ( P(B|A) = \frac{P(A|B)P(B)}{P(A)} )
- Example:
  - ( P(B|A) = \frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.02 \times 0.99} = 0.31 )
Combinatorics:
- Permutations: ( n! ) for unique items.
- Combinations: ( \binom{n}{k} = \frac{n!}{k!(n-k)!} )
- Multiplication Rule: ( n_1 \times n_2 \times \dots \times n_r )
Key Calculations:
- Password Example: ( 7^2 \times P(10,4) = 49 \times 5040 = 246960 )
- Dice Probability: ( P(\text{at least one 6 in 4 throws}) = 1 - (5/6)^4 = 0.518 )

!lecture_08_neu.pdf

Random Variables:
- Definition: A function that assigns numerical values to outcomes in a sample space.
- Discrete vs. Continuous:
  - Discrete: Countable outcomes (e.g., dice roll).
  - Continuous: Uncountable outcomes (e.g., measurement of height).
Probability Distributions:
- Discrete:
  - Bernoulli: ( $f(x) = p^x(1-p)^{1-x}$) for ( x \in \{0,1\} ).
  - Binomial: ( f(x) = \binom{n}{x}p^x(1-p)^{n-x} ) for ( x \in \{0,1,...,n\} ).
  - Uniform: ( f(x) = \frac{1}{m} ) for (x \in \{1,2,...,m\} ).
- Continuous:
  - Uniform: ( f(x) = \frac{1}{b-a} ) for ( x \in [a,b] ).
  - Normal (Gaussian): ( f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)} ).
Expected Value and Variance:
- Discrete:
  - Expected Value: ( E(X) = \sum x \cdot f(x) ).
  - Variance: ( Var(X) = E((X-E(X))^2) ).
- Continuous:
  - Expected Value: ( E(X) = \int_{-\infty}^{\infty} x \cdot f(x)dx ).
  - Variance: ( Var(X) = E((X-E(X))^2) ).
Key Calculations:
- Bernoulli Distribution:
  - ( E(X) = p ), ( Var(X) = p(1-p) ).
- Binomial Distribution:
  - ( E(X) = np ), ( Var(X) = np(1-p) ).
- Uniform Distribution (Discrete):
  - ( E(X) = \frac{m+1}{2} ), ( Var(X) = \frac{m^2-1}{12} ).
- Uniform Distribution (Continuous):
  - ( E(X) = \frac{a+b}{2} ), ( Var(X) = \frac{(b-a)^2}{12} ).
- Normal Distribution:
  - ( E(X) = \mu ), ( Var(X) = \sigma^2 ).
Normal Distribution:
- Standard Normal (Z-Score): ( Z = \frac{X-\mu}{\sigma} \sim N(0,1) ).
- 68-95-99.7 Rule: 68% data within ( \mu \pm \sigma ), 95% within ( \mu \pm 2\sigma ), 99.7% within ( \mu \pm 3\sigma ).

!lecture_09.pdf

Here are the key points and calculations from the provided data:

Simple Linear Regression:
- Model: ( Y = \beta_0 + \beta_1X + \epsilon ), where ( \epsilon \sim N(0, \sigma^2) ).
- Estimation: Parameters ( \beta_0 ) and ( \beta_1 ) are estimated using least squares method.
Least Squares Estimators:
- Slope (β̂₁): [$$ \hat{\beta}1 = \frac{\sum{i=1}^{n}(Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} $$]
- Intercept (β̂₀): [$$ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} $$]
Residual Analysis:
- Residuals: ( e_i = Y_i - \hat{Y}_i ).
- Residual Plot: Used to check model assumptions (linearity, constant variance, normality).
Coefficient of Determination (R²):
- Formula: [$$ R^2 = \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\text{ESS}}{\text{TSS}} $$]
- Interpretation: Measures goodness of fit (0 ≤ R² ≤ 1).
Variance Estimation:
- Residual Sum of Squares (RSS): [$$ RSS = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $$]
- Variance Estimator: [$$ \hat{\sigma}^2 = \frac{RSS}{n-2} $$]
Key Calculations:
- Example Calculations:
  - For given data, compute ( \hat{\beta}_0 ), ( \hat{\beta}_1 ), and ( \hat{\sigma}^2 ).
  - Calculate R² to assess model fit.

!lecture_10.pdf

Confidence Intervals:
- Point Estimator: Provides a single estimate of a population parameter.
- Interval Estimator: Provides a range of values within which the parameter is expected to lie.
- Formula for Mean (σ known): [$$ \left[ \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right] $$]
- Formula for Mean (σ unknown): [$$ \left[ \bar{X} - t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}}, \bar{X} + t_{n-1,1-\alpha/2} \frac{S}{\sqrt{n}} \right] $$]
Statistical Tests:
- Hypothesis Testing: Involves setting up a null hypothesis (H₀) and an alternative hypothesis (H₁), then determining whether to reject H₀ based on sample data.
- Z-Test: Used when the population variance is known.
  - Test Statistic: ( Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} )
- T-Test: Used when the population variance is unknown.
  - Test Statistic: ( T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} )
- Two-Sample T-Test: Compares the means of two independent groups.
  - Test Statistic: ( T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_X^2}{n} + \frac{S_Y^2}{m}}} )
Key Calculations:
- Example 1: 95% confidence interval for bonbon package weights.
  - Result: [63.84, 65.08]
- Example 2: Testing machine adjustment with t-test.
  - Result: Null hypothesis not rejected, machine does not need adjustment.
- Example 3: Two-sample t-test for bonbon weights.
  - Result: Reject H₀, second company’s bonbons are heavier.

!lecture_11.pdf

Point Estimation:
- A point estimator (e.g., sample mean) estimates the parameter θ from a random sample.
- Example: For a normal distribution, the arithmetic mean estimates the expected value.
Confidence Intervals:
- A range [gl, gu] where θ is likely to lie with probability 1−α.
- Types: One-sided (e.g., [gl, ∞)) and two-sided (finite interval).
Statistical Hypothesis Testing:
- Null Hypothesis (H0): Statement to be tested (e.g., μ = 15).
- Alternative Hypothesis (H1): Opposing statement (e.g., μ ≠ 15).
- Test Statistic: A function of the sample data to assess H0 vs. H1.
- Rejection Region: Values of the test statistic leading to rejection of H0.
- Errors:
  - Type I Error: Rejecting a true H0 (controlled by significance level α).
  - Type II Error: Failing to reject a false H0 (related to test power).
Z-Test and T-Test:
- Z-Test: Used when population variance is known or sample size is large.
- T-Test: Used when population variance is unknown; relies on sample standard deviation and t-distribution.
Two-Sample T-Test:
- Compares means of two independent groups.
- Assumptions: Normality, independence, homogeneity (if variances are equal).
- Test Statistic: Accounts for sample means, standard deviations, and sizes.
- Degrees of Freedom: Calculated using Welch-Satterthwaite equation for unequal variances.
Examples:
- One-Sample Test: Chocolate box weights using z-test (known variance).
- Two-Sample Test: Comparing bonbon weights using t-test with Welch-Satterthwaite adjustment.
Key Concepts:
- p-value: Probability of observing test statistic under H0; compared to α.
- Confidence Interval and Hypothesis Test Relationship: Rejecting H0 if the parameter lies outside the confidence interval.
Future Topics: Data preparation and decision trees.

!lecture_12.pdf

Machine Learning Overview:
- Involves data preparation, model building, and evaluation.
- Follows the CRISP-DM process: Business understanding, data understanding, modeling, evaluation, and deployment.
Data Preparation:
- Handling Missing Data: Strategies include filtering, marking, or imputing missing values (e.g., mean, median, or model-based imputation).
- Handling False Data: Identify and correct errors through analysis or expert consultation.
- Feature Engineering: Create new features from existing data (e.g., deriving age from birthdate) or combine attributes for better model performance.
Decision Trees:
- Structure: A tree with nodes representing tests on attributes, leading to leaf nodes with class predictions.
- Construction: Built by recursively splitting data to minimize entropy (disorder). Information gain determines the best split.
- Example: Medicine recommendation based on blood pressure and age.
- Pros and Cons: Easy to interpret but may overfit or require discretization for numerical attributes.
Model Evaluation:
- Train-Test Split: Evaluate models on independent test data to avoid overfitting.
- Metrics for Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE).
- Metrics for Classification: Confusion matrix, accuracy, precision, recall, F1 score.
  - Accuracy: Overall correctness.
  - Precision: Correct predictions among positive predictions.
  - Recall: Correct predictions among actual positive instances.
  - F1 Score: Harmonic mean of precision and recall.
Advanced Topics:
- Ensemble Methods: Random Forests and Gradient Boosting improve decision trees by reducing variance and bias.
- Model Comparison: Use validation sets to compare models and avoid overfitting.
Summary:
- Skills Acquired: Data preparation, decision tree classification, and model evaluation.
- Future Topics: Dashboards and summaries for data visualization and reporting.

!lecture_13.pdf

Model Evaluation Basics:
- Overfitting: When a model fits the training data too well, leading to poor performance on new data.
- Underfitting: When a model is too simple to capture the data's structure, resulting in poor performance on both training and test data.
Data Splitting:
- Train-Test Split: Commonly used to evaluate model performance. Typical splits are 80% for training and 20% for testing.
- Train-Validation-Test Split: Used to compare different models by having separate training, validation, and test sets.
Evaluation Metrics:
- Regression Metrics:
  - Mean Squared Error (MSE): Average of squared differences between predicted and actual values.
  - Mean Absolute Error (MAE): Average of absolute differences.
  - Mean Absolute Percentage Error (MAPE): Average of percentage differences.
- Classification Metrics:
  - Confusion Matrix: Tabulates true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
  - Accuracy: (TP + TN) / (TP + TN + FP + FN).
  - Precision: TP / (TP + FP).
  - Recall: TP / (TP + FN).
  - F1 Score: Harmonic mean of precision and recall, given by ( F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} ).
Example Calculations:
- Confusion Matrix Example:
  - Accuracy: ( \frac{4 + 1}{10} = 0.5 )
  - Precision: ( \frac{4}{6} \approx 0.66 )
  - Recall: ( \frac{4}{7} \approx 0.57 )
  - F1 Score: ( \frac{2 \times 0.66 \times 0.57}{0.66 + 0.57} \approx 0.31 )
Interpreting Results:
- Baseline Comparison: Always compare model performance against a naive reference (e.g., predicting the majority class).
- Contextual Relevance: Accuracy alone may not indicate practical usefulness; consider domain context and baseline performance.
Summary:
- Skills Acquired: Understanding of evaluation metrics and their application in assessing machine learning models.
- Future Topics: Advanced evaluation techniques and model optimization strategies.

18 KiB Raw Blame History Unescape Escape

18 KiB

Raw Blame History