# 1 1. **Organizational Information** - **Page:** 2 - **Notes:** - Contact: [klaus.kaiser@fh-dortmund.de](mailto:klaus.kaiser@fh-dortmund.de) - Room: B.2.04 - Professor Klaus Kaiser has a background in data science across various industries. 2. **Introduction to Data Science** - **Page:** 4-7 - **Notes:** - Definition: Data science is about turning raw data into meaningful insights. - Interdisciplinary field combining statistics, computing, and domain knowledge. - Historical context of the term “data science” from 1962 to 2001. 3. **What is Data Science?** - **Page:** 8-10 - **Notes:** - Data science involves using methods and systems to extract knowledge from data. - The intersection of math/statistics, computer science, and domain knowledge is crucial. 4. **Practical Example of Data Science Project: Monkey Detection** - **Page:** 16-24 - **Notes:** - Steps include understanding the problem, data collection, labeling, model training, and deployment. 5. **Related Fields in Data Science** - **Page:** 14-15 - **Notes:** - Data Engineering: Building systems for data collection and processing. - Data Analysis: Inspecting and transforming data to inform decisions. 6. **Tasks in Data Science** - **Page:** 16 - **Notes:** - Overview of different tasks within classical machine learning. 7. **Real-World Examples of Data Science Applications** - **Page:** 25-32 - **Notes:** - Applications include autonomous driving, face recognition, predictive maintenance, fraud detection, recommendation systems, and cancer detection. 8. **Overview of Lecture Content** - **Page:** 34-35 - **Notes:** - Basic topics include data basics, statistics, presentation techniques, and machine learning. 9. **Organizational: Schedule and Exam Information** - **Page:** 38-40 - **Notes:** - Lecture and exercise schedules, language of instruction, and exam details (written exam with bonus points for data analytics). 10. **Expectations from Students** - **Page:** 42-43 - **Notes:** - Emphasis on respect, professionalism, and willingness to participate. 11. **How to Continue in Data Science** - **Page:** 46-48 - **Notes:** - Suggested literature for further reading and related courses available in the curriculum. 12. **Summary & References** - **Page:** 51-55 - **Notes:** - Key takeaways: ability to explain data science and recognize its applications. - Important references for further study are provided. # 2 - **Data Science Definition**: Creating knowledge from data using math, statistics, and computer science. - **Data Types**: - **Structured**: Follows a predefined model (e.g., tables). - **Unstructured**: Lacks explicit structure (e.g., text, images). - **Data Categories**: - Discrete vs. Continuous - Nominal, Ordinal, Interval, Ratio - Qualitative vs. Quantitative - **Data Interchange Formats**: Common formats include CSV and JSON. - **Data Trust**: Importance of data quality dimensions: accuracy, completeness, consistency, timeliness, uniqueness, validity. # 3 - **Data Categories**: Discrete, continuous, nominal, ordinal, interval, ratio, qualitative, and quantitative. - **Data Interchange Formats**: Common formats include CSV and JSON. - **Data Quality Dimensions**: Accuracy, completeness, consistency, timelessness, uniqueness, validity. - **Data Types**: Primary (real-time, specific) vs. secondary (past, economical). - **Data Acquisition Methods**: Capturing (sensors, surveys), retrieving (databases, APIs), collecting (web scraping). - **FAIR and Open Data**: Principles for sustainable data usage and importance in scientific reproducibility. # 4 - **Primary vs. Secondary Data**: Primary data is collected for a specific purpose, while secondary data is sourced from existing datasets. - **Data Collection Techniques**: Includes scraping, which extracts data from websites, and considerations for legality and data protection. - **Data Protection**: Emphasizes GDPR compliance, anonymization, and pseudonymization of personal data. - **Statistics Basics**: Introduces descriptive and inductive statistics, frequency distributions, and graphical representations like histograms and bar charts. - **FAIR Principles**: Focus on data findability, accessibility, interoperability, and reusability. # 5 - **Data Scraping**: Extracts data from program outputs; should be a last resort. - **Anonymization**: Removes personal info to protect identity; pseudonymization allows identification with additional info. - **Statistics Types**: Descriptive, explorative, and inductive statistics. - **Frequencies**: Absolute and relative frequencies; visualized through histograms, pie charts, and bar charts. - **Central Tendencies**: Mode, median, and mean; box plots visualize data distribution. - **Statistical Dispersion**: Measures spread of data; includes range, quartile range, and empirical variance. # 6 - **Histograms**: Visual representation of frequency for continuous data. - **Cumulative Frequency**: Measures total frequency up to a certain value. - **Statistical Dispersion**: Includes empirical variance and standard deviation. - **Bivariate Analysis**: Examines relationships between two variables. - **Correlation Coefficients**: Quantifies the strength and direction of relationships. - **Contingency Tables**: Displays frequencies of categorical variables. - **Pearson Coefficient**: Measures linear correlation between metric variables. - **Ordinal Data**: Can be analyzed using rank correlation methods. # 7 - **Correlation**: Describes relationships between two variables using correlation coefficients based on variable types (nominal, ordinal, metric). - **Contingency Tables**: Used for two-dimensional frequency distributions; includes conditional frequencies and measures of association. - **Probability Theory**: Introduces random experiments, events, and Kolmogorov axioms; covers Laplace experiments and combinatorics. - **Bayes’ Theorem**: Explains conditional probability and its application in real-world scenarios, such as medical testing. - **Outcome**: Understanding of probability basics, combinatorial calculations, and Bayes’ theorem application. # 8 - **Random Experiment**: Defined by well-defined conditions with unpredictable outcomes (e.g., dice throw). - **Kolmogorov Axioms**: Fundamental properties of probability measures. - **Random Variables**: Assign outcomes to numbers; can be discrete (countable values) or continuous (any value in an interval). - **Distributions**: Includes discrete (e.g., binomial, uniform) and continuous (e.g., normal) distributions. - **Expected Value & Variance**: Key metrics for understanding random variables' behavior. - **Applications**: Used in statistical tests and linear regression. # 9 - **Random Variables**: Defined as functions mapping outcomes to real numbers. - **Discrete vs. Continuous Distributions**: Discrete has countable outcomes; continuous uses probability density functions. - **Simple Linear Regression**: Models correlation between independent (X) and dependent (Y) variables. - **Key Concepts**: - **Residual Analysis**: Evaluates fit of regression line. - **Determinacy Measure (R²)**: Indicates model fit; ranges from 0 to 1. - **Estimation**: Parameters (β0, β1) estimated using least squares method. - **Applications**: Used in various fields to predict outcomes based on correlations.