Files
obsidian/WS2425/Data Science/4/4.md
2024-12-08 18:18:11 +01:00

5.4 KiB

To pseudonymize the "name" values in the given dataset using the specified hash function, let's follow these steps:

  1. Convert each character in the name to its alphabetic position using the alph function, where:

    • A = 1, B = 2, C = 3, and so on, up to Z = 26.
  2. Compute the hash value using the formula:

  3. 
    h(s) = \sum_{i} \text{alph}(s_i) \mod 13
    

    where ( s_i ) is the ( i )-th character in the string.

  4. Add a salt to the hash value. The salt is a random value that is combined with the hash to make it more secure.

Let's apply these steps to each name in the dataset.

Step-by-Step Calculation

For each name, we'll calculate the hash value before adding a salt.

1. Name: Franz

  • F = 6, R = 18, A = 1, N = 14, Z = 26
  • Sum of positions: ( 6 + 18 + 1 + 14 + 26 = 65 )
  • Hash: ( 65 \mod 13 = 0 )

2. Name: Antje

  • A = 1, N = 14, T = 20, J = 10, E = 5
  • Sum of positions: ( 1 + 14 + 20 + 10 + 5 = 50 )
  • Hash: ( 50 \mod 13 = 11 )

3. Name: Alex

  • A = 1, L = 12, E = 5, X = 24
  • Sum of positions: ( 1 + 12 + 5 + 24 = 42 )
  • Hash: ( 42 \mod 13 = 3 )

Pseudonymized Dataset with Hash Values

We can now add the calculated hash values to the dataset. We'll also mention that a salt should be added to these hash values for increased security in a real implementation:

Original Name Height Shoe Size Hash Value (no salt)
Franz 165 40 0
Antje 170 39 11
Alex 174 42 3

To finalize the pseudonymization process, a random salt should be combined with these hash values, ensuring that even if two identical names are processed, they won't result in the same pseudonymized value.

Let's go through the steps for this exercise.

Given Data

The weights of the luggage are:

22, 44, 11, 19, 21, 17, 17, 11, 11, 19, 22, 17

Categories:

  • Light: less than 15 kg
  • Normal: between 15 and 20 kg
  • Overweight: more than 20 kg

Part (a): Absolute and Relative Frequencies

Step 1: Categorize the weights

We will count how many weights fall into each category.

Category Weight Values Frequency (Absolute)
Light 11, 11, 11 3
Normal 19, 17, 17, 17, 19 5
Overweight 22, 44, 21, 22 4

Step 2: Calculate Relative Frequencies

Relative frequency is calculated as:


\text{Relative Frequency} = \frac{\text{Absolute Frequency}}{\text{Total Number of Weights}}

Total number of weights = 12

Category Absolute Frequency Relative Frequency
Light 3 ( \frac{3}{12} = 0.25 ) or 25%
Normal 5 ( \frac{5}{12} \approx 0.42 ) or 42%
Overweight 4 ( \frac{4}{12} \approx 0.33 ) or 33%

Part (b): Empirical Distribution Function and Question

The empirical distribution function (EDF) represents the cumulative frequency of the dataset.

Let's arrange the weights in increasing order:

11, 11, 11, 17, 17, 17, 19, 19, 21, 22, 22, 44

The EDF for these weights can be expressed as:

  • Less than or equal to 11 kg: 3/12 = 0.25
  • Less than or equal to 17 kg: 6/12 = 0.5
  • Less than or equal to 19 kg: 8/12 = 0.67
  • Less than or equal to 21 kg: 9/12 = 0.75
  • Less than or equal to 22 kg: 11/12 = 0.92
  • Less than or equal to 44 kg: 12/12 = 1

Question: What is the proportion of weights that are less than 18 kg or more than 23 kg?

  • Weights less than 18 kg: 6 out of 12 = ( \frac{6}{12} = 0.5 ) or 50%
  • Weights more than 23 kg: 1 out of 12 = ( \frac{1}{12} \approx 0.08 ) or 8%

Proportion of weights that are less than 18 kg or more than 23 kg:


0.5 + 0.08 = 0.58 \text{ or 58%}

Visualization: Bar Chart and Histogram

Now, let's create a bar chart and histogram for the weight categories. I will generate these charts using the data provided.

Output image Output image The bar chart displays the absolute frequencies of each weight category, while the histogram shows the distribution of luggage weights according to the defined categories (light, normal, and overweight).

These visualizations help in understanding how the weights are distributed across the categories.