obsidian/WS2425/Data Science/4/4.md at eeb8580a31814b389854073750b6cb879e8ef0b0

Jordi/obsidian

Fork 0

Files

Gentleman-DE eeb8580a31 a

2024-12-08 18:18:11 +01:00

5.4 KiB

Raw Blame History

To pseudonymize the "name" values in the given dataset using the specified hash function, let's follow these steps:

Convert each character in the name to its alphabetic position using the alph function, where:
- A = 1, B = 2, C = 3, and so on, up to Z = 26.
Compute the hash value using the formula:
```
h(s) = \sum_{i} \text{alph}(s_i) \mod 13
```
where ( s_i ) is the ( i )-th character in the string.
Add a salt to the hash value. The salt is a random value that is combined with the hash to make it more secure.

Let's apply these steps to each name in the dataset.

Step-by-Step Calculation

For each name, we'll calculate the hash value before adding a salt.

1. Name: Franz

F = 6, R = 18, A = 1, N = 14, Z = 26
Sum of positions: ( 6 + 18 + 1 + 14 + 26 = 65 )
Hash: ( 65 \mod 13 = 0 )

2. Name: Antje

A = 1, N = 14, T = 20, J = 10, E = 5
Sum of positions: ( 1 + 14 + 20 + 10 + 5 = 50 )
Hash: ( 50 \mod 13 = 11 )

3. Name: Alex

A = 1, L = 12, E = 5, X = 24
Sum of positions: ( 1 + 12 + 5 + 24 = 42 )
Hash: ( 42 \mod 13 = 3 )

Pseudonymized Dataset with Hash Values

We can now add the calculated hash values to the dataset. We'll also mention that a salt should be added to these hash values for increased security in a real implementation:

Original Name	Height	Shoe Size	Hash Value (no salt)
Franz	165	40	0
Antje	170	39	11
Alex	174	42	3

To finalize the pseudonymization process, a random salt should be combined with these hash values, ensuring that even if two identical names are processed, they won't result in the same pseudonymized value.

Let's go through the steps for this exercise.

Given Data

The weights of the luggage are:

22, 44, 11, 19, 21, 17, 17, 11, 11, 19, 22, 17

Part (a): Absolute and Relative Frequencies

Step 1: Categorize the weights

We will count how many weights fall into each category.

Category	Weight Values	Frequency (Absolute)
Light	11, 11, 11	3
Normal	19, 17, 17, 17, 19	5
Overweight	22, 44, 21, 22	4

Step 2: Calculate Relative Frequencies

Relative frequency is calculated as:


\text{Relative Frequency} = \frac{\text{Absolute Frequency}}{\text{Total Number of Weights}}

Total number of weights = 12

Category	Absolute Frequency	Relative Frequency
Light	3	( \frac{3}{12} = 0.25 ) or 25%
Normal	5	( \frac{5}{12} \approx 0.42 ) or 42%
Overweight	4	( \frac{4}{12} \approx 0.33 ) or 33%

Part (b): Empirical Distribution Function and Question

The empirical distribution function (EDF) represents the cumulative frequency of the dataset.

Let's arrange the weights in increasing order:

11, 11, 11, 17, 17, 17, 19, 19, 21, 22, 22, 44

The EDF for these weights can be expressed as:

Less than or equal to 11 kg: 3/12 = 0.25
Less than or equal to 17 kg: 6/12 = 0.5
Less than or equal to 19 kg: 8/12 = 0.67
Less than or equal to 21 kg: 9/12 = 0.75
Less than or equal to 22 kg: 11/12 = 0.92
Less than or equal to 44 kg: 12/12 = 1

Question: What is the proportion of weights that are less than 18 kg or more than 23 kg?

Weights less than 18 kg: 6 out of 12 = ( \frac{6}{12} = 0.5 ) or 50%
Weights more than 23 kg: 1 out of 12 = ( \frac{1}{12} \approx 0.08 ) or 8%

Proportion of weights that are less than 18 kg or more than 23 kg:


0.5 + 0.08 = 0.58 \text{ or 58%}

Visualization: Bar Chart and Histogram

Now, let's create a bar chart and histogram for the weight categories. I will generate these charts using the data provided.

The bar chart displays the absolute frequencies of each weight category, while the histogram shows the distribution of luggage weights according to the defined categories (light, normal, and overweight).

These visualizations help in understanding how the weights are distributed across the categories.

5.4 KiB Raw Blame History