a
This commit is contained in:
117
WS2425/Data Science/4/4.md
Normal file
117
WS2425/Data Science/4/4.md
Normal file
@@ -0,0 +1,117 @@
|
||||
To pseudonymize the "name" values in the given dataset using the specified hash function, let's follow these steps:
|
||||
|
||||
1. **Convert each character in the name to its alphabetic position using the `alph` function**, where:
|
||||
- `A = 1`, `B = 2`, `C = 3`, and so on, up to `Z = 26`.
|
||||
|
||||
2. **Compute the hash value** using the formula:
|
||||
3. $$
|
||||
h(s) = \sum_{i} \text{alph}(s_i) \mod 13
|
||||
$$
|
||||
where \( s_i \) is the \( i \)-th character in the string.
|
||||
|
||||
3. **Add a salt** to the hash value. The salt is a random value that is combined with the hash to make it more secure.
|
||||
|
||||
Let's apply these steps to each name in the dataset.
|
||||
|
||||
### Step-by-Step Calculation
|
||||
|
||||
For each name, we'll calculate the hash value before adding a salt.
|
||||
|
||||
#### 1. Name: Franz
|
||||
- `F = 6`, `R = 18`, `A = 1`, `N = 14`, `Z = 26`
|
||||
- Sum of positions: \( 6 + 18 + 1 + 14 + 26 = 65 \)
|
||||
- Hash: \( 65 \mod 13 = 0 \)
|
||||
|
||||
#### 2. Name: Antje
|
||||
- `A = 1`, `N = 14`, `T = 20`, `J = 10`, `E = 5`
|
||||
- Sum of positions: \( 1 + 14 + 20 + 10 + 5 = 50 \)
|
||||
- Hash: \( 50 \mod 13 = 11 \)
|
||||
|
||||
#### 3. Name: Alex
|
||||
- `A = 1`, `L = 12`, `E = 5`, `X = 24`
|
||||
- Sum of positions: \( 1 + 12 + 5 + 24 = 42 \)
|
||||
- Hash: \( 42 \mod 13 = 3 \)
|
||||
|
||||
### Pseudonymized Dataset with Hash Values
|
||||
We can now add the calculated hash values to the dataset. We'll also mention that a salt should be added to these hash values for increased security in a real implementation:
|
||||
|
||||
| Original Name | Height | Shoe Size | Hash Value (no salt) |
|
||||
|---------------|--------|-----------|---------------------|
|
||||
| Franz | 165 | 40 | 0 |
|
||||
| Antje | 170 | 39 | 11 |
|
||||
| Alex | 174 | 42 | 3 |
|
||||
|
||||
To finalize the pseudonymization process, a random salt should be combined with these hash values, ensuring that even if two identical names are processed, they won't result in the same pseudonymized value.
|
||||
|
||||
Let's go through the steps for this exercise.
|
||||
|
||||
### Given Data
|
||||
The weights of the luggage are:
|
||||
```
|
||||
22, 44, 11, 19, 21, 17, 17, 11, 11, 19, 22, 17
|
||||
```
|
||||
|
||||
### Categories:
|
||||
- **Light**: less than 15 kg
|
||||
- **Normal**: between 15 and 20 kg
|
||||
- **Overweight**: more than 20 kg
|
||||
|
||||
### Part (a): Absolute and Relative Frequencies
|
||||
|
||||
#### Step 1: Categorize the weights
|
||||
We will count how many weights fall into each category.
|
||||
|
||||
| Category | Weight Values | Frequency (Absolute) |
|
||||
|------------|--------------------------------------------|---------------------|
|
||||
| Light | 11, 11, 11 | 3 |
|
||||
| Normal | 19, 17, 17, 17, 19 | 5 |
|
||||
| Overweight | 22, 44, 21, 22 | 4 |
|
||||
|
||||
#### Step 2: Calculate Relative Frequencies
|
||||
Relative frequency is calculated as:
|
||||
$$
|
||||
\text{Relative Frequency} = \frac{\text{Absolute Frequency}}{\text{Total Number of Weights}}
|
||||
$$
|
||||
|
||||
Total number of weights = 12
|
||||
|
||||
| Category | Absolute Frequency | Relative Frequency |
|
||||
|------------|--------------------|-------------------|
|
||||
| Light | 3 | \( \frac{3}{12} = 0.25 \) or 25% |
|
||||
| Normal | 5 | \( \frac{5}{12} \approx 0.42 \) or 42% |
|
||||
| Overweight | 4 | \( \frac{4}{12} \approx 0.33 \) or 33% |
|
||||
|
||||
### Part (b): Empirical Distribution Function and Question
|
||||
|
||||
The empirical distribution function (EDF) represents the cumulative frequency of the dataset.
|
||||
|
||||
Let's arrange the weights in increasing order:
|
||||
```
|
||||
11, 11, 11, 17, 17, 17, 19, 19, 21, 22, 22, 44
|
||||
```
|
||||
|
||||
The EDF for these weights can be expressed as:
|
||||
- Less than or equal to 11 kg: 3/12 = 0.25
|
||||
- Less than or equal to 17 kg: 6/12 = 0.5
|
||||
- Less than or equal to 19 kg: 8/12 = 0.67
|
||||
- Less than or equal to 21 kg: 9/12 = 0.75
|
||||
- Less than or equal to 22 kg: 11/12 = 0.92
|
||||
- Less than or equal to 44 kg: 12/12 = 1
|
||||
|
||||
#### Question: What is the proportion of weights that are less than 18 kg or more than 23 kg?
|
||||
- **Weights less than 18 kg:** 6 out of 12 = \( \frac{6}{12} = 0.5 \) or 50%
|
||||
- **Weights more than 23 kg:** 1 out of 12 = \( \frac{1}{12} \approx 0.08 \) or 8%
|
||||
|
||||
Proportion of weights that are less than 18 kg or more than 23 kg:
|
||||
$$
|
||||
0.5 + 0.08 = 0.58 \text{ or 58%}
|
||||
$$
|
||||
|
||||
### Visualization: Bar Chart and Histogram
|
||||
|
||||
Now, let's create a bar chart and histogram for the weight categories. I will generate these charts using the data provided.
|
||||
<meta http-equiv="content-type" content="text/html; charset=utf-8"><img class="my-1 max-h-full max-w-full" alt="Output image" src="https://files.oaiusercontent.com/file-1NLAbWjJbRoIpOpkPlGyFDgJ?se=2024-10-14T09%3A24%3A00Z&sp=r&sv=2024-08-04&sr=b&rscc=max-age%3D299%2C%20immutable%2C%20private&rscd=attachment%3B%20filename%3Defbc9b60-7fa3-4779-8424-e1161ace057d&sig=L/65MC847FN1r7DOH9Y4zLV9NIiqPFMxLg8iAfU3LGU%3D">
|
||||
<meta http-equiv="content-type" content="text/html; charset=utf-8"><img class="my-1 max-h-full max-w-full" alt="Output image" src="https://files.oaiusercontent.com/file-X3OdJTHxHl2hGw2cSgzDuKum?se=2024-10-14T09%3A24%3A00Z&sp=r&sv=2024-08-04&sr=b&rscc=max-age%3D299%2C%20immutable%2C%20private&rscd=attachment%3B%20filename%3Dcf1e314c-a836-410f-a4c0-5ff900ff9d6e&sig=p6mA9gHjDbY2VcS/6nAqcBCs7BCJpJ93vcRs/1ryKVs%3D">
|
||||
The bar chart displays the absolute frequencies of each weight category, while the histogram shows the distribution of luggage weights according to the defined categories (light, normal, and overweight).
|
||||
|
||||
These visualizations help in understanding how the weights are distributed across the categories.
|
||||
Reference in New Issue
Block a user