You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -16,18 +17,26 @@ An open-source Python library for data cleaning tasks. It includes functions for
16
17
> [!NOTE]
17
18
> ValX will automatically install a version of `scikit-learn` that is compatible with your device if you don't have one already.
18
19
20
+
## Changes in 0.2.6
21
+
22
+
ValX v0.2.6 fixes a major bug, view the issue here: https://github.com/Infinitode/ValX/issues/4, where profanity lists for multiple languages were missing, under wrong languages, or simply incomplete.
23
+
24
+
Version 0.2.6 introduces fixes for this using the original language lists data, and includes new handling for languages, including:
25
+
26
+
- Case insensitivity for language selection: "English", "EN", "en", or variants like "enGliSh" will all work for language selection in the `detect_profanity` and `remove_profanity` functions.
27
+
19
28
## Changes in 0.2.5
20
29
21
30
ValX v0.2.5 introduces enhanced flexibility for profanity filtering by adding support for custom profanity lists:
22
31
23
-
-**Custom Profanity Word Lists**: Users can now provide their own lists of profane words directly as Python lists to the `detect_profanity` and `remove_profanity` functions via the new `custom_words_list` parameter.
24
-
-**Standalone Custom Lists**: Utilize your custom profanity list exclusively by setting the `language` parameter to `None`. ValX will then only use the words provided in `custom_words_list`.
25
-
-**Combined Lists**: Use a custom list in conjunction with ValX's built-in language-specific wordlists. Simply provide both a `language` (e.g., "English") and your `custom_words_list`. ValX will use the combined set of words.
26
-
-**Loading Custom Lists from File**: A new helper function, `load_custom_profanity_from_file(filepath)`, allows you to easily load custom profanity words from a text file.
27
-
-**File Format**: The file should contain one profanity word per line.
28
-
- Lines starting with a hash symbol (`#`) are treated as comments and ignored.
29
-
- Empty lines or lines containing only whitespace are also ignored.
30
-
-**Updated Detection Reporting**: The `detect_profanity` function's output now specifies the source of detected profanity more clearly (e.g., "Custom", "Custom + English").
32
+
-**Custom Profanity Word Lists**: Users can now provide their own lists of profane words directly as Python lists to the `detect_profanity` and `remove_profanity` functions via the new `custom_words_list` parameter.
33
+
-**Standalone Custom Lists**: Utilize your custom profanity list exclusively by setting the `language` parameter to `None`. ValX will then only use the words provided in `custom_words_list`.
34
+
-**Combined Lists**: Use a custom list in conjunction with ValX's built-in language-specific wordlists. Simply provide both a `language` (e.g., "English") and your `custom_words_list`. ValX will use the combined set of words.
35
+
-**Loading Custom Lists from File**: A new helper function, `load_custom_profanity_from_file(filepath)`, allows you to easily load custom profanity words from a text file.
36
+
-**File Format**: The file should contain one profanity word per line.
37
+
- Lines starting with a hash symbol (`#`) are treated as comments and ignored.
38
+
- Empty lines or lines containing only whitespace are also ignored.
39
+
-**Updated Detection Reporting**: The `detect_profanity` function's output now specifies the source of detected profanity more clearly (e.g., "Custom", "Custom + English").
31
40
32
41
These features give users greater control over the profanity filtering process, allowing for more tailored and specific use cases.
33
42
@@ -42,6 +51,7 @@ We've also removed `scikit-learn==1.2.2` as a dependency, as most versions of `s
42
51
We have introduced a new optional `info_type` parameter into our `detect_sensitive_information`, and `remove_sensitive_information` functions, to allow you to have fine-grained control over what sensitive information you want to detect or remove.
43
52
44
53
Also introduced more detection patterns for other types of sensitive information, including:
54
+
45
55
-`"iban"`: International Bank Account Number.
46
56
-`"mrn"`: Medical Record Number (may not work correctly, depending on provider and country).
47
57
-`"icd10"`: International Classification of Diseases, Tenth Revision.
@@ -54,6 +64,7 @@ Also introduced more detection patterns for other types of sensitive information
54
64
## Changes in 0.2.2
55
65
56
66
We have refactored and changed the `detect_profanity` function:
67
+
57
68
- Removed unnecessary printing
58
69
- Now returns more information about each found profanity, including `Line`, `Column`, `Word`, and `Language`.
59
70
@@ -95,36 +106,65 @@ Please ensure that you have one of these Python versions installed before using
95
106
-**Remove Hate Speech**: Remove hate speech or offensive speech in text, using AI.
96
107
97
108
### List of supported languages for profanity detection and removal
109
+
98
110
Below is a complete list of all the available supported languages for ValX's profanity detection and removal functions which are valid values for `language`:
99
111
100
-
-**All**
112
+
- All
101
113
- Arabic
114
+
- AR
102
115
- Czech
116
+
- CS
103
117
- Danish
118
+
- DA
104
119
- German
120
+
- DE
105
121
- English
122
+
- EN
106
123
- Esperanto
124
+
- EO
107
125
- Persian
108
126
- Finnish
127
+
- FI
109
128
- Filipino
129
+
- FIL
110
130
- French
131
+
- FR
111
132
- French (CA)
133
+
- FR-CA-U-SD-CAQC
112
134
- Hindi
135
+
- HI
113
136
- Hungarian
137
+
- HU
114
138
- Italian
139
+
- IT
115
140
- Japanese
141
+
- JA
116
142
- Kabyle
143
+
- KAB
117
144
- Korean
145
+
- KO
118
146
- Dutch
147
+
- NL
119
148
- Norwegian
149
+
- NO
120
150
- Polish
151
+
- PL
121
152
- Portuguese
153
+
- PT
122
154
- Russian
155
+
- RU
156
+
- Spanish
157
+
- ES
123
158
- Swedish
159
+
- SV
124
160
- Thai
161
+
- TH
125
162
- Klingon
163
+
- TLH
126
164
- Turkish
165
+
- TR
127
166
- Chinese
167
+
- ZH
128
168
129
169
## Usage
130
170
@@ -214,15 +254,15 @@ print(results_file_only)
214
254
**Output Format for `detect_profanity`**
215
255
216
256
The `detect_profanity` function returns a list of dictionaries. Each dictionary includes:
257
+
217
258
-`"Line"`: The line number (1-indexed).
218
259
-`"Column"`: The column number (1-indexed) where the profanity starts.
219
260
-`"Word"`: The detected profanity word.
220
261
-`"Language"`: Indicates the source of the word list:
221
-
-`<LanguageName>` (e.g., "English"): If only a built-in language list was used.
222
-
-`"Custom"`: If `language=None` and only a `custom_words_list` was used.
223
-
-`"Custom + <LanguageName>"` (e.g., "Custom + English"): If both a built-in list and `custom_words_list` were used.
224
-
-`"Custom + All"`: If `language='All'` and `custom_words_list` were used.
225
-
262
+
-`<LanguageName>` (e.g., "English"): If only a built-in language list was used.
263
+
-`"Custom"`: If `language=None` and only a `custom_words_list` was used.
264
+
-`"Custom + <LanguageName>"` (e.g., "Custom + English"): If both a built-in list and `custom_words_list` were used.
265
+
-`"Custom + All"`: If `language='All'` and `custom_words_list` were used.
226
266
227
267
**4. Removing Profanity**
228
268
@@ -283,6 +323,7 @@ outcome_of_detection = detect_hate_speech("You are stupid.")
283
323
284
324
> [!IMPORTANT]
285
325
> The model's possible outputs are:
326
+
>
286
327
> -`['Hate Speech']`: The text was flagged and contained hate speech.
287
328
> -`['Offensive Speech']`: The text was flagged and contained offensive speech.
288
329
> -`['No Hate and Offensive Speech']`: The text was not flagged for any hate speech or offensive speech.
@@ -299,7 +340,9 @@ Contributions are welcome! If you encounter any issues, have suggestions, or wan
299
340
ValX is released under the terms of the **MIT License (Modified)**. Please see the [LICENSE](https://github.com/infinitode/valx/blob/main/LICENSE) file for the full text.
0 commit comments