Skip to content

[Backport 5.0.x] Sanitize metadata input#14343

Merged
giohappy merged 1 commit into
5.0.xfrom
p9f9-fj9v-50x
Jun 17, 2026
Merged

[Backport 5.0.x] Sanitize metadata input#14343
giohappy merged 1 commit into
5.0.xfrom
p9f9-fj9v-50x

Conversation

@etj

@etj etj commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Sanitize metadata input

Checklist

Reviewing is a process done by project maintainers, mostly on a volunteer basis. We try to keep the overhead as small as possible and appreciate if you help us to do so by completing the following items. Feel free to ask in a comment if you have troubles with any of them.

For all pull requests:

  • Confirm you have read the contribution guidelines
  • You have sent a Contribution Licence Agreement (CLA) as necessary (not required for small changes, e.g., fixing typos in the documentation)
  • Make sure the first PR targets the master branch, eventual backports will be managed later. This can be ignored if the PR is fixing an issue that only happens in a specific branch, but not in newer ones.

The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):

  • There is a ticket in https://github.com/GeoNode/geonode/issues describing the issue/improvement/feature (a notable exemption is, changes not visible to end-users)
  • The issue connected to the PR must have Labels and Milestone assigned
  • PR for bug fixes and small new features are presented as a single commit
  • PR title must be in the form "[Fixes #<issue_number>] Title of the PR"
  • New unit tests have been added covering the changes, unless there is an explanation on why the tests are not necessary/implemented

Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.

@etj etj self-assigned this Jun 16, 2026
Copilot AI review requested due to automatic review settings June 16, 2026 14:12
@cla-bot cla-bot Bot added the cla-signed CLA Bot: community license agreement signed label Jun 16, 2026
@etj etj changed the title Sanitize metadata input [Backport 5.0.x] Sanitize metadata input Jun 16, 2026
@etj etj requested a review from giohappy June 16, 2026 14:13
@etj etj added this to the 5.0.3 milestone Jun 16, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new CleanupHandler to sanitize metadata fields from potentially unsafe HTML tags (XSS protection) during deserialization, along with corresponding unit tests and localization updates. The review feedback highlights a critical security vulnerability where unclosed HTML tags can bypass the sanitization regex, suggesting a simplified pattern to catch them. Additionally, the feedback recommends removing redundant list conversions during dictionary and list iterations to improve performance, and adding a unit test to verify the sanitization of unclosed tags.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.



class CleanupHandler(MetadataHandler):
_HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z][^>]*>")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The regular expression _HTML_LIKE_PATTERN requires a closing > to match ([^>]*>). This creates a critical security vulnerability (XSS bypass). An attacker can submit an unclosed dangerous tag, such as <script src="http://evil.com/xss.js", which will not be matched by the pattern and thus will bypass sanitization completely. When rendered in the browser, the browser's lenient HTML parser will use subsequent tags in the page to close the unclosed tag and execute the malicious script.

To fix this, simplify the pattern to detect any potential tag start (closed or unclosed), allowing BeautifulSoup to safely parse and decompose it.

Suggested change
_HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z][^>]*>")
_HTML_LIKE_PATTERN = re.compile(r"<\s*/?\s*[a-zA-Z]")

Comment on lines +58 to +64
for key, nested_value in list(value.items()):
nested_path = path + [str(key)]
value[key] = self._sanitize_instance(nested_value, context, errors, nested_path)
return value

if isinstance(value, list):
for idx, nested_value in enumerate(list(value)):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating list copies of dictionary items (list(value.items())) and list elements (list(value)) is redundant and inefficient.

Since we are only modifying the values of existing keys in the dictionary and elements in the list without changing their sizes/lengths, we can safely iterate over value.items() and value directly. This avoids unnecessary memory allocation and improves performance.

Suggested change
for key, nested_value in list(value.items()):
nested_path = path + [str(key)]
value[key] = self._sanitize_instance(nested_value, context, errors, nested_path)
return value
if isinstance(value, list):
for idx, nested_value in enumerate(list(value)):
for key, nested_value in value.items():
nested_path = path + [str(key)]
value[key] = self._sanitize_instance(nested_value, context, errors, nested_path)
return value
if isinstance(value, list):
for idx, nested_value in enumerate(value):

Comment thread geonode/metadata/tests/tests.py Outdated
Comment on lines +1015 to +1017
self.assertIn("title", context["errors"])
self.assertIn("__errors", context["errors"]["title"])
self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a unit test to verify that unclosed dangerous HTML tags (e.g., <script or <iframe without a closing >) are successfully detected and sanitized, preventing future regressions of the XSS bypass vulnerability.

Suggested change
self.assertIn("title", context["errors"])
self.assertIn("__errors", context["errors"]["title"])
self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"])
self.assertIn("title", context["errors"])
self.assertIn("__errors", context["errors"]["title"])
self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"])
@override_settings(LANGUAGE_CODE="en")
def test_pre_deserialization_sanitizes_unclosed_tags(self):
instance = {
"title": "<script src=http://evil.com/xss.js",
"body": "<iframe src=http://evil.com",
}
context = {"errors": {}}
self.handler.pre_deserialization(self.resource, {}, instance, partial=set(), context=context)
self.assertEqual(instance["title"], "")
self.assertEqual(instance["body"], "")

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces server-side sanitization of incoming metadata payloads to mitigate HTML/script injection, and adds an i18n thesaurus entry to surface a user-facing warning when sanitization occurs.

Changes:

  • Add a CleanupHandler that recursively strips HTML-like content from incoming metadata values (pre-deserialization).
  • Wire the sanitization step into MetadataManager.update_schema_instance() and propagate an errors dict via context.
  • Add i18n RDF labels for new/updated metadata error messages and add unit tests covering sanitization behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
geonode/metadata/thesauri/labels-i18n.metadata.rdf Adds i18n thesaurus concepts/labels including metadata_error_sanitized.
geonode/metadata/tests/tests.py Updates manager-context expectations and adds tests for sanitization/logging/error reporting.
geonode/metadata/manager.py Injects errors into context and calls CleanupHandler.pre_deserialization() before handler updates.
geonode/metadata/handlers/meta.py Implements CleanupHandler to sanitize nested strings and record warnings/errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread geonode/metadata/tests/tests.py Outdated
Comment on lines +1001 to +1017
with self.assertLogs("geonode.metadata.handlers.meta", level="WARNING") as cm:
context = {"errors": {}}
self.handler.pre_deserialization(self.resource, {}, instance, partial=set(), context=context)

self.assertEqual(instance["title"], "xss")
self.assertEqual(instance["details"]["body"], "safe")
self.assertEqual(instance["items"][1], "bad")
self.assertEqual(instance["count"], 3)

logs = "\n".join(cm.output)
self.assertIn("Sanitized potentially unsafe metadata field 'title'", logs)
self.assertIn("Sanitized potentially unsafe metadata field 'details.body'", logs)
self.assertIn("Sanitized potentially unsafe metadata field 'items.[1]'", logs)

self.assertIn("title", context["errors"])
self.assertIn("__errors", context["errors"]["title"])
self.assertIn("metadata_error_sanitized", context["errors"]["title"]["__errors"])
Comment on lines +1 to +3
<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
<ConceptScheme rdf:about="https://i18n.geonode.org">
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.47059% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (5.0.x@31b61a6). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff            @@
##             5.0.x   #14343   +/-   ##
========================================
  Coverage         ?   74.51%           
========================================
  Files            ?      945           
  Lines            ?    56863           
  Branches         ?     7707           
========================================
  Hits             ?    42372           
  Misses           ?    12809           
  Partials         ?     1682           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@etj etj force-pushed the p9f9-fj9v-50x branch from ad128fe to 8be7396 Compare June 16, 2026 15:48
@etj etj force-pushed the p9f9-fj9v-50x branch from 8be7396 to bbf66cc Compare June 16, 2026 16:26
@giohappy giohappy merged commit 730d8fa into 5.0.x Jun 17, 2026
13 checks passed
@giohappy giohappy deleted the p9f9-fj9v-50x branch June 17, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed CLA Bot: community license agreement signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants