Module rspamd_html

This module provides different methods to access HTML tags. To get HTML context from an HTML part you could use method part:get_html()

Example:

rspamd_config.R_EMPTY_IMAGE = function(task)
  local tp = task:get_text_parts() -- get text parts in a message

  for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs`
    if p:is_html() then -- if the current part is html part
      local hc = p:get_html() -- we get HTML context
      local len = p:get_length() -- and part's length

      if len < 50 then -- if we have a part that has less than 50 bytes of text
        local images = hc:get_images() -- then we check for HTML images

        if images then -- if there are images
          for _,i in ipairs(images) do -- then iterate over images in the part
            if i['height'] + i['width'] >= 400 then -- if we have a large image
              return true -- add symbol
            end
          end
        end
      end
    end
  end
end

Brief content:

Methods:

Method Description
html:has_tag(name) Checks if a specified tag name is presented in a part.
html:check_property(name) Checks if the HTML has a specific property.
html:get_images() Returns a table of images found in html.
html:foreach_tag(tagname, callback) Processes HTML tree calling the specified callback for each tag of the specified.
html:get_invisible() Returns invisible content of the HTML data.
html_tag:get_type() Returns string representation of HTML type for a tag.
html_tag:get_extra() Returns extra data associated with the tag.
html_tag:get_parent() Returns parent node for a specified tag.
html_tag:get_flags() Returns flags a specified tag.
html_tag:get_content() Returns content of tag (approximate for some cases).
html_tag:get_content_length() Returns length of a tag’s content.
html_tag:get_style() Returns style calculated for the element.
html_tag:get_attribute(name) Returns value of attribute for the element.

Methods

The module rspamd_html defines the following methods.

Method html:has_tag(name)

Checks if a specified tag name is presented in a part

Parameters:

  • name {string}: name of tag to check

Returns:

  • {boolean}: true if the tag exists in HTML tree

Back to module description.

Method html:check_property(name)

Checks if the HTML has a specific property. Here is the list of available properties:

  • no_html - no html tag presented
  • bad_element - part has some broken elements
  • xml - part is xhtml
  • unknown_element - part has some unknown elements
  • duplicate_element - part has some duplicate elements that should be unique (namely, title tag)
  • unbalanced - part has unbalanced tags

Parameters:

  • name {string}: name of property

Returns:

  • {boolean}: true if the part has the specified property

Back to module description.

Method html:get_images()

Returns a table of images found in html. Each image is, in turn, a table with the following fields:

  • src - link to the source
  • height - height in pixels
  • width - width in pixels
  • embedded - true if an image is embedded in a message

Parameters:

No parameters

Returns:

  • {table}: table of images in html part

Back to module description.

Method html:foreach_tag(tagname, callback)

Processes HTML tree calling the specified callback for each tag of the specified type.

Callback is called with the following attributes:

  • tag: html tag structure
  • content_length: length of content within a tag

Callback function should return true to stop processing and false to continue

Parameters:

No parameters

Returns:

  • nothing

Back to module description.

Method html:get_invisible()

Returns invisible content of the HTML data

Parameters:

No parameters

Returns:

  • no description

Back to module description.

Method html_tag:get_type()

Returns string representation of HTML type for a tag

Parameters:

No parameters

Returns:

  • {string}: type of tag

Back to module description.

Method html_tag:get_extra()

Returns extra data associated with the tag

Parameters:

No parameters

Returns:

  • {url|image|nil}: extra data associated with the tag

Back to module description.

Method html_tag:get_parent()

Returns parent node for a specified tag

Parameters:

No parameters

Returns:

  • {html_tag}: parent object for a specified tag

Back to module description.

Method html_tag:get_flags()

Returns flags a specified tag:

  • closed: tag is properly closed
  • closing: tag is a closing tag
  • broken: tag is somehow broken
  • unbalanced: tag is unbalanced
  • xml: tag is xml tag

Parameters:

No parameters

Returns:

  • {table}: table of flags

Back to module description.

Method html_tag:get_content()

Returns content of tag (approximate for some cases)

Parameters:

No parameters

Returns:

  • {rspamd_text}: rspamd text with tag’s content

Back to module description.

Method html_tag:get_content_length()

Returns length of a tag’s content

Parameters:

No parameters

Returns:

  • {number}: size of content enclosed within a tag

Back to module description.

Method html_tag:get_style()

Returns style calculated for the element

Parameters:

No parameters

Returns:

  • {table}: table associated with the style

Back to module description.

Method html_tag:get_attribute(name)

Returns value of attribute for the element Refer to html_components_map in src/libserver/html/html.cxx for recognised names

Parameters:

No parameters

Returns:

  • {string|nil}: value of the attribute

Back to module description.

Back to top.