Module rspamd_html

This module provides different methods to access HTML tags. To get HTML context from an HTML part you could use method part:get_html()

Example:

rspamd_config.R_EMPTY_IMAGE = function(task)
  local tp = task:get_text_parts() -- get text parts in a message

  for _,p in ipairs(tp) do -- iterate over text parts array using `ipairs`
    if p:is_html() then -- if the current part is html part
      local hc = p:get_html() -- we get HTML context
      local len = p:get_length() -- and part's length

      if len < 50 then -- if we have a part that has less than 50 bytes of text
        local images = hc:get_images() -- then we check for HTML images

        if images then -- if there are images
          for _,i in ipairs(images) do -- then iterate over images in the part
            if i['height'] + i['width'] >= 400 then -- if we have a large image
              return true -- add symbol
            end
          end
        end
      end
    end
  end
end

Brief content:

Methods:

html:has_tag(name)

html:check_property(name)

html:get_images()

html:get_blocks()

html:foreach_tag(tagname, callback)

html_tag:get_type()

html_tag:get_extra()

html_tag:get_parent()

html_tag:get_flags()

html_tag:get_content()

html_tag:get_content_length()

Methods

The module rspamd_html defines the following methods.

Method html:has_tag(name)

Checks if a specified tag name is presented in a part

Parameters:

  • name {string}: name of tag to check

Returns:

  • {boolean}: true if the tag exists in HTML tree

Back to module description.

Method html:check_property(name)

Checks if the HTML has a specific property. Here is the list of available properties:

  • no_html - no html tag presented
  • bad_element - part has some broken elements
  • xml - part is xhtml
  • unknown_element - part has some unknown elements
  • duplicate_element - part has some duplicate elements that should be unique (namely, title tag)
  • unbalanced - part has unbalanced tags

Parameters:

  • name {string}: name of property

Returns:

  • {boolean}: true if the part has the specified property

Back to module description.

Method html:get_images()

Returns a table of images found in html. Each image is, in turn, a table with the following fields:

  • src - link to the source
  • height - height in pixels
  • width - width in pixels
  • embedded - true if an image is embedded in a message

Parameters:

No parameters

Returns:

  • {table}: table of images in html part

Back to module description.

Method html:get_blocks()

Returns a table of html blocks. Each block provides the following data:

tag - corresponding tag color - a triplet (r g b) for font color bgcolor - a triplet (r g b) for background color style - rspamd{text} with the full style description font_size - font size

Parameters:

No parameters

Returns:

  • {table}: table of blocks in html part

Back to module description.

Method html:foreach_tag(tagname, callback)

Processes HTML tree calling the specified callback for each tag of the specified type.

Callback is called with the following attributes:

  • tag: html tag structure
  • content_length: length of content within a tag

Callback function should return true to stop processing and false to continue

Parameters:

No parameters

Returns:

  • nothing

Back to module description.

Method html_tag:get_type()

Returns string representation of HTML type for a tag

Parameters:

No parameters

Returns:

  • {string}: type of tag

Back to module description.

Method html_tag:get_extra()

Returns extra data associated with the tag

Parameters:

No parameters

Returns:

  • {url|image|nil}: extra data associated with the tag

Back to module description.

Method html_tag:get_parent()

Returns parent node for a specified tag

Parameters:

No parameters

Returns:

  • {html_tag}: parent object for a specified tag

Back to module description.

Method html_tag:get_flags()

Returns flags a specified tag:

  • closed: tag is properly closed
  • closing: tag is a closing tag
  • broken: tag is somehow broken
  • unbalanced: tag is unbalanced
  • xml: tag is xml tag

Parameters:

No parameters

Returns:

  • {table}: table of flags

Back to module description.

Method html_tag:get_content()

Returns content of tag (approximate for some cases)

Parameters:

No parameters

Returns:

  • {rspamd_text}: rspamd text with tag’s content

Back to module description.

Method html_tag:get_content_length()

Returns length of a tag’s content

Parameters:

No parameters

Returns:

  • {number}: size of content enclosed within a tag

Back to module description.

Back to top.