Module rspamd_url

This module provides routines to handle URL’s and extract URL’s from the text. Objects of this class are returned, for example, by task:get_urls() or task:get_emails(). You can also create rspamd_url from any text.

Example:

local url = require "rspamd_url"
local mpool = require "rspamd_mempool"

url.init("/usr/share/rspamd/effective_tld_names.dat")
local pool = mpool.create()
local res = url.create(pool, 'Look at: http://user@test.example.com/test?query")
local t = res:to_table()
-- Content of t:
-- url = ['http://test.example.com/test?query']
-- host = ['test.example.com']
-- user = ['user']
-- path = ['test']
-- tld = ['example.com']

pool:destroy() -- res is destroyed here, so you should not use it afterwards

local mistake = res:to_table() -- INVALID! as pool is destroyed

Brief content:

Functions:

Function Description
url.create([mempool,] str, [{flags_table}]) No description
url.init(tld_file) Initialize url library if not initialized yet by Rspamd.

Methods:

Method Description
url:get_length() Get length of the url.
url:get_host() Get domain part of the url.
url:get_port() Get port of the url.
url:get_user() Get user part of the url (e.g.
url:get_path() Get path of the url.
url:get_query() Get query of the url.
url:get_fragment() Get fragment of the url.
url:get_text() Get full content of the url.
url:tostring() Get full content of the url or user@domain in case of email.
url:to_http() Get URL suitable for HTTP request (e.g.
url:get_raw() Get full content of the url as it was parsed (e.g.
url:is_phished() Check whether URL is treated as phished.
url:is_redirected() Check whether URL was redirected.
url:is_obscured() Check whether URL is treated as obscured or obfuscated (e.g.
url:is_html_displayed() Check whether URL is just displayed in HTML (e.g.
url:is_subject() Check whether URL is found in subject.
url:get_phished() Get another URL that pretends to be this URL (e.g.
url:set_redirected(url, pool) Set url as redirected to another url.
url:get_tld() Get effective second level domain part (eSLD) of the url host.
url:get_protocol() Get protocol name.
url:get_count() Return number of occurrences for this particular URL.
url:get_visible() Get visible part of the url with html tags stripped.
url:to_table() Return url as a table with the following fields.
url:get_flags() Return flags for a specified URL as map ‘flag’->true for all flags set,.

Functions

The module rspamd_url defines the following functions.

Function url.create([mempool,] str, [{flags_table}])

Parameters:

  • memory {rspamd_mempool}: pool for URL, e.g. task:get_mempool()
  • text {string}: that contains URL (can also contain other stuff)

Returns:

  • {url}: new url object that exists as long as the corresponding mempool exists

Back to module description.

Function url.init(tld_file)

Initialize url library if not initialized yet by Rspamd

Parameters:

  • tld_file {string}: path to effective_tld_names.dat file (public suffix list)

Returns:

  • nothing

Back to module description.

Methods

The module rspamd_url defines the following methods.

Method url:get_length()

Get length of the url

Parameters:

No parameters

Returns:

  • {number}: length of url in bytes

Back to module description.

Method url:get_host()

Get domain part of the url

Parameters:

No parameters

Returns:

  • {string}: domain part of URL

Back to module description.

Method url:get_port()

Get port of the url

Parameters:

No parameters

Returns:

  • {number}: url port

Back to module description.

Method url:get_user()

Get user part of the url (e.g. username in email)

Parameters:

No parameters

Returns:

  • {string}: user part of URL

Back to module description.

Method url:get_path()

Get path of the url

Parameters:

No parameters

Returns:

  • {string}: path part of URL

Back to module description.

Method url:get_query()

Get query of the url

Parameters:

No parameters

Returns:

  • {string}: query part of URL

Back to module description.

Method url:get_fragment()

Get fragment of the url

Parameters:

No parameters

Returns:

  • {string}: fragment part of URL

Back to module description.

Method url:get_text()

Get full content of the url

Parameters:

No parameters

Returns:

  • {string}: url string

Back to module description.

Method url:tostring()

Get full content of the url or user@domain in case of email

Parameters:

No parameters

Returns:

  • {string}: url as a string

Back to module description.

Method url:to_http()

Get URL suitable for HTTP request (e.g. by trimming fragment and user parts)

Parameters:

No parameters

Returns:

  • {string}: url as a string

Back to module description.

Method url:get_raw()

Get full content of the url as it was parsed (e.g. with urldecode)

Parameters:

No parameters

Returns:

  • {string}: url string

Back to module description.

Method url:is_phished()

Check whether URL is treated as phished

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is phished

Back to module description.

Method url:is_redirected()

Check whether URL was redirected

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is redirected

Back to module description.

Method url:is_obscured()

Check whether URL is treated as obscured or obfuscated (e.g. numbers in IP address or other hacks)

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is obscured

Back to module description.

Method url:is_html_displayed()

Check whether URL is just displayed in HTML (e.g. NOT a real href)

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is displayed only

Back to module description.

Method url:is_subject()

Check whether URL is found in subject

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is found in subject

Back to module description.

Method url:get_phished()

Get another URL that pretends to be this URL (e.g. used in phishing)

Parameters:

No parameters

Returns:

  • {url}: phished URL

Back to module description.

Method url:set_redirected(url, pool)

Set url as redirected to another url

Parameters:

  • url {string|url}: new url that is redirecting an old one
  • pool {pool}: memory pool to allocate memory if needed

Returns:

  • {url}: parsed redirected url (if needed)

Back to module description.

Method url:get_tld()

Get effective second level domain part (eSLD) of the url host

Parameters:

No parameters

Returns:

  • {string}: effective second level domain part (eSLD) of the url host

Back to module description.

Method url:get_protocol()

Get protocol name

Parameters:

No parameters

Returns:

  • {string}: protocol as a string

Back to module description.

Method url:get_count()

Return number of occurrences for this particular URL

Parameters:

No parameters

Returns:

  • {number}: number of occurrences

Back to module description.

Method url:get_visible()

Get visible part of the url with html tags stripped

Parameters:

No parameters

Returns:

  • {string}: url string

Back to module description.

Method url:to_table()

Return url as a table with the following fields:

  • url: full content
  • host: hostname part
  • user: user part
  • path: path part
  • tld: top level domain
  • protocol: url protocol

Parameters:

No parameters

Returns:

  • {table}: URL as a table

Back to module description.

Method url:get_flags()

Return flags for a specified URL as map ‘flag’->true for all flags set, possible flags are:

  • phished: URL is likely phished
  • numeric: URL is numeric (e.g. IP address)
  • obscured: URL was obscured
  • redirected: URL comes from redirector
  • html_displayed: URL is used just for displaying purposes
  • text: URL comes from the text
  • subject: URL comes from the subject
  • host_encoded: URL host part is encoded
  • schema_encoded: URL schema part is encoded
  • query_encoded: URL query part is encoded
  • missing_slashes: URL has some slashes missing
  • idn: URL has international characters
  • has_port: URL has port
  • has_user: URL has user part
  • schemaless: URL has no schema
  • unnormalised: URL has some unicode unnormalities
  • zw_spaces: URL has some zero width spaces
  • url_displayed: URL has some other url-like string in visible part
  • image: URL is from src attribute of img HTML tag

Parameters:

No parameters

Returns:

  • {table}: URL flags

Back to module description.

Back to top.