Module rspamd_url

This module provides routines to handle URL’s and extract URL’s from the text. Objects of this class are returned, for example, by task:get_urls() or task:get_emails(). You can also create rspamd_url from any text.

Example:

local url = require "rspamd_url"
local pool = mpool.create()
local res = url.create(pool, 'Look at: http://user@test.example.com/test?query")
local t = res:to_table()
-- Content of t:
-- url = ['http://test.example.com/test?query']
-- host = ['test.example.com']
-- user = ['user']
-- path = ['test']
-- tld = ['example.com']

pool:destroy() -- res is destroyed here, so you should not use it afterwards

local mistake = res:to_table() -- INVALID! as pool is destroyed

Brief content:

Functions:

url.create([mempool,] str)

url.create(tld_file)

Methods:

url:get_length()

url:get_host()

url:get_port()

url:get_user()

url:get_path()

url:get_query()

url:get_fragment()

url:get_text()

url:tostring()

url:get_raw()

url:is_phished()

url:is_redirected()

url:is_obscured()

url:is_html_displayed()

url:is_subject()

url:get_tag(tag)

url:get_tags()

url:add_tag(tag, mempool)

url:get_phished()

url:get_tld()

url:get_count()

url:to_table()

url:get_flags()

Functions

The module rspamd_url defines the following functions.

Function url.create([mempool,] str)

Parameters:

  • memory {rspamd_mempool}: pool for URL, e.g. task:get_mempool()
  • text {string}: that contains URL (can also contain other stuff)

Returns:

  • {url}: new url object that exists as long as the corresponding mempool exists

Back to module description.

Function url.create(tld_file)

Initialize url library if not initialized yet by Rspamd

Parameters:

  • tld_file {string}: for url library

Returns:

  • nothing

Back to module description.

Methods

The module rspamd_url defines the following methods.

Method url:get_length()

Get length of the url

Parameters:

No parameters

Returns:

  • {number}: length of url in bytes

Back to module description.

Method url:get_host()

Get domain part of the url

Parameters:

No parameters

Returns:

  • {string}: domain part of URL

Back to module description.

Method url:get_port()

Get port of the url

Parameters:

No parameters

Returns:

  • {number}: url port

Back to module description.

Method url:get_user()

Get user part of the url (e.g. username in email)

Parameters:

No parameters

Returns:

  • {string}: user part of URL

Back to module description.

Method url:get_path()

Get path of the url

Parameters:

No parameters

Returns:

  • {string}: path part of URL

Back to module description.

Method url:get_query()

Get query of the url

Parameters:

No parameters

Returns:

  • {string}: query part of URL

Back to module description.

Method url:get_fragment()

Get fragment of the url

Parameters:

No parameters

Returns:

  • {string}: fragment part of URL

Back to module description.

Method url:get_text()

Get full content of the url

Parameters:

No parameters

Returns:

  • {string}: url string

Back to module description.

Method url:tostring()

Get full content of the url or user@domain in case of email

Parameters:

No parameters

Returns:

  • {string}: url as a string

Back to module description.

Method url:get_raw()

Get full content of the url as it was parsed (e.g. with urldecode)

Parameters:

No parameters

Returns:

  • {string}: url string

Back to module description.

Method url:is_phished()

Check whether URL is treated as phished

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is phished

Back to module description.

Method url:is_redirected()

Check whether URL was redirected

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is redirected

Back to module description.

Method url:is_obscured()

Check whether URL is treated as obscured or obfusicated (e.g. numbers in IP address or other hacks)

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is obscured

Back to module description.

Method url:is_html_displayed()

Check whether URL is just displayed in HTML (e.g. NOT a real href)

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is displayed only

Back to module description.

Method url:is_subject()

Check whether URL is found in subject

Parameters:

No parameters

Returns:

  • {boolean}: true if URL is found in subject

Back to module description.

Method url:get_tag(tag)

Returns list of string for a specific tagname for an url

Parameters:

No parameters

Returns:

  • {table/strings}: list of tags for an url

Back to module description.

Method url:get_tags()

Returns list of string tags for an url

Parameters:

No parameters

Returns:

  • {table/strings}: list of tags for an url

Back to module description.

Method url:add_tag(tag, mempool)

Adds a new tag for url

Parameters:

  • tag {string}: new tag to add
  • mempool {mempool}: memory pool (e.g. task:get_pool())

Returns:

No return

Back to module description.

Method url:get_phished()

Get another URL that pretends to be this URL (e.g. used in phishing)

Parameters:

No parameters

Returns:

  • {url}: phished URL

Back to module description.

Method url:get_tld()

Get effective second level domain part (eSLD) of the url host

Parameters:

No parameters

Returns:

  • {string}: effective second level domain part (eSLD) of the url host

Back to module description.

Method url:get_count()

Return number of occurrencies for this particular URL

Parameters:

No parameters

Returns:

  • {number}: number of occurrencies

Back to module description.

Method url:to_table()

Return url as a table with the following fields:

  • url: full content
  • host: hostname part
  • user: user part
  • path: path part
  • tld: top level domain
  • protocol: url protocol

Parameters:

No parameters

Returns:

  • {table}: URL as a table

Back to module description.

Method url:get_flags()

Return flags for a specified URL as map ‘flag’->true for all flags set, possible flags are:

  • phished: URL is likely phished
  • numeric: URL is numeric (e.g. IP address)
  • obscured: URL was obscured
  • redirected: URL comes from redirector
  • html_displayed: URL is used just for displaying purposes
  • text: URL comes from the text
  • subject: URL comes from the subject
  • host_encoded: URL host part is encoded
  • schema_encoded: URL schema part is encoded
  • query_encoded: URL query part is encoded
  • missing_slahes: URL has some slashes missing
  • idn: URL has international characters
  • has_port: URL has port
  • has_user: URL has user part
  • schemaless: URL has no schema
  • unnormalised: URL has some unicode unnormalities

Parameters:

No parameters

Returns:

  • {table}: URL flags

Back to module description.

Back to top.