Filter, Rewrite and Scraper Rules
Feed Filtering Rules ¶
Noflux has a basic filtering system that allows you to ignore or keep articles.
Block Rules
Block rules ignore articles with a title, an entry URL, a tag or an author that match the regex (RE2 syntax).
For example, the regex (?i)noflux
will ignore all articles with a title that contains the word Noflux (case insensitive).
Ignored articles won’t be saved into the database.
Keep Rules
Keep rules keep only articles that match the regex (RE2 syntax).
For example, the regex (?i)noflux
will keep only the articles with a title that contains the word Noflux (case insensitive).
Global Filtering Rules ¶
Global filters are defined on the Settings page and are automatically applied to all articles from all feeds.
- Each rule must be on a separate line.
- Duplicate rules are allowed. For example, having multiple
EntryTitle
rules is possible. - The provided RegEx should use the RE2 syntax.
- The order of the rules matters as the processor stops on the first match for both Block and Keep rules.
Rule Format:
FieldName=RegEx
FieldName=RegEx
FieldName=RegEx
Available Fields:
EntryTitle
EntryURL
EntryCommentsURL
EntryContent
EntryAuthor
EntryTag
EntryDate
Date Patterns
The EntryDate
field supports the following date patterns:
future
- Match entries with future publication datesbefore:YYYY-MM-DD
- Match entries published before a specific dateafter:YYYY-MM-DD
- Match entries published after a specific datebetween:YYYY-MM-DD,YYYY-MM-DD
- Match entries published between two dates
Date format must be YYYY-MM-DD, for example: 2024-01-01
Block Rules
Block rules ignores articles that match a single rule.
For example, the rule EntryTitle=(?i)noflux
will ignore all articles with a title that contains the word Noflux (case insensitive).
For example:
EntryDate=future
will ignore articles with future publication datesEntryDate=before:2024-01-01
will ignore articles published before January 1st, 2024
Keep Rules
Keep rules will keep articles that match a single rule.
For example, the rule EntryTitle=(?i)noflux
will keep only the articles with a title that contains the word Noflux (case insensitive).
For example:
EntryDate=between:2024-01-01,2024-12-31
will keep only articles published in 2024EntryDate=after:2024-03-01
will keep only articles published after March 1st, 2024
Global Rules & Feed Rules Ordering
Rules are processed in this order:
- Global Block Rules
- Feed Block Rule
- Global Keep Rules
- Feed Keep Rule
Rewrite Rules ¶
To improve the reading experience, it’s possible to alter the content of feed items.
For example, if you are reading a popular comic website like XKCD,
it’s nice to have the image title (the alt
attribute) added under the image.
Especially on mobile devices where there is no hover
event.
add_dynamic_image
- Tries to add the highest quality images from sites that use JavaScript to load images (e.g. either lazily when scrolling or based on screen size).
add_dynamic_iframe
- Tries to add embedded videos from sites that use JavaScript to load iframes (e.g. either lazily when scrolling or after the rest of the page is loaded).
add_image_title
- Add each image's title as a caption under the image.
add_youtube_video
- Insert Youtube video to the article (automatic for Youtube.com).
add_youtube_video_from_id
- Insert Youtube video to the article based on the video ID.
add_invidious_video
- Insert Invidious player to the article (automatic for https://invidio.us).
add_youtube_video_using_invidious_player
- Insert Invidious player to the article for Youtube feeds.
add_castopod_episode
- Insert Castopod episode player.
add_mailto_subject
- Insert mailto links subject into the article.
base64_decode
- This rewrite rule decode base64 content.
It can be used with a selector:
base64_decode(".base64")
, but can also be used without argument:base64_decode
. In this case it'll try to convert all TextNodes and always fallback to original text if it can decode. nl2br
- Convert new lines
\n
to<br>
(useful for non-HTML contents). convert_text_links
- Convert text link to HTML links (useful for non-HTML contents).
fix_medium_images
- Attempt to fix Medium's images rendered in Javascript.
use_noscript_figure_images
- Use
<noscript>
content for images rendered with Javascript. replace("search term"|"replace term")
- Search and replace text.
remove(".selector, #another_selector")
- Remove DOM elements.
parse_markdown
(Removed in v2.2.4)- Convert Markdown to HTML. This rule has been removed in version 2.2.4.
remove_tables
- Remove any tables while keeping the content inside (useful for email newsletters).
remove_clickbait
- Remove clickbait titles (Convert uppercase titles).
replace_title("search-term"|"replace-term")
- Rewrite rule to adjust entry titles.
add_hn_links_using_hack
- Open HN comments with Hack.
add_hn_links_using_opener
- Open HN comments with Opener.
Noflux includes a set of predefined rules for some websites, but you could define your own rules.
On the feed edit page, enter your custom rules in the field “Rewrite Rules” like this:
rule1,rule2
Separate each rule by a comma.
Scraper Rules ¶
When an article contains only an extract of the content, you could fetch the original web page and apply a set of rules to get relevant contents.
Noflux uses CSS selectors for custom rules. These custom rules can be saved in the feed properties (Select a feed and click on edit).
CSS Selector | Description |
---|---|
div#articleBody | Fetch a div element with the ID articleBody |
div.content | Fetch all div elements with the class content |
article, div.article | Use a comma to define multiple rules |
Noflux includes a list of predefined rules for popular websites. You could contribute to the project to keep them up to date.
Under the hood, Noflux uses the library Goquery.
URL Rewrite Rules ¶
Sometimes it might be required to rewrite an URL in a feed to fetch better suited content.
For example, for some users the URL https://www.npr.org/sections/money/2021/05/18/997501946/the-case-for-universal-pre-k-just-got-stronger displays a cookie consent dialog instead of the actual content and it would be preferred to fetch the URL https://text.npr.org/997501946 instead.
The following rules does this:
rewrite("^https:\/\/www\.npr\.org\/\d{4}\/\d{2}\/\d{2}\/(\d+)\/.*$"|"https://text.npr.org/$1")
This will rewrite all URLs from the original feed to URLs pointing to text.npr.org when the article content is fetched. I also had to add my own scraper rule, because the default rule will try to fetch #storytext.
Another example is the german page
https://www.heise.de/news/Industrie-ruestet-sich-fuer-Gasstopp-Forscher-vorsichtig-optimistisch-7167721.html
which splits the article into multiple pages. The full text can be read on
https://www.heise.de/news/Industrie-ruestet-sich-fuer-Gasstopp-Forscher-vorsichtig-optimistisch-7167721.html?seite=all
The URL rewrite rule for that would be
rewrite("(.*?\.html)"|"$1?seite=all")