ReTidy page cleaner

Introduction
The functions: pre_tidy_regex, remove_nodes, my_strip_tags, strip_lang, strip_br_dupes, trim_br_tags, replace_tags, dom_regenerate_tables, dom_fix_text_tags, dom_fix_headings, dom_strip_child_tags, dom_strip_attrs, dom_strip_only_child, dom_strip_parent_only_child, dom_merge_parent_attr, dom_strip_no_attr, strip_empty_tags, combine_inline, reorder_tags, combine_br_tags, fix_img_pos, extend_quotes, combine_broken_tags, hruler, dom_parse_lists, dom_toc_add, final_regex
Requirements
Usage
Download ReTidy version 1.11 build 20070702
Known issues
Changes log

Introduction

ReTidy is a project I've been working on since August 2006.

I use ReTidy inside Agnezar (my CMS), coupled together with Awebitor (my WYSIWYG editor).

The need: automatically clean tens, even hundreds, of pages as fast as possible. Every time I make a web site I get Word documents to put on the web. I even personally recommend the use of Word documents, because the "client" never really uses anything else except Word. Saving as HTML never results in good, semantical code. Using HTML Tidy is not enough because it tries too much to keep the original code and styling intact, leaving me with too much manual work to do.

With the current word processors it's almost impossible to create a document with proper markup. Average users make tons of mistakes in editing their documents.

The solution: the ReTidy script provides a way to automatically clean various mistakes in HTML documents. I used PHP 5.2, DOM + XPath functionality, regular expressions and most importantly, the script uses HTML Tidy.

The way it works is similar to a batch process: the configuration profile just defines which cleanup functions need to be run. The configuration file also contains the options for all the possible cleanup functions.

The configuration profile is a PHP script which returns only an array. I used this for convenience - I don't need special configuration formats. There's an execution macro which defines the methods you want to call, each and every one of these "cleans" something.

The script is far from being scientific or whatever. It's just something I made for convenience, daily usage. The script has grown quite much and the output documents are almost of copy/paste quality for me. Given the nature of WYSIWYG documents, one has to expect it's required to fine tune the provided configurations in order to achieve the best results. A single configuration cannot simply encompass all possibilities.

I did think this kind of functionality should be implemented into HTML Tidy. Yet, now I doubt this: the advantage of using PHP/Perl/Python/Ruby or any similar scripting language is very important: constant evolution by means of fixing and configuring the script.

I was inspired by the awesome search and replace functionality in Dreamweaver which allows developers to search for tags and rename them, search for specific attributes and rename them, or strip them, and lots more. You'll see that ReTidy provides that functionality and more.

I have chosen to use XHTML for the output, because the parsing is more reliable in PHP DOM and the output code can be used with XML tools. This is certainly a matter of taste, you can easily configure HTML Tidy to output the code as HTML. The entire configuration used for HTML Tidy is included in the ReTidy configuration profile.

The functions

The method names containing the DOM initials use DOM-based processing. All other functions use regular expressions and such. The provided details apply mostly to the "maximum" profile - which cleans documents as aggressively as possible.

pre_tidy_regex

An array to search and replace whatever you want in the code, before the first call to HTML Tidy is ran. This is useful in cases where you need a "dirty" hack for anything you notice in the code, anything that's repeated: for example a pattern for heading titles. Most presentational information gets lost after running HTML Tidy, because the configuration is set to maximum. The entire point of ReTidy is to be a lot more "aggressive" when cleaning documents.

HTML Tidy is generally very good, for several reasons:

It does output XHTML code which can be used for further DOM-based processing;
Good enough tag soup parsing;
The output is also much cleaner than the initial code, so it makes life easier.

remove_nodes

You can configure the tags/nodes you want completely removed, including the content. Currently, I remove all <style> and <script> tags.

my_strip_tags

This selectively strips the tags you want. For now, only <font>, <span>, <col> and <colgroup> tags are stripped.

strip_lang

As the name implies, the function removes all the lang attributes. It can be configured to also remove xml:lang attributes - this is needed if you want to keep only the former attribute, but not the latter (which is automatically generated by HTML Tidy).

strip_br_dupes

All documents I've cleaned suffer of this problem: too many <br> tags. This must be the single most abused tag in HTML. The function will mercilessly remove all the duplicate non-breaking new lines.

trim_br_tags

The revenge of the non-breaking new line tag. :) When cleaning up Word documents you are bound to find something like:

Input:

<br><p><br>Sample code<br></p><br>

Output:

<p>Sample code</p>

The trim_br_tags method will remove all the <br> tags which are found immediately before or after paragraphs. It also removes the non-breaking lines which are found at the start, or the end, of such tags. The list of tags can be changes. My configuration "trims" the new lines for many tags, not only paragraphs.

Firefox (Gecko) likes trailing non-breaking lines in paragraphs and list items. Normal usage of Awebitor almost always ends up with code containing a non-breaking line before the end of paragraphs.

replace_tags

This is something simple: replace the tags you want with others. For example: replace all the underline tags with the emphasis tag.

dom_regenerate_tables

Input:

<table> 
<tr> 
<td> 
<p>test 1.1</p> 
test 2.1<br> 
<h4>test 3.1</h4> 
test 4.1</td> 
<td> 
<p>test 1.2</p> 
<h2>test 2.2</h2> 
test 3.2<br> 
test 4.2</td> 
</tr> 
</table>

Output:

<table> 
<tr> 
<td>test 1.1</td> 
<td>test 1.2</td> 
</tr> 
<tr> 
<td>test 2.1</td> 
<td>test 2.2</td> 
</tr> 
<tr> 
<td>test 3.1</td> 
<td>test 3.2</td> 
</tr> 
<tr> 
<td>test 4.1</td> 
<td>test 4.2</td> 
</tr> 
</table>

The list of tags that add a new table row is configurable. This method is not enabled by default for obvious reasons. The need for this feature was caused by a document I received - all the tables had this issue and manual cleanup was not my coup of tea.

dom_fix_text_tags

This does address the problems some people have in consistency when using punctuation.

Input:

This is an example text with wrong usage of punctuation .I do not like this ,for several reasons :it 's not easy to read , and it does not have any consistency . ,,Some people do not use quotes properly ''( yes ,really !).

Output:

This is an example text with proper usage of punctuation. I do like this, for several reasons: it's easy to read, and it does have consistency. "Some people do use quotes properly" (yes, really!).

That's obviously better. Again, this is based on a real world document.

You can configure the list of tags you want corrected.

dom_fix_headings

Most of the time people do not properly use headings. Thus, documents end up having <h1>, then <h3>, <h6>, and back to <h2>. Everything gets mixed up. This method finds the headings which skip levels and renames them, decreasing the heading number as needed (based on the previous heading level).

Input:

<h1>Heading</h1>
<h3>Heading</h3>
<h2>Heading</h2>
<h4>Heading</h4>

Output:

<h1>Heading</h1>
<h2>Heading</h2>
<h2>Heading</h2>
<h3>Heading</h3>

Of course, there's no precise way to identify which are the logical headings. At least, this fixes the markup.

dom_strip_child_tags

This function allows you to strip tags which are direct child nodes of given parent nodes. I use this to remove paragraphs in table cells, or in list items.

dom_strip_attrs

Selectively remove the attributes you want. You can specify to remove certain attributes only for specific tags, or for all tags.

dom_strip_only_child

This works like dom_strip_child_tags, however there's a rule: the child node must be the only child of the parent. If you have a heading which has the entire text emphasized, then you can remove the emphasis.

dom_strip_parent_only_child

Like the function above, but this removes the parent node. I use this to remove the heading tags when they only contain images.

dom_merge_parent_attr

Input:

<font face="DejaVu"><font size="7">Sample code.</font></font>

Output:

<font face="DejaVu" size="7">Sample code.</font>

This merges tags and keeps the attributes from both tags. I use this for the Awebitor profile, which must allow font tags.

dom_strip_no_attr

Remove the configured tags which have no attribute. Currently unused.

strip_empty_tags

As the name implies: the function will remove the empty tags you want.

combine_inline

Input:

<em>This</em> <em>is an example sentence</em>, <em>only for you.</em>

Output:

<em>This is an example sentence, only for you.</em>

Inline tags are combined into one if only white space characters, or insignificant punctuation, separates the two tags. I noticed this problem very often in Word documents. It's mostly caused by the fact people change their mind about which words they want to emphasize.

The list of "inline" tags can be configured, and the regular expression used for matching the "white space" between the tags.

reorder_tags

Input:

<strong><em>This is an</em></strong> <em>example</em>

Output:

<em><strong>This is an</strong> example</em>

People change their mind like weather and WYSIWYG editors can't keep up. As usual, the list of tags can be configured.

combine_br_tags

Input:

<ul> 
<li>item 1</li> 
... 
<li>item n</li> 
</ul> 
<br> 
<ul> 
<li>item n+1</li> 
... 
<li>item n+x</li> 
</ul>

Output:

<ul> 
<li>item 1</li> 
... 
<li>item n</li> 
<li>item n+1</li> 
... 
<li>item n+x</li> 
</ul>

At first, this looks like the cleanup is too aggressive, since there are legitimate use cases for having two lists separated by white spaces (or by a non-breaking new line). However, this is a common error made in Word documents and I shall treat it as it is.

This is obviously only applicable to unordered/ordered/definition lists. Yet, if you choose to use the function for other tags, you can easily change the configuration.

fix_img_pos

Images are never really properly positioned in the text flow of the document.

Input:

<p>I love <img src="you.png"> this image.</p>

Output:

<p><img src="you.png">I love this image.</p>

This works with any tag, not just paragraphs.

extend_quotes

Input:

&quot;<em>Sample quote&quot;</em>

Output:

<em>&quot;Sample quote&quot;</em>

This is another function which tries to make the source code consistent.

combine_broken_tags

Input:

<p>This is an</p> <p>example sentence.</p>

Output:

<p>This is an example sentence.</p>

Most of the time this error appears because definition lists are used to indent the text in Word documents. For this reason, the configuration for replace_tags changes all definition lists into paragraphs. Due to the parsing model of paragraphs (automatic tag closing), multiple paragraphs containing the same sentence will appear in the code.

Another reason for this error is that some users do not allow automatic word wrapping in their documents, and they manually insert the new lines when "needed" (at the right side of their screen).

This function combines the tags you allow, if it finds a tag ending with a lower case alphabetic character and the next sibling starts with a lower case alphabetic character as well. This is risky, but it works in most cases, and it really depends on your documents.

hruler

Input:

<p>* * *</p>

Output:

<hr />

This function replaces the tags you want with a <hr> (horizontal ruler/separator), if they contain only white space characters, Unicode symbols and punctuation.

dom_parse_lists

This is probably the biggest function in the entire script. This parses unordered/ordered lists from text nodes.

Input:

<p>- item 1</p> 
<p>- item 2</p> 
<p>- item 3</p> 
<p>1. item 1</p> 
<p>2. item 2</p> 
<p>3. item 3</p>

Output:

<ul>
<li>item 1</li> 
<li>item 2</li> 
<li>item 3</li> 
</ul>
<ol>
<li>item 1</li> 
<li>item 2</li> 
<li>item 3</li>
</ol>

First, we have an unordered list. Instead of the "-" character the text node can contain any punctuation, or any Unicode symbol character.

Second, we have an ordered list which can be parsed even if it makes use of ) after each number, or other symbol/punctuation character. The function also parses ordered lists of alphabetic type (a,b,c, ...).

As seen in the two examples: the text nodes must be in separate element nodes. You can configure which element nodes are checked.

Another important example of what I call is "fuzzy list" parsing.

Input:

<p>What follows is a fuzzy list:</p> 
<p>fuzzy item 1;</p> 
<p>fuzzy item 2;</p> 
<p>the last fuzzy item.</p>
<p>You like it?</p>

Output:

<p>What follows is a fuzzy list:</p> 
<ul>
<li>fuzzy item 1;</li> 
<li>fuzzy item 2;</li> 
<li>the last fuzzy item.</li>
</ul>
<p>You like it?</p>

Fuzzy lists are those which have no "bullets" (special characters like punctuation), and have no numbering. Fuzzy lists are detected only if there's a node which has a colon at the end of the textContent. All the list items in the fuzzy list must end with a semi-colon or a dot. All the list items must have the first alphabetic character with the same case, altering the case can end the list. Fuzzy lists end when a node ending with a dot is found and the next sibling does not end with a semi-colon. The second node is not included in the list.

There are several configuration options for this function.

dom_toc_add

The purpose of this function is simple: generate an unordered list which contains the text nodes of all the headings in the parsed document. Optionally, the list items contain quick links to the headings - IDs are automatically generated.

This is useful when making pages for clients: sometimes it's nice to have a list of quick links inside a big document (like this page has).

final_regex

Like pre_tidy_regex: any regular expression hacks you want at the end. This is good for custom profiles.

Requirements

PHP 5 or newer. The script will not with PHP 4 or older because it causes syntax errors. I use the new PHP 5 OOP features.
(Optional) The mbstring extension.
If you want to use the external HTML Tidy binary you must have the binary itself (doh).
- Edit retidy.php and change the value of htmltidy_config_dir. You must provide a path to a folder where ReTidy has the permission to save file which contains the htmltidy configuration (as generated based on the loaded profile).
- You must configure PHP to allow the usage of proc_open.
If you do not want to use the external HTML Tidy binary, then you must install the php5-tidy extension.

Usage

Example:

include("retidy.php"); 
$cleaner = new ReTidy('profile-name'); 
$cleaner->setCode($your_code); 
$cleaner->cleanCode(); 
$the_clean_code = $cleaner->getCode(); 
echo $cleaner->getMessages(); // if you want to see the verbose output 
unset($cleaner);

If you prefer the command line, you can use retidy-stdio.php. You only need to send via standard input the code you want cleaned (cleaning starts only after EOF is sent - it will not clean the code while data is "streaming"). The standard output will be the clean code. Verbose output (the messages) will be sent to STDERR while processing the code.

ReTidy was tested on about 400-500 A4-sized pages, spanning about 10 documents. I will probably provide examples when I have time.

Download

License: GPL v2.

Latest version: 1.11 (build 20070702)

Download the script (zip archive).

Known issues

Cleanup is too aggressive.
It doesn't work as you want it the first time. You must know what needs to be cleaned up, the script only provides you with often-used cleanup features.
New lines are duplicated in <pre> tags.

Changes log

2007-07-02, version 1.11:

The script no longer requires the mbstring extension.
Added several error messages which should make it easier for users to debug anything that might go wrong.
Added the htmltidy_config_dir property to the ReTidy class. This allows changing the folder used for saving the temporary configuration files used when calling the external HTML Tidy binary.
Better packaging: now the archive includes the documentation (this page), an example script and a sample input HTML file.

2007-06-19, version 1.1:

Added an option which allows the user to quickly switch between using the HTML Tidy module for PHP, or the standalone HTML Tidy binary. The latter is sometimes needed if the PHP installation does not have the htmltidy module installed. See the htmltidy_app option in any of the profiles.
Minor fixes which includes updated profiles.

2007-05-07, version 1.0: Initial release. For this release I almost completely rewrote the script - I made it use OOP. After the rewrite, the script runs about 5 times faster - switched to XPath in all the places I could, instead of slow DOM work. I also integrated the script into Agnezar (together with Awebitor).

August 2006: project started.

ROBO Design

ReTidy page cleaner

Table of contents

Introduction

The functions

pre_tidy_regex

remove_nodes

my_strip_tags

strip_lang

strip_br_dupes

trim_br_tags

replace_tags

dom_regenerate_tables

dom_fix_text_tags

dom_fix_headings

dom_strip_child_tags

dom_strip_attrs

dom_strip_only_child

dom_strip_parent_only_child

dom_merge_parent_attr

dom_strip_no_attr

strip_empty_tags

combine_inline

reorder_tags

combine_br_tags

fix_img_pos

extend_quotes

combine_broken_tags

hruler

dom_parse_lists

dom_toc_add

final_regex

Requirements

Usage

Download

Known issues

Changes log