The script available at http://corpus1.leeds.ac.uk/cleaneval/cleanset.pl was used to pre-process the data. You may or may not choose to use it or somethign based on it.
Prepared by Francis Chantree.
There are around 60 items for each language, mirrored at Leeds and Trento:
| English original | http://corpus1.leeds.ac.uk/cleaneval/devel/en-original.tgz |
| Chinese original | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-original.tgz |
| English stripped | http://corpus1.leeds.ac.uk/cleaneval/devel/en-stripped.tgz |
| Chinese stripped | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-stripped.tgz |
| English cleaned | http://corpus1.leeds.ac.uk/cleaneval/devel/en-cleaned.tgz |
| Chinese cleaned | http://corpus1.leeds.ac.uk/cleaneval/devel/zh-cleaned.tgz |
| English original | http://polorovereto.unitn.it/~baroni/cleaneval/devel/en-original.tgz |
| Chinese original | http://polorovereto.unitn.it/~baroni/cleaneval/devel/zh-original.tgz |
| English stripped | http://polorovereto.unitn.it/~baroni/cleaneval/devel/en-stripped.tgz |
| Chinese stripped | http://polorovereto.unitn.it/~baroni/cleaneval/devel/zh-stripped.tgz |
| English cleaned | http://polorovereto.unitn.it/~baroni/cleaneval/devel/en-cleaned.tgz |
| Chinese cleaned | http://polorovereto.unitn.it/~baroni/cleaneval/devel/zh-cleaned.tgz |