jTokeniser - v.2.0 - README

Andrew Roberts (16-Jul-2006)
dev [at] andy-roberts [dot] net

http://www.andy-roberts.net/software/jTokensier/

Overview
========

Tokenising strings into its constituent words/tokens can prove tricky for
non-trivial examples. In particular, when dealing with natural language, you
must take into consideration punctuation too in order to isolate the words. The
jTokeniser package was designed to combine a set of tokenisers that range from
basic whitespace tokenisers to more complex ones that deal intuitively with
natural language.

Each of the tokenisers adopt a similar structure to java.util.StringTokenizer in
terms of how to instantiate the classes and extract the tokens. This means they
are simple to use.

What's new in 2.0?
====================

 * A new GUI front-end to the jTokeniser library. You can type in, copy and 
   paste, or even load a text file into the application. You must select your
   tokeniser of choice (and any options of interest) and then hit the Tokenise
   button. Your results will be displayed as soon as they are processed and you
   have the option to save the results to file, if you choose.
   
   The GUI is particularly useful for experimenting with tokenisation methods
   in a teaching environment (such as an NLP course). It will also be of
   interest to those wishing to use the jTokeniser library but don't have the
   Java programming experience to utilise the code directly.

NB There have been no changes to the core tokeniser libraries and the API
remain fully compatible with prior versions.
   
Features
========

jTokeniser comprises of six tokenisers that all extend from an abtract
Tokeniser class:

 * WhiteSpaceTokeniser - this splits a string on all occurances of whitespace,
   which include spaces, newlines, tabs and linefeeds.

 * StringTokeniser - this is basically the same as java.util.StringTokenizer
   with some extra methods (and extends from Tokeniser). Its default behaviour
   is to act as a WhiteSpaceTokeniser, however, you can specify a set of 
   characters that are to be used to indicate word delimiters.

 * RegexTokeniser - this tokeniser is much more flexible as you can use regular
   expressions to define a what a token is. So, "\\w+" means whenever it matches
   one or more letters, it will consider that a word. By default, it uses a
   regular expression equivalent to a whitespace tokeniser.

 * RegexSeparatorTokeniser - this can be thought of as an advanced 
   StringTokeniser. Whereas StringTokeniser is limited to defining delimiters 
   as a set of individual characters, RegexSeparatorTokeniser can utilise 
   regular expressions for a richer and more flexible approach.

 * BreakIteratorTokeniser - the most sophisticated of the four, although should
   only be used on natural language strings to isolate words. It also comes with
   built-in rules about how to find words, knowing how to disregard punctuation,
   etc.

 * SentenceTokeniser - this also uses a BreakIterater like the above, but tuned 
   towards finding sentence boundaries. The "tokens" in this tokeniser are in 
   fact individual sentences.

Installation
============

The jTokeniser package doesn't need installing as such. You simply have to
download it to your computer, and then make sure that the Java compiler and
virtual machine can "find" it.

Unlike previous versions, jTokeniser is now distributed as a Zip file. 

Download from:

http://www.andy-roberts.net/software/jTokeniser/releases/2.0/jTokeniser-2.0.zip

To uncompress the file, there are many utilities. On Windows, a popular utility
is WinZip. On most platforms, there are command-line tools, such as 'unzip'
that can also be used.

It contains the following:

./README.txt
./jTokeniser-2.0.jar
./lib/swing-layout-1.0.jar  (additional library required for the GUI)


Important note:
In order to use jTokeniser, you need to have the Java Runtime Environment
installed. It requires Java 5.0 or above.

To obtain Java (or update to the latest version) goto http://www.java.com and
it will automatically detect the version that you need to download and install.

Running the jTokeniser GUI
==========================

On Windows:
  When you install the Java Runtime, it normally associates .jar files
  with a jar-runner program. Therefore, just double-clicking the 
  jTokeniser-2.0.jar file and the GUI should load promptly.

On all platforms:
  At the command-line. change to the directory with the jar file and type:
  java -jar jTokeniser-2.0.jar

Using the jTokeniser library in your programs
=============================================

The package is bundled together a JAR file, with is a Java archive containing
all the classes. JAR is actually compressed using the well known zip
algorithms. The advantage of using JARs are that you can keep lots of related
classes together in a single file, rather than having to uncompress them. 

All Java needs to know is where the JAR file is, and there are a couple of
ways of achieving this. Imagine you have a class that uses a tokeniser from
this package called ClassThatTokenises.java. To compile and run:

1. Specifying at the command-line

   javac -classpath /path/to/jTokeniser-2.0.jar ClassThatTokenises.java
   java -classpath /path/to/jTokeniser-2.0.jar ClassThatTokenises

   NB in Windows, the path would be more like c:\path\to\jTokeniser-2.0.jar

2. Setting the CLASSPATH environment variable.

   In Linux:

   export CLASSPATH=$CLASSPATH:/path/to/jTokeniser-2.0.jar  (for bash)
   setenv CLASSPATH $CLASSPATH:/path/to/jTokeniser-2.0.jar  (for csh)
   
   javac ClassThatTokenises.java
   java ClassThatTokenises

   In Windows:

   set CLASSPATH=%CLASSPATH%;c:\path\to\jTokeniser-2.0.jar

   NB you can set the CLASSPATH via Control Panel/System/Advanced/Environment
   Variables

Support
=======

The package API (as generated by JavaDoc) are to be found at:
http://www.andy-roberts.net/software/jTokeniser/releases/2.0/doc/index.html

An example Java file which illustrates common usage of each of the tokenisers
exists at:
http://www.andy-roberts.net/software/jTokeniser/releases/2.0/JTokeniserExample.java


Contact
=======

If you wish to contact the developer about jTokeniser to suggest future
features, bugs or anything that you want, please email me at:

dev [at] andy-roberts [dot] net

* Anti-spam format. Please remove all spaces, and replace '[at]' with
  the '@' symbol (no quotes), etc.
