Tokenization: A library for tokenizing English

This is a library for ad hoc tokenization of English text. Extract tokens from a body of text for use with NLP tools or statistical analysis.

The library currently works well with most simple cases. I do plan to extend it a bit for flexibility, but not at significant cost to performance. There are other tokenizers out there with more bells and whistles, plus better multilingual support.

Usage

To use, simply pass your input string to the TextTokenizer constructor and then call the Tokenize() method. This will return an IEnumerable<string>, which you may want to convert to a string[] or List<string> (but you don’t have to).

var myTokenizer = new TextTokenizer("This is 1 string to ... process.");

foreach (var token in myTokenizer.Tokenize())
{
	Console.WriteLine(token);
}

// Should print:
//   This
//   is
//   string
//   to
//   process

By default, only Tokens.Word tokens are emitted. This can be changed by setting the EmitTypes enum on the tokenizer instance.

For example:

// Emits ALL tokens, regardless of type.
myTokenizer.EmitTypes = Tokens.All;

// Emits Number and Symbol tokens.
myTokenizer.EmitTypes = Tokens.Number | Tokens.Symbol;

Warning: API subject to change!

I wrote this for a small, experimental project. As I try to make it more generally useful, the API might have to change. But I hope to keep it as simple as calling Tokenize() in most cases.

View the Tokenization library on GitHub


 MENU