Tokenizer

  • namespace: Rindow\NeuralNetworks\Data\Sequence
  • classname: Tokenizer

Assign numbers to words to convert texts into sequences of integers.

Methods

constructor

public function __construct(
    object $mo,
    callable $analyzer=null,
    int $num_words=null,
    string $filters="!\"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
    string $specials=null,
    bool $lower=true,
    string $split=" ",
    bool $char_level=false,
    string $oov_token=null,
    int $document_count=0,
)

Options

  • num_words: maximum number of words to use in corpus. Words with a large number of occurrences are prioritized, and words with a low priority are skipped when corpus is generated.
  • filters: Characters removed from input text.
  • specials: A special character that is treated as a single word, independent of the word delimiter.
  • lower: Whether to convert the texts to lowercase.
  • split: Word delimiter.
  • oov_token: An alternative character to replace and include in corpus instead of skipping low priority.

fitOnTexts

public function fitOnTexts(iterable $texts) : void

Initialization of vocabulary table or append vocabulary

Arguments

  • texts: list of text string. Initialize with the entered text.

textsToSequences

public function textsToSequences(iterable $texts) : iterable

Convert text to sequence based on vocabulary table

Arguments

  • texts: list of text string.

sequencesToTexts

public function sequencesToTexts(iterable $sequences) : iterable

Convert sequence to text based on vocabulary table

Arguments

  • sequences: list of sequences.

numWords

public function numWords(bool $internal=null) : int

Number of valid words in the vocabulary table.

Arguments

  • internal: If True, returns the number of internally held words instead of the number of valid words.

wordToIndex

public function wordToIndex(string $word) : int

Convert the word to the word number.

Arguments

  • word: Word string.

indexToWord

public function indexToWord(int $index) : string

Convert the word number to the word.

Arguments

  • word: Word number.

Examples

$texts = [
    "Hello Tom!\n",
    "Good morning.\n",
    "Good night Tom.\n",
];
$tokenizer = new Tokenizer($mo);
$tokenizer->fitOnTexts($texts);
$sequences = $tokenizer->textsToSequences($texts);
# $sequences:
# [[1,2],[3,4],[3,5,2]]
$texts = $tokenizer->sequencesToTexts($sequences);
# $texts:
# [["hello tom"],["good morning"],["good night tom"]]
$word = $tokenizer->numWords()