Module Stemmer :: Class Stemmer
[hide private]
[frames] | no frames]

Class Stemmer

object --+
         |
        Stemmer

An instance of a stemming algorithm.

The algorithm has internal state, so must not be called concurrently.
ie, only a single thread should access the instance at any given time.

When creating a `Stemmer` object, there is one required argument: the
name of the algorithm to use in the new stemmer.  A list of the valid
algorithm names may be obtained by calling the `algorithms()` function
in this module.  In addition, the appropriate stemming algorithm for a
given language may be obtained by using the 2 or 3 letter ISO 639
language codes.

A second optional argument to the constructor for `Stemmer` is the size
of cache to use.  The cache implemented in this module is not terribly
efficient, but benchmarks show that it approximately doubles
performance for typical text processing operations, without too much
memory overhead.  The cache may be disabled by passing a size of 0.
The default size (10000 words) is probably appropriate in most
situations.  In pathological cases (for example, when no word is
presented to the stemming algorithm more than once, so the cache is
useless), the cache can severely damage performance.

The "benchmark.py" script supplied with the PyStemmer distribution can
be used to test the performance of the stemming algorithms with various
cache sizes.

Instance Methods [hide private]
 
__init__(...)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
a new object with type S, a subtype of T

__new__(T, S, ...)
 
__purgeCache(...)
 
stemWord(...)
Stem a word.
 
stemWords(...)
Stem a list of words.

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties [hide private]
  maxCacheSize
Maximum number of entries to allow in the cache.

Inherited from object: __class__

Method Details [hide private]

__init__(...)
(Constructor)

 

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides: object.__init__

__new__(T, S, ...)

 


Returns:
a new object with type S, a subtype of T

Overrides: object.__new__

stemWord(...)

 
Stem a word.

This takes a single argument, ``word``, which should either be a UTF-8
encoded string, or a unicode object.

The result is the stemmed form of the word.  If the word supplied
was a unicode object, the result will be a unicode object: if the
word supplied was a string, the result will be a UTF-8 encoded
string.

stemWords(...)

 
Stem a list of words.

This takes a single argument, ``words``, which must be a sequence,
iterator, generator or similar.

The entries in ``words`` should either be UTF-8 encoded strings, or a
unicode objects.

The result is a list of the stemmed forms of the words.  If the
word supplied was a unicode object, the stemmed form will be a
unicode object: if the word supplied was a string, the stemmed form
will be a UTF-8 encoded string.


Property Details [hide private]

maxCacheSize

Maximum number of entries to allow in the cache.

This may be set to zero to disable the cache entirely.

The maximum cache size may be set at any point - setting the
maximum size will purge entries from the cache if the new maximum
size is smaller than the current size.