weka.datagenerators
Class BIRCHCluster

java.lang.Object
  |
  +--weka.datagenerators.ClusterGenerator
        |
        +--weka.datagenerators.BIRCHCluster
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class BIRCHCluster
extends ClusterGenerator
implements OptionHandler, java.io.Serializable

Cluster data generator designed for the BIRCH System Dataset is generated with instances in K clusters. Instances are 2-d data points. Each cluster is characterized by the number of data points in it its radius and its center. The location of the cluster centers is determined by the pattern parameter. Three patterns are currently supported grid, sine and random. todo: (out of: BIRCH: An Efficient Data Clustering Method for Very Large Databases; T. Zhang, R. Ramkrishnan, M. Livny; 1996 ACM) Class to generate data randomly by producing a decision list. The decision list consists of rules. Instances are generated randomly one by one. If decision list fails to classify the current instance, a new rule according to this current instance is generated and added to the decision list.

The option -V switches on voting, which means that at the end of the generation all instances are reclassified to the class value that is supported by the most rules.

This data generator can generate 'boolean' attributes (= nominal with the values {true, false}) and numeric attributes. The rules can be 'A' or 'NOT A' for boolean values and 'B < random_value' or 'B >= random_value' for numeric values.

Valid options are:

-G
The pattern for instance generation is grid.
This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.

-I
The pattern for instance generation is sine.
This flag cannot be used at the same time as flag G. The pattern is random, if neither flag G nor flag I is set.

-N num .. num
The range of the number of instances in each cluster (default 1..50).
Lower number must be between 0 and 2500, upper number must be between 50 and 2500.

-R num .. num
The range of the radius of the clusters (default 0.1 .. SQRT(2)).
Lower number must be between 0 and SQRT(2), upper number must be between
SQRT(2) and SQRT(32).

-M num
Distance multiplier, only used if pattern is grid (default 4).

-C num
Number of cycles, only used if pattern is sine (default 4).

-O
Flag for input order is ordered. If flag is not set then input order is randomized.

-P num
Noise rate in percent. Can be between 0% and 30% (default 0%).
(Remark: The original algorithm only allows noise up to 10%.)

-S seed
Random number seed for random function used (default 1).

Version:
$Revision: 1.2 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz)
See Also:
Serialized Form

Field Summary
static int GRID
           
static int ORDERED
           
static int RANDOM
           
static int RANDOMIZED
           
static int SINE
           
 
Constructor Summary
BIRCHCluster()
           
 
Method Summary
 Instances defineDataFormat()
          Initializes the format for the dataset produced.
 Instance generateExample()
          Generate an example of the dataset.
 Instances generateExamples()
          Generate all examples of the dataset.
 Instances generateExamples(java.util.Random random, Instances format)
          Generate all examples of the dataset.
 java.lang.String generateFinished()
          Compiles documentation about the data generation after the generation process
 java.lang.String generateStart()
          Compiles documentation about the data generation before the generation process
 Instances getDatasetFormat()
          Gets the dataset format.
 double getDistMult()
          Gets the distance multiplier.
 boolean getGridFlag()
          Gets the grid flag (option G).
 int getInputOrder()
          Gets the input order.
 java.lang.String getInstNums()
          Gets the upper and lower boundary for instances per cluster.
 int getMaxInstNum()
          Gets the upper boundary for instances per cluster.
 double getMaxRadius()
          Gets the upper boundary for the radiuses of the clusters.
 int getMinInstNum()
          Gets the lower boundary for instances per cluster.
 double getMinRadius()
          Gets the lower boundary for the radiuses of the clusters.
 double getNoiseRate()
          Gets the percentage of noise set.
 int getNumCycles()
          Gets the number of cycles.
 java.lang.String[] getOptions()
          Gets the current settings of the datagenerator BIRCHCluster.
 boolean getOrderedFlag()
          Gets the ordered flag (option O).
 int getPattern()
          Gets the pattern type.
 java.lang.String getRadiuses()
          Gets the upper and lower boundary for the radius of the clusters.
 java.util.Random getRandom()
          Gets the random generator.
 int getSeed()
          Gets the random number seed.
 boolean getSineFlag()
          Gets the sine flag (option S).
 boolean getSingleModeFlag()
          Gets the single mode flag.
 java.lang.String globalInfo()
          Returns a string describing this data generator.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 void setDatasetFormat(Instances newDatasetFormat)
          Sets the dataset format.
 void setDefaultOptions()
          Sets all options to their default values.
 void setDistMult(double newDistMult)
          Sets the distance multiplier.
 void setInputOrder(int newInputOrder)
          Sets the input order.
 void setInstNums(java.lang.String fromTo)
          Sets the upper and lower boundary for instances per cluster.
 void setMaxInstNum(int newMaxInstNum)
          Sets the upper boundary for instances per cluster.
 void setMaxRadius(double newMaxRadius)
          Sets the upper boundary for the radiuses of the clusters.
 void setMinInstNum(int newMinInstNum)
          Sets the lower boundary for instances per cluster.
 void setMinRadius(double newMinRadius)
          Sets the lower boundary for the radiuses of the clusters.
 void setNoiseRate(double newNoiseRate)
          Sets the percentage of noise set.
 void setNumCycles(int newNumCycles)
          Sets the the number of cycles.
 void setOptions(java.lang.String[] options)
          Parses a list of options for this object.
 void setPattern(int newPattern)
          Sets the pattern type.
 void setRadiuses(java.lang.String fromTo)
          Sets the upper and lower boundary for the radius of the clusters.
 void setRandom(java.util.Random newRandom)
          Sets the random generator.
 void setSeed(int newSeed)
          Sets the random number seed.
 
Methods inherited from class weka.datagenerators.ClusterGenerator
getClassFlag, getDebug, getNumAttributes, getNumClusters, getNumExamplesAct, getOutput, getRelationName, makeData, setClassFlag, setDebug, setNumAttributes, setNumClusters, setNumExamplesAct, setOutput, setRelationName
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GRID

public static final int GRID
See Also:
Constant Field Values

SINE

public static final int SINE
See Also:
Constant Field Values

RANDOM

public static final int RANDOM
See Also:
Constant Field Values

ORDERED

public static final int ORDERED
See Also:
Constant Field Values

RANDOMIZED

public static final int RANDOMIZED
See Also:
Constant Field Values
Constructor Detail

BIRCHCluster

public BIRCHCluster()
Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this data generator.

Returns:
a description of the data generator suitable for displaying in the explorer/experimenter gui

setInstNums

public void setInstNums(java.lang.String fromTo)
Sets the upper and lower boundary for instances per cluster.


getInstNums

public java.lang.String getInstNums()
Gets the upper and lower boundary for instances per cluster.

Returns:
the string containing the upper and lower boundary for instances per cluster separated by ..

getMinInstNum

public int getMinInstNum()
Gets the lower boundary for instances per cluster.

Returns:
the the lower boundary for instances per cluster

setMinInstNum

public void setMinInstNum(int newMinInstNum)
Sets the lower boundary for instances per cluster.

Parameters:
newMinInstNum - new lower boundary for instances per cluster

getMaxInstNum

public int getMaxInstNum()
Gets the upper boundary for instances per cluster.

Returns:
the upper boundary for instances per cluster

setMaxInstNum

public void setMaxInstNum(int newMaxInstNum)
Sets the upper boundary for instances per cluster.

Parameters:
newMaxInstNum - new upper boundary for instances per cluster

setRadiuses

public void setRadiuses(java.lang.String fromTo)
Sets the upper and lower boundary for the radius of the clusters.


getRadiuses

public java.lang.String getRadiuses()
Gets the upper and lower boundary for the radius of the clusters.

Returns:
the string containing the upper and lower boundary for the radius of the clusters, separated by ..

getMinRadius

public double getMinRadius()
Gets the lower boundary for the radiuses of the clusters.

Returns:
the lower boundary for the radiuses of the clusters

setMinRadius

public void setMinRadius(double newMinRadius)
Sets the lower boundary for the radiuses of the clusters.

Parameters:
newMinRadius - new lower boundary for the radiuses of the clusters

getMaxRadius

public double getMaxRadius()
Gets the upper boundary for the radiuses of the clusters.

Returns:
the upper boundary for the radiuses of the clusters

setMaxRadius

public void setMaxRadius(double newMaxRadius)
Sets the upper boundary for the radiuses of the clusters.

Parameters:
newMaxRadius - new upper boundary for the radiuses of the clusters

getGridFlag

public boolean getGridFlag()
Gets the grid flag (option G).

Returns:
true if grid flag is set

getSineFlag

public boolean getSineFlag()
Gets the sine flag (option S).

Returns:
true if sine flag is set

getPattern

public int getPattern()
Gets the pattern type.

Returns:
the current pattern type

setPattern

public void setPattern(int newPattern)
Sets the pattern type.

Parameters:
newPattern - new pattern type

getDistMult

public double getDistMult()
Gets the distance multiplier.

Returns:
the distance multiplier

setDistMult

public void setDistMult(double newDistMult)
Sets the distance multiplier.

Parameters:
newDistMult - new distance multiplier

getNumCycles

public int getNumCycles()
Gets the number of cycles.

Returns:
the number of cycles

setNumCycles

public void setNumCycles(int newNumCycles)
Sets the the number of cycles.

Parameters:
newNumCycles - new number of cycles

getInputOrder

public int getInputOrder()
Gets the input order.

Returns:
the current input order

setInputOrder

public void setInputOrder(int newInputOrder)
Sets the input order.

Parameters:
newInputOrder - new input order

getOrderedFlag

public boolean getOrderedFlag()
Gets the ordered flag (option O).

Returns:
true if ordered flag is set

getNoiseRate

public double getNoiseRate()
Gets the percentage of noise set.

Returns:
the percentage of noise set

setNoiseRate

public void setNoiseRate(double newNoiseRate)
Sets the percentage of noise set.

Parameters:
newNoiseRate - new percentage of noise

getRandom

public java.util.Random getRandom()
Gets the random generator.

Returns:
the random generator

setRandom

public void setRandom(java.util.Random newRandom)
Sets the random generator.

Parameters:
newRandom - is the random generator.

getSeed

public int getSeed()
Gets the random number seed.

Returns:
the random number seed.

setSeed

public void setSeed(int newSeed)
Sets the random number seed.

Parameters:
newSeed - the new random number seed.

getDatasetFormat

public Instances getDatasetFormat()
Gets the dataset format.

Returns:
the dataset format.

setDatasetFormat

public void setDatasetFormat(Instances newDatasetFormat)
Sets the dataset format.

Parameters:
newDatasetFormat - the new dataset format.

getSingleModeFlag

public boolean getSingleModeFlag()
Gets the single mode flag.

Overrides:
getSingleModeFlag in class ClusterGenerator
Returns:
true if methode generateExample can be used.

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setDefaultOptions

public void setDefaultOptions()
Sets all options to their default values.


setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a list of options for this object.

For list of valid options see class description.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the datagenerator BIRCHCluster.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

defineDataFormat

public Instances defineDataFormat()
                           throws java.lang.Exception
Initializes the format for the dataset produced.

Overrides:
defineDataFormat in class ClusterGenerator
Returns:
the output data format
Throws:
java.lang.Exception - data format could not be defined

generateExample

public Instance generateExample()
                         throws java.lang.Exception
Generate an example of the dataset.

Overrides:
generateExample in class ClusterGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined or generating
examples one by one is not possible, because voting is chosen

generateExamples

public Instances generateExamples()
                           throws java.lang.Exception
Generate all examples of the dataset.

Overrides:
generateExamples in class ClusterGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateExamples

public Instances generateExamples(java.util.Random random,
                                  Instances format)
                           throws java.lang.Exception
Generate all examples of the dataset.

Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateFinished

public java.lang.String generateFinished()
                                  throws java.lang.Exception
Compiles documentation about the data generation after the generation process

Overrides:
generateFinished in class ClusterGenerator
Returns:
string with additional information about generated dataset
Throws:
java.lang.Exception - no input structure has been defined

generateStart

public java.lang.String generateStart()
Compiles documentation about the data generation before the generation process

Overrides:
generateStart in class ClusterGenerator
Returns:
string with additional information

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments for the data producer:


Copyright (c) 2003 David Lindsay, Computer Learning Research Centre, Dept. Computer Science, Royal Holloway, University of London