[ Overview ] [ How to cite PSEMiner ] [ License ] [ Download ] [ Instructions ] [ Contact ]

Overview

PSEMiner is a very efficient tool for detecting and mining periodically recurring behavior in dynamic network and event datasets. It is based on robust theoretical analysis and can be used with many different kinds of dynamic data. Some examples of uses are:

For more information, please see the accompanying conference paper published at ICDM '08 and extended journal paper (Knowledge and Information Systems, to appear).


How to cite PSEMiner

If you use PSEMiner in academic research, please cite the following papers:

License

PSEMiner is licensed under the terms of GNU General Public License (GPL), and download and use of code or binaries constitutes your agreement to the terms of the license.

Download and Install

Optimized, pre-compiled binaries are available for the following platforms.
The source is also available, and should compile on most systems with any ANSI compliant C/C++ compiler with GNU autotools and a sane build environment. To compile, untar the file and run
./configure
followed by
make

PSEMiner-2.0.tar.gz (source)

Prerequisites: Google's sparsehash library needs to be installed only if you want to compile the source yourself. If you just want to run the executable, then you don't need to have sparsehash installed.

Input file format

The input file format consists of one timestep per line. On each line, a set of integers specifies the edges and/or vertices present at that timestep, in time increasing order of timesteps. This requires mapping each unique edge to an integer, so for example, "a-b" would map to 0, "b-c" would map to 1. Each line can have an optional date label if it is prefixed with an asterisk. The following is an example:
*Jan-20-1998 0 1 9 123
*Jan-21-1998
*Jan-22-1999 0 1 9123
In the example above, the second timestep is blank. Note that time labels are not parsed -- they are just a convenience for humans examining the file. The edges 0 and 1 appear in two timesteps. An equivalent format for the same dataset without time labels is as follows:
0 1 9 123

0 1 9123
Note the blank line to signify an empty timestep.

Running the program

The following command line parameters are required:
-i <filename>    Specifies the input file
-o <filename>    Specifies the output file for mined patterns
The following command line parameters are optional:
-s <minsup>         Specifies the minimum support value >= 2 (default: 3)
-Pmax <max period>  Maximum period for mined patterns -- will greatly speed up processing if set to reasonable value (default: unrestricted)
-smooth <s>         Number of timesteps to smooth, for handling noise (default: 1, or no smoothing)
-q                  Quiet mode -- do not display diagnostic information
-X     		    Optimize hashmap for speed over memory usage

-h                  Display additional options

Enron Dataset

We are making the following version of the Enron dataset available for use with the mining program. The following file is in a format appropriate for use as input to the mining program:

Enron (dated)

The following file contains the mapping from integers to actual e-mail addresses from the Enron database.

Enron integer-edge mapping


Contact

Bugs, comments, suggestions, requests for assistance can be sent to mlahir2 at the mail server uic.edu.
© 2009
Mayank Lahiri