Overview
PSEMiner is a very efficient tool for detecting and mining periodically recurring behavior in dynamic network and event datasets.
It is based on robust theoretical analysis and can be used with many different kinds of dynamic data. Some examples of uses are:
- Detecting periodically recurring communication patterns from e-mail, phone call, text message and IM logs.
- Detecting periodically recurring webserver accesses by web robots, for example, or as a result of brute-force password attacks.
- Characterizing the natural periodicities present in a dataset, even if periodic behavior only persists for a short period of time.
For more information, please see the accompanying conference paper published at ICDM '08 and extended journal paper (Knowledge and Information Systems, to appear).
How to cite PSEMiner
If you use PSEMiner in academic research, please cite the following papers:
- [ pdf ] [ bibtex ] M. Lahiri and T.Y. Berger-Wolf. Periodic Subgraph Mining in Dynamic Networks. Knowledge and Information Systems, to appear.
- [ pdf ] [ bibtex ] M. Lahiri and T.Y. Berger-Wolf. Mining Periodic Behavior in Dynamic Social Networks. In Proc. of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy. December 2008.
License
PSEMiner is licensed under the terms of
GNU General Public License (GPL), and download and use of code or binaries constitutes your agreement to the terms of the license.
Download and Install
Optimized, pre-compiled binaries are available for the following platforms.
The source is also available, and should compile on most systems with any ANSI compliant
C/C++ compiler with GNU autotools and a sane build environment. To compile, untar the file and run ./configure
followed by make
PSEMiner-2.0.tar.gz (source)
Prerequisites: Google's
sparsehash library needs to be installed
only if you want to compile the source yourself. If you just want to run the executable, then you don't need to have sparsehash installed.
Input file format
The input file format consists of one timestep per line. On each line, a set of integers specifies the edges and/or vertices present at that timestep, in time increasing order of timesteps. This requires mapping each unique edge to an integer, so for example, "a-b" would map to 0, "b-c" would map to 1. Each line can have an optional date label if it is prefixed with an asterisk. The following is an example:
*Jan-20-1998 0 1 9 123
*Jan-21-1998
*Jan-22-1999 0 1 9123
In the example above, the second timestep is blank. Note that time labels are not parsed -- they are just a convenience for humans examining the file. The edges 0 and 1 appear in two timesteps. An equivalent format for the same dataset without time labels is as follows:
0 1 9 123
0 1 9123
Note the blank line to signify an empty timestep.
Running the program
The following command line parameters are
required:
-i <filename> Specifies the input file
-o <filename> Specifies the output file for mined patterns
The following command line parameters are
optional:
-s <minsup> Specifies the minimum support value >= 2 (default: 3)
-Pmax <max period> Maximum period for mined patterns -- will greatly speed up processing if set to reasonable value (default: unrestricted)
-smooth <s> Number of timesteps to smooth, for handling noise (default: 1, or no smoothing)
-q Quiet mode -- do not display diagnostic information
-X Optimize hashmap for speed over memory usage
-h Display additional options
Enron Dataset
We are making the following version of the Enron dataset available for use with the
mining program. The following file is in a format appropriate for use as input
to the mining program:
Enron (dated)
The following file contains the mapping from integers to actual e-mail addresses from
the Enron database.
Enron integer-edge mapping
Contact
Bugs, comments, suggestions, requests for assistance can be sent to mlahiri at the mail server cs.uic.edu.
© 2009
Mayank Lahiri