RE-SPaM
Regular Expression for Sequential Pattern Mining
Using RE-SPaM for Mining Semantic Trajectories

Introduction

Many application domains produce information as temporal ordered sequences. Discovery of hidden patterns over these sequences have been extensively studied. Particulary, when these sequences are organized in transactions, the Generalized Sequential Pattern algorithm (Srikant & Agrawal, 1996) and its  variations are suitable for this problem.

 

Generalized Sequential Pattern algorithm (GSP) produces all patterns that appear at least as many times as a user-specified threshold, named minimum support. But since the user may obtain lots of patterns, not necessarily interesting from a user's point of view, the SPIRIT algorithm (Garofalakis, Rastogi, & Shim, 2001) was proposed in order to prune the candidate sequences obtained during the mining process using a user-specified constraint expressed as a regular expression over the items to be mined.

 

Motivation

Our motivation is focused on mining trajectories of moving objects, although our proposal is general enough to be applied in any domain. Moving objects report their locations as a time-ordered sequence obtained by means of  electronic devices (e.g GPS, RFID). The trajectory of a moving object is given by samples composed of a finite number of tuples of the form (OID, t, x, y), stating that at a certain point in time, namely t, the object OID was located at coordinates (x, y).

Traditional sequential pattern algorithms where designed to be applied to items, not real numbers which involve precision problems.

 

For mining trajectories we use the idea of considering a map composed of disjoint geometries, named Places of Interest of the Application (PoIs). For example, for a tourist application domain PoIs can be hotels, restaurants and tourist attractions. When a moving object spends a sufficient amount of time within a PoI, the PoI is considered a stop of the trajectory and all the locations that belong to the PoI are replaced by a data object representing the stop (Spaccapietra, C. Parent, M. L. Damiani, J. A. Fernandes de Macedo, F. Porto, and C. Vangenot.) Thus, we rewrite each trajectory as a sequence of stops. As each stop is an object (it has a geometry and additional information), coincidence of stops can be detected and GSP can be applied. See http://piet.exp.dc.uba.ar/moving/

for details on stops and moves.

 

But following this approach, we were able to detect only similar trajectories if moving objects passed by the same PoIs. Instead, we want to detect similarities between sequences using semantic information of higher level. Imagine that tourist OID1 visits the following sequences of PoIs (Zoo, Restaurant1, Coffee2, Hotel1) and tourist OID2 visits the following sequences of PoIs (Zoo, Restaurant2, Coffee2). Have both tourists followed the same trajectory? If we consider PoIs as atomic items, we will infer that both trajectories are different. But if Restaurant1 and Restaurant2 have semantics associated with them, for example they are characterized by a "typeOfFood" and both of them were "French", we would infer that both tourists have followed the same trajectory, i.e. they begin in a Zoo, then go to a "French" restaurant and finish in Coffee2.

 

Using semantics we can infer useful patterns that cannot be detected otherwise. We are introducing the novel idea of detecting semantic similarities in mining process that also include traditional approaches.

The RE-SPaM Approach

We propose an algorithm which uses semantic information during the mining process. Moreover, we designed a query language, based on regular expressions, to let users to express which kind of patterns they are interested in finding, which differs from Spirit proposal.

 

We consider all items composed by attributes, which provide  semantic information about the item. Thus, items are decomposable in the following way: they are organized in categories such that  each category is associated to a set of attributes. Intuitively, this allows to talk about schema and instances of categories. In our example, we identify four categories, with different granularities: hotels, restaurants, the Eiffel Tower and the zoo, each one with different attributes. The following Table shows these categories and their schema.

Category Schema
hotels [ID, categoryName, geom, star, parking, pets]
restaurants [ID, categoryName, geom, typeOfFood, price, parking, speciality]
Eiffel Tower [ID, categoryName, geom]
zoos [ID, categoryName, geom, price]

 

The following Table shows Category Instances (i.e., specific hotels or restaurants).

Category Instance
Hotels (5 occurrences)
(ID, 'H1') (categoryName, 'Hotel') (geom, 'pol1') (star, '3') (parking, 'false') (pets, 'false')
(ID, 'H2') (categoryName, 'Hotel') (geom, 'pol2') (star, '5') (parking, 'false') (pets, 'true')
(ID, 'H3') (categoryName, 'Hotel') (geom, 'pol8') (star, '5') (parking, 'true') (pets, 'true')
(ID, 'H4') (categoryName, 'Hotel') (geom, 'pol9') (star, '3') (parking, 'false') (pets, 'false')
(ID, 'H5') (categoryName, 'Hotel') (geom, 'pol10') (star, '1') (parking, 'false') (pets, 'true')
Restaurants (10 occurrences)
(ID, 'R1') (categoryName, 'Restaurant') (geom, 'pol3') (typeOfFood, 'French') (price, 'cheap') (parking, 'false') (speciality, 'fondeau')
(ID, 'R2') (categoryName, 'Restaurant') (geom, 'pol4') (typeOfFood, 'French') (price, 'cheap') (parking, 'true') (speciality, 'fondeau')
(ID, 'R3') (categoryName, 'Restaurant') (geom, 'pol5') (typeOfFood, 'Italian') (price, 'cheap') (parking, 'false') (speciality, 'pasta')
(ID, 'R4') (categoryName, 'Restaurant') (geom, 'pol11') (typeOfFood, 'Italian') (price, 'cheap') (parking, 'false') (speciality, 'pasta')
(ID, 'R5') (categoryName, 'Restaurant') (geom, 'pol12') (typeOfFood, 'Italian') (price, 'cheap') (parking, 'false') (speciality, 'pasta')
(ID, 'R6') (categoryName, 'Restaurant') (geom, 'pol13') (typeOfFood, 'Japenesse') (price, 'expensive') (parking, 'true') (speciality, 'sushi')
(ID, 'R7') (categoryName, 'Restaurant') (geom, 'pol14') (typeOfFood, 'Japenesse') (price, 'expensive') (parking, 'true') (speciality, 'rice')
(ID, 'R8') (categoryName, 'Restaurant') (geom, 'pol15') (typeOfFood, 'Chinesse') (price, 'cheap') (parking, 'false') (speciality, 'rice')
(ID, 'R9') (categoryName, 'Restaurant') (geom, 'pol16') (typeOfFood, 'Italian') (price, 'cheap') (parking, 'true') (speciality, 'risotto')
(ID, 'R10') (categoryName, 'Restaurant') (geom, 'pol17') (typeOfFood, 'French') (price, 'expensive') (parking, 'true') (speciality, 'fish')
Eiffel Tower (1 occurrence) (ID, 'E') (categoryName, 'EiffelTower') (geom, 'pol6')
Zoos (1 occurrence) (ID, 'Z') (categoryName, 'Zoo') (geom, 'pol7') (price, 'cheap')

 

Each item in the database of sequences is decomposed by attributes: those ones of its category instance (non-temporal attributes) and time interval attributes (which represent the instant of time when the item occurred, namely ts_date, ts_time, tf_date and tf_time are temporal attributes corresponding to the beginning and ending of the time interval of the occurrence).

 

Our regular expression language is  built on constraints, expressed as conditions over attributes (temporal and non-temporal ones). Constraints are built using the following grammar:

R1 CONSTRAINT := [ CONDITION ]
R2
CONDITION := lambda
CONDITION := EQ
CONDITION := EQ ^ CONDITION
R3
EQ := attribute= 'constant'
EQ := attribute= @vble
EQ := functionName(attribute, ...) = 'constant'
EQ := functionName(attribute, ...) = @vle

 

So, conditions can use constants (literals enclosed by simple quotes), attributes (temporal or non-temporal ones), functions over attributes, and variables (begin with the @ symbol). The incorporation of variables can be used either for metadata validation or for matching. Variables can be compared against any attributes (temporal or non-temporal ones) and functions over attributes.

 

Implementation Details

The table of sequences, ToI, stores occurrences of items. As non-temporal attributes of items belong to its category instance, this table can be normalized. For example, an instance of ToI could be:

OID ts_date ts_time tf_date tf_time ID
OID1 10/10/2007 10:08:00 10/10/2007 12:22:05 R1
OID1 10/10/2007 23:08:00 11/10/2007 01:22:05 C1
... ... ... ... ... ...

 

where the first tuple shows that the moving object identified by OID1, visited the PoI identified by R1 the date 10/10/2007 10:08:00 and left it at 10/10/2007 12:22:05. Notice that ToI is the only place where temporal attributes are stored, but non-temporal attributes are inferred via category instances. In this case, we can conclude that 'OID1' during the interval of time [10/10/2007 10:08:00, 10/10/2007 12:22:05] visited a 'Restaurant' (categoryName) identified by 'R1' (ID) characterized also by having 'pol3' shape (geom), 'French' type of food, 'cheap' price, non parking and 'Fondeau' speciality.

 

Then, we can involve all these attributes in queries.

RE-SPaM by Example

Here we provide some examples.

 

 

Q1) Find trajectories about tourists who visit a cheap place, then passed through zoo or Eiffel Tower, and finished in some cheap place.

 

[price='cheap'].([ID='E'] | [ID='Z']).[price='cheap']

 

 

Q2) Find trajectories about tourists who visit two places with the same price. In between the zoo or Eiffel Tower must be visited. Notice we use the same variable @x for expressing coincidence of prices. In our running example, both of them could be expensive or both of them could be cheap. These values do not must be expressed a priori.

 

[price=@x].([ID='Z'] | [ID='E']).[price=@x]

 

 

 

Q3) Find trajectories about tourists who visit a cheap place during the morning of some date, and then go to a place that serves French food during the morning of the same date. Notice that the coincidence of ts_date is expressed with the same variable, and instead of specifying the exact instant of time of both occurrences we prefer to apply a rollup function.

 

[rollup(ts_time, 'range', 'time dimension')='Morning' ^ price='cheap' ^ ts_date=@z].[rollup(ts_time, 'range', 'time dimension')='Morning' ^ typeOfFood='French' ^ ts_date=@z]

 

Analyzing Results

Using our database of sequences, if we run Q2 and ask for 0.4 of minimum support, we will obtain the following sequences: [[r2, z, r1], [r10, z, r10], [r1, z, r2], [r2, z, r2], [r2, z, r3], [r2, z, r4], [r2, z, z], [r1, e, r1], [r2, e, r1], [r9, e, r1], [z, e, r1], [r1, e, r2], [r2, e, r2], [r2, e, r9], [r1, e, z], [z, e, z]].

Notice that all sequences has Z or E in the middle, but the extremes are PoIs with some prices. For example, R2 and R9 are different places, but both of them has cheap prices. Moreover, R2 and Z are so different places (the former is a restaurant, the latter a zoo) but they have same prices. Using variables over attributes gives to users a very powerful query language for express constraints, and by pruning not useful candidate sequences during the mining process help to solve complex queries.

 

 

Do you want to see a demo?

Follow this link

 

Do you need more information about this project?

This project was partially fund from the Agencia Nacional de Promocion Cientifica y Tecnologica, Argentina. PICT 2004 project # 11-21350.

Responsible: Dr. Alejandro Vaisman.

For additional information send an email to avaisman@dc.uba.ar