Calendrier

Scalable Clustering using PatchWork

Clustering is a fundamental task in Knowledge Discovery and Data mining.
It aims to discover the unknown nature of data by grouping together data
objects that are more similar.

While hundreds of clustering algorithms have been proposed, many are
complex and do not scale well as more data become available, making then
inadequate to analyze very large datasets. In addition, many clustering
algorithms are sequential, thus inherently difficult to parallelize.

We propose PatchWork, a novel clustering algorithm to address those
issues. PatchWork is a distributed density clustering algorithm with
linear computational complexity and linear horizontal scalability. It
presents several desirable characteristics in knowledge discovery, in
particular, it does not require a priori the number of clusters to
identify, and offers a natural protection against outliers and noise. In
addition, PatchWork makes it possible to discover spatially large
clusters instead of dense clusters only.

PatchWork relies on the map/reduce paradigm to parallelize computations
and was implemented using Apache Spark, the distributed computation
framework. As a result, PatchWork can cluster a billion points in a few
minutes only, a 1000x improvement over a Spark implementation of the
popular DBSCAN and a 40x improvement over the distributed implementation
of k-means in Spark~MLLib.


Bio:
Thomas joined the CRIM Development and Internet Technologies team in
2014 as a data science researcher data integration, data mining,
knowledge representation and predictive analytics of spatio-temporal
big-data.

He earned his Ph.D. in Computer Science from the University of Nebraska
USA (2009) at 24. He obtained a French Engineer Certificate and a M.Sc.
in computer science and engineering from the French Grande École
ENSICAEN with distinctions. Thomas was a post-doctoral fellow with the
Centre for Structural and Functional Genomics in Montreal from 2009 to
2013, and addressed some of the challenges of Omics big data to optimize
the production of cellulosic biofuels. In 2013-2014, Thomas improved the
social search engine of Wajam where he built a semantic knowledge graph
by processing and analyzing unstructured and structured big data, and
statistical models to help interpret users’ intent in real-time. Thomas
has published and presented his work in several well-established
peer-reviewed journals and international conferences, and his work was
rewarded by many awards and prizes.

His areas of interest includes the mining, analysis and visualization of
big data, artificial intelligence, with applications to the IoT,
smart-cities and personalized medicine.
 

Bienvenue à tous!

 

Date

Thursday February 11, 2016
From 11:30 to 12:20

Contact

1-514-340-4711 #4233

Place

Polytechnique Montréal - Pavillon principal
2500, chemin de Polytechnique
Montréal
QC
Canada
H3T 1J4
L-4812

Categories