Research A – Autumn Semester 2013/14
Recent news
- Please note updated information on progress meetings and the next lecture!
- The proposal presentations on Wednesday, October 2 will take place in HG00.108 (morning session) and in ELAAN9 01.10 (afternoon session).
- Please note updated information on the first progress meetings!
- Proposal presentations will take place on wednesday, October 2 in two timeslots: 10:00–12:30, and 13:30–16:30. We will schedule a break in the second timeslot from 14:45 to 15:00.
- Make sure to submit slides for your 3-minute presentation by Tuesday, September 24 at 20:00 (CET, Amsterdam time)!
- The first lecture will be on September 4, 2013 from 8:30 until 10:30 in room GN 6.
Basic information
- Lecturers: Lejla Batina <lejla@cs.ru.nl> and Peter Schwabe <peter@cryptojedi.org>
- Credit: 6 EC
- Studigids information: http://www.studiegids.science.ru.nl/2013/science/prospectus/computing_science_master/courses/course/30869/
Deadlines and Meetings
Deadlines
All deadlines are at 15:00, CET (Amsterdam time).- Choosing a topic (send e-mail with topic, supervisor, and team): September 17, 2013
- Submission of research proposal: September 27, 2013
- Submission of slides for the presentation of the research proposal: October 1, 2013
- Presentation of the research proposal:October 2, 2013
- Submission of the first version of the final paper: December 10, 2013
- Submission of slides for the final presentation of the research project: December 17, 2013
- Final presentation of the research project: December 18, 2013
- Submission of the final version of the paper: January 17, 2014
Lectures
- September 4, 8:30
- September 11, 8:30
- November 13, 8:30
Progress Meetings
- January 8. Meetings will take 20 minutes per group.
- November 27; two timeslots: 10:30–12:30, and 13:40–15:00. Meetings will take 10 minutes per group.
- November 13; two timeslots: 11:00–12:30, and 13:30–15:10. Meetings will take 10 minutes per group.
- October 9; two timeslots: 11:00–12:30, and 13:30–15:10. Meetings will take 10 minutes per group.
- September 25; two timeslots: 11:00–12:30, and 13:30–15:10. Meetings will take 10 minutes per group.
Topics
Please also consider the topics proposed on http://www.ru.nl/is/ml/education/student_projects/.Topics marked with a red frame are already taken.
Title: | Fencing with security: using iBeacons for things other than advertising |
---|---|
Chosen by: | s4346998, s4354559 |
Supervisor: | Roland van Rijswijk-Deij <rijswijk@cs.ru.nl> |
Description: |
When they announced iOS 7, Apple rather quietly introduced an interesting new technology they call "iBeacon".
The idea behind this technology is to be a means to provide location information in areas where, e.g., GPS reception is problematic (read: indoors).
The technology relies on Bluetooth Low Energy devices called "beacons" that can be placed in an area
and are automatically detected and read by the CoreLocation service on an iOS device.
Use cases for this technologies that pop up on the web show how this technology could be used for things like advertising,
or to provide rich content at sites like museums.
The goal for this research project is to investigate how this new technology can be leveraged in security applications.
In order to do this, students are encouraged to:
|
Literature: |
|
Type of work: |
40% researching state-of-the-art, iBeacon technology and Bluetooth technology 30% experiments 30% documenting results |
Title: | "Balance" in Real-Time Strategy (RTS) games: a causal perspective. |
---|---|
Chosen by: | s4024443, s4068459 |
Supervisor: | Tom Claassen<tomc@cs.ru.nl> |
Description: |
In games like Starcraft II and League of Legends players battle online
against each other as one of multiple factions/heroes. One of the most
hotly debated topics is that of balance: did a player win because of
skill or because his race/hero is too strong ("overpowered")? Perhaps
surprisingly, this type of question belongs to the field of causal
discovery. Despite the fact that nearly every player on the ladder has an outspoken opinion on the subject, 'imbalance' is surprisingly hard to define, and establish objectively from (even) a huge amount of game results. Not least because the respective match-making systems always try to find opponents such that both teams roughly have an equal chance of winning. In this project we look at how a large-scale ("Big Data") Bayesian ranking systems works and how player skill can be modeled. We try to find an objective measure for balance, and when/how to infer from matches. We implement and test our methods against results from (simulated) ladder seasons. |
Literature: |
|
Type of work: | Depends on focus, but will always include a fair amount of theory, experiments, and programming. |
Title: | Predictive modeling of airfare data. |
---|---|
Supervisor: | Syed Saiden Abbas<saiden.abbas@cs.ru.nl> |
Description: | The price of airline tickets fluctuates over time and this can cause a change in price from a few dollars to a few hundred dollars. The fluctuations are due to a number of factors e.g. number of seats sold, time left for departure, timing of the flight etc. Data mining and machine learning techniques can be used to model this fluctuating behavior of ticket prices based on the historic data of ticket prices. The purpose of this project is to develop a predictive tool using R programming language, which can be used by consumers in decision making while purchasing a ticket. The tool would be able to predict whether to buy a ticket now or to wait for some days. |
Literature: |
|
Title: | Locales vs. encapsulates: A comparison of Isabelle/HOL and ACL2 |
---|---|
Chosen by: | s4070968, s4156889 |
Supervisor: | Julien Schmaltz <Julien.Schmaltz@ou.nl> |
Description: | Isabelle/HOL and ACL2 are two popular theorem proving systems. There are both used for the verification of software and hardware designs, as well as formalizations of mathematics. These two systems provide a way to define "constrained functions", that is, functions which defined by their properties and not with an explicit definition. Isabelle/HOL uses "locales" and ACL2 uses "encapsulates". The question you will answer is how do these two systems compare? What are their common aspects? What are their differences? In which cases is one more appropriate than the others? etc. |
Literature: | TBD |
Type of work: | In our research about formalizing communication networks and separation kernels, we use both ACL2 and Isabelle/HOL. Your task will be, using examples, to illustrate the differences and commonalities about the support provided for abstract functions. You will start by formalizing some parts of our formal theory of communication networks in Isabelle/HOL using locale. You will then compare with our own formalisation done in ACL2 using encapsulates. |
Title: | Data Mining in Formal Mathematics. |
---|---|
Chosen by: | s4290755, s0814938 |
Supervisor: | Josef Urban <josef.urban@gmail.com> and Herman Geuvers <herman@cs.ru.nl> |
Description: |
Some large mathematical theories are today fully understandable to
computers. An example is the formal proof of Kepler Conjecture
(Flyspeck project) by Thomas Hales and his team, written in the HOL
Light proof assistant. Flyspeck consist of about 20 thousand named
theorems and 100 million unnamed lemmas (atomic proof steps). Just
writing out all the lemmas takes about 100GB. The lemmas form an
inference graph, which is directed and acyclic. Some of the lemmas are
used very often, and some of the lemmas are proved over and over. It
would be good if we could detect the most useful and most interesting
lemmas automatically. This would help the mathematicians who develop
Flyspeck in organizing such large libraries of computer
mathematics. And a better-structured library would be also more useful
when proving new theorems fully automatically over such library,
without human assistance. Such automated proving provides a rigorous
metric for evaluating the usefulness of the selected lemmas as awhole. There are several methods for "lemma-mining", summarized in the paper http://mws.cs.ru.nl/~urban/lm/lmhol.pdf. These methods are mostly still too slow to process the whole Flyspeck corpus in reasonable time. So far only PageRank has reasonable speed, but it needs to be combined with additional clustering ideas, as proposed for example in this paper: http://cseweb.ucsd.edu/~kspham/pub/sigir2008.pdf. Some of the more interesting methods have not been tried even on much smaller corpora which contain only about one million of lemmas. The goal of the project would be to develop good and fast lemma-selecting methods for big corpora such as Flyspeck. The three initial approaches that can be explored are (i) combining fast distance-based clustering with PageRank, (ii) smarter (possibly imperfect, but fast) update of the lemma-quality data structures after the selection of the best lemma, and also (iii) applying the lemma-quality ideas implemented in the AGInTRater system (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.6840"rep=rep1"type=pdf) on smaller corpora consisting of 1-10M lemmas. The students are welcome to come up with their own research ideas and to evaluate them. |
Literature: | See links in the description text. |
Title: | Trusted interfaces for secure devices. |
---|---|
Chosen by: | s4064453, s4058038 |
Supervisor: | Wouter Lueks<lueks.cs.ru.nl> and Roland van Rijswijk<rijswijk@cs.ru.nl> |
Description: |
Over the last decade there has been a trend to better protect online trans-
actions of all types. To use your online bank-account you typically use a secure
token, your bank card, secure SMS services or a combination thereof to authen-
ticate yourself and any transactions you make. In these scenarios it is important
to see precisely which transactions you authenticate. Preferably, it should be
difficult for an attacker to change this information.
In the Digital Security group we have been working on the IRMA card.
This smart card implements attribute based credentials and allows you to reveal
certain information about yourself in a privacy friendly way. The IRMA card,
like any other smart card, doesn't have a trusted interface. However, this also
means that you have to trust the reader of your card not to read too much
information. We think this can be improved by adding a trusted interface.
There are many different approaches to adding trusted interfaces, ranging
from hardware only to software only. Banks typically use dedicated hardware
that only works with their cards, but more general dedicated hardware interfaces
also exist. Some mobile platforms are equipped with a trusted execution en-
vironment. This environment protects the software running the interface from
interference and unauthorized access [1]. Finally, pure software solutions are
possible as well. In this project we like you to do two things:
|
Literature: |
|
Title: | Privacy-friendly solutions for data aggregation and filtering in SmartGrids. |
---|---|
Supervisor: | Bárbara Vieira<b.vieira@cs.ru.nl> |
Description: |
C-DAX is an ongoing European Project ( http://cdax.eu)
which proposes a cyber-secure data and
control cloud for future power distribution networks based on an information-centric networking
(ICN) solution as an overlay of IP. The C-DAX solution will provide a distributed data-cloud
tailored to the specific needs of SmartGrids. In particular, it is intended to efficiently support
of massive integration of renewables and be able to cope with a heterogeneous set of co-existing
SmartGrid applications, running on devices and communicating over networks with widely varying
capabilities when it comes to communication and computation speeds. Precursors to the C-DAX
solution are overlay networking solutions developed at Bell-Labs.
In a nutshell, the C-DAX platform consists of two major components: the C-DAX middleware
that provides publisher-subscriber interfaces to clients hosting SmartGrid applications and the
C-DAX cloud which consists of logically interconnected C-DAX nodes which are responsible for the
resolution and delivery of messages exchanged between publishers and subscribers in a resilient,
self-configurable, and scalable manner. The main idea of C-DAX is that, instead of applying host-
centric and point-to-point communication, it supports group communication that is data-centric
(i.e., its concepts are developed around the data being communicated) and topic-based (as the
routing of data is based on topic identifiers). C-DAX addresses several use cases and in some of them each node (in the C-DAX cloud) can be abstracted as being a message broker that receives (encrypted) topic-data from the publishers (aggregator, DNO, prosumers, etc) and forwards that (encrypted) data to the intended subscribers. Several tasks can be carried out at each node to enhance the information awareness in the platform, namely reducing number of topics, unnecessary network traffic, etc. Ideally, primitives such as data aggregation and filtering, should be part of the functionalities of the C-DAX cloud. However, specially in the retail market, costumers privacy is an issue, therefore all the topic-data has to be encrypted (and the C-DAX cloud just treats it as a black box decryption and subsequent encryption is not possible). Performing computations on encrypted data has been a research challenge for several years. Solutions based on homomorphic encryption, searchable encryption and functional en- cryption are proposed in the literature. The goal of this research project is to investigate existing solutions to perform computations (namely aggregation and filtering) on encrypted data and how they can be adapted to the specific context of C-DAX, where a massive generation of data is expected from different measuring devices. |
Literature: |
|
Title: | Predicting the effects of gene knockouts from observational data. |
---|---|
Chosen by: | s4241746, s4244931 |
Supervisor: | Joris Mooij <j.mooij@cs.ru.nl> |
Description: | Genes play a vital role in living organisms. In this project, the student will analyze gene expression data measured in yeast cells. The task is to predict which genes will have a strong effect on which other genes in knockout experiments. This will be done with a combination of state-of-the-art causal discovery methods. This project fits well into the "Big Data" theme: even though the dataset itself is not very big (64 samples and about 5000 genes), the causal discovery methods need to iterate through (subsets of) pairs, triples and even higher tuples of genes. Doing this na¨ively is not tractable, and smart optimization and/or parallellization tricks will be needed. If successful, this work may very well lead to a publication. |
Literature: |
|
Type of work: | The work to be performed is a combination of theory (30%), programming (40%) and experiments (30%). The student first familiarizes him-/herself with the current state-of-the-art causal discovery methods [1, 2] for this type of data, implements a suitable combination of these two methods in a suitable language, runs it on the biological data and analyzes the results. |
Title: | Can Google Trends predict outbreaks of influenza and trading behavior in financial markets (or something else)? |
---|---|
Chosen by: | s4241754, s4191552 |
Supervisor: | Tom Heskes <t.heskes@science.ru.nl> |
Description: | Google Trends reveals how often a particular search-term is entered over time and across various regions of the world. In a seminal Nature paper [1], the authors showed that outbreaks of influenza were correlated to numbers of Google searches for terms related to flu prevention and cure. Another, very recent paper claims that a trading strategy based on the search volume of "debt" would have would have increased the value of a financial portfolio by more than 300% [2]. Both studies indicate that there are these correlations between search volume and (social) behavior, but do this "after the fact" and hence may be susceptible to overfitting (if you search for correlations long enough, you will always find some that are "significant"). One research topic could be to check whether the suggested correlations have real predictive behavior, e.g., by checking whether they also apply to novel data, after the authors published their study. Another challenge could be to come up with another cool example of the predictive power of Google Trends. |
Literature: |
Title: | Blazingly fast Random Forests |
---|---|
Chosen by: | s4031253, s3030547 |
Supervisor: | Tom Heskes <t.heskes@science.ru.nl> |
Description: |
Random Forest (RF) is one of the most popular machine learning
algorithms these days [1,2]. RF is a so-called ensemble method which
aggregates the predictions of randomized decision trees to obtain a
predictive model which is both robust and accurate. RF has a number of
attractive properties, making it easy-to-use across many different
domains. This usability and adaptability has made RF the
benchmark-of-choice for Kaggle competitions, world-wide and also
locally, by students participating in the Machine Learning in Practice
course... There are various implementations of RF in Weka, R, and Python. The company wise.io now claims to have a version which is orders of magnitude faster than the existing ones, because of [a] "data set design and special implementation strategies that no one else had (or has yet) figured out" [3]. They appear to make quite some profit by selling the software to companies. Can you take up the challenge and figure out how to speed up RF? Since they make their software (but not code) available, you may also be able to reverse engineer some of their tricks. |
Literature: |
Title: | Side channel analysis in the frequency domain. |
---|---|
Chosen by: | s4335880, s3052869 |
Supervisor: | Baris Ege <B.Ege@cs.ru.nl> |
Description: | Side channel analysis exploits unintended information leakage of devices (timing, power consumption, etc...) and applicable to almost all platforms used in daily life. Most side channel attack papers in the literature applies analysis on time domain, however it has been known that leakages are also present in frequency domain as well. There are quite some papers available online on the internet exploring the topic, and it is a promising area of research attracting increasing attention from researchers all over the world. |
Literature: | TBD |
Type of work: | You are expected to do first a literature search on the subject. Then they should implement and demonstrate the attack either on a mathematical software (sage, matlab, and the like...) or a programming language of their chosing. |
Title: | Power measurement acquisition from an FPGA board. |
---|---|
Supervisor: | Baris Ege <B.Ege@cs.ru.nl> |
Description: | Side channel analysis exploits unintended information leakage of devices (timing, power consumption, etc...) and applicable to almost all platforms used in daily life. One way of doing research in side channel analysis is by using dedicated hardware designed for research. SASEBO is such a board which employs two FPGAs, one for communicating with the host computer, and the other to run the design under evaluation. The evaluation board present in the lab (SASEBO prototype) is well documented but may require knowledge of basic physics and FPGA design. |
Literature: | TBD |
Type of work: | This topic is suitable for a student who is looking for a more hands on (practical) project. Aim of the project is to first establish communication between two FPGAs present on the evaluation board. Then, the students are expected to collect power measurements from an FPGA which runs AES. |
Title: | Tor vs. the NSA |
---|---|
Chosen by: | s4062833, s3052273 |
Supervisor: | Anna Krasnova <anna@mechanical-mind.org> |
Description: | With the recent events proving the fact of global surveillance performed by NSA, the burning problem is how well can existing anonymization tools actually protect privacy against such a powerful attacker? This project is considering one of the most established tools for providing communication anonymity – Tor. Tor has appeared in 2004 and since that time it was heavily studied, attacked and developed. Without initially being designed to be a an anticensorship tool, tor turned out to be often used for this purpose. This motivated many changes to tor, that were aiming at protecting users against powerful censors, like China. This experience of an arms-race with censoring governments has shown, that a powerful censor can go far beyond of what is expected to be achievable. The purpose of this project is to analyse how well can tor protect against a powerful attacker like NSA, whose goal is to spy on individuals. |
Literature: |
|
Title: | Causal discovery for ADHD data set |
---|---|
Supervisor: | Elena Sokolova <e.sokolova@cs.ru.nl> |
Description: | Attention deficit-hyperactivity disorder (ADHD) is a frequent and highly heritable disorder affecting both children and adults. In order to better understand the correlation between personal characteristics (age, gender, IQ, etc.), brain functioning and disease, Global competition ADHD-200 has been established. This competition collected a large data set, describing 285 patients and 491 healthy controls. The results of the competition applying standard classification techniques are described in. We propose an alternative method for analyzing the data, involving causal modeling. The Bayesian Constraint-based Causal Discovery (BCCD) algorithm is able to find direct and indirect dependencies between variables, determines the strength of the dependencies, and provides a graphical visualization to interpret the results. The idea of the project is to understand, which factors influence ADHD, by learning the causal model of the data. |
Literature: |
|
Title: | Learning state machines from big data |
---|---|
Chosen by: | s4048911, s4053060 |
Supervisor: | Sicco Verwer <siccoverwer@gmail.com> |
Description: | A central challenge in detecting and analyzing malicious behavior on computer networks is how to make sense of the vast amounts of data generated in monitoring such systems. The LEMMA project will develop automated tools to tackle this challenge. In our research, we investigate methods for learning state machines from data streams, such as network traffic. In this project, you will help with our investigation by reviewing the literature on special data-structures for data stream mining, implementing a prototype stream learning algorithm for state machines, and reporting on experiments performed with this prototype. |
Literature: |
|
Title: | Big data analysis of control models |
---|---|
Supervisor: | Sicco Verwer <siccoverwer@gmail.com> |
Description: | In the ITALIA project, we investigate and build tools for learning state machine descriptions for communication protocols of real-world software systems. In a recent study, we applied such tools to embedded control software in Oce industrial printers. These models are learned actively, by providing the software with inputs, reading the resulting outputs, and formulating the observed behavior into a hypothesis model. In this way, we obtained a large state machine description of a part of the communication to and from the controller in an Oce printer, which takes place on a central bus. We know this model is incomplete, but have been unable to prove its incompleteness using model-based testing. You will investigate the uses of Big Data analysis for proving this incompleteness. An industrial printer continuously logs enormous amounts of data, including all traffic taking place on the central communication bus. This data contains statistical information that cannot be obtained by actively providing inputs. For instance, it can indicate which states are used more frequently, and by focussing our active learning algorithms on these frequent parts of the model, we hope to find errors more quickly. From the previous study, there already is an implementation connecting the active learner to the control software, and a learned model. Large data logs will be provided by Oce. You will combine the two in order to find errors in the model and/or to direct the active learning software. |
Literature: |
|
Title: | More data and simple models. |
---|---|
Chosen by: | s3016439, s4386035 |
Supervisor: | Sicco Verwer <siccoverwer@gmail.com> |
Description: | Why does more data make simple models more powerful? Intuitively, more data should make larger models and complex algorithms increasingly powerful as these theoretically require more data to converge. However, an important paper reporting on experiments done by Google claims the opposite is true. Given the huge amount of data and processing power available at Google, it is very hard to repeat these experiments. We wonder whether the scientific literature contains any experiments that confirm or contradict this result. You will perform this study of the recent Big Data literature. |
Literature: | A google query on (for instance): big data simple models, gives many references. |
Title: | Sudoconnect |
---|---|
Chosen by: | s0721964, s4250559 |
Supervisor: | Hans Zantema <h.zantema@tue.nl> |
Description: | Recently, a new variant of the well-known Sudoku puzzle has been invented in which apart from the usual conditions it is also required that the solution has an underlying graph that is connected. Instances of this new kind of puzzle can be generated based on SAT solving. In the underlying background lots of questions arise, for instance about the possible structure of the underlying graph, not only for standard 9x9 sudoku's, but also for variants and general Latin squares. Both logical reasoning and experimenting with implementation will help to get answers on these questions. |
Literature: | TBD |
Type of work: | Programming and experimenting with these programs will be a substantial part of the project. |
Title: | Social login 4.0 & Using the privacy-friendly IRMA technology online with OpenID Connect |
---|---|
Supervisor: | Gergely Alpar <gergely@cs.ru.nl> and Roland van Rijswijk-Deij <rijswijk@cs.ru.nl> |
Description: | It is becoming more and more common to see web sites where you can log in using your social identity (e.g. your Facebook, Google or Twitter account). Most of these login scenarios are based on OAuth, OAuth2 and - in the near future - Open ID Connect. The problem with many of these logins is that relying parties (the site you log in to) often request a lot of personal data. From a privacy perspective that is undesirable. The IRMA project on the other hand is "privacy-by-design". We differentiate between identifying and non-identifying information about a user (attributes) and put the user at the centre of all interactions. No data is revealed without the user's consent and the system is built to facilitate selective and minimal disclosure of personal information. The goal of this student assignment is to investigate how we can couple IRMA's privacy-friendly approach with OpenID Connect. Students are challenged to analyse how IRMA fits in the OpenID architecture and to build a prototype that demonstrates the use of IRMA credentials in an OpenID Connect identity provider. |
Literature: | TBD |
Type of work: | Knowledge of OAuth2 and federated identity management helps, as well as good programming skills. We have OAuth2 software available in several programming languages (e.g. PHP and Java) that can be used as a starting point. |
Slides and Course Material
- Slides of the introductory lecture on September 4, 2013
- Slides of the lecture "Elements of a research project", September 11, 2013
- Slides of the lecture "How to write a paper", November 13, 2013