Bio. I am a Ph.D. candidate at Oregon State University, where I do research on data management and machine learning. I have worked on software development and data science projects. See my portfolio and publications below.
My research has mainly focused on relational machine learning — machine learning algorithms that learn over structured data, such as relational databases. I study the impact that data heterogeneities have on the learning algorithms and how to overcome the associated challenges. I am advised by Arash Termehchy.
During my Ph.D., I did three internships, one at Microsoft (2017) and two at Intel (2015 and 2014). I was also a teaching assistant for several courses: Introduction to Databases, Database Management Systems, Introduction to Artificial Intelligence, Machine Learning and Data Mining, Data Structures, and Web Development.
Before starting my Ph.D., I lived in North Carolina for two years, where I got my Masters degree in Computer Science from Wake Forest University, advised by Sriraam Natarajan. My research was on statistical relational learning, specifically with applications to information extraction.
I'm originally from Costa Rica, where I studied Computer Science at the Costa Rica Institute of Technology and then worked as a Software Engineer at Avantica Technologies.
I enjoy playing tennis, football (the kind you play with your feet), running, music, and traveling.
NOTE. I am looking for a data scientist/research scientist/software engineer full-time position.
Public cloud database providers observe all sorts of different usage patterns and behaviors while operating their services. Service providers try to understand and characterize these behaviors in order to improve the quality of their service, provide new features for customers, and/or increase the efficiency of the operations. This project aimed at determining how long public cloud databases survive before being dropped. This project involved doing a large-scale survivability study of the Microsoft Azure SQL Databases, developing a machine learning classifier that classified cloud databases into short-lived and long-lived, and identifying some factors that can help predict the lifespan of cloud databases.Paper: SIGMOD 2018 (slides).
Most machine learning algorithms assume that data can be stored in matrix form: rows represent observations and columns represent features. However, real-world data is rarely in this form. Instead, application domains usually contain information about different types of entities. One way of storing information about different types of entities is through relational databases. Given a relational database and training examples for a new concept, relational machine learning algorithms learn a definition of the concept in terms of existing relations in the database. We developed a relational learning system called Castor. We used Castor in different domains, such as learning a model to predict whether a chemical compound has anti-HIV activity, learning a model to predict whether a movie will be high grossing, and reverse-engineering SQL queries from training examples.Papers: SIGMOD 2017 (slides), VLDBJ 2018, VLDB 2016.
The information in a domain is usually spread across several databases, which often represent the same entities under different names. Therefore, learning over multiple databases may result in inaccurate definitions. Currently, users have to spend a great deal of time and effort to resolve these heterogeneities and create a unified and clean database instance to be used for learning. We developed CastorX, an extension of Castor, that learns directly over heterogeneous databases without any pre-processing step. The user specifies the attributes across relations that contain values that may refer to the same real-world entity through a set of declarative constraints. CastorX leverages these dependencies to learn accurate definitions over the heterogeneous data.Papers: VLDB 2018.
We developed a novel approach for extracting adverse drug events from text. Given a drug-effect pair, our method searches publicly available medical literature to find documents related to the drug-effect pair. It then converts these documents to standard natural language processing (NLP) features. These features are then used in a probabilistic classifier based on Markov logic networks to determine whether the drug-effect pair is indeed an adverse drug event.Paper: KAIS 2016.
We used a deep learning approach to predict the tags of Stack Overflow questions, given their title and content. In particular, we used word vectors to represent each word and, given the sequence of word vectors corresponding to a question, we used a Long Short-Term Memory (LSTM) network to predict the tags of the question.Report: PDF (slides).
Measuring the amount of rainfall on a specific field is an important issue in agriculture. We explored the use of Bayesian networks to estimate the distribution of rainfall given measurements from Polarimetric radars. We employed two different approaches for constructing the structure of the Bayesian network. First, we manually designed the structure based on domain knowledge. Second, we applied a structure learning algorithm to learn the structure automatically from data. Results showed that the Bayes network with the structure learned from data performed best.Report: PDF (slides).