Jose

Jose Picado

jpicado [AT] gmail [DOT] com

LinkedIn     GitHub     Email me     Resume

Bio. I am a Ph.D. candidate at Oregon State University, where I do research on data management and machine learning. I have worked on software development and data science projects. See my portfolio and publications below.

My research has mainly focused on relational machine learning — machine learning algorithms that learn over structured data, such as relational databases. I study the impact that data heterogeneities have on the learning algorithms and how to overcome the associated challenges. I am advised by Arash Termehchy.

During my Ph.D., I did three internships, one at Microsoft (2017) and two at Intel (2015 and 2014). I was also a teaching assistant for several courses: Introduction to Databases, Database Management Systems, Introduction to Artificial Intelligence, Machine Learning and Data Mining, Data Structures, and Web Development.

Before starting my Ph.D., I lived in North Carolina for two years, where I got my Masters degree in Computer Science from Wake Forest University, advised by Sriraam Natarajan. My research was on statistical relational learning, specifically with applications to information extraction.

I'm originally from Costa Rica, where I studied Computer Science at the Costa Rica Institute of Technology and then worked as a Software Engineer at Avantica Technologies.

I enjoy playing tennis, football (the kind you play with your feet), running, music, and traveling.

NOTE. I am looking for a data scientist/research scientist/software engineer full-time position.

Portfolio

Predicting the lifespan of cloud databases

Public cloud database providers observe all sorts of different usage patterns and behaviors while operating their services. Service providers try to understand and characterize these behaviors in order to improve the quality of their service, provide new features for customers, and/or increase the efficiency of the operations. This project aimed at determining how long public cloud databases survive before being dropped. This project involved doing a large-scale survivability study of the Microsoft Azure SQL Databases, developing a machine learning classifier that classified cloud databases into short-lived and long-lived, and identifying some factors that can help predict the lifespan of cloud databases.

Paper: SIGMOD 2018 (slides).
Skills: Python, pandas, scikit-learn, lifelines, matplotlib, SQL, Microsoft Cosmos (Microsoft's internal Big Data system) and Scope (Cosmos's query language).
Project type: Internship project (Microsoft).

Machine learning over structured data

Most machine learning algorithms assume that data can be stored in matrix form: rows represent observations and columns represent features. However, real-world data is rarely in this form. Instead, application domains usually contain information about different types of entities. One way of storing information about different types of entities is through relational databases. Given a relational database and training examples for a new concept, relational machine learning algorithms learn a definition of the concept in terms of existing relations in the database. We developed a relational learning system called Castor. We used Castor in different domains, such as learning a model to predict whether a chemical compound has anti-HIV activity, learning a model to predict whether a movie will be high grossing, and reverse-engineering SQL queries from training examples.

Papers: SIGMOD 2017 (slides), VLDBJ 2018, VLDB 2016.
Code: Github.
Skills: Java, Python, pandas, SQL, VoltDB.
Project type: Research project.

Machine learning over heterogeneous databases

The information in a domain is usually spread across several databases, which often represent the same entities under different names. Therefore, learning over multiple databases may result in inaccurate definitions. Currently, users have to spend a great deal of time and effort to resolve these heterogeneities and create a unified and clean database instance to be used for learning. We developed CastorX, an extension of Castor, that learns directly over heterogeneous databases without any pre-processing step. The user specifies the attributes across relations that contain values that may refer to the same real-world entity through a set of declarative constraints. CastorX leverages these dependencies to learn accurate definitions over the heterogeneous data.

Papers: VLDB 2018.
Skills: Java, Python, pandas, SQL, VoltDB.
Project type: Research project.

Extracting adverse drug events from text

We developed a novel approach for extracting adverse drug events from text. Given a drug-effect pair, our method searches publicly available medical literature to find documents related to the drug-effect pair. It then converts these documents to standard natural language processing (NLP) features. These features are then used in a probabilistic classifier based on Markov logic networks to determine whether the drug-effect pair is indeed an adverse drug event.

Paper: KAIS 2016.
Skills: Java, Stanford CoreNLP.
Project type: Research project.

Learning to label Stack Overflow questions

We used a deep learning approach to predict the tags of Stack Overflow questions, given their title and content. In particular, we used word vectors to represent each word and, given the sequence of word vectors corresponding to a question, we used a Long Short-Term Memory (LSTM) network to predict the tags of the question.

Report: PDF (slides).
Skills: Python, keras, scikit-learn, deep learning.
Project type: Class project; Kaggle competition.

Using Bayesian networks to estimate rainfall distribution given polarimetric radar data

Measuring the amount of rainfall on a specific field is an important issue in agriculture. We explored the use of Bayesian networks to estimate the distribution of rainfall given measurements from Polarimetric radars. We employed two different approaches for constructing the structure of the Bayesian network. First, we manually designed the structure based on domain knowledge. Second, we applied a structure learning algorithm to learn the structure automatically from data. Results showed that the Bayes network with the structure learned from data performed best.

Report: PDF (slides).
Skills: Weka.
Project type: Class project; Kaggle competition.

Publications

  • Logical Scalability and Efficiency of Relational Learning Algorithms     [PDF]
    Jose Picado, Arash Termehchy, Alan Fern, Parisa Ataei
    The VLDB Journal (VLDBJ), 2018
  • Learning Efficiently Over Heterogeneous Databases     [PDF]
    Jose Picado, Arash Termehchy, Sudhanshu Pathak
    Proceedings of the VLDB Endowment (PVLDB), 2018
  • Survivability of Cloud Databases - Factors and Prediction     [PDF]
    Jose Picado, Willis Lang, Edward C. Thayer
    Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2018
  • Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue    [PDF]
    Jose Picado, Arash Termehchy, Sudhanshu Pathak
    Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning (DEEM), 2018
  • AutoMode: Relational Learning With Less Black Magic    [PDF]
    Jose Picado, Sudhanshu Pathak, Arash Termehchy, Alan Fern
    Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2018
  • Schema Independent Relational Learning    [PDF]    [Technical Report]
    Jose Picado, Arash Termehchy, Alan Fern, Parisa Ataei
    Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2017
  • Schema Independent and Scalable Relational Learning By Castor    [PDF]
    Jose Picado, Parisa Ataei, Arash Termehchy, Alan Fern
    Proceedings of the VLDB Endowment (PVLDB), 2016
  • Markov Logic Networks for Adverse Drug Event Extraction from Text    [PDF]
    Sriraam Natarajan, Vishal Bangera, Tushar Khot, Jose Picado, Anurag Wazalwar, Vitor Santos Costa, David Page, Michael Caldwell
    Knowledge and Information Systems Journal (KAIS), 2016
  • Representation Independent Analytics Over Structured Data    [Technical Report]
    Jose Picado, Yodsawalai Chodpathumwan, Arash, Termehchy, Alan Fern, Yizhou Sun
    Technical Report, 2014
  • Effectively Creating Weakly Labeled Training Examples Via Approximate Domain Knowledge    [PDF]
    Sriraam Nataranan, Jose Picado, Tushar Khot, Kristian Kersting, Christopher Re, Jude Shavlik
    International Conference on Inductive Logic Programming (ILP), 2014
  • Efficient Information Extraction Using Statistical Relational Learning    [Thesis]
    Jose Picado
    Master's Thesis, 2013
  • Using Commonsense Knowledge to Automatically Create (Noisy) Training Examples from Text    [PDF]
    Sriraam Natarajan, Jose Picado, Tushar Khot, Kristian Kersting, Christopher Re, Jude Shavlik
    International Workshop on Statistical Relational AI (StarAI), 2013

Education

OSU

Oregon State University

Ph.D. in Computer Science
Machine Learning and Databases
Advisor: Arash Termehchy
WFU

Wake Forest University

M.Sc. in Computer Science
Machine Learning
Advisor: Sriraam Natarajan
Thesis
TEC

Costa Rica Institute of Technology

B.S. in Computer Science

Updated April 2019.
Created with Shield template by TemplateMag.