A Visual Question Answering system combines the applications of different fields such as Deep Learning, Natural Language Processing, and Knowledge Representation. Since it is adopted by many public platforms and devices it is likely to change the way we find and interact with data. In this project, we have implemented a Hierarchical Co-Attention model which incorporates attention to both the image and question to jointly reason about them both.This method uses a hierarchical encoding of the question, in which the encoding occurs at the word level, at the phrase level, and at the question level.The parallel co-attention approach simultaneously addresses both the question and the image, which allows the relevance of words in the question and of specific image regions to be determined by each other. We predict the final answer recursively by combining all three levels of the co-attended features from the hierarchy.