TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING