Hermann Ney is a professor of computer science at RWTH Aachen University, Germany. His main research interests lie in the area of statistical classification, machine learning and neural networks with specific applications to speech recognition, handwriting recognition, machine translation and other tasks in natural language processing.
He and his team participated in a large number of large-scale joint projects like the German project VERBMOBIL, the European projects TC-STAR, QUAERO, TRANSLECTURES, EU-BRIDGE and US-American projects GALE, BOLT, BABEL. His work has resulted in more than 700 conference and journal papers with an h index of 100+ and 60000+ citations (based on Google scholar). More than 50 of his former PhD students work for IT companies on speech and language technology.
The results of his research contributed to various operational research prototypes and commercial systems. In 1993 Philips Dictation Systems Vienna introduced a product for large-vocabulary continuous-speech recognition. In 1997 Philips Dialogue Systems Aachen introduced a spoken dialogue system for traintable information via the telephone. In VERBMOBIL, his team introduced the phrase-based approach to data-driven machine translation, which in 2008 was used by his former PhD students at Google as starting point for the service Google Translate. In TC-STAR, his team built the first research prototype system for spoken language translation of real-life domains.
Awards: 2005 Technical Achievement Award of the IEEE Signal Processing Society; 2013 Award of Honour of the International Association for Machine Translation; 2019 IEEE James L. Flanagan Speech and Audio Processing Award; 2021 ISCA Medal for Scientific Achievements.
The last 40 years have seen a dramatic progress in machine learning and statistical methods for speech and language processing like speech recognition, handwriting recognition and machine translation. Many of the key statistical concepts had originally been developed for speech recognition. Examples of such key concepts are the Bayes decision rule for minimum error rate and sequence-to-sequence processing using approaches like the alignment mechanism based on hidden Markov models and the attention mechanism based on neural networks.
Recently the accuracy of speech recognition, handwriting recognition machine translation could be improved significantly by the use of artificial neural networks and specific architectures, such as deep feedforward multi-layer perceptrons and recurrent neural networks, attention and transformer architectures. We will discuss these approaches in detail and how they form part of the probabilistic approach.
Gregory Rogez is a Senior Scientist at NAVER LABS Europe, in Grenoble, France, where he leads the Computer Vision group. He received an M.Eng in Physics from ENSPM/Centrale Marseille France in 2002, and MSc and Ph.D. degrees in computer vision from the University of Zaragoza, Spain, in 2005 and 2012 respectively. During his PhD, he was a regular visiting student (2007-2008) and research fellow (2009-2010) in Oxford, UK.
His work on monocular human body pose analysis received the best Ph.D. thesis award from the Spanish Association on Pattern Recognition (AERFAI) for the period 2011-2013. He was then a Marie Curie Fellow at UC Irvine, USA (2013-2015), a Research Scientist at Inria, France (2015-2018) and joined NAVER LABS Europe in 2019. His main research interests include computer vision and deep learning, with a special focus on sensing and understanding people from visual data and the application of such technology to AR/VR and robotics. He has published more than 30 papers on the topic and is a regular reviewer for top-tier journals and conferences in the field.
In this tutorial, I will explain how the complex and severely ill-posed problem of 3D human pose estimation from monocular images can be tackled as a detection problem using standard object classifiers. I will review the classication-based techniques that were proposed over the past 15 years to handle different levels of the human body including full-body, upper body, face and hands. I will discuss advantages and drawbacks of classification approaches and present in detail some solutions involving training data synthesis, CNN architectures, distillation and transformers.