Data Scientists for the 21 Century – Are we in for a drought?
Recently, a TechCrunch article, citing a McKinsey study, noted that by 2018 the number of data science jobs in the US alone would exceed 490,000, whereas there would be fewer than 200,000 data scientists to fill those positions, http://techcrunch.com/2015/12/31/how-to-stem-the-global-shortage-of-data-scientists/. Globally, the demand for data scientists is projected to exceed supply by more than 50% by 2018. Yet, one has to be impressed by the number of courses and degrees offered by US universities in data science. For example, the following lists programs offered by California universities, http://www.mastersindatascience.org/schools/california/.
Other states have also been equally proactive; please see the above link for more information. Add to that, online courses offered by Coursera, Udacity, edX, etc., and the practicing engineer has a wide range of choices for specializing in the data science field. So, is it going to be gloom and doom when it comes to filling data science positions of the future? I, for one, do not think so.
On February 11, I had attended UC Berkeley’s EECS annual BEARS meeting, http://www.eecs.berkeley.edu/bears/. Berkeley has been at the forefront of technology development for data science with significant contributions, such as the Berkeley Data Analytics Stack (BDAS), https://amplab.cs.berkeley.edu/software/, the main component of which is the Spark (now Apache Spark) software. Berkeley has been equally innovative in developing courses for budding data scientists, http://databears.berkeley.edu/. Prof. David Culler talked about the “Foundations of Data Science” class, which is taken by about 500 undergraduate students; please see the attached picture of the class in session. 80% of those students noted that the class was outside their major field of study. This means that data science is becoming a fundamental tool, much like mathematics. And if Berkeley alone trains that many students with basic data science skills, how could we possibly be headed for a shortage?
To address the question, one has to examine what the new field of data science encompasses. It includes at the infrastructure layer data storage and retrieval systems, such as HDFS file system, NoSQL databases, etc. Above that layer comes tools for parsing, auditing, and cleansing the data. Then comes machine learning models to extract insights out of the data, and finally visualization tools for humans to extract and consume further insights from the output of the models. Do we really need the all-in-one data scientist with all the above skills across the stack? If one looks at the software engineering field, one might be tempted to argue that the ideal software engineer should be adept at developing software across the system and application stacks.
Yet, in reality, software engineers are specialized in specific areas, such as storage systems, databases and middleware, networking software, application software, etc. The same is going to be true for data science as well. Teams of software engineers, statisticians, and data analysts are likely to effectively fill the needs of the data science stack. Some understanding of the stack would certainly be useful, but the specialists focusing on their areas of expertise would be the best approach. This is already happening in the industry. Moreover, a number of open source data science and machine learning toolsets, such as Apache Spark, H2O, Google TensorFlow, etc., have made it easy to build complex predictive analytic models. So, I conclude that engineers and data analysts working together in teams will indeed fill the needs of the industry when it comes to data science for 2018 and well beyond. Comments and feedback from the readers are most welcome.
Dr. K. Moidin Mohiuddin
March 11, 2016