Walmart interview question

What are the data structures in Python? What are the data structures in the Pandas package in Python? (hint - data-frame, index, etc.) What are generator, iterators, list comprehension in Python? Define parametric and non-parametric methods. Give some examples. What are the different types of Sampling methods that you have used? Give some problems or scenarios where map-reduce concept works well and where it doesn't work.

Interview Answer

Anonymous

6 Jun 2018

Python has several data structures - most useful among them are list, dictionary, arrays, heap and priority queue. Primary data structure in Pandas is data frame, which is very powerful and makes data-wrangling and analysis very convenient. A generator is a function that returns an object (iterator) which we can iterate over. To create a generator in Python, we have to simply define a function with yield statement instead of a return statement. Iterator is an objects that allows you to traverse through all the elements of a collection, regardless of its specific implementation. An iterator is an object that implements the iterator protocol. An iterator protocol is nothing but a specific class in Python which further has the __next()__ method. Which means every time you ask for the next value, an iterator knows how to compute it. It keeps information about the current state of the iterable it is working on. List comprehension is a way to easily and concisely construct a list. e.g. list_variable = [x for x in iterable] Parametric and non-parametric methods - A machine learning method with a known number of parameters are called parametric methods such a logistic regression and expectation maximization based clustering. However, algorithms that have an infinite numbers of parameters are referred to as non-parametric method. For example gaussian process is a non-parametric method. Sampling Methods - broadly speaking there are two types of sampling approaches probability sampling and non-probability sampling. The former includes simple random sampling, stratified sampling, clustered sampling among others, while the latter includes volunteer sampling, Haphazard sampling among others. Map Reduce concepts works well for methods that can be parallelized. That is, if we can break down a problem into smaller task and then be able to combine (reduce) the results from each of these smaller task to obtain the final output, we can use map-reduce. For example map-reduce works really well for k-means clustering whereas for approaches that works sequentially map-reduce approaches does not work well.

2