# Data scientist Interview Questions

In a data scientist interview, expect employers to ask questions that assess your data modelling, problem-solving and programming skills. Be prepared to answer general questions that test your knowledge of statistics and data science. You should also be ready to answer open-ended questions that test your creativity, communication skills and formal education in data modelling and programming.

## Top Data Scientist Interview Questions & How to Answer

Here are three data scientist interview questions and tips on how to answer them:

### Question No. 1: Which data modelling techniques do you prefer and why?

How to answer: Turning data into understandable and actionable information is a critical part of the data scientist's job. This question allows employers to understand your data modelling skills and background. List and discuss your preferred data modelling techniques, including benefits such as ease of use, flexibility, etc.

### Question No. 2: How would you detect bogus Instagram accounts used for scamming consumers?

How to answer: Questions like this one allow an employer to test your problem-solving skills. When answering open-ended questions such as these, feel free to ask clarifying questions and use whiteboards to demonstrate your coding and diagramming skills. Share your thought process as you work through the problem.

### Question No. 3: Describe circumstances that require a list, tuple or set in Python.

How to answer: Interviewers will use questions such as this one to test your Python programming skills. Review Python basics such as lists, tuples and sets before your interview. You should be able to explain when and how each tool is used by data scientists.

### Describe yourself

57 Answers↳

So if you get any call from a company sating that you are selected then first visit their office get your hard copy of offer letter do some research on google ask people then think. And if anyone is saying you first you need to do certification then it will be final then dont choose that company. They are fooling you by taking your money and wasting your time. Less

↳

Ya we will report it to police and cybercrime their all numbers their all Email ids and all from so called company people to specialist learning institute we will report everyones number Less

↳

its a serious issue please report to cybercrime police benguluru.

### Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

40 Answers↳

CREATE temporary table likes ( userid int not null, pageid int not null ) CREATE temporary table friends ( userid int not null, friendid int not null ) insert into likes VALUES (1, 101), (1, 201), (2, 201), (2, 301); insert into friends VALUES (1, 2); select f.userid, l.pageid from friends f join likes l ON l.userid = f.friendid LEFT JOIN likes r ON (r.userid = f.userid AND r.pageid = l.pageid) where r.pageid IS NULL; Less

↳

select w.userid, w.pageid from ( select f.userid, l.pageid from rollups_new.friends f join rollups_new.likes l ON l.userid = f.friendid) w left join rollups_new.likes l on w.userid=l.userid and w.pageid=l.pageid where l.pageid is null Less

↳

Use Except select f.user_id, l.page_id from friends f inner join likes l on f.fd_id = l.user_id group by f.user_id, l.page_id -- for each user, the unique pages that liked by their friends Except select user_id, page_id from likes Less

### You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

34 Answers↳

Bayesian stats: you should estimate the prior probability that it's raining on any given day in Seattle. If you mention this or ask the interviewer will tell you to use 25%. Then it's straight-forward: P(raining | Yes,Yes,Yes) = Prior(raining) * P(Yes,Yes,Yes | raining) / P(Yes, Yes, Yes) P(Yes,Yes,Yes) = P(raining) * P(Yes,Yes,Yes | raining) + P(not-raining) * P(Yes,Yes,Yes | not-raining) = 0.25*(2/3)^3 + 0.75*(1/3)^3 = 0.25*(8/27) + 0.75*(1/27) P(raining | Yes,Yes,Yes) = 0.25*(8/27) / ( 0.25*8/27 + 0.75*1/27 ) **Bonus points if you notice that you don't need a calculator since all the 27's cancel out and you can multiply top and bottom by 4. P(training | Yes,Yes,Yes) = 8 / ( 8 + 3 ) = 8/11 But honestly, you're going to Seattle, so the answer should always be: "YES, I'm bringing an umbrella!" (yeah yeah, unless your friends mess with you ALL the time ;) Less

↳

Answer from a frequentist perspective: Suppose there was one person. P(YES|raining) is twice (2/3 / 1/3) as likely as P(LIE|notraining), so the P(raining) is 2/3. If instead n people all say YES, then they are either all telling the truth, or all lying. The outcome that they are all telling the truth is (2/3)^n / (1/3)^n = 2^n as likely as the outcome that they are not. Thus P(ALL YES | raining) = 2^n / (2^n + 1) = 8/9 for n=3 Notice that this corresponds exactly the bayesian answer when prior(raining) = 1/2. Less

↳

26/27 is incorrect. That is the number of times that at least one friend would tell you the truth (i.e., 1 - probability that would all lie: 1/27). What you have to figure out is the odds it raining | (i.e., given) all 3 friends told you the same thing. Because they all say the same thing, they must all either be lying or they must all be telling the truth. What are the odds that would all lie and all tell the truth? In 1/27 times, they would the all lie and and in 8/27 times they would all tell the truth. So there are 9 ways in which all your friends would tell you the same thing. And in 8 of them (8 out of 9) they would be telling you the truth. Less

### want you to write me a simple spell checking engine. The query language is a very simple regular expression-like language, with one special character: . (the dot character), which means EXACTLY ONE character (it can be any character). So, for example, 'c.t' would match 'cat' as the dot matches any character. There may be any number of dot characters in the query (or none). Your spell checker will have to be optimized for speed, so you will have to write it in the required way. There would be a one-time setUp() function that does any pre-processing you require, and then there will be an isMatch() function that should run as fast as possible, utilizing that pre-processing. There are some examples below, feel free to ask for clarification. Word List: [cat, bat, rat, drat, dart, drab] Queries: cat -> true c.t -> true .at -> true ..t -> true d..t -> true dr.. -> true ... -> true .... -> true ..... -> false h.t -> false c. -> false */ // write a function // Struct setup(List<String> list_of_words) // Do whatever processing you want here // with reasonable efficiency. // Return whatever data structures you want. // This function will only run once // write a function // bool isMatch(Struct struct, String query) // Returns whether the query is a match in the // dictionary (True/False) // Should be optimized for speed

25 Answers↳

Here is the Python Code (inspired by someone's code on this page): def setUp(word, input_list): word = word.strip() temp_list = [] Ismatch = False if word in input_list: Ismatch = True elif word is None or len(word) == 0: Ismatch = False else: for w in input_list: if len(w) == len(word): temp_list.append(w) for j in range(len(temp_list)): count=0 for i in range(len(word)): if word[i] == temp_list[j][i] or word[i] == '.': count += 1 else: break if count == len(word): Ismatch = True print(Ismatch) def isMatch(word, input_list): return setUp(word, input_list) isMatch('c.t', ['cat', 'bte', 'art', 'drat', 'dart', 'drab']) Less

↳

bear in mind for your solution, checking the lengths of words in the dictionary is very fast. That's what you can use your setup for. There's no need to iterate through the whole loop of checks if the word fails the length already. See my solution above Less

↳

This was the fastest I could do without regex: def func(wrd,lst): if len(wrd) not in [len(x) for x in lst]: return False elif wrd in lst: return True else: lst1 = [x for x in lst if len(x)==len(wrd)] for z in lst1: c=0 for i in range(len(wrd)): if wrd[i] != '.' and wrd[i] == z[i]: c=c+1 if len(wrd)-wrd.count('.') == c: return True return False Less

### Write a SQL query to compute a frequency table of a certain attribute involving two joins. What if you want to GROUP or ORDER BY some attribute? What changes would you need to make? How would you account for NULLs?

25 Answers↳

If you group by parent_id, you'll be leaving out all posts with zero comments.

↳

select number_comments, count(submission_id) as number_posts from ( # more than zero comments select submission_id, count(post_id) as number_comments from ( select submission_id, case when parent_id is null 1 else 0 end as post, case when parent_id is not null parent_id else null end as post_id, body from Submissions )k where post =0 group by submission_id ) k1 group by number_comments union select number_comments, count(submission_id) as number_posts from ( # comments= 0 select submission_id, 0 as number_comments from ( select submission_id, case when parent_id is null 1 else 0 end as post, case when parent_id is not null parent_id else null end as post_id, body from Submissions )k where post =1 group by submission_id ) k1 group by number_comments Less

↳

@ RLeung shouldn't you use left join? You are effectively losing all posts with zero comment. Less

### Given an array of integers, we would like to determine whether the array is monotonic (non-decreasing/non-increasing) or not. Examples: // 1 2 5 5 8 // true // 9 4 4 2 2 // true // 1 4 6 3 // false //1 1 1 1 1 1 // true

19 Answers↳

if l == sorted(l) or l == sorted(l,reverse=True): print(True) else: print(False) Less

↳

python solution. inc = dec = True for i in range(len(nums)-1): if nums[i] > nums[i+1]: inc = false if nums[i] < nums[i+1]: dec = false return inc or dec Less

↳

An array is monotonic if and only if it is monotone increasing, or monotone decreasing. Since p = A[i+1] for all i indexing from 0 to len(A)-2. Note: Array with single element can be considered to be both monotonic increasing or decreasing, hence returns “True“. --------------------Python 3 code---------------------- # Check if given array is Monotonic def isMonotonic(A): return (all(A[i] = A[i + 1] for i in range(len(A) - 1))) # Test with an Array A = [6, 5, 4, 4] # Print required result print(isMonotonic(A)) Less

### Given an list A of objects and another list B which is identical to A except that one element is removed, find that removed element.

19 Answers↳

All these supposed answers are missing the point, and this question isn't even worded correctly. It should be lists of NUMBERS, not "objects". Anyway, the question is asking how you figure out the number that is missing from list B, which is identical to list A except one number is missing. Before getting into the coding, think about it logically - how would you find this? The answer of course is to sum all the numbers in A, sum all the numbers in B, subtract the sum of B from the sum of A, and that gives you the number. Less

↳

select b.element from b left join a on b.element = a.element where a.element is null Less

↳

In Python: (just numbers) def rem_elem_num(listA,listB): sumA = 0 sumB = 0 for i in listA: sumA += i for j in listB: sumB += j return sumA-sumB (general) def rem_elem(listA, listB): dictB = {} for j in listB: dictB[j] = None for i in listA: if i not in dictB: return i Less

### Data challenge was very similar to the ads analysis challenge on the book the collection of data science takehome challenge, so that was easy (if you have done your homework). SQL was: you have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in her history a user has listened to a given song. So count is cumulative sum. You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count. If it is the first time a user has listened to a given song, you won't have this pair in the lifetime table, so you have to create the pair there and then add the count of the last day. Onsite: lots of ads related and machine learning questions. How to build an ad model, how to test it, describe a model. I didn't do well in some of these.

18 Answers↳

Can't tell you the solution of the ads analysis challenge. I would recommend getting in touch with the book author though. It was really useful to prep for all these interviews. SQL is a full outer join between life time count and last day count and then sum the two. Less

↳

Can you post here your solution for the ads analysis from the takehome challenge book. I also bought the book and was interested in comparing the solutions. Also can you post here how you solved the SQL question? Less

↳

for the SQL, I think both should work. Outer join between lifetime count and new day count and then sum columns replacing NULLs with 0, or union all between those two, group by and then sum. Less

### What was the last gift you gave someone?

17 Answers↳

I gave a kidney to a sick little girl who was on the verge of death, Her parents could not afford the operation so I performed the operation myself, I am a pizza delivery man with just a pair of scissors. Less

↳

Patience, though it is a virtue, it can be a gift when you try to understand a co-worker rather than gossip, yell or ostracize them. Less

↳

Time. I spent an afternoon in an assisted-living home and helped with the Bingo tournament. It was all the rage and my ounce of time is immeasurable to those who need it the most. Less

### We have two options for serving ads within Newsfeed: 1 - out of every 25 stories, one will be an ad 2 - every story has a 4% chance of being an ad For each option, what is the expected number of ads shown in 100 news stories? If we go with option 2, what is the chance a user will be shown only a single ad in 100 stories? What about no ads at all?

16 Answers↳

For the questions 1: I think both options have the same expected value of 4 For the question 2: Use binomial distribution function. So basically, for one case to happen, you will use this function p(one case) = (0.96)^99*(0.04)^1 In total, there are 100 positions for the ad. 100 * p(one case) = 7.03% Less

↳

For "MockInterview dot co": The binomial part is correct but you argue that the expected value for option 2 is not 4 but this is false. In both cases E(x) = np = 100*(4/100) = 4 and E(x) = np=100*(1/25) = 4 again. Less

↳

Chance of getting exactly one add is ~7% As the formula is (NK) (0,04)^K * (0,96)^(N−K) where the first (NK) is the combination number N over K Less