# 8K

Associate Research Scientist interview questions shared by candidates

## Top Interview Questions

Sort: Relevance|Popular|Date
Data Scientist was asked...1 March 2016

### Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

CREATE temporary table likes ( userid int not null, pageid int not null ) CREATE temporary table friends ( userid int not null, friendid int not null ) insert into likes VALUES (1, 101), (1, 201), (2, 201), (2, 301); insert into friends VALUES (1, 2); select f.userid, l.pageid from friends f join likes l ON l.userid = f.friendid LEFT JOIN likes r ON (r.userid = f.userid AND r.pageid = l.pageid) where r.pageid IS NULL; Less

SELECT f.userid, l.pageid FROM friends f LEFT JOIN likes l ON f.friendid = l.userid WHERE l.pageid NOT IN (SELECT r.pageid FROM likes r WHERE f.userid = r.userid) Can someone tell me if this works? Less

Use Except select f.user_id, l.page_id from friends f inner join likes l on f.fd_id = l.user_id group by f.user_id, l.page_id -- for each user, the unique pages that liked by their friends Except select user_id, page_id from likes Less

### Data challenge was very similar to the ads analysis challenge on the book the collection of data science takehome challenge, so that was easy (if you have done your homework). SQL was: you have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in her history a user has listened to a given song. So count is cumulative sum. You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count. If it is the first time a user has listened to a given song, you won't have this pair in the lifetime table, so you have to create the pair there and then add the count of the last day. Onsite: lots of ads related and machine learning questions. How to build an ad model, how to test it, describe a model. I didn't do well in some of these.

Can't tell you the solution of the ads analysis challenge. I would recommend getting in touch with the book author though. It was really useful to prep for all these interviews. SQL is a full outer join between life time count and last day count and then sum the two. Less

Can you post here your solution for the ads analysis from the takehome challenge book. I also bought the book and was interested in comparing the solutions. Also can you post here how you solved the SQL question? Less

for the SQL, I think both should work. Outer join between lifetime count and new day count and then sum columns replacing NULLs with 0, or union all between those two, group by and then sum. Less

### Consider a game with 2 players, A and B. Player A has 8 stones, player B has 6. Game proceeds as follows. First, A rolls a fair 6-sided die, and the number on the die determines how many stones A takes over from B. Next, B rolls the same die, and the exact same thing happens in reverse. This concludes the round. Whoever has more stones at the end of the round wins and the game is over. If players end up with equal # of stones at the end of the round, it is a tie and another round ensues. What is the probability that B wins in 1, 2, ..., n rounds?

Because at the beginning time, A has 8 and B has 6, so let A:x and B:y, then A:8+x-y and B:6-x+y; so there are 10/36 prob of B wins. And A wins prob is 21/36 and the equal prob for next round is 5/36. So for B wins at round prob is 10/36. And if they are equal and to have another round, the number has changed to 7 and 7. So A:7+x-y and B:7-x+y, so this time B wins has prob 15/36 and A wins has prob 15/36. And the equal to have another round is 6/36=1/6. So overall B wins in 2 rounds has prob 5/36*15/36. And for round 3,4,...etc, since after each equal round, the number will go back to 7 and 7 so the prob will not change. So B wins in round 3,4,...n has prob 5/36*(6/36)^(r-2)*15/36. r means the number of the total rounds. Less

So many answers...Here's my version: For round1, B win only if it gets 3 or more stones than A, which is (A,B) = (1,4) (1,5) (1, 6) (2, 5) (2,6) (3,6) which is 6 cases out of all 36 probabilities. So B has 1/6 chance to win. To draw, B needs to get exactly 2 stones more than A, which is (A, B) = (1,3) (2,4) (3,5) (4,6) or 1/9. Entering the second round, all stones should be equal, so the chance to draw become 1/6, and the chance for either to win is 5/12. So the final answer is (1/6, 1/9*5/12, (1/9)^2*5/12, .....(1/9)^(n-1)*5/12) ) Less

I don't get it. Shouldn't prob of B winning given it's tie at 1st round be 15/36? given it's tie at 1st round, at the 2nd round Nb &gt; Na can happen if (B,A) is (2,1), (3,1/2),(4,1/2/3), (5,1/2/3/4),(6,1/2/3/4/5), which totals 15 out of 36. Less

### Find the second largest element in a Binary Search Tree

The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null &amp;&amp; root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Less

find the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. Less

One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. Less

### data question: dialoglog (userid int appid int type char , a flag either "imp" or "click" ds timestamp ) How would you access the quality of app? How to compute click-through rate (in mySQL)?

Assuming ctr is defined as total number clicks / total number of impressions (not counting each unique user's action) select appid, total_click_ct / total_imp_ct as ctr from ( select appid, count(distinct case when flag = 'imp' then 1 else 0 end) as total_imp_ct, count(distinct case when flag = 'click' then 1 else 0 end) as total_click_ct, from table where ts &gt; x and ts &lt; y group by appid) table2 order by ctr desc; Less

SELECT appid, sum(click) / count(*) AS crt FROM dialoglog GROUP BY appid;

### Write a sorting algorithm for a numerical dataset in Python.

def mergeSort(lst): split = len(lst) / 2 left = lst[:split] right = lst[split:] if len(left) &gt; 1 or len(right) &gt; 1: left = mergeSort(left) right = mergeSort(right) return merge(left, right) def merge(A, B): i=0 j=0 sorted_list = [] while i &lt; len(A) and j &lt; len(B): if A[i] &lt;= B[j]: sorted_list.append(A[i]) i += 1 else: sorted_list.append(B[j]) j += 1 if i &lt; len(A): sorted_list.extend(A[i:]) elif j &lt; len(B): sorted_list.extend(B[j:]) return sorted_list Less

list = ["1", "4", "2", "10", "5"] list = [ int(i) for i in list ] list.sort() print (list) Less

def mysort(arr): if len(arr) pivot] return mysort(left) + middle + mysort(right) Less

### Given a table with columns country (with two-letter country abbreviation), count of requests sent, percentage of requests sent failed, condense down to this data grouped by country: country (one row per country), total count of requests sent, total count of requests sent failed.

SELECT country, SUM(count_of_request_sent) AS total_count_of_request_sent, SUM((count_of_request_sent * percent_of_request_failed) / 100) AS total_count_of_request_failed FROM table GROUP BY country; Less

SELECT country, SUM(requests_sent), SUM((percent_failed/100) * requests_sent) FROM requests GROUP BY country; Less

My assumption is that the requests_sent columns is not an integer. It might just have status as per 'failed", "successful" etc. Best to verify from the interviewer. That means we may not be able to take the sum of teh request_sent but have to use count instead. WITH CTE AS ( SELECT country, request_sent, (perc_failed * count(request_sent)) AS total_failed FROM T1 ) SELECT country, COUNT(request_sent), SUM(total_failed) GROUP BY 1 Less

### SQL question: given a table of interaction between users (user_a | user_b | day), find number of users who had more than 5 interactions yesterday (assume there is only one unique interaction between a pair of users per day). Product Question: A user satisfaction survey was conducted for two groups of facebook users (each with 50 k sample size). Group1: who had enabled certain login security features Group 2: who had not enabled these security features. It was found that user satisfaction with group1 was 30% lower than with group 2. Why do you think so? Comment on how the survey was conducted?

For the product question: This is an example of selection bias. The subset of users who opted in for the security features are not representative of the underlying population of FB users. They are probably more concerned about how their data is being used. Further, because the survey would have been completed on a voluntary basis, it represents another source of selection bias. These individuals are more likely to be engaged with the FB product than the general population, and care enough to complete the survey. Once again, these users are not a representative (or for that matter random) sample from the underlying population of FB users. This would explain why Group 1 had a much lower satisfaction rate as compared to Group 2, as they represent a specific subset of FB's users who care deeply about this issue, and by the virtue that they have the security feature on is indicative of the fact that they take data privacy seriously. For the SQL question: This is probably not the most efficient, but is the only thing that came to my mind: with all_users as ( select userA as userID, count(1) as interactions from table where date = current_date - interval 'day' group by userA union all select userB as userID, count(1) as interactions from table where date = current_date - interval 'day' group by userB ), engaged as ( select userID, sum(interactions) as totalInteractions from all_users group by userID having totalInteractions &gt;= 5 ) select count(1) as greaterThanFive from engaged Less

select c.user_id , sum(c.cnt) as sum_cnt from ( SELECT USER_A as user_id,count(user_b) cnt FROM interactions group by user_a union all SELECT USER_B as user_id,count(user_a) cnt FROM interactions group by user_B) c group by c.user_id having sum(c.cnt)&gt;5 ; Less

WITH CTE AS ( SELECT date, User_a AS user FROM users UNION ALL SELECT date, User_b AS user FROM users) SELECT date, user, count(DISTINCT user) AS interactions FROM CTE where date = DATE_SUB(date INTERVAL 1 day) GROUP BY 1,2 HAVING interactions &gt; 5 Less

### Given a table of friend requests sent and friend requests received, find the user with the most friends.

request_sent (sender, target, date) s request_received (approver, source, date) r r.date &gt; max(s.date) to get the most updated view in case of "unfriend" ** we'd still have a blind spot of unfriends who occurred and new requests werent sent. every friend is a result of a request sent and approved and adds a friend to 2 people (sender and approver). SELECT user, sum(cf) as num_friends FROM ( SELECT approver as user, count(distinct sender) as cf FROM s inner join r on s.sender=r.source and s.target=r.approver GROUP BY approver HAVING r.date&gt;max(s.date) UNION ALL SELECT sender as user, count(distinct approver) as cf FROM s inner join r on s.sender=r.source and s.target=r.approver GROUP BY sender HAVING r.date&gt;max(s.date) ) GROUP BY user Less

Why do you need to use both the tables? Cant, we achieve this with only the second table? Less

select userid, sum(friends) from (select approver as userid, count(distinct source) friends from request_received group by approver union select source, count(distinct approver) from request_received group BY source) f group by userid Less