# Associate research scientist Interview Questions

# 8K

Associate Research Scientist interview questions shared by candidates### Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

40 Answers↳

CREATE temporary table likes ( userid int not null, pageid int not null ) CREATE temporary table friends ( userid int not null, friendid int not null ) insert into likes VALUES (1, 101), (1, 201), (2, 201), (2, 301); insert into friends VALUES (1, 2); select f.userid, l.pageid from friends f join likes l ON l.userid = f.friendid LEFT JOIN likes r ON (r.userid = f.userid AND r.pageid = l.pageid) where r.pageid IS NULL; Less

↳

SELECT f.userid, l.pageid FROM friends f LEFT JOIN likes l ON f.friendid = l.userid WHERE l.pageid NOT IN (SELECT r.pageid FROM likes r WHERE f.userid = r.userid) Can someone tell me if this works? Less

↳

Use Except select f.user_id, l.page_id from friends f inner join likes l on f.fd_id = l.user_id group by f.user_id, l.page_id -- for each user, the unique pages that liked by their friends Except select user_id, page_id from likes Less

### Data challenge was very similar to the ads analysis challenge on the book the collection of data science takehome challenge, so that was easy (if you have done your homework). SQL was: you have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in her history a user has listened to a given song. So count is cumulative sum. You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count. If it is the first time a user has listened to a given song, you won't have this pair in the lifetime table, so you have to create the pair there and then add the count of the last day. Onsite: lots of ads related and machine learning questions. How to build an ad model, how to test it, describe a model. I didn't do well in some of these.

18 Answers↳

Can't tell you the solution of the ads analysis challenge. I would recommend getting in touch with the book author though. It was really useful to prep for all these interviews. SQL is a full outer join between life time count and last day count and then sum the two. Less

↳

Can you post here your solution for the ads analysis from the takehome challenge book. I also bought the book and was interested in comparing the solutions. Also can you post here how you solved the SQL question? Less

↳

for the SQL, I think both should work. Outer join between lifetime count and new day count and then sum columns replacing NULLs with 0, or union all between those two, group by and then sum. Less

### Consider a game with 2 players, A and B. Player A has 8 stones, player B has 6. Game proceeds as follows. First, A rolls a fair 6-sided die, and the number on the die determines how many stones A takes over from B. Next, B rolls the same die, and the exact same thing happens in reverse. This concludes the round. Whoever has more stones at the end of the round wins and the game is over. If players end up with equal # of stones at the end of the round, it is a tie and another round ensues. What is the probability that B wins in 1, 2, ..., n rounds?

16 Answers↳

Because at the beginning time, A has 8 and B has 6, so let A:x and B:y, then A:8+x-y and B:6-x+y; so there are 10/36 prob of B wins. And A wins prob is 21/36 and the equal prob for next round is 5/36. So for B wins at round prob is 10/36. And if they are equal and to have another round, the number has changed to 7 and 7. So A:7+x-y and B:7-x+y, so this time B wins has prob 15/36 and A wins has prob 15/36. And the equal to have another round is 6/36=1/6. So overall B wins in 2 rounds has prob 5/36*15/36. And for round 3,4,...etc, since after each equal round, the number will go back to 7 and 7 so the prob will not change. So B wins in round 3,4,...n has prob 5/36*(6/36)^(r-2)*15/36. r means the number of the total rounds. Less

↳

So many answers...Here's my version: For round1, B win only if it gets 3 or more stones than A, which is (A,B) = (1,4) (1,5) (1, 6) (2, 5) (2,6) (3,6) which is 6 cases out of all 36 probabilities. So B has 1/6 chance to win. To draw, B needs to get exactly 2 stones more than A, which is (A, B) = (1,3) (2,4) (3,5) (4,6) or 1/9. Entering the second round, all stones should be equal, so the chance to draw become 1/6, and the chance for either to win is 5/12. So the final answer is (1/6, 1/9*5/12, (1/9)^2*5/12, .....(1/9)^(n-1)*5/12) ) Less

↳

I don't get it. Shouldn't prob of B winning given it's tie at 1st round be 15/36? given it's tie at 1st round, at the 2nd round Nb > Na can happen if (B,A) is (2,1), (3,1/2),(4,1/2/3), (5,1/2/3/4),(6,1/2/3/4/5), which totals 15 out of 36. Less

### Find the second largest element in a Binary Search Tree

15 Answers↳

The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null && root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Less

↳

find the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. Less

↳

One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. Less

### data question: dialoglog (userid int appid int type char , a flag either "imp" or "click" ds timestamp ) How would you access the quality of app? How to compute click-through rate (in mySQL)?

8 Answers↳

Assuming ctr is defined as total number clicks / total number of impressions (not counting each unique user's action) select appid, total_click_ct / total_imp_ct as ctr from ( select appid, count(distinct case when flag = 'imp' then 1 else 0 end) as total_imp_ct, count(distinct case when flag = 'click' then 1 else 0 end) as total_click_ct, from table where ts > x and ts < y group by appid) table2 order by ctr desc; Less

↳

re: assessing the quality of an app, some ideas: if it's your company's app, you can look at numbers of downloads and server traffic. You can survey your users, measure profitability, and track how often your users refer others to get the app. if it's not your app, you can look at ratings in the app store, cost, read reviews, and look at how well it does against your competitors' cost and ratings. you can look at specific features: does it do what's needed? does it look good? does it have a bunch of unnecessary features that nobody uses or wants? Does it clutter up your storage or run very slowly, i.e. is it optimized? you can look at the company's reputation, if you don't know anything about them, look at their website, does it look professional? does it have demos? can you test the app with a free trial before you buy? Less

↳

SELECT appid, sum(click) / count(*) AS crt FROM dialoglog GROUP BY appid;

### Write a sorting algorithm for a numerical dataset in Python.

8 Answers↳

def mergeSort(lst): split = len(lst) / 2 left = lst[:split] right = lst[split:] if len(left) > 1 or len(right) > 1: left = mergeSort(left) right = mergeSort(right) return merge(left, right) def merge(A, B): i=0 j=0 sorted_list = [] while i < len(A) and j < len(B): if A[i] <= B[j]: sorted_list.append(A[i]) i += 1 else: sorted_list.append(B[j]) j += 1 if i < len(A): sorted_list.extend(A[i:]) elif j < len(B): sorted_list.extend(B[j:]) return sorted_list Less

↳

list = ["1", "4", "2", "10", "5"] list = [ int(i) for i in list ] list.sort() print (list) Less

↳

def mysort(arr): if len(arr) pivot] return mysort(left) + middle + mysort(right) Less

### Given a table with columns country (with two-letter country abbreviation), count of requests sent, percentage of requests sent failed, condense down to this data grouped by country: country (one row per country), total count of requests sent, total count of requests sent failed.

7 Answers↳

SELECT country, SUM(count_of_request_sent) AS total_count_of_request_sent, SUM((count_of_request_sent * percent_of_request_failed) / 100) AS total_count_of_request_failed FROM table GROUP BY country; Less

↳

SELECT country, SUM(requests_sent), SUM((percent_failed/100) * requests_sent) FROM requests GROUP BY country; Less

↳

My assumption is that the requests_sent columns is not an integer. It might just have status as per 'failed", "successful" etc. Best to verify from the interviewer. That means we may not be able to take the sum of teh request_sent but have to use count instead. WITH CTE AS ( SELECT country, request_sent, (perc_failed * count(request_sent)) AS total_failed FROM T1 ) SELECT country, COUNT(request_sent), SUM(total_failed) GROUP BY 1 Less

### SQL question: given a table of interaction between users (user_a | user_b | day), find number of users who had more than 5 interactions yesterday (assume there is only one unique interaction between a pair of users per day). Product Question: A user satisfaction survey was conducted for two groups of facebook users (each with 50 k sample size). Group1: who had enabled certain login security features Group 2: who had not enabled these security features. It was found that user satisfaction with group1 was 30% lower than with group 2. Why do you think so? Comment on how the survey was conducted?

7 Answers↳

For the product question: This is an example of selection bias. The subset of users who opted in for the security features are not representative of the underlying population of FB users. They are probably more concerned about how their data is being used. Further, because the survey would have been completed on a voluntary basis, it represents another source of selection bias. These individuals are more likely to be engaged with the FB product than the general population, and care enough to complete the survey. Once again, these users are not a representative (or for that matter random) sample from the underlying population of FB users. This would explain why Group 1 had a much lower satisfaction rate as compared to Group 2, as they represent a specific subset of FB's users who care deeply about this issue, and by the virtue that they have the security feature on is indicative of the fact that they take data privacy seriously. For the SQL question: This is probably not the most efficient, but is the only thing that came to my mind: with all_users as ( select userA as userID, count(1) as interactions from table where date = current_date - interval 'day' group by userA union all select userB as userID, count(1) as interactions from table where date = current_date - interval 'day' group by userB ), engaged as ( select userID, sum(interactions) as totalInteractions from all_users group by userID having totalInteractions >= 5 ) select count(1) as greaterThanFive from engaged Less

↳

select c.user_id , sum(c.cnt) as sum_cnt from ( SELECT USER_A as user_id,count(user_b) cnt FROM interactions group by user_a union all SELECT USER_B as user_id,count(user_a) cnt FROM interactions group by user_B) c group by c.user_id having sum(c.cnt)>5 ; Less

↳

WITH CTE AS ( SELECT date, User_a AS user FROM users UNION ALL SELECT date, User_b AS user FROM users) SELECT date, user, count(DISTINCT user) AS interactions FROM CTE where date = DATE_SUB(date INTERVAL 1 day) GROUP BY 1,2 HAVING interactions > 5 Less

### Given a table of friend requests sent and friend requests received, find the user with the most friends.

7 Answers↳

request_sent (sender, target, date) s request_received (approver, source, date) r r.date > max(s.date) to get the most updated view in case of "unfriend" ** we'd still have a blind spot of unfriends who occurred and new requests werent sent. every friend is a result of a request sent and approved and adds a friend to 2 people (sender and approver). SELECT user, sum(cf) as num_friends FROM ( SELECT approver as user, count(distinct sender) as cf FROM s inner join r on s.sender=r.source and s.target=r.approver GROUP BY approver HAVING r.date>max(s.date) UNION ALL SELECT sender as user, count(distinct approver) as cf FROM s inner join r on s.sender=r.source and s.target=r.approver GROUP BY sender HAVING r.date>max(s.date) ) GROUP BY user Less

↳

Why do you need to use both the tables? Cant, we achieve this with only the second table? Less

↳

select userid, sum(friends) from (select approver as userid, count(distinct source) friends from request_received group by approver union select source, count(distinct approver) from request_received group BY source) f group by userid Less

### Given 25 swimmers and a pool with five lanes, what is the minimum number of heats needed to determine the three fastest swimmers in the group?

6 Answers↳

Assuming, "fastest time" means the use of a timing device. Sticking to the exact wording of the question, 5 heats are required. Pick the 3 fastest times of the 5 heats. Less

↳

Assuming no timing devices, and additionally, that a swimmer will swim at the same speed in each heat, the answer is 9. Taking the top 3 finishers from the first 5 heats, that leaves 15 swimmers. Because the 3rd-place finisher in any of these heats could be faster than all swimmers in other heats, additional heats must be run. The 2nd round of 3 heats would include the 5 first-place finishers, the 5 second-place finishers, and the 5 third-place finishers. Taking the top 3 from each of those heats, that leaves 9 swimmers. You would eliminate, however, the swimmers who finished 2nd and 3rd in the two heats that included the 2nd and 3rd place finishers. They necessarily would be slower than at least 3 other swimmers in the competition. You would also conclude that the winner of the Round 2 heat that included the 5 1st-place finishers is necessarily the fastest swimmer in the competition. That leaves 4 swimmers on the bubble after a total of 8 heats. The swimmer who won the Round 2 heat of 1st-place finishers, plus the top 2 finishers in the 9th and final heat of 4 swimmers, would be the fastest 3. Less

↳

Case 1: With timer - 5 heats - get the times, choose the top 3. Case 2: No timer - the answer should be 7. Process: Step 1: Divide everyone into 5 groups of 5 each. Run 5 heats - Lets name them Group 1 - 5 Step 2: Take all the first rankers of each heat (run above) and run another heat Now in this heat whoever is 4th and 5th are eliminated (with their groups) We also know that the one who is first in this heat is definitely first in the 25 people. Lets assume for simplicity that group 1 has the first ranker of this heat, group 2 has the second ranker and group 3 has the third ranker. Step 3: Take the 2nd and 3rd player from the group 1 (group with 1st ranker) and 1st and 2nd ranker of group 2 and first ranker of group 3 Basically the 3rd ranker of group 2 is not in the race because there are 3 people who are faster than this player (G1 P1, G2 P1, G2 P2) Similarly 2nd and 3rd of group 3 will also not qualify since there are three people faster than them already (G1 P1, G2 P1, G3 P1) In this heat the top two will be the 2nd and 3rd of the 25 people. Less