Data engineers are IT professionals who are needed in almost every industry. Data engineers monitor data trends to determine the best next steps for companies. A critical part of a data engineer job is to process raw data into usable data by creating data pipelines and building data systems.

38,502 Data Engineer interview questions shared by candidates

Here are three top data engineer interview questions and tips on how to answer them:

How to answer: Before the interview, review your CV and/or portfolio and make a list of the programs you are most proficient with. If you find that you are lacking the expertise in a program that the company predominately uses, describe yourself as a highly motivated self-starter who will work tirelessly to learn the program(s).

How to answer: Highlight your role in relation to the larger organisation and other roles such as data scientists to clearly define your contribution to the overall system of business. Clarify the difference between a database-centric engineer and a pipeline-centric engineer.

How to answer: Research the company's software, data cloud products and use of Apache Hadoop to be prepared for this inquiry. Data engineers must be fluent in programming languages and data management systems used throughout the industry such as Apache Hadoop.

Data Scientist was asked...1 March 2016

↳

CREATE temporary table likes ( userid int not null, pageid int not null ) CREATE temporary table friends ( userid int not null, friendid int not null ) insert into likes VALUES (1, 101), (1, 201), (2, 201), (2, 301); insert into friends VALUES (1, 2); select f.userid, l.pageid from friends f join likes l ON l.userid = f.friendid LEFT JOIN likes r ON (r.userid = f.userid AND r.pageid = l.pageid) where r.pageid IS NULL; Less

↳

select w.userid, w.pageid from ( select f.userid, l.pageid from rollups_new.friends f join rollups_new.likes l ON l.userid = f.friendid) w left join rollups_new.likes l on w.userid=l.userid and w.pageid=l.pageid where l.pageid is null Less

↳

Use Except select f.user_id, l.page_id from friends f inner join likes l on f.fd_id = l.user_id group by f.user_id, l.page_id -- for each user, the unique pages that liked by their friends Except select user_id, page_id from likes Less

Data Scientist was asked...12 September 2013

↳

Bayesian stats: you should estimate the prior probability that it's raining on any given day in Seattle. If you mention this or ask the interviewer will tell you to use 25%. Then it's straight-forward: P(raining | Yes,Yes,Yes) = Prior(raining) * P(Yes,Yes,Yes | raining) / P(Yes, Yes, Yes) P(Yes,Yes,Yes) = P(raining) * P(Yes,Yes,Yes | raining) + P(not-raining) * P(Yes,Yes,Yes | not-raining) = 0.25*(2/3)^3 + 0.75*(1/3)^3 = 0.25*(8/27) + 0.75*(1/27) P(raining | Yes,Yes,Yes) = 0.25*(8/27) / ( 0.25*8/27 + 0.75*1/27 ) **Bonus points if you notice that you don't need a calculator since all the 27's cancel out and you can multiply top and bottom by 4. P(training | Yes,Yes,Yes) = 8 / ( 8 + 3 ) = 8/11 But honestly, you're going to Seattle, so the answer should always be: "YES, I'm bringing an umbrella!" (yeah yeah, unless your friends mess with you ALL the time ;) Less

↳

Answer from a frequentist perspective: Suppose there was one person. P(YES|raining) is twice (2/3 / 1/3) as likely as P(LIE|notraining), so the P(raining) is 2/3. If instead n people all say YES, then they are either all telling the truth, or all lying. The outcome that they are all telling the truth is (2/3)^n / (1/3)^n = 2^n as likely as the outcome that they are not. Thus P(ALL YES | raining) = 2^n / (2^n + 1) = 8/9 for n=3 Notice that this corresponds exactly the bayesian answer when prior(raining) = 1/2. Less

↳

26/27 is incorrect. That is the number of times that at least one friend would tell you the truth (i.e., 1 - probability that would all lie: 1/27). What you have to figure out is the odds it raining | (i.e., given) all 3 friends told you the same thing. Because they all say the same thing, they must all either be lying or they must all be telling the truth. What are the odds that would all lie and all tell the truth? In 1/27 times, they would the all lie and and in 8/27 times they would all tell the truth. So there are 9 ways in which all your friends would tell you the same thing. And in 8 of them (8 out of 9) they would be telling you the truth. Less

Data Engineer was asked...22 May 2020

↳

Here is the Python Code (inspired by someone's code on this page): def setUp(word, input_list): word = word.strip() temp_list = [] Ismatch = False if word in input_list: Ismatch = True elif word is None or len(word) == 0: Ismatch = False else: for w in input_list: if len(w) == len(word): temp_list.append(w) for j in range(len(temp_list)): count=0 for i in range(len(word)): if word[i] == temp_list[j][i] or word[i] == '.': count += 1 else: break if count == len(word): Ismatch = True print(Ismatch) def isMatch(word, input_list): return setUp(word, input_list) isMatch('c.t', ['cat', 'bte', 'art', 'drat', 'dart', 'drab']) Less

↳

bear in mind for your solution, checking the lengths of words in the dictionary is very fast. That's what you can use your setup for. There's no need to iterate through the whole loop of checks if the word fails the length already. See my solution above Less

↳

This was the fastest I could do without regex: def func(wrd,lst): if len(wrd) not in [len(x) for x in lst]: return False elif wrd in lst: return True else: lst1 = [x for x in lst if len(x)==len(wrd)] for z in lst1: c=0 for i in range(len(wrd)): if wrd[i] != '.' and wrd[i] == z[i]: c=c+1 if len(wrd)-wrd.count('.') == c: return True return False Less

Data Scientist, Analytics was asked...6 March 2015

↳

If you group by parent_id, you'll be leaving out all posts with zero comments.

↳

@ RLeung shouldn't you use left join? You are effectively losing all posts with zero comment. Less

↳

Here is the solution. You need a left self join that accounts for posts with zero comments. Select children , count(submission_id) from ( Select a.submission_id, count(b.submission_id) as children from Submissions a Left Join submissions b on On a.submission_id=b.parent_id Where a.parent_id is null Group by a.submission_id ) a Group by children Less

Data Scientist was asked...23 March 2017

↳

All these supposed answers are missing the point, and this question isn't even worded correctly. It should be lists of NUMBERS, not "objects". Anyway, the question is asking how you figure out the number that is missing from list B, which is identical to list A except one number is missing. Before getting into the coding, think about it logically - how would you find this? The answer of course is to sum all the numbers in A, sum all the numbers in B, subtract the sum of B from the sum of A, and that gives you the number. Less

↳

select b.element from b left join a on b.element = a.element where a.element is null Less

↳

In Python: (just numbers) def rem_elem_num(listA,listB): sumA = 0 sumB = 0 for i in listA: sumA += i for j in listB: sumB += j return sumA-sumB (general) def rem_elem(listA, listB): dictB = {} for j in listB: dictB[j] = None for i in listA: if i not in dictB: return i Less

Data Engineer was asked...22 May 2020

↳

if l == sorted(l) or l == sorted(l,reverse=True): print(True) else: print(False) Less

↳

python solution. inc = dec = True for i in range(len(nums)-1): if nums[i] > nums[i+1]: inc = false if nums[i] < nums[i+1]: dec = false return inc or dec Less

↳

An array is monotonic if and only if it is monotone increasing, or monotone decreasing. Since p = A[i+1] for all i indexing from 0 to len(A)-2. Note: Array with single element can be considered to be both monotonic increasing or decreasing, hence returns “True“. --------------------Python 3 code---------------------- # Check if given array is Monotonic def isMonotonic(A): return (all(A[i] = A[i + 1] for i in range(len(A) - 1))) # Test with an Array A = [6, 5, 4, 4] # Print required result print(isMonotonic(A)) Less

Data Scientist was asked...9 May 2016

↳

Can't tell you the solution of the ads analysis challenge. I would recommend getting in touch with the book author though. It was really useful to prep for all these interviews. SQL is a full outer join between life time count and last day count and then sum the two. Less

↳

Can you post here your solution for the ads analysis from the takehome challenge book. I also bought the book and was interested in comparing the solutions. Also can you post here how you solved the SQL question? Less

↳

for the SQL, I think both should work. Outer join between lifetime count and new day count and then sum columns replacing NULLs with 0, or union all between those two, group by and then sum. Less

Data Scientist was asked...29 March 2017

↳

For the questions 1: I think both options have the same expected value of 4 For the question 2: Use binomial distribution function. So basically, for one case to happen, you will use this function p(one case) = (0.96)^99*(0.04)^1 In total, there are 100 positions for the ad. 100 * p(one case) = 7.03% Less

↳

For "MockInterview dot co": The binomial part is correct but you argue that the expected value for option 2 is not 4 but this is false. In both cases E(x) = np = 100*(4/100) = 4 and E(x) = np=100*(1/25) = 4 again. Less

↳

Chance of getting exactly one add is ~7% As the formula is (NK) (0,04)^K * (0,96)^(N−K) where the first (NK) is the combination number N over K Less

Data Scientist was asked...29 March 2015

↳

Because at the beginning time, A has 8 and B has 6, so let A:x and B:y, then A:8+x-y and B:6-x+y; so there are 10/36 prob of B wins. And A wins prob is 21/36 and the equal prob for next round is 5/36. So for B wins at round prob is 10/36. And if they are equal and to have another round, the number has changed to 7 and 7. So A:7+x-y and B:7-x+y, so this time B wins has prob 15/36 and A wins has prob 15/36. And the equal to have another round is 6/36=1/6. So overall B wins in 2 rounds has prob 5/36*15/36. And for round 3,4,...etc, since after each equal round, the number will go back to 7 and 7 so the prob will not change. So B wins in round 3,4,...n has prob 5/36*(6/36)^(r-2)*15/36. r means the number of the total rounds. Less

↳

So many answers...Here's my version: For round1, B win only if it gets 3 or more stones than A, which is (A,B) = (1,4) (1,5) (1, 6) (2, 5) (2,6) (3,6) which is 6 cases out of all 36 probabilities. So B has 1/6 chance to win. To draw, B needs to get exactly 2 stones more than A, which is (A, B) = (1,3) (2,4) (3,5) (4,6) or 1/9. Entering the second round, all stones should be equal, so the chance to draw become 1/6, and the chance for either to win is 5/12. So the final answer is (1/6, 1/9*5/12, (1/9)^2*5/12, .....(1/9)^(n-1)*5/12) ) Less

↳

I don't get it. Shouldn't prob of B winning given it's tie at 1st round be 15/36? given it's tie at 1st round, at the 2nd round Nb > Na can happen if (B,A) is (2,1), (3,1/2),(4,1/2/3), (5,1/2/3/4),(6,1/2/3/4/5), which totals 15 out of 36. Less

Data Scientist Intern was asked...25 February 2012

↳

The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null && root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Less

↳

find the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. Less

↳

One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. Less

perl developeranalytics engineermodeling engineerbusiness intelligence specialistetl developerdata warehouse architectdata scientistdata warehouse developeroracle data integratordata analystmachine learning scientistdatawarehouse developerbusiness objects developeretl testerdata modelerdata warehouse managerdata minerdatabase administratordatastage developer