Monday 16 June 2014

Hadoop interview questions

 Hadoop interview questions

imp 



1)Can you  explanation HADOOP BIG DATA Analytics?

Hadoop BIG DATA Analytics is a such a hugevolume of complex data it becomes very tedious to captureing, storeing, processing, retrieveing,reporing and analyze using management hadoop tools database by hand or processing techniques of transaction data.

What is Hadoop or Hadoop big data?

 HadoopApache  is a software developement  framework  that makes the promotion of 

 distributed data-intensive applications.

• The Hadoop platform consists of Hadoop kernal, component of MapReduce, HDFS (Hadoop distributed file system)

• Hadoopcode is written in Java programming language and is a top-level project Apache built and used by a global community of contributors.

• The best-known technology used for large data is Hadoop

• Two languages are identified as original Hadoop languages: pig and hive.

• System hadoop, data is distributed to thousands of nodes in parallel

• Hadoop discusses the complexities of large volume, velocity & various data

• Batch processing focused on is greatly in Hadoop

• Hadoop can store more  petabytes of reliable data

• Accessibility is ensured even if any machine fails or is disposed of the network.

• One can use programs map reduce for accessing and manipulating data. The developer do not have to worry about when the data is stored, it can reference data in a unique view provided from the master node that stores the metadata of all files stored on the cluster.

do you want hadoop training videos  :    hadoop training videos


2)Explain the JobTracker in Hadoop ETL? no'of instances of JobTracker runing on a HadoopETL cluster?

Hadoop JobTracker is the demon submission service and monitor MapReduce jobs in Hadoop. There is only one process Job Tracker runs on any Hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster's race on a separate machine. Each slave node is configured with storage node Job Tracker. The JobTracker is the single point of failure for the Hadoop MapReduce service. If it fails, all work will be stopped. JobTracker in Hadoop performs the following actions.

Client applications  submit  their work to the Job Tracker.

JobTracker  talks  to the  NameNode to determine the location of the data

The JobTracker locates TaskTracker   nodes with slots in or near data

The JobTracker submits the job to the chosen TaskTracker nodes.

TaskTracker  nodes are monitored. If they do not have a heart rate signals quite often, they are deemed to have failed and the work is scheduled on another TaskTracker.

A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do next: it may refer the employment elsewhere, he can score this specific record as something to be avoided, and it can even be blacklisted TaskTracker as unreliable.
When work is completed, the JobTracker updates its status.

3)What is the main difference of HDFS and NAS?

 HDFS  file is a distributed file system designed  runing on quality standard hardware. It have so many similarities with existing distributed file systems. However, differences from other distributed file systems are important.  some important differences between HDFS and NAS.

1)In HDFS data blocks are allocated on local disks of all machines in a cluster. While in the NAS data is stored on dedicated hardware.

2)HDFS is designed to operate with the MapReduce system, are moved from the calculation data. NAS is not suitable for MapReduce, because the data is stored separately from the calculations.


3)HDFS run on a cluster of machines and provides redundancy using protocal replication. While NAS is provided by a single machine can't not provide data redundancy.What is the difference between motor MapReduce and HDFS cluster?
HDFS cluster is the name given to the entire configuration of the master and slaves, where data is stored. Map reduce engine is the programming module that is used to retrieve and analyze data.

4)Map resembles a pointer?
No, the card is not as a pointer.

6)Claim on two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system because it stores information on the location of all files stored in different datanodes and other hand, datanodes require system low configuration.

8)Why are splits equal number of cards?

The number of cards is equal divisions of entry because we want that key pairs / first value entry is divided.

9) which clints are  using HadoopTools ? Give few examples?
. oracle
• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!



11)A work is divided between maps?

No, a work is divided into cards. Reversed is created for the file. The file is placed on the datanodes into blocks. For each division, a card is required.

15)Who are the two types of 'records' in HDFS?
There are two types of entries in HDFS: write validated and not being charged. Validated write is when write us and forget it, regardless of the acknowledgement of receipt. It is similar to our Indian traditional post. In an entry not posted, wait us the acknowledgement. It is similar to the current messaging services. Naturally, writing not being charged is more expensive than the posted entry. It is much more expensive, although the two writes are asynchronous.

17)Why 'Reading' is done in parallel and "Written" is not in HDFS?

Reading is done in parallel, because by doing so, we can access the fast data. But we do not write in parallel operation. The reason is that if we make the write in parallel operation, then it can result in data inconsistency. For example, you have a file and try to write data to the file in parallel two nodes, then the first node does not know what wrote the second node, and vice versa. Thus, this makes it confused what data to be stored and accessible.

19)Hadoop is akin to the NOSQL Cassandra database?

If NOSQL is closet technology which can be compared to Hadoop, it has its own advantages and disadvantages. There is no DFS in NOSQL. Hadoop is not a database. This is a filesystem (HDFS) and the framework of distributed programming (MapReduce).


Why should we Hadoop? 

Daily a large amount of unstructured data is getting discharged into our machines. The major challenge is to not store large sets of data in our systems but to collect and analyze large data in organizations, that too the data present in different machines at different locations. In this case, a necessity for Hadoop arises. Hadoop has the ability to analyze the data in different machines at different locations very quickly and very cost effectively. It uses the concept of MapReduce which allows it to divide the query into smaller pieces and process them in parallel. This is also known as parallel computing

.

Read more...

Lorem Ipsum

  © Blogger templates Newspaper III by Ourblogtemplates.com 2008

Back to TOP