This article is Part 1 in series that will take a closer look at the
architecture and methods of a Hadoop cluster, and how it relates to the
network and server infrastructure. The content presented here is
largely based on academic work and conversations I’ve had with customers
running real production clusters. If you run production Hadoop
clusters in your data center, I’m hoping you’ll provide your valuable
insight in the comments below. Subsequent articles to this will cover
the server and network architecture options in closer detail. Before we
do that though, lets start by learning some of the basics about how a
Hadoop cluster works. OK, let’s get started!
The three major categories of machine roles in a Hadoop deployment
are Client machines, Masters nodes, and Slave nodes. The Master nodes
oversee the two key functional pieces that make up Hadoop: storing lots
of data (HDFS), and running parallel computations on all that data (Map
Reduce). The Name Node oversees and coordinates the data storage
function (HDFS), while the Job Tracker oversees and coordinates the
parallel processing of data using Map Reduce. Slave Nodes make up the
vast majority of machines and do all the dirty work of storing the data
and running the computations. Each slave runs both a Data Node and Task
Tracker daemon that communicate with and receive instructions from
their master nodes. The Task Tracker daemon is a slave to the Job
Tracker, the Data Node daemon a slave to the Name Node.
Client machines have Hadoop installed with all the cluster settings,
but are neither a Master or a Slave. Instead, the role of the Client
machine is to load data into the cluster, submit Map Reduce jobs
describing how that data should be processed, and then retrieve or view
the results of the job when its finished. In smaller clusters (~40
nodes) you may have a single physical server playing multiple roles,
such as both Job Tracker and Name Node. With medium to large clusters
you will often have each role operating on a single server machine.
In real production clusters there is no server virtualization, no
hypervisor layer. That would only amount to unnecessary overhead
impeding performance. Hadoop runs best on Linux machines, working
directly with the underlying hardware. That said, Hadoop does work in a
virtual machine. That’s a great way to learn and get Hadoop up and
running fast and cheap. I have a 6-node cluster up and running in
VMware Workstation on my Windows 7 laptop.
Thứ Năm, 25 tháng 5, 2017
Thứ Năm, 11 tháng 5, 2017
How to show device on linux
lspci -nnk
lspci -vmmnn
02:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] [1000:0079] (rev 05) Subsystem: Intel Corporation RAID Controller RS2BL040 [8086:9260] Kernel driver in use: megaraid_sas Kernel modules: megaraid_sas
lspci -vmmnn
Slot: 02:00.0 Class: RAID bus controller [0104] Vendor: LSI Logic / Symbios Logic [1000] Device: MegaRAID SAS 2108 [Liberator] [0079] SVendor: Intel Corporation [8086] SDevice: RAID Controller RS2BL040 [9260] Rev: 05
Đăng ký:
Bài đăng (Atom)