Hadoop Fundamentals: HDFS, MapReduce, Pig, and Hive
Hadoop has two core components:- HDFS
- MapReduce
And an ecosystem: Pig, Hive, HBase, Flume, Oozie, Sqoop.
HDFS
HDFS is a distributed file system based on GFS and on top of native file system such as extfs. It provides redundant storage. It performs best with files 100Mb or more and is optimized for large streaming files. HDFS does not allowed random write, but append allowed from version 0.21. It splits files in blocks of 64 Mb or 128Mb, and these blocks are distributed and duplicated on nodes. A master node keeps trace of files/blocks mapping, and data nodes hold actual blocks.
MapReduce
MapReduce is programming model to distribute tasks accross multiple nodes. It is automatically parallel processing and fault tolerant.It comeswith a monitoring tool.
Abstraction for Java and scripting language compliant with HDFS streaming are provided with hte following semantic:
- job: full program
- task: execution of map or reduce
- task attempt: execution of a task
- job tracker in the master node (job and task manager) to monitor jobs
MapReduce divides in two phases: map, and reduce. Between them, there is the stored and shuffle phase. Before excuting a reduce, every map has to be finished and there produced data stored in the same node than the node from which data are read. Reducer can be single or multiple and its output are written to HDFS. A possible bottleneck is with a slow mapper, as reducer cannot be launch if one mapper is not yet finished, and another is huge amount of data produced by mapper.
Combiner allows to send intermediate data to reducer.
A job is defined by a driver class, usually in a main method, which could read configuration from /etc/hadoop/conf and where the input and output dir should be specified. Mapper, Reducer and Combiner implementations should be specified into the conf. It run the job against th conf (JobConf). Mapper implementations should extend MapReduceBase and implements Mapper (parameterized by key/value read and produced). Method map take OutputCollector and Reporter to respectively, collect produced key/value pairs and aggregate data via Counters (visible into GUI).
Distributed cache push data to all slave nodes. It should implement Tool, and be invoke by the ToolRunner.
Hive
Hive is build on top of MapReduce provides a SQL like interface:
- subset of SQL 92 + hive specificity
- no transaction
- no indexes
- no update nor delete
Hive provides a metastore that holds structure of a table and data about where are data in HDFS. It allows to copy file from local FS to table, but does no check at load. Failure are found at request time.
Pig
Pig is a data flow language located on client side. It is executed by LocalJobRunner and on local FS.
The latest innovations of Adobe Flash Platform for Java developers
Flash Player 10.1
Now, Flash Player supports:
- multitouche - gesture
- accelerometer
- screen orientation
- hardware acceleration
For optimization, Flash player brings:
- sleep mode
- reduce memory usage by 50%
- reduces CPU usage for video by using hardware acceleration
- prevent stack trace popup in production mode (programmatically)
AIR 2
New AIR platform version provides:
- native process API (profile: extendedDesktop)
- native installer generation (Windows, Androis, iOS), but needs shared library to be installed on device
- cross compilation, usage of LLVM (but some API cannot be implemented)
Flex 4
Flex 4 comes many news:
- FXG framework for graphics
- skins are separated from components
- 3D API
- new layout framework (don't forget to override updateDisplayList on layout classes)
- asynchronous list and paging components
- globalization API
Mobile
- Needs Flex 4.5 (Hero)
- provides debuger, packager, web view, geolocation, ease of deploiment
Flash Catalyst
- Flash Catalyst reduces gap between developpers and designers
- from a vectorial (Adobe Director) image, we can sepcify which graphical element should be interactive and generate a Flex project
LCDS
- brings bridge to other technologies such as .NET, PHP, etc
Live Cycle Collaboration Service
- from 15$/month
- clustered to Adobe
- component for dashbord/ chat, webcam management
Spring Developer Tools to push your Productivity
Spring focuses on providing tools for framework and languages to speed up application development. Spring Tool Suite comes comes with tcServer, maven, and Spring ROO, and has auto-configuration capabilities. It detects, at installation time, tc Server and Tomcat.
While developing a web application, a part of lost time is located in stop/server the server. Tc Server comes with three refresh approch:
- standard: reload on change
- JMX based: reload only if dynamic content has changed
- agent based
Intelligent data analysis - Apache Mahout
Mahout is a tool mixing data mining, to extract pattern data, and machine learning, to extract model data. As it is build on top of Hadoop, it manages huge amount of data. It's target is to provide scalable data mining algorithm to, for example, analyse news through the internet and group articles by subject and eliminate doublon. Or another example is to search facial photo looks like to a given one in a collection.
No comments:
Post a Comment