Java Asylum: Devoxx 2010 daily notes: day one

Hadoop Fundamentals: HDFS, MapReduce, Pig, and Hive

Hadoop has two core components:

HDFS
MapReduce

And an ecosystem: Pig, Hive, HBase, Flume, Oozie, Sqoop.

HDFS

HDFS is a distributed file system based on GFS and on top of native file system such as extfs. It provides redundant storage. It performs best with files 100Mb or more and is optimized for large streaming files. HDFS does not allowed random write, but append allowed from version 0.21. It splits files in blocks of 64 Mb or 128Mb, and these blocks are distributed and duplicated on nodes. A master node keeps trace of files/blocks mapping, and data nodes hold actual blocks.

MapReduce

MapReduce is programming model to distribute tasks accross multiple nodes. It is automatically parallel processing and fault tolerant.It comeswith a monitoring tool.

Abstraction for Java and scripting language compliant with HDFS streaming are provided with hte following semantic:

job: full program
task: execution of map or reduce
task attempt: execution of a task
job tracker in the master node (job and task manager) to monitor jobs

MapReduce divides in two phases: map, and reduce. Between them, there is the stored and shuffle phase. Before excuting a reduce, every map has to be finished and there produced data stored in the same node than the node from which data are read. Reducer can be single or multiple and its output are written to HDFS. A possible bottleneck is with a slow mapper, as reducer cannot be launch if one mapper is not yet finished, and another is huge amount of data produced by mapper.

Combiner allows to send intermediate data to reducer.

A job is defined by a driver class, usually in a main method, which could read configuration from /etc/hadoop/conf and where the input and output dir should be specified. Mapper, Reducer and Combiner implementations should be specified into the conf. It run the job against th conf (JobConf). Mapper implementations should extend MapReduceBase and implements Mapper (parameterized by key/value read and produced). Method map take OutputCollector and Reporter to respectively, collect produced key/value pairs and aggregate data via Counters (visible into GUI).

Distributed cache push data to all slave nodes. It should implement Tool, and be invoke by the ToolRunner.

Hive

Hive is build on top of MapReduce provides a SQL like interface:

subset of SQL 92 + hive specificity
no transaction
no indexes
no update nor delete

Hive provides a metastore that holds structure of a table and data about where are data in HDFS. It allows to copy file from local FS to table, but does no check at load. Failure are found at request time.

Pig

Pig is a data flow language located on client side. It is executed by LocalJobRunner and on local FS.

The latest innovations of Adobe Flash Platform for Java developers

Flash Player 10.1

Now, Flash Player supports:

multitouche - gesture
accelerometer
screen orientation
hardware acceleration

For optimization, Flash player brings:

sleep mode
reduce memory usage by 50%
reduces CPU usage for video by using hardware acceleration
prevent stack trace popup in production mode (programmatically)

AIR 2

New AIR platform version provides:

native process API (profile: extendedDesktop)
native installer generation (Windows, Androis, iOS), but needs shared library to be installed on device
cross compilation, usage of LLVM (but some API cannot be implemented)

Flex 4

Flex 4 comes many news:

FXG framework for graphics
skins are separated from components
3D API
new layout framework (don't forget to override updateDisplayList on layout classes)
asynchronous list and paging components
globalization API

Mobile

Needs Flex 4.5 (Hero)
provides debuger, packager, web view, geolocation, ease of deploiment

Flash Catalyst

Flash Catalyst reduces gap between developpers and designers
from a vectorial (Adobe Director) image, we can sepcify which graphical element should be interactive and generate a Flex project

LCDS

brings bridge to other technologies such as .NET, PHP, etc

Live Cycle Collaboration Service

from 15$/month
clustered to Adobe
component for dashbord/ chat, webcam management

Spring Developer Tools to push your Productivity

Spring focuses on providing tools for framework and languages to speed up application development. Spring Tool Suite comes comes with tcServer, maven, and Spring ROO, and has auto-configuration capabilities. It detects, at installation time, tc Server and Tomcat.

While developing a web application, a part of lost time is located in stop/server the server. Tc Server comes with three refresh approch:

standard: reload on change
JMX based: reload only if dynamic content has changed
agent based

Intelligent data analysis - Apache Mahout

Mahout is a tool mixing data mining, to extract pattern data, and machine learning, to extract model data. As it is build on top of Hadoop, it manages huge amount of data. It's target is to provide scalable data mining algorithm to, for example, analyse news through the internet and group articles by subject and eliminate doublon. Or another example is to search facial photo looks like to a given one in a collection.

Java Asylum

Tuesday, November 16, 2010

Devoxx 2010 daily notes: day one