Hive provides SQL-like query language on HDFS(Hadoop Distributed File System)

By Chun Kang - Last updated: Wednesday, March 14, 2012

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF’s), aggregations (UDAF’s), and table functions (UDTF’s).

Hive provides SQL-like query language on HDFS(Hadoop Distributed File System)

Hive Query Language provides following features

Basic SQL

Extensibility

 

See below example of Hive query language. Amaging thing is Hiveis compatible with standard SQL.

SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid

It is almost the same as the usual RDB SQL. This is really great feature of Hive so programmers having experiences in RDB can implement software easily.

Hive does not mandate read or written data be in the "Hive format"—there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details.

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

Hive provides SQL-like query language on HDFS(Hadoop Distributed File System)

Following is Data Model for Hive.

Hive provides SQL-like query language on HDFS(Hadoop Distributed File System)

References

https://cwiki.apache.org/confluence/display/Hive/Home

Hive ApacheCon 2008, New Oreleans, LA (Ashish Thusoo, Facebook)

Filed in Computers & Internet • Tags: , , , , , , , , , , , ,

Apache HBase is a storage system, with roots in Hadoop, and uses HDFS for underlying storage.

By Chun Kang - Last updated: Thursday, March 1, 2012

Apache HBase is a storage system, with roots in Hadoop, from which it gets its "H". Though HBase uses HDFS for underlying storage, HBase is designed much more for fast and frequent access to blobs of binary data.

It is an example of what most would call a NoSQL column-oriented store; it holds semi-structured values for keys.

Below is the reference architecture based on HDFS, MapReduce, and HBase.

image

MapReduce might be used for parallel processing to calculate something. I will search much detailed knowledge in the future for HBase to make better understanding.

Reference

http://www.acunu.com/blogs/sean-owen/hadoop-universe/

http://hortonworks.com/technology/hortonworksdataplatform/

Filed in Computers & Internet • Tags: , , , , , , ,

HDFS(Hadoop Distributed File System) is designed to run on commodity hardware – Low cost hardware

By Chun Kang - Last updated: Tuesday, February 28, 2012

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

HDFS(Hadoop Distributed File System) is designed to run on commodity hardware – Low cost hardware

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is http://hadoop.apache.org/hdfs/.

HDFS(Hadoop Distributed File System) is designed to run on commodity hardware – Low cost hardware

The goal of HDFS

 

Data Replication

HDFS is designed to reliably store very large files across machines in a large cluster.

HDFS(Hadoop Distributed File System) is designed to run on commodity hardware – Low cost hardware

MapReduce Software Framework

Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation.

HDFS(Hadoop Distributed File System) is designed to run on commodity hardware – Low cost hardware

- Processes large jobs in parallel across many nodes and combines results.
- Eliminates the bottlenecks imposed by monolithic storage systems.
- Results are collated and digested into a single output after each piece has been analyzed.

 

References

http://hadoop.apache.org/common/docs/current/hdfs_design.html

http://www.cloudera.com/what-is-hadoop/hadoop-overview/

http://www.infoq.com/articles/data-mine-cloud-hadoop

Filed in Computers & Internet • Tags: , , , , ,

Hadoop MapReduce is a software framework for processing vast amounts of data in-parallel on large clusters

By Chun Kang - Last updated: Tuesday, February 28, 2012

Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

In other words, Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Hadoop MapReduce is a software framework for processing var amounts of data in-parallel on large clusters

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Hadoop MapReduce is a software framework for processing var amounts of data in-parallel on large clusters

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see below HDFS Architecture Diagram) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

Hadoop MapReduce is a software framework for processing var amounts of data in-parallel on large clusters

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs’ component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in Java.

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNI based).

 

References

- http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Inputs+and+Outputs

Filed in Computers & Internet • Tags: , , , , , , , , , , , , , , , , , ,

Apache Hadoop is designed to scale up from single servers to thousands of machines

By Chun Kang - Last updated: Tuesday, February 28, 2012

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.

Apache Hadoop is designed to scale up from single servers to thousands of machines

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop is designed to scale up from single servers to thousands of machines

The above yellow elephant is the mascot for Hadoop.

Filed in Computers & Internet • Tags: , , , , , ,

Table of International Country Code, Time Zones, And Dialing prefix lookup

By Chun Kang - Last updated: Wednesday, November 24, 2010

Here’s the information for international country code, time zones, and dialing prefix.

Country

International dial code

Start GMT

End GMT

Albania

355

GMT+01:00

 

Algeria

213

GMT

 

Andorra

376

GMT+01:00

 

Angola

244

GMT+01:00

 

Anguilla

264

GMT-04:00

 

Antigua and Barbuda

268

GMT-04:00

 

Argentina

54

GMT-03:00

 

Armenia

374

GMT+04:00

 

Aruba

297

GMT-04:00

 

Ascension Island

247

GMT

 

Australia

61

GMT+10:00

GMT+07:00

Austria

43

GMT+01:00

 

Azerbaijan

994

GMT+04:00

 

Bahamas

242

GMT-05:00

 

Bahrain

973

GMT+03:00

 

Bangladesh

880

GMT+06:00

 

Barbados

246

GMT-04:00

 

Belarus

375

GMT+03:00

 

Belgium

32

GMT+01:00

 

Belize

501

GMT-06:00

 

Benin

229

GMT+01:00

 

Bermuda

441

GMT-04:00

 

Bhutan

975

GMT+05:30

 

Bolivia

591

GMT-04:00

 

Bosnia

387

GMT+01:00

 

Botswana

267

GMT+02:00

 

Brazil

55

GMT-03:00

GMT-05:00

Brunei

673

GMT+08:00

 

Bulgaria

359

GMT+02:00

 

Burkina Faso

226

GMT

 

Burundi

257

GMT+02:00

 

Cambodia

855

GMT+07:00

 

Cameroon

237

GMT+01:00

 

Canada

1

GMT-04:00

GMT-08:00

Cape Verde Islands

238

GMT-01:00

 

Cayman Islands

345

GMT-05:00

 

Central Africa Republic

236

GMT+01:00

 

Chad

235

GMT+01:00

 

Chile

56

GMT-04:00

 

China

86

GMT+08:00

 

Columbia

57

GMT-05:00

 

Comoros Island

269

GMT+03:00

 

Congo

242

GMT+01:00

 

Cook Islands

682

GMT-10:00

 

Costa Rica

506

GMT-06:00

 

Croatia

385

GMT+01:00

 

Cuba

53

GMT-03:00

 

Cyprus

357

GMT+02:00

 

Czech Republic

420

GMT+01:00

 

Democratic Republic of Congo (Zaire)

243

GMT+02:00

GMT+01:00

Denmark

45

GMT+01:00

 

Diego Garcia

246

GMT+05:00

 

Djibouti

253

GMT+03:00

 

Dominica Islands

767

GMT-04:00

 

Dominican Republic

809

GMT-04:00

 

Ecuador

593

GMT-05:00

 

Egypt

20

GMT+02:00

 

El Salvador

503

GMT-06:00

 

Equatorial Guinea

240

GMT+01:00

 

Eritrea

291

GMT+03:00

 

Estonia

372

GMT+03:00

 

Ethiopia

251

GMT+03:00

 

Faeroe Islands

298

GMT

 

Falkland Islands

500

GMT-04:00

 

Fiji Islands

679

GMT+12:00

 

Finland

358

GMT+02:00

 

France

33

GMT+01:00

 

French Guiana

594

GMT-04:00

 

French Polynesia

689

GMT-10:00

 

Gabon

241

GMT+01:00

 

Georgia

995

GMT+04:00

 

Germany

49

GMT+01:00

 

Ghana

233

GMT

 

Gibraltar

350

GMT+01:00

 

Greece

30

GMT+02:00

 

Greenland

299

GMT-03:00

 

Grenada

473

GMT-04:00

 

Guadeloupe

590

GMT-04:00

 

Guam

671

GMT+10:00

 

Guatemala

502

GMT-06:00

 

Guinea Bissau

245

GMT-01:00

 

Guinea Republic

224

GMT

 

Guyana

592

GMT-03:00

 

Haiti

509

GMT-05:00

 

Honduras

503

GMT-06:00

 

Hong Kong

852

GMT+08:00

 

Hungary

36

GMT+01:00

 

Iceland

354

GMT

 

India

91

GMT+05:30

 

Indonesia

62

GMT+09:00

GMT+07:00

Iran

98

GMT+03:30

 

Iraq

964

GMT+03:00

 

Ireland

353

GMT

 

Israel

972

GMT+02:00

 

Italy

39

GMT+01:00

 

Ivory Coast

225

GMT

 

Jamaica

876

GMT-05:00

 

Japan

81

GMT+09:00

 

Jordan

962

GMT+02:00

 

Kazakhstan

7

GMT+06:00

 

Kenya

254

GMT+03:00

 

Kiribati

686

GMT+12:00

 

Korea, North

850

GMT+09:00

 

Korea, South

82

GMT+09:00

 

Kuwait

965

GMT+03:00

 

Kyrgyzstan

996

GMT+06:00

 

Laos

856

GMT+07:00

 

latvia

371

GMT+03:00

 

Lebanon

961

GMT+02:00

 

Lesotho

266

GMT+02:00

 

Liberia

231

GMT

 

Libya

218

GMT+02:00

 

Liechtenstein

423

GMT+01:00

 

Lithuania

370

GMT+02:00

 

Luxembourg

352

GMT+01:00

 

Macau

853

GMT+08:00

 

Macedonia (Fyrom)

389

GMT+01:00

 

Madagascar

261

GMT+03:00

 

Malawi

265

GMT+02:00

 

Malaysia

60

GMT+08:00

 

Maldives Republic

960

GMT+05:00

 

Mali

223

GMT

 

Malta

356

GMT+01:00

 

Mariana Islands

670

GMT+10:00

 

Marshall Islands

692

GMT+10:00

 

Martinique

596

GMT-04:00

 

Mauritius

230

GMT+04:00

 

Mayotte Islands

269

GMT+03:00

 

Mexico

52

GMT-06:00

GMT-08:00

Micronesia

691

GMT+10:00

 

Moldova

373

GMT+03:00

 

Monaco

377

GMT+01:00

 

Mongolia

976

GMT+08:00

 

Montserrat

664

GMT-04:00

 

Morocco

212

GMT

 

Mozambique

258

GMT+02:00

 

Myanmar (Burma)

95

GMT+06:30

 

Namibia

264

GMT+02:00

 

Nauru

674

GMT+12:00

 

Nepal

977

GMT+05:30

 

Netherlands

31

GMT+01:00

 

Netherlands Antilles

599

GMT-04:00

 

New Caledonia

687

GMT+11:00

 

New Zealand

64

GMT+12:00

 

Nicaragua

505

GMT-06:00

 

Niger

227

GMT+01:00

 

Nigeria

234

GMT+01:00

 

Niue Island

683

GMT-11:00

 

Norfolk Island

672

GMT+11:30

 

Norway

47

GMT+01:00

 

Oman

968

GMT+04:00

 

Pakistan

92

GMT+05:00

 

Palau

680

GMT+09:00

 

Palestine

970

GMT+02:00

 

Panama

507

GMT-05:00

 

Papua New Guinea

675

GMT+10:00

 

Paraguay

595

GMT-04:00

 

Peru

51

GMT-05:00

 

Philippines

63

GMT+08:00

 

Poland

48

GMT+01:00

 

Portugal

351

GMT+01:00

 

Puerto Rico

787

GMT-04:00

 

Qatar

974

GMT+03:00

 

Reunion Island

262

GMT+04:00

 

Romania

40

GMT+02:00

 

Russia

7

GMT+03:00

 

Rwanda

250

GMT+02:00

 

Samoa (American)

684

GMT-11:00

 

Samoa (Western)

685

GMT-11:00

 

San Marino

378

GMT+01:00

 

Sao Tome & Principe

239

GMT

 

Saudi Arabia

966

GMT+03:00

 

Senegal

221

GMT

 

Serbia

381

GMT+01:00

 

Seychelles

248

GMT+04:00

 

Sierra Leone

232

GMT

 

Singapore

65

GMT+08:00

 

Slovak Republic

421

GMT+01:00

 

Slovenia

386

GMT+01:00

 

Solomon Islands

677

GMT+11:00

 

Somalia

252

GMT+03:00

 

South Africa

27

GMT+02:00

 

Spain

34

GMT+01:00

 

Sri Lanka

94

GMT+05:30

 

St Helena

290

GMT

 

St Kitts & Nevia

869

GMT-04:00

 

St Lucia

758

GMT-04:00

 

Sudan

249

GMT+02:00

 

Surinam

597

GMT-03:30

 

Swaziland

268

GMT+02:00

 

Sweden

46

GMT+01:00

 

Switzerland

41

GMT+01:00

 

Syria

963

GMT+02:00

 

Taiwan

886

GMT+08:00

 

Tajikistan

992

GMT+06:00

 

Tanzania

255

GMT+03:00

 

Thailand

66

GMT+07:00

 

The Gambia

220

GMT

 

Togo

228

GMT

 

Tonga

676

GMT+13:00

 

Trinidad & Tobago

868

GMT-04:00

 

Tunisia

216

GMT+01:00

 

Turkey

90

GMT+02:00

 

Turkmenistan

993

GMT+05:00

 

Turks & Caicos Islands

649

GMT-05:00

 

Tuvalu

688

GMT+12:00

 

Uganda

256

GMT+03:00

 

Ukraine

380

GMT+03:00

 

United Arab Emirates

971

GMT+04:00

 

United Kingdom

44

GMT

 

Uruguay

598

GMT-03:00

 

USA

1

GMT-05:00

GMT-11:00

Uzbekistan

998

GMT+06:00

 

Vanuatu

678

GMT+11:00

 

Venezuela

58

GMT-04:00

 

Vietnam

84

GMT+07:00

 

Wallis & Futuna Islands

681

GMT+12:00

 

Yemen Arab Republic

967

GMT+03:00

 

Zambia

260

GMT+02:00

 

Zimbabwe

263

GMT+02:00

 

Filed in Computers & Internet • Tags: , , , , ,

Web Cache function in Network Gateway could cause internet service trouble

By Chun Kang - Last updated: Friday, October 15, 2010

Have you ever had some bad experiences with internet connected device or internet based software? Please figure it out whether your case is just like this or not:

 

1) The device is working very well at other people’s home or other ISP, but it does not work at your home network.

 

2) IP address both public IP address checked from router and http://ip.kurapa.com is not the same.

 

If your case is the exactly the same like above, you need to contact network administrator.

 

Web Cache function in Network Gateway could cause internet service trouble
<Picture: Network Gateway Specifications>

 

Web Cache feature is adding to Network Gateway since 2009 for network performance enhancement. Actually this is very good feature in terms of QoS(Quality of Service). So some ISPs are adopting Web Cache enabled Network Gateway. But some of the web cache function has a bug. The bug is causing service disability for OpenAPI based internet applications such as Google MAP, Twitter, Facebook, and something like that.

 

The simplest way to clear above problems is turning off the option (Web Cache). If your system is just like above, please contact ISP’s network administrator right now.

Filed in Computers & Internet • Tags: , , , , , , , , ,

Drinking diet shakes during pregnancy

By Chun Kang - Last updated: Tuesday, April 13, 2010

Diet shakes are intended to replace all or some portion of meals, with the goal of reducing calories. In and of themselves, they may make for a good “snack” during pregnancy, but they should not replace a well-balanced diet.

In general, dieting for weight loss is discouraged during pregnancy. The fetus needs a full supply of calories and nutrients for normal development. A balanced diet that allows for a total weight gain of about 30 to 35 pounds is usually sufficient for this.

The supplements added to many diet shakes present another safety issue. For example, the additional vitamin A in some shakes – on top of the amount in prenatal vitamins – may exceed the daily amount considered safe in pregnancy.

Filed in Babycare • Tags: , , , , , , ,

Using my microwave oven during pregnancy

By Chun Kang - Last updated: Tuesday, April 13, 2010

The dangers of microwave radiation, much like the dangers of cell phone and other non-ionizing forms of radiation, have absolutely no basis in scientific fact. Microwave radiation can’t change the molecular structure of anything, because it simply doesn’t have enough energy to break apart chemical bonds.

To put it in perspective, plain old visible blue light has many, many times the energy of a microwave, and can break apart weak chemical bonds (this is what causes photochemical smog).

Microwaves actually have less energy than the infrared radiation (i.e. heat) that is given off by our bodies and the earth.

If you want to worry about radiation, worry about the small amount of UVB light that manages to reach the earth.

That radiation has enough energy to break apart the chemical bonds that make up DNA, causing cancer.

Microwaves, at 1/100,000 of the energy necessary to break apart chemical bonds, are closer to radio waves. All they do is heat up your food.

Filed in Babycare • Tags: , , , , , , , , ,

Taking vitamin C during pregnancy

By Chun Kang - Last updated: Tuesday, April 13, 2010

Too much Vitamin C can cause cell damage in the fetus.

You should consume a normal amount of vitamin C when you’re pregnant. The recommended daily amount is 85 mg for pregnant women age 19 and older. The maximum is 2,000 mg per day.

If you’re taking prenatal vitamins, you’ll be getting vitamin C in that supplement. You’ll also get some from the food you eat. If you decide to take more, remember to keep the total under 2,000 mg.

Filed in Babycare • Tags: , , , ,