NoSQLApache Sqoop – Overview(概述)

Sqoop Connectors

Using specialized connectors, Sqoop can connect with external systems
that have optimized import and export facilities, or do not support
native JDBC. Connectors are plugin components based on Sqoop’s extension
framework and can be added to any existing Sqoop installation. Once a
connector is installed, Sqoop can use it to efficiently transfer data
between Hadoop and the external store supported by the connector.

By default Sqoop includes connectors for various popular databases such
as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes
fast-path connectors for MySQL and PostgreSQL databases. Fast-path
connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a
generic JDBC connector that can be used to connect to any database that
is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their
own connectors that can be plugged into Sqoop. These range from
specialized connectors for enterprise data warehouse systems to NoSQL
datastores.

导入数据

上边的通令用于将一个MySQL数据库中名为ORDERS的表中所有数据导入到集群中—

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password ****

在那条命令中的各个选项解释如下:

import: 提醒Sqoop早开头入

–connect , –username , –password :
这几个都是三番四次数据库时须要的参数。那跟你通过JDBC连接数据库时所利用的参数没有分别

–table: 指定要导入哪个表

导入操作通过下边Figure1所描绘的那两步来完毕。第一步,Sqoop从数据库中得到要导入的数码的元数据。第二步,Sqoop提交map-only作业到Hadoop集群中。第二步通过在前一步中收获的元数据做实际的数据传输工作。

Figure 1: Sqoop Import Overview

导入的数额存储在HDFS目录下。正如Sqoop大部分操作一样,用户可以指定其余替换路径来储存导入的数据。

默许景况下那么些文档包罗用逗号分隔的字段,用新行来分隔区其余记录。你可以明确地指定字段分隔符和笔录甘休符简单地完毕文件复制进度中的格式覆盖。

Sqoop也支撑分化数额格式的多少导入。例如,你可以透过点名
–as-avrodatafile 选项的命令行来大约地促成导入Avro 格式的数量。

Sqoop提供如拾草芥抉择可以用来满意指定必要的导入操作。

Importing Data into Hive

In most cases, importing data into Hive is the same as running the
import task and then using Hive to create and load a certain table or
partition. Doing this manually requires that you know the correct type
mapping between the data and other details like the serialization format
and delimiters. Sqoop takes care of populating the Hive metastore with
the appropriate metadata for the table and also invokes the necessary
commands to load the table or partition as the case may be. All of this
is done by simply specifying the option –hive-import with the import
command.


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password **** –hive-import


When you run a Hive import, Sqoop converts the data from the native
datatypes within the external datastore into the corresponding types
within Hive. Sqoop automatically chooses the native delimiter set used
by Hive. If the data being imported has new line or other Hive delimiter
characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.

Once the import is complete, you can see and operate on the table just
like any other table in Hive.

Importing Data into HBase

You can use Sqoop to populate data in a particular column family within
the HBase table. Much like the Hive import, this can be done by
specifying the additional options that relate to the HBase table and
column family being populated. All data imported into HBase is converted
to their string representation and inserted as UTF-8 bytes.


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–hbase-create-table –hbase-table ORDERS –column-family mysql


In this command the various options specified are as follows:

–hbase-create-table: This option instructs Sqoop to create the HBase
table.

–hbase-table: This option specifies the table name to use.

–column-family: This option specifies the column family name to use.

The rest of the options are the same as that for regular import
operation.

Apache Sqoop 概述

选取Hadoop来分析和处理数据须要将数据加载到集群中并且将它和企业生产数据库中的其余数据开展组合处理。从生育系统加载大块数据到Hadoop中或者从大型集群的map
reduce应用中拿到多少是个挑战。用户必须意识到担保数据一致性,消耗生产系统资源,供应下游管道的多少预处理那么些细节。用脚本来转化数据是无济于事和耗时的法门。使用map
reduce应用直接去取得外部系统的数量驱动应用变得复杂和扩大了生育系统来自集群节点过度负载的高风险。

那就是Apache Sqoop可以形成的。Aapche Sqoop
近来是Apache软件会的孵化项目。越来越多关于这么些项目标新闻方可在http://incubator.apache.org/sqoop查看

Sqoop可以使得像关系型数据库、集团数据仓库和NoSQL系统这样不难地从结构化数据仓库中导入导出数据。你可以行使Sqoop将数据从表面系统加载到HDFS,存储在Hive和HBase表格中。Sqoop同盟Ooozie可以接济您调度和机关运行导入导出职分。Sqoop使用基于帮衬插件来提供新的外部链接的连接器。

当您运行Sqoop的时候看起来是至极简单的,不过表象底层下面发生了什么样呢?数据集将被切开分到不相同的partitions和运作一个唯有map的功课来负担数据集的某部切片。因为Sqoop使用数据库的元数据来臆度数据类型所以每条数据皆以一种档次安全的章程来拍卖。

在那篇文章其他部分中我们将透过一个例证来呈现Sqoop的各样应用方式。那篇著作的对象是提供Sqoop操作的一个概述而不是长远高级作用的底细。

Exporting Data

In some cases data processed by Hadoop pipelines may be needed in
production systems to help run additional critical business functions.
Sqoop can be used to export such data into external datastores as
necessary. Continuing our example from above – if data generated by the
pipeline on Hadoop corresponded to the ORDERS table in a database
somewhere, you could populate it using the following command:


$ sqoop export –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–export-dir /user/arvind/ORDERS


In this command the various options specified are as follows:

export: This is the sub-command that instructs Sqoop to initiate an
export.

–connect , –username , –password : These are connection parameters
that are used to connect with the database. This is no different from
the connection parameters that you use when connecting to the database
via a JDBC connection.

–table: This parameter specifies the table which will be populated.

–export-dir : This is the directory from which data will be exported.

Export is done in two steps as depicted in Figure 2. The first step is
to introspect the database for metadata, followed by the second step of
transferring the data. Sqoop divides the input dataset into splits and
then uses individual map tasks to push the splits to the database. Each
map task performs this transfer over many transactions in order to
ensure optimal throughput and minimal resource utilization.

Figure 2: Sqoop Export Overview

Some connectors support staging tables that help isolate production
tables from possible corruption in case of job failures due to any
reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.

总结

在那篇文档中可以看到大数额集在Hadoop和外部数据仓库例如关系型数据库的传输是何其的简便。除此之外,Sqoop提供许多高级提示如分裂数额格式、压缩、处理查询等等。我们建议你多尝试Sqoop并给大家提供报告。

越多关于Sqoop的音信方可在底下路径找到:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status:  http://incubator.apache.org/projects/sqoop.html

Mailing Lists:
https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

下边是原文


导出数据

在一部分情景中,通过Hadoop
pipelines来处理数据可能须要在生育系统中运行额外的严重性业务函数来提供帮扶。Sqoop可以在须求的时候用来导出那一个的数额到表面数据仓库。照旧选择方面的例子,若是Hadoop
pieplines暴发的多寡对应数据库OREDERS表格中的某些地点,你可以利用上面的命令行:

$ sqoop export –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–export-dir /user/arvind/ORDERS

上面是各样选项的诠释:

export: 提醒Sqoop初叶导出

–connect , –username , –password
:那几个都是一而再数据库时必要的参数。那跟你通过JDBC连接数据库时所利用的参数没有分歧

–table: 指定要被填充的报表

–export-dir : 导出路径.

导入操作通过上面Figure2所形容的这两步来形成。第一步,从数据库中拿走要导入的数据的元数据,第二步则是多少的传导。Sqoop将输入数据集分割成片然后用map职分将片插入到数据库中。为了有限支持最佳的吞吐量和纤维的资源使用率,每个map义务通过两个事情来执行这一个数目传输。

Figure 2: Sqoop Export Overview

有的连接器帮忙临时表格来赞助隔离那个任何原因导致的功课战败而暴发的生产报表。一旦拥有的多寡都传输落成,临时表格中的数据首先被填充到map职责和归并到对象表格。

Apache Sqoop – Overview

Using Hadoop for analytics and data processing requires loading data
into clusters and processing it in conjunction with other data that
often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map
reduce applications running on large clusters can be a challenging task.
Users must consider details like ensuring consistency of data, the
consumption of production system resources, data preparation for
provisioning downstream pipeline. Transferring data using scripts is
inefficient and time consuming. Directly accessing data residing on
external systems from within the map reduce applications complicates
applications and exposes the production system to the risk of excessive
load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing
incubation at Apache Software Foundation. More information on this
project can be found at http://incubator.apache.org/sqoop.

Sqoop allows easy import and export of data from structured data stores
such as relational databases, enterprise data warehouses, and NoSQL
systems. Using Sqoop, you can provision the data from external system on
to HDFS, and populate tables in Hive and HBase. Sqoop integrates with
Oozie, allowing you to schedule and automate import and export tasks.
Sqoop uses a connector based architecture which supports plugins that
provide connectivity to new external systems.

What happens underneath the covers when you run Sqoop is very
straightforward. The dataset being transferred is sliced up into
different partitions and a map-only job is launched with individual
mappers responsible for transferring a slice of this dataset. Each
record of the data is handled in a type safe manner since Sqoop uses the
database metadata to infer the data types.

In the rest of this post we will walk through an example that shows the
various ways you can use Sqoop. The goal of this post is to give an
overview of Sqoop operation without going into much detail or advanced
functionality.

导入数据到 HBase

你可以运用Sqoop将数据插入到HBase表格中一定列族。跟Hive导入操作很像,可以透过点名一个格外的选项来指定要插入的HBase表格和列族。所有导入到HBase的数额将转换成字符串并以UTF-8字节数组的格式插入到HBase中

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–hbase-create-table –hbase-table ORDERS –column-family mysql

上面是命令行中各样选项的演讲:

–hbase-create-table: 这一个选项提醒Sqoop创设HBase表.

–hbase-table: 这几个选项指定HBase表格的名字.

–column-family: T那个选项指定列族的名字.

结余的选项跟日常的导入操作一样。

Importing Data

The following command is used to import all data from a table called
ORDERS from a MySQL database:


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password ****


In this command the various options specified are as follows:

import: This is the sub-command that instructs Sqoop to initiate an
import.

–connect , –username , –password : These are connection parameters
that are used to connect with the database. This is no different from
the connection parameters that you use when connecting to the database
via a JDBC connection.

–table: This parameter specifies the table which will be imported.

The import is done in two steps as depicted in Figure 1 below. In the
first Step Sqoop introspects the database to gather the necessary
metadata for the data being imported. The second step is a map-only
Hadoop job that Sqoop submits to the cluster. It is this job that does
the actual data transfer using the metadata captured in the previous
step.

Figure 1: Sqoop Import Overview

The imported data is saved in a directory on HDFS based on the table
being imported. As is the case with most aspects of Sqoop operation, the
user can specify any alternative directory where the files should be
populated.

By default these files contain comma delimited fields, with new lines
separating different records. You can easily override the format in
which data is copied over by explicitly specifying the field separator
and record terminator characters.

Sqoop also supports different data formats for importing data. For
example, you can easily import data in Avro data format by simply
specifying the option –as-avrodatafile with the import command.

There are many other options that Sqoop provides which can be used to
further tune the import operation to suit your specific requirements.

导入数据到 Hive

在成千成万场所下,导入数据到Hive就跟运行一个导入职务然后利用Hive创设和加载一个确定的表和partition。手动执行这么些操作需求你要明了科学的数据类型映射和其余细节像种类化格式和分隔符。Sqoop负责将适可而止的报表元数据填充到Hive
元数据仓库和调用必要的命令来加载table和partition。那些操作都能够由此不难地在指令行中指定–hive-import
来落成。

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password **** –hive-import

当您运行一个Hive
import时,Sqoop将会将数据的品类从表面数据仓库的原生数据类型转换成Hive中对应的类型,Sqoop自动地拔取Hive使用的地面分隔符。如果被导入的多少中有新行或者有其他Hive分隔符,Sqoop允许你移除那几个字符并且获得导入到Hive的没错数据。

假若导入操作已毕,你就如Hive其余表格一样去查看和操作。

Sqoop 连接器

选拔专门连接器,Sqoop可以接连那一个具有优化导入导出基础设备的外部系统,或者不支持本地JDBC。连接器是插件化组件基于Sqoop的可扩展框架和可以添加到其余当前设有的Sqoop。一旦连接器安装好,Sqoop可以行使它在Hadoop和连接器支持的外表仓库里面开展急速的传输数据。

默许景况下,Sqoop包涵援救各样常用数据库例如MySQL,PostgreSQL,Oracle,SQLServer和DB2的连接器。它也蕴藏辅助MySQL和PostgreSQL数据库的快速路径连接器。神速路径连接器是特其他连接器用来贯彻批次传输数据的高吞吐量。Sqoop也富含一般的JDBC连接器用于连接通过JDBC连接的数据库

跟内置的连日差其他是,许多集团会支付他们自己的连接器插入到Sqoop中,从专门的铺面仓库连接器到NoSQL数据库。

Wrapping Up

In this post you saw how easy it is to transfer large datasets between
Hadoop and external datastores such as relational databases. Beyond
this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc. We encourage
you to try out Sqoop and give us your feedback.

More information regarding Sqoop can be found at:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status:  http://incubator.apache.org/projects/sqoop.html

Mailing Lists:
https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

网站地图xml地图