Spark - Spark SQL and DataFrame

DataFrame为编程抽象及SQL查询引擎。

DataFrame

带有列名的分布式数据集合,相当于关系型数据库中的表或R/Python中的data frame。
四种API:Scala, Java, Python, R

SQLContext及其子类

Spark SQL的起点。

HiveContext

SQLContext的子类。可以使用HiveQL解析器,访问Hive的UDF,读取Hive表。
不需要安装Hive。(未来可能并入SQLContext)

可以选择SQL解析器,SQLContext只有一种选择,HiveContext有两种……

示例

1
2
3
4
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

解析sqlContext.read - DataFrameReader(@Experimental)

Java语言规范简介 - 执行

JVM启动

JVM通过调用指定类的main方法并传入一个参数(字符串数组)来启动执行。准确语义参数JVM规范第五章。
以下均以Test类为例。

加载类

首次尝试执行类Test的main方法时发现该类未被加载,即JVM不包含该类的二进制表示。JVM利用一个类加载器尝试寻找该二进制表示。若未找到,则抛出错误。

链接:验证、准备、(可选)解析

类加载之后,必须在调用main方法之前初始化。所有的类或接口类型必须在初始化之前进行链接。链接过程引入验证、准备和(可选)解析。
验证过程使用符号表检查加载的二进制表示形式是否正确。验证过程还检查代码的语义正确性。
准备过程涉及静态存储和JVM内部使用的数据结构(如方法表)的分配。
解析过程是指通过加载所涉及的类和接口来检查对这些类和接口的符号引用是否正确。
解析步骤在初始化链接过程时可选。
一种实现是递归解析所使用的符号引用,类似静态链接(编译后程序文件包含完整链接版本的程序)。
一种实现只在实际用到时才解析该符号引用,它代表了一种懒惰式解析。边执行边解析,也可能有些执行不到的引用一直不会得到解析。

初始化

只有在类初始化之后才会执行main方法。
初始化过程由类变量初始化器和静态初始化器以文本顺序执行。类初始化前提是父类初始化,一直上溯到根父类(即Object)。

调用

main方法必须public、static和void,参数是字符串数组形式。如String[] args和String… args。

加载类和接口

加载即以指定名称来寻找类或接口的二进制格式,有可能是即时计算;但更一般的过程是由编译器从代码中提取二进制表示,并构造一个类对象。
类或接口的二进制格式正式来说即JVM规范描述的class文件格式,但其他格式也有可能支持,前提是满足在下一章节中描述的需求。类ClassLoader的defineClass方法可以从class文件格式的二进制表示中构造类对象。

表现好的类加载器应该具有以下属性:

  • 相同名称总是返回同一类对象
  • 类加载器L1委托类加载器L2加载类C,则作为类C的直接父类或直接父接口,或类C的域,或类C方法或构造器的正式参数,或方法的返回类型的类T。L1和L2应该返回同一类对象。

加载过程

加载过程由ClassLoader类及其子类实现。不同子类可能实现了不同的加载策略。类加载器也可能缓存类和接口的二进制表示,基于使用预期采用预取方式,或者同时加载一组相关联类。这些动作对于运行的应用来说并不完全透明,比如有可能因为类加载器加载了旧版本的类,因而无法发现新编译的版本。类加载器的职责只是在没有预取或组加载时抛出加载错误。
类加载出错时,会抛出LinkageError的以下子类实例:

  • ClassCircularityError: 类(接口)无法加载的原因是该类(接口)是自己父类(父接口)
  • ClassFormatError: 类(接口)的二进制数据有问题
  • NoClassDefFoundError: 相关类加载器中无定义

加载过程涉及新的数据结构的内存分配,所以可能引发OutOfMemoryError。

链接类(接口)

链接过程是指将类(接口)的二进制形式合并为JVM的运行时状态,以使其能够被执行。加载过程必须在链接之前进行。
链接过程涉及:验证、准备、符号引用的解析
链接的准确语义在JVM规范第五章中给出。
初始化之前必须完成验证和准备过程。链接过程中抛出错误可以在需要链接的时候进行。
例如,一种实现可以选择只在符号引用有用到时才进行解析(懒惰解析),或者在验证类时一次性解析所有符号引用(静态解析)。这意味着在一些实现里,类(接口)初始化后,仍然可能有解析过程。
链接过程涉及新的数据结构的内存分配,也可能引发OutOfMemoryError。

二进制验证

验证过程确保二进制表示在结构上是正确的。例如,它检查每个指令有合法操作码,每个分支指令指向另一个指令的开始而非中间,每个方法提供了结构上正确的签名,每个指令遵循JVM语言的类型约束。
验证过程出错会抛出LinkageError子类异常:

  • VerifyError: 验证失败

准备过程

创建静态域(类变量和常量),初始化为默认值。不需要执行任何源代码。静态域的显式初始化器是作为初始化过程的一部分执行,而不是在准备过程。

注:部分词汇可能翻译不准确

Java语言规范简介 - 语法

上下文无关文法(Context-Free Grammar)

上下文无关文法由一系列生成式(production)组成。每个生成式有一个非终结符作为left-hand,一个或多个非终结符和终结符序列作为right-hand。每个文法的终结符由一个特定的字母表所表示。

注:部分词汇可能翻译不准确

Linux命令 - lsof

1
2
3
4
5
lsof [ -?abChlnNOPRtUvVX ] [ -A A ] [ -c c ] [ +c c ] [ +|-d d ] [ +|-D D ] [ +|-e s ] 
[ +|-f [cfgGn] ] [ -F [f] ] [ -g [s] ] [ -i [i] ] [ -k k ] [ +|-L [l] ]
[ +|-m m ] [ +|-M ] [ -o [o] ] [ -p s ] [ +|-r [t[m<fmt>]] ] [ -s [p:s] ]
[ -S [t] ] [ -T [t] ] [ -u s ] [ +|-w ] [ -x [fl] ] [ -z [z] ] [ -Z [Z] ]
[ -- ] [names]

描述

显示进程打开文件信息。

选项

默认列出所有活动进程打开的所有文件。
-U显示Unix套接字,-N显示NFS文件,-u指定所属用户。

-? -h: 显示帮助。

-a: 设置选择模式为AND模式。

-A A:

This option is available on systems configured for AFS whose AFS kernel code is implemented via dynamic modules.   It  allows
the  lsof  user to specify A as an alternate name list file where the kernel addresses of the dynamic modules might be found.
See the lsof FAQ (The FAQ section gives its location.)  for more information about dynamic modules, their  symbols,  and  how they affect lsof.

-b: 该选项避免可能阻塞的内核函数 - lstat(2), readlink(2), and stat(2).
参考 BLOCKS AND TIMEOUTSAVOIDING KERNEL BLOCKS 部分。

-c c: 选择以c字符串开头的命令所执行进程的文件。可以指定多个-c选项。它们在指定AND选项前以OR模式设置。
若c以‘^’开始,则为取反模式,即指定不以c字符串开头的命令进程。
若c以斜线‘/’开始和结束,其中的字符解释为正则表达式。正则表达式中的Shell元字符需要进行转义。结束斜线后可以跟以下字符:

  • b: 基本正则表达式
  • i: 忽略大小写
  • x: 扩展正则表达式(默认)
    先测简单参数,fail之后再测正则表达式。若指定-V选项,会产生‘‘no command found for regex:’’消息提示。

+c w: 显示命令名称的字符数,由UNIX dialect支持。默认是9。许多系统做了最大限制,Linux是15个字符。w为0时,打印最大限制字符数;w小于该列标题‘‘COMMAND’’的大小(7)时,会自动提升至该大小。

-C: 禁止内核命名缓存中的路径名称组件提供报告。参考 KERNEL NAME CACHE 部分。

+d s: 只查找指定目录下进程打开的子目录和文件(不下溯查找)。+D D选项下溯查找目录树。+d选项默认不处理软链接,可以通过指定-x-x l选项启用处理软链;默认不处理目录中的挂载点,可以通过指定-x-x f选项启用。
注意:权限相关参考系统函数stat(2).

-d s: 该选项指定一系列排除或包含的输出文件描述符。文件描述符以逗号分隔,如’cwd,1,3’,’^6,^2’。其中不能有任何空格。
若集合的所有项以’^’开始,则该列表是排除列表。若没有以’^’开头的项,则列表是包含列表。目前不支持混合列表。
文件描述符数字区间表示法,如’0-7’,’3-10’。若以’^’开头,如’^0-7’,则为排除区间,即排除所有0至7之间的文件描述符。
设置AND模式前,所有文件描述符数字以OR模式组合。

+D D: 查找目录D下所有打开实例(下溯查找)。默认不下溯软链和挂载点。参考-x-x l-x f
注意:权限相关参考系统函数stat(2)。

注意:该选项可能处理较慢,且需要较大的动态内存。原因在于它需要下溯D的整个目录树,对每个文件和目录调用stat(2),创建所有文件的列表,依照列表查找其中每一个打开的文件。谨慎使用该选项!

-D D: 该选项与lsof处理设备缓存文件功能有关。参考 DEVICE CACHE FILE 部分。
后面必须跟一个功能字符,该字符后可跟一个路径。目前支持以下功能字符:

  • ? - 报告设备缓存文件路径
  • b - 构建设备缓存文件
  • i - 忽略设备缓存文件
  • r - 读取设备缓存文件
  • u - 读取并更新设备缓存文件

后面跟路径的b、r和u有时是受限制的。

The  b,  r,  and  u functions, accompanied by a path name, are
sometimes restricted.  When these  functions  are  restricted,
they  will not appear in the description of the -D option that
accompanies -h or -?  option output.   See  the  DEVICE  CACHE
FILE section and the sections that follow it for more informa-
tion on these functions and when they’re restricted.

The ?  function reports the read-only  and  write  paths  that
lsof can use for the device cache file, the names of any envi-
ronment variables whose values lsof will examine when  forming
the  device  cache  file path, and the format for the personal
device cache file path.  (Escape the  ‘?’  character  as  your
shell requires.)

When  available,  the b, r, and u functions may be followed by
the  device  cache  file’s  path.   The  standard  default  is
.lsof_hostname  in the home directory of the real user ID that
executes lsof, but this could have been changed when lsof  was
configured  and  compiled.   (The  output  of  the  -h  and -?
options show the current default prefix  -  e.g.,  ‘‘.lsof’’.)
The  suffix,  hostname,  is  the first component of the host’s
name returned by gethostname(2).

When available, the b function directs lsof  to  build  a  new
device cache file at the default or specified path.

The i function directs lsof to ignore the default device cache
file and obtain its information about devices via direct calls
to the kernel.

The  r  function  directs lsof to read the device cache at the
default or specified path, but prevents it from creating a new
device  cache  file  when  none  exists or the existing one is
improperly structured.  The r function, when specified without
a  path name, prevents lsof from updating an incorrect or out-
dated device cache file, or creating a new one in  its  place.
The  r function is always available when it is specified with-
out a path name argument; it may be restricted by the  permis-
sions of the lsof process.

When available, the u function directs lsof to read the device
cache file at the default or specified path, if possible,  and
to rebuild it, if necessary.  This is the default device cache
file function when no -D option has been specified.

+|-e s: 剔除已提交到可能阻塞的内核函数调用的文件系统路径。+e剔除stat(2)lstat(2)readlink(2)-e只剔除stat(2)lstat(2)。多数文件系统将两个选项分开,每个选项都可以剔除readlink(2)调用。
目前仅Linux支持该选项。

         CAUTION: this option can easily be mis-applied to  other  than
         the  file system of interest, because it uses path name rather
         than the more reliable device and inode numbers.  (Device  and
         inode  numbers  are  acquired  via  the  potentially  blocking
         stat(2) kernel call and are thus not available,  but  see  the
         +|-m  m  option as a possible alternative way to supply device
         numbers.)

         Use this option with great care and  fully  specify  the  path
         name  of  the file system to be exempted.  Consider, for exam-
         ple, that specifying ‘‘-e /’’ would exempt all  file  systems,
         since all their paths begin with a ‘/’.

         When  open  files on exempted file systems are reported, it is
         not possible to obtain all their information.  Therefore, some
         information  columns  will  be  blank, the characters ‘‘UNKN’’
         preface the values in the  TYPE  column,  and  the  applicable
         exemption  option  is  added  in parentheses to the end of the
         NAME column.  Some device number  information  might  be  made
         available via the +|-m m option.

+|-f [cfgGn]
         f by itself clarifies how path name arguments are to be inter-
         preted.  When followed by c, f, g, G, or n in any  combination
         it  specifies that the listing of kernel file structure infor-
         mation is to be enabled (‘+’) or inhibited (‘-’).

         Normally a path name argument is taken to  be  a  file  system
         name  if  it  matches  a mounted-on directory name reported by
         mount(8), or if it represents a block  device,  named  in  the
         mount  output  and  associated  with a mounted directory name.
         When +f is specified, all path name arguments will be taken to
         be  file  system names, and lsof will complain if any are not.
         This can be useful, for example, when  the  file  system  name
         (mounted-on  device)  isn’t  a block device.  This happens for
         some CD-ROM file systems.

         When -f is specified by itself, all path name  arguments  will
         be  taken  to be simple files.  Thus, for example, the ‘‘-f --
         /’’ arguments direct lsof to search for open files with a  ‘/’
         path name, not all open files in the ‘/’ (root) file system.

         Be  careful to make sure +f and -f are properly terminated and
         aren’t followed by a character (e.g., of the file or file sys-
         tem  name)  that  might be taken as a parameter.  For example,
         use ‘‘--’’ after +f and -f as in these examples.

              $ lsof +f -- /file/system/name
              $ lsof -f -- /file/name

         The  listing  of  information  from  kernel  file  structures,

Linux命令 - tcpdump

1
2
3
4
5
6
7
8
tcpdump [ -AdDefIKlLnNOpqRStuUvxX ] [ -B buffer_size ] [ -c count ]
[ -C file_size ] [ -G rotate_seconds ] [ -F file ]
[ -i interface ] [ -m module ] [ -M secret ]
[ -r file ] [ -s snaplen ] [ -T type ] [ -w file ]
[ -W filecount ]
[ -E spi@ipaddr algo:secret,... ]
[ -y datalinktype ] [ -z postrotate-command ] [ -Z user ]
[ expression ]

描述

-w标志保存packet数据到指定文件,-r读取指定文件中的packet数据。
指定expression的情况下,只有与其匹配才会被处理。

-c指定接收包的总数量。若不指定-c,则tcpdump会一直处理packet直至接收到SIGINTSIGTERM信号。

tcpdump报告以下内容:

  • 捕获的包数量
  • 过滤器拦截的包数量
  • 被kernel抛弃的包数量(由于缓冲空间不足)

SIGINFO…

从网络接口读取packets需要特殊的权限,参考pcap (3PCAP)帮助手册。

选项

-A: 以ASCII格式打印每个packet(不包含数据链路层头部)。捕获web网页时比较好用。

-B: 设置OS的capture缓存大小为buffer_size。

-c: 接收指定数目packets后退出。

-C: 将原始packet写入文件前,检查文件是否超过指定的file_size,若超过则关闭当前文件并新创建一个文件,新创建的文件名由-w指定名称及数字序号组成,起始值为1。file_size单位为1,000,000字节(不是1,048,576字节)。
注意: 使用-Z选项(默认启用)时,在打开第一个文件之前抛弃权限。

-d: 以易读格式dump编译后的packet-matching代码到标准输出并停止。

-dd: 以C程序片段形式dump packet-matching代码。

-ddd: Dump packet-matching code as decimal numbers (preceded with a count).

-D: 打印tcpdump可以截包的可用网络接口列表。对于每个网络接口,打印一个数字、接口名字,可能有接口描述信息。接口名称或数字可以由-i标志来指定截包接口。
若tcpdump构建于缺少pcap_findalldevs()函数的旧版本libpcap之上,则有可能不支持-D标志。

-e: 打印数据链路层头部。

-E: 使用spi@ipaddr algo:secret 来解密发往addr且包含Security Parameter Index值spi的IPsec ESP packets。

This combination may be repeated with comma or newline seperation.
Note that setting the secret for IPv4 ESP packets is supported at this time.
Algorithms may be des-cbc, 3des-cbc, blowfish-cbc, rc3-cbc, cast128-cbc, or none.  The default  is  des-cbc.   The  ability  to decrypt packets is only present if tcpdump was compiled with cryptography enabled.
secret is the ASCII text for ESP secret key.  If preceeded by 0x, then a hex value will be read.
The  option assumes RFC2406 ESP, not RFC1827 ESP.  The option is only for debugging purposes, and the use of this option with a true ‘secret’ key is discouraged.  By presenting IPsec secret key onto command line you make it visible to  others,  via  ps(1) and other occasions.
In addition to the above syntax, the syntax file name may be used to have tcpdump read the provided file in. The file is opened upon receiving the first ESP packet, so any special permissions that tcpdump may have been given should already have been given up.

-f:

Print ‘foreign’ IPv4 addresses numerically rather than symbolically (this option is intended to get around serious brain damage
in Sun’s NIS server — usually it hangs forever translating non-local internet numbers).

The test for ‘foreign’ IPv4 addresses is done using the IPv4 address and netmask of the interface on  which  capture  is  being
done.  If that address or netmask are not available, available, either because the interface on which capture is being done has
no address or netmask or because the capture is being done on the Linux "any" interface, which can capture  on  more  than  one
interface, this option will not work correctly.

-F: 使用文件中的filter表达式。命令行中的表达式将会被忽略。

-G:

If  specified,  rotates  the dump file specified with the -w option every rotate_seconds seconds.  Savefiles will have the name
specified by -w which should include a time format as defined by strftime(3).  If no time format is specified,  each  new  file
will overwrite the previous.

If used in conjunction with the -C option, filenames will take the form offile<count>’.

-i: 监听指定接口。若不指定,则查找最小数标的接口(除loopback外)。

Ties are broken by choosing the earliest match.

在Linux 2.2及以上内核版本中,接口参数‘any’可以用来对所有接口截包。在promiscuous模式下,对‘any’设备截包不起作用。
若支持-D标志,则由该标志打印的接口数标可以用来作用接口参数。

-I: 将接口设置为monitor模式。该项只在IEEE 802.11 Wi-Fi接口上提供支持,且只有部分OS支持。

Note that in monitor mode the adapter might disassociate from the network with which it’s associated, so that you will  not  be
able to use any wireless networks with that adapter.  This could prevent accessing files on a network server, or resolving host
names or network addresses, if you are capturing in monitor mode and are not connected to another network with another adapter.

This flag will affect the output of the -L flag.  If -I isn’t specified, only those link-layer types available when not in mon-
itor mode will be shown; if -I is specified, only those link-layer types available when in monitor mode will be shown.

-K: 不去尝试验证IP/TCP/UDP的checksum值。对于在硬件内部计算checksum的接口很有用(否则验证失败)。

-l: 缓存标准输出行。Make stdout line buffered. Useful if you want to see the data while capturing it. E.g.,
‘‘tcpdump -l | tee dat’’ or ‘‘tcpdump -l > dat & tail -f dat’’.

-L     List  the  known  data link types for the interface, in the specified mode, and exit.  The list of known data link types may be
       dependent on the specified mode; for example, on some platforms, a Wi-Fi interface might support one set  of  data  link  types
       when  not  in  monitor  mode (for example, it might support only fake Ethernet headers, or might support 802.11 headers but not
       support 802.11 headers with radio information) and another set of data link types when in monitor mode (for example,  it  might
       support 802.11 headers, or 802.11 headers with radio information, only in monitor mode).

-m     Load  SMI MIB module definitions from file module.  This option can be used several times to load several MIB modules into tcp-
       dump.

-M     Use secret as a shared secret for validating the digests found in TCP segments with the TCP-MD5 option (RFC 2385), if  present.

-n     Don’t convert host addresses to names.  This can be used to avoid DNS lookups.

-nn    Don’t convert protocol and port numbers etc. to names either.

-N     Don’t  print  domain  name qualification of host names.  E.g., if you give this flag then tcpdump will print ‘‘nic’’ instead of
       ‘‘nic.ddn.mil’’.

-O     Do not run the packet-matching code optimizer.  This is useful only if you suspect a bug in the optimizer.

-p     Don’t put the interface into promiscuous mode.  Note that the interface might be in promiscuous mode  for  some  other  reason;
       hence, ‘-p’ cannot be used as an abbreviation for ‘ether host {local-hw-addr} or ether broadcast’.