从零开始的Centos安装使用Bitextor

这里使用的是腾讯云最低配置的服务器。服务器是新的CentOS 7.3 64位系统只有python环境。

根据官方文档说明,Bitextor的requirements如下:

Autotools are necessary for building and installing the project. Tools javac and jar are needed for building Java dependences, and the virtual machine of Java is needed for running them. In addition, a c++ compiler is requeired for compiling. Most of the scripts in bitextor are written in Python. Because of this, it is necessary to also install Python 2. All these tools are available in most Unix-based operative systems repositories.

安装开发者工具包

1
yum groupinstall "Development Tools"

配置Java环境

安装Java

1
yum install java-1.8.0-openjdk

安装javac

1
yum install java-devel

安装python依赖包

首先安装python包管理工具pip

1
yum install python-pip

需要安装的python依赖包有

  • python-Levenshtein: Python library for computing the Levenshtein edit-distance.
  • langid: Python library for plain text language detection.
  • regex: Python package for regular expressions in Python.
  • NLTK: Python package with natural language processing utilities.
  • numpy: Python package for scientific computing with Python.
  • keras and tensorflow: Python package for implementing neural networks for deep learning.
  • h5py: Pythonic interface to the HDF5 binary data format.
  • python-magic: Python interface for the magic library, used to detect files’ format (install from apt or source code in https://github.com/threatstack/libmagic/tree/master/python, not from pip: it has a different interface).
1
pip install python-Levenshtein keras tensorflow h5py langid nltk regex python-magic

可能会报错:缺少Python.h文件。Solution:

1
yum install python-devel

安装httrack

1
yum install httrack

安装bitextor

下载bitextor

1
wget https://sourceforge.net/projects/bitextor/files/bitextor/bitextor-4.1/bitextor-4.1.3.tar.gz/download -O bitextor-4.1.3.tar.gz

解压

1
tar zxvf bitextor-4.1.3.tar.gz

检查配置

1
[root@localhost bitextor-4.1.3]# ./configure

即使前面的环境都安装好了,这里还是会报错

1
configure: error: You don't have apertium-destxt installed: try to install it or run this script with option --without-apertium.

下载apertium

1
wget https://apertium.projectjj.com/rpm/install-nightly.sh -O - | sudo bash

Install dev tools

1
yum install apertium-all-devel

编译

1
2
[root@localhost bitextor-4.1.3]# make
[root@localhost bitextor-4.1.3]# make install

运行bitextor

至此bitextor已经安装完成,但要使用bitextor去获取平行语料还要建立一个双语词典,这个双语词典与将要建立的平行语料库的语言有关。已经提供的一些字典可以从 https://sourceforge.net/projects/bitextor/files/lexicons/ 下载。

这里选用英法字典。

1
wget https://sourceforge.net/projects/bitextor/files/lexicons/en-fr.dic/download -O en-fr.dic

爬取给定网址的双语对齐语料

1
bitextor -u "https://www.apple.com/" -v en-fr.dic -d ./web -O res.tmx -x -b 1  en fr

调用bitextor的方法有多种,这里的USAGE是

1
bitextor [OPTIONS] -u URL -d DIRECTORY    -v VOCABULARY LANG1 LANG2
  • -u URL 要抓取和处理的website。
  • -v VOCABULARY 指向脚本使用的字典的路径。
  • -d DIRECTORY 保存爬取网站的文件夹路径。
  • -O FILE 如果启用此选项,则bitextor的输出将被重定向到文件FILE,否则将被重定向到标准输出。
  • -x 如果启用此选项,则bitextor的输出将被格式化为标准的TMX转换存储器(此选项在管道末尾添加脚本bitextor-buildTMX)。
  • -b NUM 当启用该选项时,计算双向文档对齐时只考虑RINDEX候选列表中的前NUM个候选项。

在爬取的过程中可以另开一个进程运行一下命令随时查看 ./web 的变化。

1
watch -n 5 "du -h -d 1 | sort -shr"