菜鸟又来求助 pandas 了

	id	regions	isp	answers
1	1	广东	电信	xxx.xxx.com. xxx.xxx.xxx.com. 1.1.1.1 中国深圳电信 2.2.2.2 中国深圳电信
2	2	上海	电信	xxx.xxx.com. xxx.xxx.xxx.com. 3.3.3.3 中国上海电信 4.4.4.4 中国上海电信

df2 长这样

Content-Type	Content-Length	Connection	Accept-Ranges	Age	ip	status_code
text/plain	15310871	keep-alive	bytes	13	1.1.1.1	200
text/plain	15310871	keep-alive	bytes	0	2.2.2.2	403
text/plain	4668490	keep-alive	bytes	20	3.3.3.3	200
text/plain	15310871	keep-alive	bytes	25	4.4.4.4	200

想要合并成这样（由于太长了影响观看，中间有些列我编辑 v2 的时候就删掉了）：

answers	ip	Content-Length	Age	status_code
xxx.xxx.com. xxx.xxx.xxx.com. 1.1.1.1 中国深圳电信 2.2.2.2 中国深圳电信	1.1.1.1 2.2.2.2	15310871 15310871	13 0	200 403
xxx.xxx.com. xxx.xxx.xxx.com. 3.3.3.3 中国上海电信 4.4.4.4 中国上海电信	3.3.3.3 4.4.4.4	4668490 15310871	20 25	200 200

合并的要求是 df1 里面的 answers 列里面的值如果包含了 df2 里面 ip 列的值，就合并到一行里面来

我现在 df1 里面 answers 列的每个值，是用的\n 换行符连接的字符串，然后合并之后列，也希望是\n 连接，比如 1.1.1.1\n2.2.2.2 ，这样到时候输出到表格就和 v2 这里展示的一样了

上面的描述不知道把需求表达清楚了没，感觉这个需求有点变态，我用 merge 尝试了好久没搞定，跪求大佬帮忙看看

16 条回复 • 2023-07-12 16:24:52 +08:00

Codelike

2023-06-28 20:42:06 +08:00

用正则算出 ansers 里面的 ip ，找 df2 中所有相同的 ip ，手动拼一下，组一行

Qusic

2023-06-28 20:54:41 +08:00

不如先 explode 拆开，join 完了再按需要合并回去

512357301

2023-06-28 22:30:42 +08:00 via Android

没用过 pandas ，不过横向合并跟 SQL 的 join 差不多，也是用一个或多个关联列进行关联的。
你这个感觉得搞虚拟列，把 answers 列的内容替换成 IP 列那样的格式，这步用 gpt 应该是可以找到答案的。然后就是 pandas 的更想合并了，这个 gpt 也可以搞定。
不要指望着一次就搞定这两步的需求。而且这些东西通过百度谷歌也可以搜到答案的，不用依赖 gpt 。
第一步的思路应该是得用 Python 里的正则库，第二步就是用 pandas 的 merge 了

Rommy

2023-06-29 00:25:02 +08:00

import pandas as pd
import re
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth',100)

df1=pd.DataFrame({'id':[1,2],
'regions':['广东','上海'],
'isp':['电信','电信'],
'answers':['xxx.xxx.com.\nxxx.xxx.xxx.com.\n1.1.1.1 中国深圳电信\n2.2.2.2 中国深圳电信\n',
'xxx.xxx.com.\nxxx.xxx.xxx.com.\n3.3.3.3 中国上海电信\n4.4.4.4 中国上海电信\n']})

df2=pd.DataFrame({'Age':[13,0,20,25],
'ip':['1.1.1.1',
'2.2.2.2',
'3.3.3.3',
'4.4.4.4'],
'status_code':[200,403,200,200]})
for column in df2.columns:
df2[column]=df2[column].apply(str)

def ip_extract(input_string):
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ip_addresses = re.findall(ip_pattern, input_string)
return '\n'.join(ip_addresses)
df1_new = df1.copy()
df1_new['ip'] = df1_new['answers'].apply(ip_extract)

df_ip = df1_new['ip'].str.split('\n',expand=True).stack().reset_index(level=1,drop=True).to_frame(name='ip')

df_merge = df1_new.drop(['ip'],axis=1).join(df_ip).merge(df2,on=['ip'])

def concat_func(x):
return pd.Series({column:'\n'.join(x[column]) for column in df2.columns})
df_group = df_merge.groupby(['answers']).apply(concat_func).reset_index()

df = df1.merge(df_group,on=['answers'])
print(df)

luozic

2023-06-29 02:53:37 +08:00

https://stackoverflow.com/questions/39291499/how-to-concatenate-multiple-column-values-into-a-single-column-in-pandas-datafra
分两步：第一步把 answers 行加到第二个上正则 match ，第二步骤，根据 answer 行 group 操作。

ohayoo

2023-06-29 07:54:58 +08:00

@Rommy 不多说了我直接在网线这这头给大佬磕一个多谢大佬

xkoma001

2023-06-29 12:29:16 +08:00

@ohayoo 哈哈哈

cy1027

2023-06-29 14:54:24 +08:00

@ohayoo 你要是说自己能解决这个问题，但是方案感觉不够优雅，想上来问问最通用或者更便捷的方案我能理解，但是如果这种问题的最差的解决方案都没有，太奇怪了，为什么会有人喜欢帮别人做作业，他是 gpt 吗？

ohayoo

2023-06-29 14:59:47 +08:00 via Android

@cy1027 嗯嗯，我就是那种喜欢饭喂到嘴边的傻逼，遇到需求自己不思考，不尝试，就喜欢当个传话筒要答案，可以不？

Rommy

2023-06-29 20:07:34 +08:00

@cy1027 问个问题还要瞻前顾后，纠结是否优雅，这是学习该有的态度吗？另外，我就是太久没用 pandas ，看着有点兴致，试着写了下，写完分享一下。你如果是说我哪里写的不好，太粗糙，那我接受，但我觉得我的行为没啥可指摘的。最后，gpt 的实现依赖的是人与人之间的交流对话文本，这背后都是一个个特别基础的问题与回答，有啥可傲慢的？

wxf666

2023-06-30 00:00:15 +08:00

试着写了一个好懂一些的：

*（ V 站排版会吃掉行首空格，所以替换成了全角空格。若要使用，注意替换）*

```python
import re
import pandas as pd

df1 = pd.DataFrame({
　　'id': [1, 2],
　　'isp': ['电信', '电信'],
　　'regions': ['广东', '上海'],
　　'answers': [
　　　　'xxx.xxx.com.\nxxx.xxx.xxx.com.\n1.1.1.1 中国深圳电信\n2.2.2.2 中国深圳电信\n',
　　　　'xxx.xxx.com.\nxxx.xxx.xxx.com.\n3.3.3.3 中国上海电信\n4.4.4.4 中国上海电信\n',
　　],
})

df2 = pd.DataFrame({
　　'Age': [13, 0, 20, 25],
　　'ip': [
　　　　'1.1.1.1',
　　　　'2.2.2.2',
　　　　'3.3.3.3',
　　　　'4.4.4.4',
　　],
　　'status_code': [200, 403, 200, 200],
})

df_ip = (
　　 df1
　　.set_index('id')['answers'].str
　　.extractall(r'^(?P<ip>[^\s]+)', flags=re.M)
　　.reset_index(level='id')
　　.set_index('ip')
)

df_result = (
　　 df2
　　.merge(df_ip, how='left', on='ip')
　　.groupby('id')
　　.agg({
　　　　'ip': '\n'.join,
　　　　'Age': lambda s: '\n'.join(s.astype('string')),
　　　　'status_code': lambda s: '\n'.join(s.astype('string')),
　　})
　　.merge(df1, how='left', on='id')[[
　　　　'answers',
　　　　'ip',
　　　　'Age',
　　　　'status_code',
　　]]
)
```

cy1027

2023-06-30 13:13:33 +08:00

@Rommy 我理解不了你，也不奢求你理解我

Drahcir

2023-06-30 21:21:18 +08:00

用 pandas 原生的方法似乎没有解决方案，还是得按行遍历，效率不会高

ohayoo

2023-07-03 15:50:55 +08:00

@wxf666 多谢大佬，学习下大佬这种方式

wxf666

2023-07-03 15:58:28 +08:00

@Drahcir #13 pandas 用的不多，可以评估下 11 楼的效率吗？

ohayoo

2023-07-12 16:24:52 +08:00

@wxf666 老哥，我的，忘了回复，后面测试了你这种效率高些